Bf 02475200

March 26, 2018 | Author: Tareas Y Medios Audiovisuales | Category: Principal Component Analysis, Eigenvalues And Eigenvectors, Factor Analysis, Matrix (Mathematics), Linear Algebra


Comments



Description

CEJC 3(4) 2005 731–741SVD-based principal component analysis of geochemical data Petr Praus∗ Department of Analytical Chemistry and Material Testing, VSB-Technical University Ostrava, 17. listopadu 15, 708 33 Ostrava, Czech Republic Received 2 May 2005; accepted 20 June 2005 Abstract: Principal Component Analysis (PCA) was used for the mapping of geochemical data. A testing data matrix was prepared from the chemical and physical analyses of the coals altered by thermal and oxidation effects. PCA based on Singular Value Decomposition (SVD) of the standardized (centered and scaled by the standard deviation) data matrix revealed three principal components explaining 85.2 % of the variance. Combining the scatter and components weights plots with knowledge of the composition of tested samples, the coal samples were divided into seven groups depending on the degree of their oxidation and thermal alteration. The PCA findings were verified by other multivariate methods. The relationships among geochemical variables were successfully confirmed by Factor Analysis (FA). The data structure was also described by the Average Group dendrogram using Euclidean distance. The found sample clusters were not defined so clearly as in the case of PCA. It can be explained by the PCA filtration of the data noise. c Central European Science Journals. All rights reserved. Keywords: Principal component analysis, singular value decomposition, factor analysis, hierarchical cluster analysis, altered coals, geochemistry 1 Introduction Real chemical data contain not only important information but also confusing noise. They are mostly far from normality with collinear and/or autocorrelated variables, containing outliers and so forth. For extraction of this information with its minimal lost, there are several chemometric methods for the reduction of data dimensionality, such as Principal Component Analysis, Factor Analysis, Independent Component Analysis [1], Independent ∗ E-mail: [email protected] Unauthenticated Download Date | 11/22/15 12:25 PM The content of ash (Ad . SVD is the more robust. In PCA. concentration of elements (Cat . including sampling and preservation. This testing data set was adopted from the work of Klika and Kraussov´a [11] with the authors courtesy. we look for new abstract orthogonal components (eigenvectors) which explain the most of data variation. Computing the SVD consists of finding the eigenvalues and eigenvectors of AAT and AT A. The importance of each singular vector is given by the squares of nonnegative singular values of the matrix S. wt %). The aim of this paper was to analyze geochemical data by the SVD-based PCA (PCA/SVD). S (nxp) is a diagonal matrix with singular values in decreasing order. SVD has already found the wide range of various applications in molecular dynamic and gene expression analysis [7]. The powerful property of SVD is compressing the information contained in A into the first singular vectors which are mutually orthogonal and their importance rapidly decreases after the first columns/rows. humic acids (HAdaf . The parameters were selected in order to classify altered coals. even in cases of ill-conditioned problems. SVD decomposes an arbitrary matrix A (nxp) into three matrices: A = U S VT (1) where U (nxn) and VT (pxp) are orthogonal and normalized matrices. image processing [9]. PCA is based on the Eigenvalue Decomposition (EVD) [4. Coal analyses. Within the red beds and their vicinity. volatile matter (Vdaf . mean reflectance of vitrinite (RO ). all in atom %). spectral analysis [10]. Oat . Nat ..e. moisture (Wa . wt %). The singular values are the square roots of the eigenvalues of AAT or AT A.5] of covariance/correlation matrixes and/or by the SVD of real data matrices. i. Hat . information retrieval. In comparison with EVD. were carried out according to the standard methods. 2 Experimental 2.. The sample summary statistics are given Unauthenticated Download Date | 11/22/15 12:25 PM . combustion heat (Qdaf . The testing data matrix summarises the results of chemical and physical properties of the coal samples taken from the Upper Silesian Coal Basin in the Czech Republic.732 P. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 Factor Analysis [2]. in the technique of Latent Semantic Indexing [8]. UT U=I and VT V=I. There are Carboniferous red bed bodies in this area. SVD is well known for its stability and convergence. e.1 Geochemical data set PCA/SVD was tested on the data matrix of the coal samples (n= 52) taken from the region of red bed bodies of the Upper Silesian Coal Basin. etc. reliable. etc. and precise method with no need to compute the input covariance/correlation matrix [6].g. The columns of U are eigenvectors of AAT and the rows VT are the eigenvectors of AT A. wt %). coals were altered by the oxidation and thermal effects which are manifested in their macroscopic and microscopic characteristics and various chemical composition. The columns of U are the left singular vectors and the rows of the VT are the right singulars vectors. Generative Topographic Mapping [3]. wt %). From a numerical point of view. MJ/kg). Under these conditions. 3.P. The reduction of Hat is associated with the relative increase of Cat . and Hat can express the intensity of the thermal changes in coals. SVD of this matrix A was executed using the standard MATLAB command svds(A. The high reflectance of vitrinite R0 is an effect of the coal oxidation at higher temperatures. The chemical composition of thermally altered coals without an influence of oxygen is characterized by the high concentrations of hydrogen. 85.2 PCA of the relationships among geochemical variables The components weights were calculated as the correlation coefficients of the original variables with the principal components and their plot is displayed in Fig. i. These data are typical by various types and scales of measured variables.6 %. Utah). = P (2) n 2 si 1 where sk is a singular value. Regarding the SVD theory. FA and HCA were performed by the software packages NCSS 97 (Number Cruncher Statistical Systems. and their reciprocity to Hat are evident from their opposite positions in the plot and well agree with the geochemical theory. 3 Results and discussion 3. the data were standardized. the singular values correspond to the square roots of the eigenvalues.2 % of the total data variance. The strong relations among Ro . Cat . To avoid the scaling effects. i.e.1 Determination of the principal components Determination of the components number is given by the characteristics of singular values and is demonstrated in a scree plot (see Fig. PCs) can be expressed according to the equation s2k var. 2. That is why the variance of the singular vectors (principal components.2 %. The singular values sharply decrease within three largest singular vectors and then slowly stabilize for remaining ones which contain a great deal of noise and therefore are not useful. This is in close agreement with the conventional 80% of the variance which should be explained by the principal components. and 11. Ro .2 Multivariate computations The data matrix for PCA/SVD was prepared and treated in Excel 97. 27.4 %. The revealed PC1 to PC3 contain 46. 1). Praus / Central European Journal of Chemistry 3(4) 2005 731–741 733 in Table 1. 2. It is obvious that the data are not normally distributed and scaling effects can be expected. centered the mean (average) and scaled the standard deviation of the original measurement variables [12]. Cat .e. Unauthenticated Download Date | 11/22/15 12:25 PM .k) which computes the k largest values and associated singular vectors of the matrix A. 28 45.58 0.7 152 12.598 17.43 3.50 12. Minimum Maximum Range Skewness Kurtosis 52 10. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 Unauthenticated Download Date | 11/22/15 12:25 PM Parametr .07 24.68 5.1900 1.2330 52 24.2955 52 64.41 5.1 4. dev.46 55.06 32.86 1.3122 -1.34 40.76 9.47 88.70 43.72 6.75 23. = Standard deviation.734 Wa (%) Ad (%) Vdaf (%) Qdaf (MJ/kg) HAdaf (%) R Cat (%) Hat (%) Nat (%) Oat (%) Count Average Variance Stnd. ∗ Table 1 Summary statistics of the tested geochemical data.7 71.88 5.06 3.42 -1.7067 52 13.8078 Concentration bellow the detection limit.3927 52 1.7214 -1.51 52.4920 2.41 0.70 4.0 6.14 1.07 1.55 0.3 2.3599 4.8191 52 2.82 4.98 0.0423 -1.83 55.86 1.362 0.4316 -0.1 135 11.5 8.6 47. Stnd.2678 0.9257 52 26.1052 52 7.59 3.41 88.6 2.6016 52 14.2 7.2286 6.44 35. dev.8529 0.7645 -1.06 -1.4 14.3 712 26.1 60. P.27 13.2066 52 30.72 15.71 4.32 30.728 0.65 23.7 0∗ 88. On the other hand. 2 Principal components weights plot of the geochemical variables. Fig. 2 and can be logically expected.P. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 735 Fig. Humic acids and water are the direct products of oxidation processes in altered coals. the variables Wa . 1 Scree plot of the singular values. the closest relation between HAdaf and Ad is exhibited. The high content of Ad is likely caused by loses of the gaseous oxidation products. In addition. 2. Unauthenticated Download Date | 11/22/15 12:25 PM . The reciprocal relationships of Wa and Oat to Qdaf are obvious from Fig. Their positions are similar to each other in Fig. CO2 . and H2 O. and Ad are considered to indicate the oxidation processes in coal. HAdaf . such as CO. Oat . and Delta(1.5). negatively correlates with Ro and Cat .8197 -0.80) which were proved by the Dean-Dixon test to be the outlying values. and Delta(1.0) were obtained for the Simple Average method (CC=0.5) and Delta(1. these variables can characterize the thermal coal alteration with zero or very low content of oxygen. it is evident that the groups are vertically and horizontally located with respect Unauthenticated Download Date | 11/22/15 12:25 PM .2855.0301 0.0081 0. The loading factors are given in Table 2. 3).0361 -0.9487 0.6348 0.9080 0. Applying the Euclidean distance metric.4159 -0. The sample groups in Fig.2063 0. 3.9364 -0.2.7430 -0.8205 0.0538 0. the content of Hat strongly.1 Factor Analysis of the geochemical data Factor Analysis was carried out to confirm the relations among the variables found by PCA/SVD. the highest CC and the lowest Delta(0. the sample CM-2 seems to be an outlier within the group I (see below).98 %) and Ro (2. On the basis of the diagnostic plots and the composition of the sample groups I to IV/2 (Table 3). Factor 2 reveals the significant correlations among the content of Wa and Oat which are negatively correlated with Qdaf and thus corresponds to the oxidation alteration of coals.19130 -0. A suitable clustering method was chosen according to the three clustering criteria.3369). as well.4268 Table 2 The factor loadings after Varimax rotation. Delta(0. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 3. In factor 1.1139 -0.2720 -0.0734 0.2427 -0. 3 were created exactly according to the clusters of the PC1 and PC2 dendrogram. By looking at this plot.0510 0. It is caused by its atypical content of Vdaf (19.8728 0. As it was given above. Factor 3 is mainly saturated by the content of Ad and HAdaf as the products of coal oxidation.5812 0. Delta(0. PC1 was constructed (Figs.8573 0. it can be concluded that the relationship among variables within all three factors well agree with those discovered by PCA using the components weights plot.1948 -0.4698 0.4502 0. The seven clearly separated groups were found.2857 0.5)=0. Parameter Factor 1 Factor 2 Factor 3 Wa Ad Vdaf Qdaf HAdaf Ro Cat Hat Nat Oat -0.0) [13].1657 0.1503 -0.7113. The coal samples were divided into these groups by means of hierarchical clustering of the two largest principal components scores.7771 -0.736 P. On the basis of these results. 2. such as Cophenetic correlation coefficient (CC).3 PCA clustering of the geochemical data The principal component scatter plot of PC2 vs.0)= 0. Their relation was clearly shown in Fig. 3. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 737 Fig. The groups III/1 and IV/1 are significant by the high content of HAdaf . III/1.5) and Delta(1. The group of samples with the highest concentrations of hydrogen. HAdaf indicating oxidation processes increase in the order II/1. The clusters were denoted in consistency with the group identification in Fig. 2) have the highest content of Cat and R0 . the highest CC and the lowest Delta(0. The magnitudes of the variables Wa .P.0) were obtained for the Group Average method (Table 4). the lowest values of the vitrinite reflectances and the lowest concentrations of water was denoted as I. Oat . As it is obvious from this figure the data structure is not so clearly organized as it was found by PCA. 2). Applying the Euclidean and Manhattan distances. as well. 2). 2). The better clustering results of PCA are likely caused by the its noise filtration which is due to the data dimensionality reduction.1 Hierarchical Clustering of the original geochemical data For verification of the PCA/SVD mapping. The final dendrogram (Fig. and IV (1. The lower content of HAdaf in III/2 and IV/2 is likely caused by their thermal decomposition at the temperatures above 250 ◦ C [14]. Six other groups contain the samples altered both thermally and by oxidation and therefore they can be divided into two subgroups (1 and 2) according to the temperatures (lower and higher) of coal oxidation: II (1. hierarchical clustering of the original geochemical data was carried out. III (1. The samples II (1. to the thermal and oxidation changes during coal alteration. and IV/1. Moreover. specially in the case of the mixed clusters I+III/1 and I+IV/1. respectively. These samples are considered to be mainly thermally altered. 3. Unauthenticated Download Date | 11/22/15 12:25 PM . 4) was constructed utilizing the Group Average method with the Euclidean distance. 3 Principal components scatter plot of the coal samples.3. the content of Wa is always higher in the subgroups 2 (at higher temperatures). 23 0.065 1.548 4.35 13.73 2.54 74.320 1.050 0.626 1.320 0.31 33.141 1.45 0.52 1.321 1.144 1.98 1.82 20.764 3.088 17.01 2.984 27.86 26.12 74.11 1.99 29.94 57.674 1.742 0.450 7.180 1.071 0.104 1.13 38.17 0.61 67.95 21. their units are the same as in Table 1.93 30.067 2.14 58.22 24.94 0.491 10.549 4.090 0.182 1.05 30.71 28.037 0.57 26.40 21.936 4.67 2.01 33.38 1.787 3.341 0.005 0.552 x-average and s-standard deviation.251 2.13 57.62 9.01 63.282 0.55 2.87 34.68 5.257 1.80 1.73 0.80 20.765 9.40 1.90 3.86 2.429 0.36 1.09 25.007 0. Table 3 Basic statistics of the selected coal groups.502 0.52 12.202 0.187 9.209 0.389 0.11 0.05 4. P.427 11.46 5.31 3.095 0. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 Unauthenticated Download Date | 11/22/15 12:25 PM Wa Ad Vdaf Qdaf HAdaf Ro Cat Hat Nat Oat I x .86 3.33 30.846 15.997 1.444 1.20 34.80 0.98 25.549 2.32 7.31 3.84 0.851 0.262 3.272 0.78 7.746 2.08 4.080 1.715 0.12 0.22 9.11 31.03 83.340 0.96 12.67 32.83 11.19 14.067 3.73 14.641 4.17 7.370 0.844 5.01 5.238 1.70 1.78 0.986 19.185 0.954 0.738 Coal type II/1 II/2 III/1 III/2 IV/1 IV/2 s x s x s x s x s x s x s 1.556 9.76 1.928 2.976 0.40 0.813 3.914 10. 6058 0.5789 0.4816 0.0961 0.0) CC Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Manhattan Manhattan Manhattan Manhattan Manhattan Manhattan Manhattan 1.1959 0.3931 0.P.8905 0.0136 0.2911 0.5551 0.6625 0.2476 0.4905 0.2445 0.7682 0.5295 0.4639 0. 4 Average group dendrogram of the coal samples using the Euclidean distance. Variance 739 Distance Delta(0.6916 0.7109 0.2636 0.8745 1.6819 0.6431 0.3212 1.6229 Table 4 Hierarchical clustering of the standardized geochemical data.6306 0. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 Clustering Method Single Linkage Complete Linkage Group Average Simple Average Centroid Median Ward’s Min.5254 0.9081 1. Fig.2981 0.8526 0.2440 0.5884 0.8672 0.5) Delta(1.6912 0.7252 0.6848 0. Unauthenticated Download Date | 11/22/15 12:25 PM .6760 0.5672 0.2342 0.8983 1.4219 0.69 0. Variance Single Linkage Complete Linkage Group Average Simple Average Centroid Median Ward’s Min. pp. Jessup: “Matrices. pp. The samples were clustered into seven groups in accordance with the intensity of thermal and oxidation changes in coal matter. Bishop.P.pdf [10] Safavi and H.. 1–17. pp.R. Granzow (Eds): A Practical Approach to Microarray Data Analysis. The relations among the geochemical variables were recognised by the components weights plot and also confirmed by Factor Analysis. Abdollahi: “Thermodynamic characterization of weak association equilibria accompanied with spectral overlapping by a SVD-based chemometric method”.740 4 P. 11. (1998). July 15-19. M. Rev. New York. Sn´aˇsel: Latent Semantic Indexing for Image Retrieval Systems.siam. This work was supported by the Ministry of Education.. pp.2 % of the variance. MA. Praks P. 335–362. Vol. Kluwer. (2003). Drmaˇc and E. Norwell. 53. [8] M.. 10. a new concept?”. Anal. W. Acknowledgment Author kindly thanks to Pavel Praks for the programming of SVD in MATLAB and to professor Zdenˇek Klika (both from VSB-Technical University Ostrava) for providing the testing data of altered coals. [5] P.M. Acta Part B. pp. (1995). Neural Comput. 58. [6] E. Vector Spaces. Signal Process. Due to the data noise filtration properties PCA provides the better resolved cluster of the coal samples.W. 1001–1007.M. [2] H. Wall. 2nd ed.. Geladi: Chemometrics in spectroscopy. 2003.K. 36. (1998).I. 287–314. Classical chemometrics“.org/meetings/la03/proceedings/Dvorsky. (2001). 185. SIAM Conference on Applied Algebra. Malinowski: Factor Analysis in Chemistry. John Wiley & Sons. Dvorsk´y and V. Neural Comput. [4] P. and Information Retrieval”. ” Spectrochim. Geladi and B. pp. Attias: “Independent factor analysis”. Williamsburg. 767–782. pp. [7] M. Berry. Vol.R. Vol. (1994). References [1] P. Vol. Kowalski: “Partial least square regression: A tutorial”. The PCA clustering was compared with the Average Group dendrogram of the same data. Svens´en and C. Talanta. 803– 851. Dubitzky and M. Part 1. 215–234. Acta.. [9] P. Chim. Rechtsteiner and L. In: D.E. Berrar. (1986). Siam. A. Comon: “Independent Component Analysis. J.R. http://www. 1991. 41. Youth and Sport of the Czech Republic (MSM 6198910016). Vol. 2003. Rocha: “Singular value decomposition and principal component analysis“. Williams: GTM: “The generative topographic mapping”.M. Vol. Vol.. Unauthenticated Download Date | 11/22/15 12:25 PM . [3] C. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 Conclusion The SVD-based Principal Component Analysis of the standardized data has revealed the three principal components which explain 85. Z. pp. Klika. Praus / Central European Journal of Chemistry 3(4) 2005 731–741 741 [11] Z. Klika and J. Z. I. pp. 217–235. Havel: “Humic acids from oxidized coals. Kurkov´a. Int. Geology. (1993). 1237–1245. Klikov´a and J. New York. heavy metals in HA samples.K. John Wiley & Sons. Chichester. Coal.): Encyclopedia of Analytical Chemistry. [14] M. Chemosphere. titration curves. Vol. (2004). Meyers (Ed. John Wiley & Sons. nuclear magnetic resonance spectra of HA and infrared spectroscopy”. [13] P. Elemental composition. 22. In: R. [12] B. 1976. 54. Ch. 2000. Mather: Computational Methods of Multivariate Analysis in Physical Geography. Kraussov´a: “Properties or Altered Coals Associated with Carboniferous Red Beds in the Upper Silesian Coal Basin and their Tentative Classification”. Vol. Lavine: “Clustering and Classification of Analytical Data“.P. J. Unauthenticated Download Date | 11/22/15 12:25 PM .A.
Copyright © 2024 DOKUMEN.SITE Inc.