首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A fuzzy c-means clustering algorithm is presented which is much faster than the traditional algorithm for data sets in which the number of features is significantly larger than the number of feature vectors. The algorithm is constructed by utilizing the covariance structure of feature vectors and cluster centers. By using results from a previous clustering, modified versions of the new algorithm achieve additional reductions in floating point operations. © 1995 by John Wiley & Sons, Inc.  相似文献   

2.
Principal component analysis (PCA) is a favorite tool in chemometrics for data compression and information extraction. PCA finds linear combinations of the original measurement variables that describe the significant variations in the data. However, it is well-known that PCA, as with any other multivariate statistical method, is sensitive to outliers, missing data, and poor linear correlation between variables due to poorly distributed variables. As a result data transformations have a large impact upon PCA. In this regard one of the most powerful approaches to improve PCA appears to be the fuzzification of the matrix data, thus diminishing the influence of outliers. In this paper we discuss a robust fuzzy PCA algorithm (FPCA). The new algorithm is illustrated on a data set concerning interaction of carbon-hydrogen bonds with transition metal-oxo bonds in molybdenum complexes. Considering, for example, a two component model, FPCA accounts for 97.20% of the total variance and PCA accounts only for 69.75%.  相似文献   

3.
Pharmacophore modeling of large, drug-like molecules, such as the dopamine reuptake inhibitor GBR 12909, is complicated by their flexibility. A comprehensive hierarchical clustering study of two GBR 12909 analogs was performed to identify representative conformers for input to three-dimensional quantitative structure–activity relationship studies of closely-related analogs. Two data sets of more than 700 conformers each produced by random search conformational analysis of a piperazine and a piperidine GBR 12909 analog were studied. Several clustering studies were carried out based on different feature sets that include the important pharmacophore elements. The distance maps, the plot of the effective number of clusters versus actual number of clusters, and the novel derived clustering statistic, percentage change in the effective number of clusters, were shown to be useful in determining the appropriate clustering level.Six clusters were chosen for each analog, each representing a different region of the torsional angle space that determines the relative orientation of the pharmacophore elements. Conformers of each cluster that are representative of these regions were identified and compared for each analog. This study illustrates the utility of using hierarchical clustering for the classification of conformers of highly flexible molecules in terms of the three-dimensional spatial orientation of key pharmacophore elements.  相似文献   

4.
一种基于免疫算法的新型因子分析算法   总被引:3,自引:0,他引:3  
基于免疫算法的基本思想,提出了新的免疫主成分分析法(IPCA),该方法将免疫算法中抗体对抗原的消除运算应用于二维数据矩阵的正交分解,可得到矩阵的特征值和特征向量.结果表明,IPCA与传统的主成分分析法比较,对HPLC-DAD模拟信号的计算结果基本一致.对HPLC-DAD实验信号的解析结果表明,将IPCA与窗口因子分析技术结合比传统的WFA具有更强的解析能力.  相似文献   

5.
Determining the rank of a chemical matrix is the first step in many multivariate, chemometric studies. Rank is defined as the minimum number of linearly independent factors after deletion of factors that contribute to random, nonlinear, uncorrelated errors. Adding a matrix of rank 1 to a data matrix not only increases the rank by one unit but also perturbs the primary factor axes, having little effect on the secondary axes associated with the random errors in the measurements. The primary rank of a data matrix can be determined by comparing the residual variances obtained from principal component analysis (PCA) of the original data matrix to those obtained from an augmented matrix. The ratio of the residual variances between adjacent factor levels represents a Fisher ratio that can be used to distinguish the primary factors (chemical as well as instrumental factors) from the secondary factors (experimental errors). The results gleaned from model studies as well as those from experimental studies are used to illustrate the efficacy of the proposed methodology. The method is independent of the nature of the error distribution. Limitations and precautions are discussed. An algorithm, written in MATLAB format, is included. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

6.
Sârbu C  Pop HF 《Talanta》2005,65(5):1215-1220
Principal component analysis (PCA) is a favorite tool in environmetrics for data compression and information extraction. PCA finds linear combinations of the original measurement variables that describe the significant variations in the data. However, it is well-known that PCA, as with any other multivariate statistical method, is sensitive to outliers, missing data, and poor linear correlation between variables due to poorly distributed variables. As a result data transformations have a large impact upon PCA. In this regard one of the most powerful approach to improve PCA appears to be the fuzzification of the matrix data, thus diminishing the influence of the outliers. In this paper we discuss and apply a robust fuzzy PCA algorithm (FPCA). The efficiency of the new algorithm is illustrated on a data set concerning the water quality of the Danube River for a period of 11 consecutive years. Considering, for example, a two component model, FPCA accounts for 91.7% of the total variance and PCA accounts only for 39.8%. Much more, PCA showed only a partial separation of the variables and no separation of scores (samples) onto the plane described by the first two principal components, whereas a much sharper differentiation of the variables and scores is observed when FPCA is applied.  相似文献   

7.
8.
In environmental chemistry studies, it may be necessary to analyze data sets constituted by different blocks of variables, possibly of different types, measured on the same samples. Multiple factor analysis (MFA) is presented as a tool for exploring such data. The most important features of MFA are shown on a real environmental data set, consisting of two blocks of data, namely heavy metals and polycyclic aromatic hydrocarbons, measured for sediment samples. They are discussed and compared to principal component analysis (PCA). The usefulness of the weighting scheme used in MFA as a preprocessing step for other chemometric methods, such as clustering, is also highlighted.  相似文献   

9.
Thermally driven materials characterized by complex energy landscapes, such as proteins, exhibit motions on a broad range of space and time scales. Principal component analysis (PCA) is often used to extract modes of motion from protein trajectory data that correspond to coherent, functional motions. In this work, two other methods, maximum covariance analysis (MCA) and canonical correlation analysis (CCA) are formulated in a way appropriate to analyze protein trajectory data. Both methods partition the coordinates used to describe the system into two sets (two measurement domains) and inquire as to the correlations that may exist between them. MCA and CCA provide rotations of the original coordinate system that successively maximize the covariance (MCA) or correlation (CCA) between modes of each measurement domain under suitable constraint conditions. We provide a common framework based on the singular value decomposition of appropriate matrices to derive MCA and CCA. The differences between and strengths and weaknesses of MCA and CCA are discussed and illustrated. The application presented here examines the correlation between the backbone and side chain of the peptide met-enkephalin as it fluctuates between open conformations, found in solution, to closed conformations appropriate to when it is bound to its receptor. Difficulties with PCA carried out in Cartesian coordinates are found and motivate a formulation in terms of dihedral angles for the backbone atoms and selected atom distances for the side chains. These internal coordinates are a more reliable basis for all the methods explored here. MCA uncovers a correlation between combinations of several backbone dihedral angles and selected side chain atom distances of met-enkephalin. It could be used to suggest residues and dihedral angles to focus on to favor specific side chain conformers. These methods could be applied to proteins with domains that, when they rearrange upon ligand binding, may have correlated functional motions or, for multi-subunit proteins, may exhibit correlated subunit motions.  相似文献   

10.
We describe a method of performing trilinear analysis on large data sets using a modification of the PARAFAC‐ALS algorithm. Our method iteratively decomposes the data matrix into a core matrix and three loading matrices based on the Tucker1 model. The algorithm is particularly useful for data sets that are too large to upload into a computer's main memory. While the performance advantage in utilizing our algorithm is dependent on the number of data elements and dimensions of the data array, we have seen a significant performance improvement over operating PARAFAC‐ALS on the full data set. In one case of data comprising hyperspectral images from a confocal microscope, our method of analysis was approximately 60 times faster than operating on the full data set, while obtaining essentially equivalent results. Copyright © 2008 by John Wiley & Sons, Ltd.  相似文献   

11.
Ya Xiong Zhang 《Talanta》2007,73(1):68-75
Two clinical data sets were applied for pattern recognition in order to discover the correlation between urinary nucleoside profiles and tumours. One data set contains 168 clinical urinary samples, of which 84 specimens are from female thyroid cancer patients (malignant tumour group), and the other samples were collected from healthy women (normal group). However, 168 clinical urinary samples comprised the second data set, too. In all the specimens, each number of the samples for both uterine cervical cancer patients (malignant tumour group) and healthy females (normal group) is 60, and the other 48 samples were collected from uterine myoma patients (benign tumour group). For the two data sets, the separation and quantitative determination of the clinical urinary nucleosides were performed by capillary electrophoresis (CE). The pattern recognition was achieved applying multiple layer perceptron artificial neural networks (MLP ANN) based on conjugate gradient descent training algorithm. Moreover, applying the proposed principal component analysis (PCA) input selection scheme to MLP ANN, the accuracy rate of the pattern recognition was improved to some extent (or without any deterioration) even by much simpler structure of MLP ANN. The study showed that MLP ANN based on PCA input selection was a promising tool for pattern recognition.  相似文献   

12.
ChemCam is a remote laser-induced breakdown spectroscopy (LIBS) instrument that will arrive on Mars in 2012, on-board the Mars Science Laboratory Rover. The LIBS technique is crucial to accurately identify samples and quantify elemental abundances at various distances from the rover. In this study, we compare different linear and nonlinear multivariate techniques to visualize and discriminate clusters in two dimensions (2D) from the data obtained with ChemCam. We have used principal components analysis (PCA) and independent components analysis (ICA) for the linear tools and compared them with the nonlinear Sammon’s map projection technique. We demonstrate that the Sammon’s map gives the best 2D representation of the data set, with optimization values from 2.8% to 4.3% (0% is a perfect representation), together with an entropy value of 0.81 for the purity of the clustering analysis. The linear 2D projections result in three (ICA) and five times (PCA) more stress, and their clustering purity is more than twice higher with entropy values about 1.8. We show that the Sammon’s map algorithm is faster and gives a slightly better representation of the data set if the initial conditions are taken from the ICA projection rather than the PCA projection. We conclude that the nonlinear Sammon’s map projection is the best technique for combining data visualization and clustering assessment of the ChemCam LIBS data in 2D. PCA and ICA projections on more dimensions would improve on these numbers at the cost of the intuitive interpretation of the 2D projection by a human operator.  相似文献   

13.
Partial least squares (PLS) regression is a linear regression technique developed to relate many regressors to one or several response variables. Robust methods are introduced to reduce or remove the effect of outlying data points. In this paper, we show that if the sample covariance matrix is properly robustified further robustification of the linear regression steps of the PLS algorithm becomes unnecessary. The robust estimate of the covariance matrix is computed by searching for outliers in univariate projections of the data on a combination of random directions (Stahel—Donoho) and specific directions obtained by maximizing and minimizing the kurtosis coefficient of the projected data, as proposed by Peña and Prieto [1]. It is shown that this procedure is fast to apply and provides better results than other methods proposed in the literature. Its performance is illustrated by Monte Carlo and by an example, where the algorithm is able to show features of the data which were undetected by previous methods. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

14.
15.
Motivation: Microarrays have allowed the expression level of thousands of genes or proteins to be measured simultaneously. Data sets generated by these arrays consist of a small number of observations (e.g., 20-100 samples) on a very large number of variables (e.g., 10,000 genes or proteins). The observations in these data sets often have other attributes associated with them such as a class label denoting the pathology of the subject. Finding the genes or proteins that are correlated to these attributes is often a difficult task since most of the variables do not contain information about the pathology and as such can mask the identity of the relevant features. We describe a genetic algorithm (GA) that employs both supervised and unsupervised learning to mine gene expression and proteomic data. The pattern recognition GA selects features that increase clustering, while simultaneously searching for features that optimize the separation of the classes in a plot of the two or three largest principal components of the data. Because the largest principal components capture the bulk of the variance in the data, the features chosen by the GA contain information primarily about differences between classes in the data set. The principal component analysis routine embedded in the fitness function of the GA acts as an information filter, significantly reducing the size of the search space since it restricts the search to feature sets whose principal component plots show clustering on the basis of class. The algorithm integrates aspects of artificial intelligence and evolutionary computations to yield a smart one pass procedure for feature selection, clustering, classification, and prediction.  相似文献   

16.
This paper presents a new multiblock analysis method called OnPLS, a general extension of O2PLS to the multiblock case. The proposed method is equivalent to O2PLS in cases involving only two matrices, but generalises to cases involving more than two matrices without giving preference to any particular matrix: the method is fully symmetric. OnPLS extracts a minimal number of globally predictive components that exhibit maximal covariance and correlation. Furthermore, the method can be used to study orthogonal variation, i.e. local phenomena captured in the data that are specific to individual combinations of matrices or to individual matrices. The method's utility was demonstrated by its application to three synthetic data sets. It was shown that OnPLS affords a reduced number of globally predictive components and increased intercorrelations of scores, and that it greatly facilitates interpretation of the predictive model. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

17.
18.
Producing good low‐dimensional representations of high‐dimensional data is a common and important task in many data mining applications. Two methods that have been particularly useful in this regard are multidimensional scaling and nonlinear mapping. These methods attempt to visualize a set of objects described by means of a dissimilarity or distance matrix on a low‐dimensional display plane in a way that preserves the proximities of the objects to whatever extent is possible. Unfortunately, most known algorithms are of quadratic order, and their use has been limited to relatively small data sets. We recently demonstrated that nonlinear maps derived from a small random sample of a large data set exhibit the same structure and characteristics as that of the entire collection, and that this structure can be easily extracted by a neural network, making possible the scaling of data set orders of magnitude larger than those accessible with conventional methodologies. Here, we present a variant of this algorithm based on local learning. The method employs a fuzzy clustering methodology to partition the data space into a set of Voronoi polyhedra, and uses a separate neural network to perform the nonlinear mapping within each cell. We find that this local approach offers a number of advantages, and produces maps that are virtually indistinguishable from those derived with conventional algorithms. These advantages are discussed using examples from the fields of combinatorial chemistry and optical character recognition. © 2001 John Wiley & Sons, Inc. J Comput Chem 22: 373–386, 2001  相似文献   

19.
Multi-wavelength fingerprints of Cassia seed, a traditional Chinese medicine (TCM), were collected by high-performance liquid chromatography (HPLC) at two wavelengths with the use of diode array detection. The two data sets of chromatograms were combined by the data fusion-based method. This data set of fingerprints was compared separately with the two data sets collected at each of the two wavelengths. It was demonstrated with the use of principal component analysis (PCA), that multi-wavelength fingerprints provided a much improved representation of the differences in the samples. Thereafter, the multi-wavelength fingerprint data set was submitted for classification to a suite of chemometrics methods viz. fuzzy clustering (FC), SIMCA and the rank ordering MCDM PROMETHEE and GAIA. Each method highlighted different properties of the data matrix according to the fingerprints from different types of Cassia seeds. In general, the PROMETHEE and GAIA MCDM methods provided the most comprehensive information for matching and discrimination of the fingerprints, and appeared to be best suited for quality assurance purposes for these and similar types of sample.  相似文献   

20.
The attractor for the dynamics of a complex system can be constructed from the time series measurement of a single variable. A recently proposed procedure is to construct a covariance matrix using an embedding window on the time series. An analysis of the meaning of the eigenvalues of the covariance matrix of the time series is undertaken here. It is argued that each principal eigenvalue can be decomposed into components which describe the time evolution of the correlations of the system along the given principal direction. A one-dimensional iterative map of these components can be constructed in correlation space. Such a map displays the regular or chaotic nature of the dynamics for each principal direction of the attractor. Illustrative examples of such maps are constructed for regular and random time series and for the Lorenz attractor.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号