首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Analytica chimica acta》2004,515(1):87-100
The goal of present work is to analyse the effect of having non-informative variables (NIV) in a data set when applying cluster analysis and to propose a method computationally capable of detecting and removing these variables. The method proposed is based on the use of a genetic algorithm to select those variables important to make the presence of groups in data clear. The procedure has been implemented to be used with k-means and using the cluster silhouettes as fitness function for the genetic algorithm.The main problem that can appear when applying the method to real data is the fact that, in general, we do not know a priori what the real cluster structure is (number and composition of the groups).The work explores the evolution of the silhouette values computed from the clusters built by using k-means when non-informative variables are added to the original data set in both a literature data set as well as some simulated data in higher dimension. The procedure has also been applied to real data sets.  相似文献   

2.
Airborne particulate matter is an important component of atmospheric pollution, affecting human health, climate, and visibility. Modern instruments allow single particles to be analyzed one-by-one in real time, and offer the promise of determining the sources of individual particles based on their mass spectral signatures. The large number of particles to be apportioned makes clustering a necessary step. The goal of this study is to compare using mass spectral data the accuracy and speed of several clustering algorithms: ART-2a, several variants of hierarchical clustering, and K-means. Repeated simulations with various algorithms and different levels of data preprocessing suggest that hierarchical clustering methods using derivatives of Ward's algorithm discriminate sources with fewer errors than ART-2a, which itself discriminates much better than point-wise hierarchical clustering methods. In most cases, K-means algorithms do almost as well as the best hierarchical clustering. These efficient algorithms (clustering derived from Ward's algorithm, ART-2a and K-means) are most accurate when the relative peak areas have been pre-scaled by taking the square root. Analysis times vary within a factor of 30, and when accuracy above 95% is required, run times scale up as the square of the number of particles. Algorithms derived from Ward's remain the most accurate under a wide range of conditions and conversely, for an equal accuracy, can deliver a shorter list of clusters, allowing faster and maybe on-the-fly classification.  相似文献   

3.
A detailed comparison of six multivariate algorithms is presented to analyze and generate Raman microscopic images that consist of a large number of individual spectra. This includes the segmentation algorithms for hierarchical cluster analysis, fuzzy C-means cluster analysis, and k-means cluster analysis and the spectral unmixing techniques for principal component analysis and vertex component analysis (VCA). All algorithms are reviewed and compared. Furthermore, comparisons are made to the new approach N-FINDR. In contrast to the related VCA approach, the used implementation of N-FINDR searches for the original input spectrum from the non-dimension reduced input matrix and sets it as the endmember signature. The algorithms were applied to hyperspectral data from a Raman image of a single cell. This data set was acquired by collecting individual spectra in a raster pattern using a 0.5-??m step size via a commercial Raman microspectrometer. The results were also compared with a fluorescence staining of the cell including its mitochondrial distribution. The ability of each algorithm to extract chemical and spatial information of subcellular components in the cell is discussed together with advantages and disadvantages.  相似文献   

4.
This study describes the analysis of total hops essential oils from 18 cultivated varieties of hops, five of which were bred in Lithuania, and 7 wild hop forms using gas chromatography-mass spectrometry. The study sought to organise the samples of hops into clusters, according to 72 semi-volatile compounds, by applying a well-known method, k-means clustering analysis and to identify the origin of the Lithuanian hop varieties. The bouquet of the hops essential oil was composed of various esters, terpenes, hydrocarbons and ketones. Monoterpenes (mainly β-myrcene), sesquiterpenes (dominated by β-caryophyllene and α-humulene) and oxygenated sesquiterpenes (mainly caryophyllene oxide and humulene epoxide II) were the main compound groups detected in the samples tested. The above compounds, together with a-muurolene, were the only compounds found in all the samples. Qualitative and quantitative differences were observed in the composition of the essential oils of the hop varieties analysed. For successful and statistically significant clustering of the data obtained, expertise and skills in employing chemometric analysis methods are necessary. The result is also highly dependent on the set of samples (representativeness) used for segmentation into groups, the technique for pre-processing the data, the method selected for partitioning the samples according to the similarity measures chosen, etc. To achieve a large and representative data set for clustering analysis from a small number of measurements, numerical simulation was applied using the Monte Carlo method with normal and uniform distributions and several relative standard deviation values. The grouping was performed using the k-means clustering method, employing several optimal number of clusters evaluation techniques (Davies-Bouldin index, distortion function, etc.) and different data pre-processing approaches. The hop samples analysed were separated into 3 and 5 clusters according to the data filtering scenario used. However, the targeted Lithuanian hop varieties were clustered identically in both cases and fell into the same group together with other cultivated hop varieties from Ukraine and Poland.  相似文献   

5.
This paper proposes a new method for determining the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods.The novel method is based on the recently proposed canonical measure of correlation (CMC index) between two sets of variables [R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio, A. Mauri, Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications, Anal. Chim. Acta submitted for publication (2009)]. Following a stepwise procedure (backward elimination), each variable in turn is compared to all the other variables and the most correlated is definitively discarded. Finally, a key subset of variables being as orthogonal as possible are selected. The performance was evaluated on both simulated and real data sets. The effectiveness of the novel method is discussed by comparison with results of other well known methods for variable reduction, such as Jolliffe techniques, McCabe criteria, Krzanowski approach and its modification based on genetic algorithms, loadings of the first principal component, Key Set Factor Analysis (KSFA), Variable Inflation Factor (VIF), pairwise correlation approach, and K correlation analysis (KIF). The obtained results are consistent with those of the other considered methods; moreover, the advantage of the proposed CMC method is that calculation is very quick and can be easily implemented in any software application.  相似文献   

6.
Forward selection improved radial basis function (RBF) network was applied to bacterial classification based on the data obtained by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS). The classification of each bacterium cultured at different time was discussed and the effect of parameters of the RBF network was investigated. The new method involves forward selection to prevent overfitting and generalized cross-validation (GCV) was used as model selection criterion (MSC). The original data was compressed by using wavelet transformation to speed up the network training and reduce the number of variables of the original MS data. The data was normalized prior training and testing a network to define the area the neural network to be trained in, accelerate the training rate, and reduce the range the parameters to be selected in. The one-out-of-n method was used to split the data set of p samples into a training set of size p−1 and a test set of size 1. With the improved method, the classification correctness for the five bacteria discussed in the present paper are 87.5, 69.2, 80, 92.3, and 92.8%, respectively.  相似文献   

7.
The potential of the redox couple FeIV=O / FeIII–O is of interest for the reactivity of the high-valent nonheme iron oxidants in enzymes and bioinspired small molecule systems but, unfortunately, experimentally it so far is very poorly described. Discussed are three computational methods that are used in combination with available experimental data derived from titrations of FeIV=O species with ferrocene derivatives in dry acetonitrile, and from spectroelectrochemical titrations of FeIII–OH complexes in wet acetonitrile, i.e. describing the FeIV=O / FeIII–OH couple – both data sets are known to have some ambiguities. First, a DFT-based method is used to compute the values of 14 FeIV=O / FeIII–O couples with an error margin of around 110 mV. A subset of four species of the original data set is used to evaluate a DLPNO-CCSD(T) based approach, and another subset of complexes, where the spectroelectrochemically determined FeIV=O / FeIII–OH potentials are also known, are used for a Bordwell-Polanyi analysis, which also yield pKa values. It is shown that the three approaches lead to a consistent picture but due to possible ambiguities with the experimental data, it currently is not possible to fully evaluate the accuracy of the used approaches.  相似文献   

8.
The use rank annihilation factor analysis (RAFA) for spectrophotometric studies of complex formation equilibria are proposed. One-step complex formation and two successive and mononuclear complex formation systems studied successfully by proposed methods. When the complex stability constant acts as an optimizing object, and simply combined with the pure spectrum of ligand, the rank of original data matrix can be reduced by one by annihilating the information of the ligand from the original data matrix. The residual standard deviation (R.S.D.) of the residual matrix after bilinearization of the background matrix is regarded as the evaluation function. The performance of the method has been evaluated by using synthetic data. For two-step successive complex formation systems, the effects of noise level and equilibrium constants K1 and K2 on output of algorithm are investigated. The applicability of method for resolving the two-step successive complex formation systems with full spectral overlapping of two complex species also is shown. Spectrophotometric studies of murexide-calcium, dithiazone-nickel and methyl thymol blue (MTB)-copper are used as experimental model systems with different complexation stoichiometries and spectral overlapping of involved components.  相似文献   

9.
We present a ‘partial decoupling“ scheme by which the collinear collision of two diatomics is reduced to an atom-diatomic collinear collision with new definitions for the parameters E, m, and a. The pseudo atom-diatomic problem is studied using the adiabatic approximation of Thiele and Katz and the first order distorted wave T- and K-matrix methods. We compare our results to the numerical results of Riley and Kuppermann and conclude that their results are reliable only for the transitions 00 → 01 and 01 → 10. We prepare a set of ‘corrected“ adiabatic transition probabilities and compare the approximate T- and K-matrix results to this set The accuracies of these approximate methods are then seen to be consistent with earlier conclusions concerning the application of the first order distorted wave approximation to atom-diatomic collinear collisions.  相似文献   

10.
Fraga CG  Farmer OT  Carman AJ 《Talanta》2011,83(4):1166-1172
Potassium cyanide was used as a model toxicant to determine the feasibility of using anionic impurities as a forensic signature for matching cyanide salts back to their source. In this study, portions of eight KCN stocks originating from four countries were separately dissolved in water and analyzed by high performance ion chromatography (HPIC) using an anion exchange column and conductivity detection. Sixty KCN aqueous samples were produced from the eight stocks and analyzed for 11 anionic impurities. Hierarchal cluster analysis and principal component analysis were used to demonstrate that KCN samples cluster according to source based on the concentrations of their anionic impurities. The Fisher-ratio method and degree-of-class separation (DCS) were used for feature selection on a training set of KCN samples in order to optimize sample clustering. The optimal subset of anions needed for sample classification was determined to be sulfate, oxalate, phosphate, and an unknown anion named unk5. Using K-nearest neighbors (KNN) and the optimal subset of anions, KCN test samples from different KCN stocks were correctly determined to be manufactured in the United States. In addition, KCN samples from stocks manufactured in Belgium, Germany, and the Czech Republic were all correctly matched back to their original stocks because each stock had a unique anionic impurity profile. The application of the Fisher-ratio method and DCS for feature selection improved the accuracy and confidence of sample classification by KNN.  相似文献   

11.
12.
Relativistically parameterized extended Hückel (REX) calculations of spin—spin coupling tensors 1J(MTe) for cluster models of CdTe, HgTe and PbTe are reported. The relativistic equivalent of Ramsey's theory is used. Electronic densities of states are obtained as a by-product. Two assignments of the experimental NMR data are shown to be possible. The REX results support the original Nolle one for PbTe and the alternative ones for CdTe and HgTe. The increase of ΔJ is attributed to relativistic effects, arising from the M(ns)—Te(5p12) AO combination, recently discussed by Pyykkö and Wiesenfeld. A principle is proposed that both K and ΔK are positive for systems with dominant bonding-to-antibonding excitations, a case likely for half-filled valence shells. Nearly empty or nearly full valence shells with dominant bonding-to-bonding or antibonding-to-antibonding excitations should lead to negative K and ΔK. Double-zeta REX radial parameters are reported for ZnHg and OPo.  相似文献   

13.
Proficiency testing is an external quality control check, whereby the quality of an analytical result is checked against criteria that are set independently of the laboratory carrying out the analysis. Participants in a proficiency test are encouraged to use the method of their choice to determine the analyte in question. The collated results submitted by the participants are used to derive the best estimate of the ‘true’ level, or assigned value, of the analyte, as a consensus value of the whole data set. Generally, the data submitted will be normally distributed and from a single population, but if a data set is found to be multimodal, then the selection of one of the modes as the assigned value is possible where there is supporting data, typically methodology information. Unless there are independent grounds for preferring one mode over another, it is not possible to set an assigned value or calculate z-scores. However, the analysis of allergens has presented proficiency testers with a new challenge, since it has become apparent that quantitative results may be dependent on the brand of enzyme-linked immunosorbent assay kit used, the specific analyte targeted (e.g. total content or allergen protein content) and the limit of detection achievable. FAPAS® has run more than 40 proficiency tests for allergen analysis over the past 7 years, during which time methods have been developed and improved and the requirements for determination of food ingredient allergens has increased. Two case studies are presented which highlight some of the issues around the use of allergen measurement methods.
Figure A selection of food items which might cause allergenic or intolerance reactions
  相似文献   

14.
15.
16.
In multivariate regression, it is often reported that wavelength selection can improve results. Improvement is often solely based on bias measures such as the root mean square error of calibration (RMSEC) and root mean square error of validation (RMSEV), R2 for the calibration and validation, etc. In recent studies, it has been shown that when variance measures are included, Pareto optimal models can be determined. However, variance measures used to date do not provide the ability to choose wavelength subset models relative to full wavelength models when wavelength subset models may be the Pareto models. In this paper, simplex optimization is used with a more complete variance measure to generate Pareto optimal models. The standard basis set is used as well a basis set that includes the range and null space of the calibration spectra. Results show that it is possible to identify Pareto optimal models and if a wavelength subset is best, these are the models found. Regression coefficients for non-essential wavelengths are zero to near zero.  相似文献   

17.
This paper describes the general design and application of CerBeruS, a computer-based system for supporting the process of sequential screening. CerBeruS stands for cluster-based selection, with cluster analysis forming the pivotal part of the system. CerBeruS uses the Ward's clustering method for partitioning the data set to be screened into smaller, more homogeneous subsets. One representative is picked from each subset and suggested as a screening candidate. Although the number of compounds submitted to screening is most often driven by the capacity of the assay, CerBeruS provides a statistical measure that computes the optimal number of clusters in the data set. This measure forms a point of reference for all screening experiments. Different hierarchies of subsets are stored in an Oracle database. Information about the size and content of a cluster can be retrieved from this database via a Visual Basic application. How these components work together in the CerBeruS system is demonstrated on a large data set. In addition, we show that, using the statistical measure, one can find an optimal trade-off between screening effort and number of hits.  相似文献   

18.
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures.In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one.The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks’ Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees.A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.  相似文献   

19.
African swine fever virus (ASFV) causes a highly contagious and severe hemorrhagic viral disease with high mortality in domestic pigs of all ages. Although the virus is harmless to humans, the ongoing ASFV epidemic could have severe economic consequences for global food security. Recent studies have found a few antiviral agents that can inhibit ASFV infections. However, currently, there are no vaccines or antiviral drugs. Hence, there is an urgent need to identify new drugs to treat ASFV. Based on the structural information data on the targets of ASFV, we used molecular docking and machine learning models to identify novel antiviral agents. We confirmed that compounds with high affinity present in the region of interest belonged to subsets in the chemical space using principal component analysis and k-means clustering in molecular docking studies of FDA-approved drugs. These methods predicted pentagastrin as a potential antiviral drug against ASFVs. Finally, it was also observed that the compound had an inhibitory effect on AsfvPolX activity. Results from the present study suggest that molecular docking and machine learning models can play an important role in identifying potential antiviral drugs against ASFVs.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号