首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
There are many algorithms for detecting epistatic interactions in GWAS. However, most of these algorithms are applicable only for detecting two-locus interactions. Some algorithms are designed to detect only two-locus interactions from the beginning. Others do not have limits to the order of interactions, but in practice take very long time to detect higher order interactions in real data of GWAS. Even the better ones take days to detect higher order interactions in WTCCC data.We propose a fast algorithm for detection of high order epistatic interactions in GWAS. It runs k-means clustering algorithm on the set of all SNPs. Then candidates are selected from each cluster. These candidates are examined to find the causative SNPs of k-locus interactions. We use mutual information from information theory as the measure of association between genotypes and phenotypes.We tested the power and speed of our method on extensive sets of simulated data. The results show that our method has more or equal power, and runs much faster than previously reported methods. We also applied our algorithm on each of seven diseases in WTCCC data to analyze up to 5-locus interactions. It takes only a few hours to analyze 5-locus interactions in one dataset. From the results we make some interesting and meaningful observations on each disease in WTCCC data.In this study, a simple yet powerful two-step approach is proposed for fast detection of high order epistatic interaction. Our algorithm makes it possible to detect high order epistatic interactions in GWAS in a matter of hours on a PC.  相似文献   

2.
A detailed comparison of six multivariate algorithms is presented to analyze and generate Raman microscopic images that consist of a large number of individual spectra. This includes the segmentation algorithms for hierarchical cluster analysis, fuzzy C-means cluster analysis, and k-means cluster analysis and the spectral unmixing techniques for principal component analysis and vertex component analysis (VCA). All algorithms are reviewed and compared. Furthermore, comparisons are made to the new approach N-FINDR. In contrast to the related VCA approach, the used implementation of N-FINDR searches for the original input spectrum from the non-dimension reduced input matrix and sets it as the endmember signature. The algorithms were applied to hyperspectral data from a Raman image of a single cell. This data set was acquired by collecting individual spectra in a raster pattern using a 0.5-??m step size via a commercial Raman microspectrometer. The results were also compared with a fluorescence staining of the cell including its mitochondrial distribution. The ability of each algorithm to extract chemical and spatial information of subcellular components in the cell is discussed together with advantages and disadvantages.  相似文献   

3.
This study describes the analysis of total hops essential oils from 18 cultivated varieties of hops, five of which were bred in Lithuania, and 7 wild hop forms using gas chromatography-mass spectrometry. The study sought to organise the samples of hops into clusters, according to 72 semi-volatile compounds, by applying a well-known method, k-means clustering analysis and to identify the origin of the Lithuanian hop varieties. The bouquet of the hops essential oil was composed of various esters, terpenes, hydrocarbons and ketones. Monoterpenes (mainly β-myrcene), sesquiterpenes (dominated by β-caryophyllene and α-humulene) and oxygenated sesquiterpenes (mainly caryophyllene oxide and humulene epoxide II) were the main compound groups detected in the samples tested. The above compounds, together with a-muurolene, were the only compounds found in all the samples. Qualitative and quantitative differences were observed in the composition of the essential oils of the hop varieties analysed. For successful and statistically significant clustering of the data obtained, expertise and skills in employing chemometric analysis methods are necessary. The result is also highly dependent on the set of samples (representativeness) used for segmentation into groups, the technique for pre-processing the data, the method selected for partitioning the samples according to the similarity measures chosen, etc. To achieve a large and representative data set for clustering analysis from a small number of measurements, numerical simulation was applied using the Monte Carlo method with normal and uniform distributions and several relative standard deviation values. The grouping was performed using the k-means clustering method, employing several optimal number of clusters evaluation techniques (Davies-Bouldin index, distortion function, etc.) and different data pre-processing approaches. The hop samples analysed were separated into 3 and 5 clusters according to the data filtering scenario used. However, the targeted Lithuanian hop varieties were clustered identically in both cases and fell into the same group together with other cultivated hop varieties from Ukraine and Poland.  相似文献   

4.
Ion Mobility Spectrometry (IMS) provides a means for analyzing the substances a person exhales. In this paper, we report on an approach to support early diagnosis of bronchial carcinoma based on such IMS measurements. Given the peaks in a set of ion mobility spectra, we first cluster these peaks with a modified k-means algorithm. We then apply probabilistic relational modelling and learning methods to a logical representation of the data obtained from the ion mobility spectra and the peak clusters. Markov Logic Networks and the MLN system Alchemy are employed for various modelling and learning scenarios. These scenarios are evaluated with respect to ease of use, classification accuracy, and knowledge representation aspects.  相似文献   

5.
Representative subset selection   总被引:1,自引:0,他引:1  
Fast development of analytical techniques enable to acquire huge amount of data. Large data sets are difficult to handle and therefore, there is a big interest in designing a subset of the original data set, which preserves the information of the original data set and facilitates the computations. There are many subset selection methods and their choice depends on the problem at hand. The two most popular groups of subset selection methods are uniform designs and cluster-based designs. Among the methods considered in this paper there are uniform designs, such as those proposed by Kennard and Stone, OptiSim, and cluster-based designs applying K-means technique and density based spatial clustering of applications with noise (DBSCAN). Additionally, a new concept of the subset selection with K-means is introduced.  相似文献   

6.
The congener profile of samples contaminated with dioxin and dioxin-like compounds allows identifying sources of contamination. This article studies the statistical methods of congener profile analysis reported in the literature with respect to the reliability of obtained results. The performance of customary analysis methods regarding raw data transformation and applied TEF (toxic equivalency factor) values is discussed. In particular, the method of principal component analysis and k-means cluster is taken as an example and examined in detail. Reasons for occurring inconsistencies such as the dependence of results on raw data transformation and the disregard of measurement uncertainty are described, and it is shown that they also explain inconsistencies in other methods of cluster analysis such as hierarchical cluster analysis and neural networks. It is concluded that these methods cannot be employed to reach court-proof decisions, i.e. decisions which meet court evidentiary standards. An alternative approach to analyzing congener profiles based on mathematical statistics is briefly presented, allowing reliable, court-proof decisions.  相似文献   

7.
8.
This work proposes a modification to the successive projections algorithm (SPA) aimed at selecting spectral variables for multiple linear regression (MLR) in the presence of unknown interferents not included in the calibration data set. The modified algorithm favours the selection of variables in which the effect of the interferent is less pronounced. The proposed procedure can be regarded as an adaptive modelling technique, because the spectral features of the samples to be analyzed are considered in the variable selection process. The advantages of this new approach are demonstrated in two analytical problems, namely (1) ultraviolet–visible spectrometric determination of tartrazine, allure red and sunset yellow in aqueous solutions under the interference of erythrosine, and (2) near-infrared spectrometric determination of ethanol in gasoline under the interference of toluene. In these case studies, the performance of conventional MLR-SPA models is substantially degraded by the presence of the interferent. This problem is circumvented by applying the proposed Adaptive MLR-SPA approach, which results in prediction errors smaller than those obtained by three other multivariate calibration techniques, namely stepwise regression, full-spectrum partial-least-squares (PLS) and PLS with variables selected by a genetic algorithm. An inspection of the variable selection results reveals that the Adaptive approach successfully avoids spectral regions in which the interference is more intense.  相似文献   

9.
The joint use of genetic algorithms and pruning computational neural networks is shown to be an effective means for selecting the number of inputs required to correct temperature variations in kinetic-based determinations. The genetic algorithm uses a pruning procedure based on Bayesian regularization and is highly efficient as a feature selector; it provides quite good results in the generalization process without the need to use a validation set. The fitness function is defined as the sum of two subfunctions: one controls the learning ability of the network and the other its complexity. The training, pruning, and generalization processes were initially tested with simulated data in order to acquire preliminary information for the ensuing work with real data. The performance of the proposed method was assessed by applying it to the determination of the amino acid L-glycine by its classical spectrophotometric reaction with ninhydrin. A straightforward network topology including temperature as input (40+T:2:1 with 19 connections after the pruning process) was used to estimate the L-glycine concentration from kinetic curves affected by temperature variations over the range 60-75 degrees C, using kinetic data acquired up to only 1.5 half-lives. The trained network estimates this concentration with a standard error of prediction for the testing set of ca. 8%, which is much smaller than those provided by a classical parametric method such as nonlinear regression (even if kinetic data acquired at longer half-lives are used). Finally, a kinetic interpretation of the pruning process is provided in order to better demonstrate its potential for kinetic analysis.  相似文献   

10.
Airborne particulate matter is an important component of atmospheric pollution, affecting human health, climate, and visibility. Modern instruments allow single particles to be analyzed one-by-one in real time, and offer the promise of determining the sources of individual particles based on their mass spectral signatures. The large number of particles to be apportioned makes clustering a necessary step. The goal of this study is to compare using mass spectral data the accuracy and speed of several clustering algorithms: ART-2a, several variants of hierarchical clustering, and K-means. Repeated simulations with various algorithms and different levels of data preprocessing suggest that hierarchical clustering methods using derivatives of Ward's algorithm discriminate sources with fewer errors than ART-2a, which itself discriminates much better than point-wise hierarchical clustering methods. In most cases, K-means algorithms do almost as well as the best hierarchical clustering. These efficient algorithms (clustering derived from Ward's algorithm, ART-2a and K-means) are most accurate when the relative peak areas have been pre-scaled by taking the square root. Analysis times vary within a factor of 30, and when accuracy above 95% is required, run times scale up as the square of the number of particles. Algorithms derived from Ward's remain the most accurate under a wide range of conditions and conversely, for an equal accuracy, can deliver a shorter list of clusters, allowing faster and maybe on-the-fly classification.  相似文献   

11.
This paper proposes a new method for determining the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods.The novel method is based on the recently proposed canonical measure of correlation (CMC index) between two sets of variables [R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio, A. Mauri, Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications, Anal. Chim. Acta submitted for publication (2009)]. Following a stepwise procedure (backward elimination), each variable in turn is compared to all the other variables and the most correlated is definitively discarded. Finally, a key subset of variables being as orthogonal as possible are selected. The performance was evaluated on both simulated and real data sets. The effectiveness of the novel method is discussed by comparison with results of other well known methods for variable reduction, such as Jolliffe techniques, McCabe criteria, Krzanowski approach and its modification based on genetic algorithms, loadings of the first principal component, Key Set Factor Analysis (KSFA), Variable Inflation Factor (VIF), pairwise correlation approach, and K correlation analysis (KIF). The obtained results are consistent with those of the other considered methods; moreover, the advantage of the proposed CMC method is that calculation is very quick and can be easily implemented in any software application.  相似文献   

12.
Accelerated K-means clustering in metric spaces   总被引:1,自引:0,他引:1  
The K-means method is a popular technique for clustering data into k-partitions. In the adaptive form of the algorithm, Lloyds method, an iterative procedure alternately assigns cluster membership based on a set of centroids and then redefines the centroids based on the computed cluster membership. The most time-consuming part of this algorithm is the determination of which points being clustered belong to which cluster center. This paper discusses the use of the vantage-point tree as a method of more quickly assigning cluster membership when the points being clustered belong to intrinsically low- and medium-dimensional metric spaces. Results will be discussed from simulated data sets and real-world data in the clustering of molecular databases based upon physicochemical properties. Comparisons will be made to a highly optimized brute-force implementation of Lloyd's method and to other pruning strategies.  相似文献   

13.
The non-linear regression technique known as alternating conditional expectations (ACE) method is only applicable when the number of objects available for calibration is considerably greater than the number of considered predictors. Alternating conditional expectations regression with selection of significant predictors by genetic algorithms (GA-ACE), the non-linear regression technique presented here, is based on the ACE algorithm but introducing several modifications to resolve the applicability limitations of the original ACE method, thus facilitating the practical implementation of a very interesting calibration tool. In order to overcome the lack of reliability displayed by the original ACE algorithm when working on data sets characterized by a too large number of variables and prior to the development of the non-linear regression model, GA-ACE applies genetic algorithms as a variable selection technique to select a reduced subset of significant predictors able to accurately model and predict a considered variable response. Furthermore, GA-ACE actually provides two alternative application approaches, since it allows either the performance of prior data compression computing a number of principal components to be subsequently subjected to GA-selection, or working directly on original variables.In this study, GA-ACE was applied to two real calibration problems, with a very low observation/variable ratio (NIR data), and the results were compared with those obtained by several linear regression techniques usually employed. When using the GA-ACE non-linear method, notably improved regression models were developed for the two response variables modeled, with root mean square errors of the residuals in external prediction (RMSEP) equal to 11.51 and 6.03% for moisture and lipid contents of roasted coffee samples, respectively. The improvement achieved by applying the new non-linear method introduced is even more remarkable taking into account the results obtained with the best performance linear method (IPW-PLS) applied to predict the studied responses (14.61 and 7.74% RMSEP, respectively).  相似文献   

14.
Non-linear absorption spectral data obtained from ternary mixtures of analytes are analyzed by using a linear model, iterative target transformation factor analysis (ITTFA). The use of transformed original variables is used to correct non-linearities in the original data. Absorbance below a certain limit (k) is described as linear and above this limit as non-linear. The extension of the regressor variables is the squared absorbances above the linear range. The variation of the prediction error as a function of the number of the factors and the k-values were considered and the minimum prediction error was evaluated for reaching to optimum. Except the natural non-negativity constraint the correlation constraint also is used on concentration vector in each iteration of ITTFA algorithm. The reliability of the method is evaluated using model data for ternary mixtures by spectral overlapping and different degrees of non-linearity. Simultaneous spectrophotometric determination of Eu3+, UO22+ and Th4+ with arsenazo III as chromogenic reagent is used as experimental model systems with non-linearity behavior of Eu3+and UO22+ components. The application to both synthetic and real data sets with different degrees of non-linearity demonstrate the ability of the proposed methodology to obtain better results than original data and ITTFA. The relative standard errors of prediction for proposed method in comparison with using the PLS calibration on original and extended data are nearly smaller.  相似文献   

15.
This article presents a data analysis method for biomarker discovery in proteomics data analysis. In factor analysis-based discriminate models, the latent variables (LV's) are calculated from the response data measured at all employed instrument channels. Since some channels are irrelevant and their responses do not possess useful information, the extracted LV's possess mixed information from both useful and irrelevant channels. In this work, clustering of variables (CLoVA) based on unsupervised pattern recognition is suggested as an efficient method to identify the most informative spectral region and then it is used to construct a more predictive multivariate classification model. In the suggested method, the instrument channels (m/z value) are clustered into different clusters via self-organization map. Subsequently, the spectral data of each cluster are separately used as the input variables of classification methods such as partial least square-discriminate analysis (PLS-DA) and extended canonical variate analysis (ECVA). The proposed method is evaluated by the analysis of two experimental data sets (ovarian and prostate cancer data set). It is found that our proposed method is able to detect cancerous from healthy samples with much higher sensitivity and selectivity than conventional PLS-DA and ECVA methods.  相似文献   

16.
The differentiation of aromas of Chinese liquor is important for their unique flavors. In this work, aromas of Chinese liquor were characterized by gas chromatography and chemometrics. Ten representative aroma compounds, including three alcohols, four esters, two organic acids, and acetal in 16 Chinese liquor were determined by gas chromatography with flame ionization detection. The relationship between these compounds and six classic aromas was investigated using principal component analysis and k-means clustering. The cumulative contribution of the first three principal components reached up to 84.607%, which effectively differentiated the liquors. The variables with the highest loading absolute value were acetal and ethyl acetate for principal component 1, ethyl butanoate and ethyl hexanoate for principal component 2, and the hexanoic acid and ethyl butanoate for principal component 3. The aromas of the liquors were characterized by k-means clustering with the first three principal component scores, indicating that the acetal, ethyl acetate, ethyl butanoate, ethyl hexanoate, and hexanoic acid are important for the aroma of Chinese liquors. This work demonstrated that the gas chromatography with chemometrics is effective for the characterization of aromatic liquor.  相似文献   

17.
Cross‐validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the independence between model testing and calibration, the observation‐wise k‐fold operation is commonly implemented in each cross‐validation step. This operation renders the CV algorithm computationally intensive, and it is the main limitation to apply CV on very large data sets. In this paper, we carry out an empirical and theoretical investigation of the use of this operation in the element‐wise k‐fold (ekf) algorithm, the state‐of‐the‐art CV algorithm. We show that when very large data sets need to be cross‐validated and the computational time is a matter of concern, the observation‐wise k‐fold operation can be skipped. The theoretical properties of the resulting modified algorithm, referred to as column‐wise k‐fold (ckf) algorithm, are derived. Also, its performance is evaluated with several artificial and real data sets. We suggest the ckf algorithm to be a valid alternative to the standard ekf to reduce the computational time needed to cross‐validate a data set. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

18.
In this work the Successive Projection Algorithm is presented for intervals selection in N-PLS for three-way data modeling. The proposed algorithm combines noise-reduction properties of PLS with the possibility of discarding uninformative variables in SPA. In addition, second-order advantage can be achieved by the residual bilinearization (RBL) procedure when an unexpected constituent is present in a test sample. For this purpose, SPA was modified in order to select intervals for use in trilinear PLS. The ability of the proposed algorithm, namely iSPA-N-PLS, was evaluated on one simulated and two experimental data sets, comparing the results to those obtained by N-PLS. In the simulated system, two analytes were quantitated in two test sets, with and without unexpected constituent. In the first experimental system, the determination of the four fluorophores (l-phenylalanine; l-3,4-dihydroxyphenylalanine; 1,4-dihydroxybenzene and l-tryptophan) was conducted with excitation-emission data matrices. In the second experimental system, quantitation of ofloxacin was performed in water samples containing two other uncalibrated quinolones (ciprofloxacin and danofloxacin) by high performance liquid chromatography with UV–vis diode array detector. For comparison purpose, a GA algorithm coupled with N-PLS/RBL was also used in this work. In most of the studied cases iSPA-N-PLS proved to be a promising tool for selection of variables in second-order calibration, generating models with smaller RMSEP, when compared to both the global model using all of the sensors in two dimensions and GA-NPLS/RBL.  相似文献   

19.
The a term is a primary parameter that is used to indicate the deviation of the epithermal neutron distribution in the k 0-standardization method of neutron activation analysis, k 0-NAA. The calculation of a using a mathematical procedure is a challenge for some researchers. The calculation of a by the "bare-triple monitor" method is possible using the dedicated commercial software KAYZERO®/SOLCOI®. However, when this software is not available in the laboratory it is possible to carry out the calculation of a applying a simple iterative linear regression using any spreadsheets. This approach is described in this paper. The experimental data used in the example were obtained by the irradiation of a set of suitable monitors in the NAA #1 irradiation channel of the HANARO research reactor (KAERI, Korea). The results obtained by this iterative linear regression method agree well with the results calculated by the validated mathematical method.  相似文献   

20.
Pierce KM  Hope JL  Hoggard JC  Synovec RE 《Talanta》2006,70(4):797-804
Comprehensive two-dimensional gas chromatography combined with time-of-flight mass spectrometry (GC × GC-TOFMS) provides high resolution separations of complex samples with a mass spectrum at every point in the separation space. The large volumes of multidimensional data obtained by GC × GC-TOFMS analysis are analyzed using a principal component analysis (PCA) method described herein to quickly and objectively discover differences between complex samples. In this work, we submitted 54 chromatograms to PCA to automatically compare the metabolite profiles of three different species of plants, namely basil (Ocimum basilicum), peppermint (Mentha piperita), and sweet herb stevia (Stevia rebaudiana), where there were 18 chromatograms for each type of plant. The 54 scores of the m/z 73 data set clustered in three groups according to the three types of plants. Principal component 1 (PC 1) separated the stevia cluster from the basil and peppermint clusters, capturing 61.84% of the total variance. Principal component 2 (PC 2) separated the basil cluster from the peppermint cluster, capturing 16.78% of the total variance. The PCA method revealed that relative abundances of amino acids, carboxylic acids, and carbohydrates were responsible for differentiating the three plants. A brief list of the 16 most significant metabolites is reported. After PCA, the 54 scores of the m/z 217 data set clustered in three groups according to the three types of plants, as well, yielding highly loaded variables corresponding with chemical differences between plants that were complementary to the m/z 73 information. The PCA data mining method is applicable to all of the monitored selective mass channels, utilizing all of the collected data, to discover unknown differences in complex sample profiles.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号