首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
Metabolomics datasets generated by modern analytical instruments tend to be increasingly complex. In this study, a recent method named shrunken centroids regularized discriminant analysis (SCRDA) has been introduced and applied in the exploration of metabolomics dataset. It is a supervised method for variable selection, discriminant analysis and biomarker screening. By regularizing the estimate of the within‐class covariance matrix, SCRDA can deal with the singularity issue of linear discriminant analysis. Then a shrinkage estimator is applied to perform variable selection. The method presented is illustrated through the simulated datasets and three complex metabolomics datasets. Commonly used orthogonal partial least squares discriminant analysis and two other similar statistical methods, penalized linear discriminant analysis and nearest shrunken centroids, are used for comparisons. The results illustrate that SCRDA has some desirable abilities in variable selection, classification and prediction. Moreover, the biomarkers identified by SCRDA are further demonstrated to be in accordance with the biochemical research. It has been proved that SCRDA can be applied as a promising strategy in metabolomics. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

2.
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.  相似文献   

3.
Because cerebrospinal fluid (CSF) is the biofluid which interacts most closely with the central nervous system, it holds promise as a reporter of neurological disease, for example multiple sclerosis (MScl). To characterize the metabolomics profile of neuroinflammatory aspects of this disease we studied an animal model of MScl-experimental autoimmune/allergic encephalomyelitis (EAE). Because CSF also exchanges metabolites with blood via the blood-brain barrier, malfunctions occurring in the CNS may be reflected in the biochemical composition of blood plasma. The combination of blood plasma and CSF provides more complete information about the disease. Both biofluids can be studied by use of NMR spectroscopy. It is then necessary to perform combined analysis of the two different datasets. Mid-level data fusion was therefore applied to blood plasma and CSF datasets. First, relevant information was extracted from each biofluid dataset by use of linear support vector machine recursive feature elimination. The selected variables from each dataset were concatenated for joint analysis by partial least squares discriminant analysis (PLS-DA). The combined metabolomics information from plasma and CSF enables more efficient and reliable discrimination of the onset of EAE. Second, we introduced hierarchical models fusion, in which previously developed PLS-DA models are hierarchically combined. We show that this approach enables neuroinflamed rats (even on the day of onset) to be distinguished from either healthy or peripherally inflamed rats. Moreover, progression of EAE can be investigated because the model separates the onset and peak of the disease.  相似文献   

4.
Large amounts of data from high-throughput metabolomics experiments become commonly more and more complex, which brings an enormous amount of challenges to existing statistical modeling. Thus there is a need to develop statistically efficient approach for mining the underlying metabolite information contained by metabolomics data under investigation. In the work, we developed a novel kernel Fisher discriminant analysis (KFDA) algorithm by constructing an informative kernel based on decision tree ensemble. The constructed kernel can effectively encode the similarities of metabolomics samples between informative metabolites/biomarkers in specific parts of the measurement space. Simultaneously, informative metabolites or potential biomarkers can be successfully discovered by variable importance ranking in the process of building kernel. Moreover, KFDA can also deal with nonlinear relationship in the metabolomics data by such a kernel to some extent. Finally, two real metabolomics datasets together with a simulated data were used to demonstrate the performance of the proposed approach through the comparison of different approaches.  相似文献   

5.
At present, tertiary structure discovery growth rate is lagging far behind discovery of primary structure. The prediction of protein structural class using Machine Learning techniques can help reduce this gap. The Structural Classification of Protein – Extended (SCOPe 2.07) is latest and largest dataset available at present. The protein sequences with less than 40% identity to each other are used for predicting α, β, α/β and α + β SCOPe classes. The sensitive features are extracted from primary and secondary structure representations of Proteins. Features are extracted experimentally from secondary structure with respect to its frequency, pitch and spatial arrangements. Primary structure based features contain species information for a protein sequence. The species parameters are further validated with uniref100 dataset using TaxId. As it is known, protein tertiary structure is manifestation of function. Functional differences are observed in species. Hence, the species are expected to have strong correlations with structural class, which is discovered in current work. It enhances prediction accuracy by 7%–10%. The subset of SCOPe 2.07 is trained using 65 dimensional feature vector using Random Forest classifier. The test result for the rest of the set gives consistent accuracy of better than 95%. The accuracy achieved on benchmark datasets ASTRAL 1.73, 25PDB and FC699 is better than 86%, 91% and 97% respectively, which is best reported to our knowledge.  相似文献   

6.
In this study, we have investigated quantitative relationships between critical temperatures of superconductive inorganic materials and the basic physicochemical attributes of these materials (also called quantitative structure-property relationships). We demonstrated that one of the most recent studies (titled "A data-driven statistical model for predicting the critical temperature of a superconductor” and published in Computational Materials Science by K. Hamidieh in 2018) reports on models that were based on the dataset that contains 27% of duplicate entries. We aimed to deliver stable models for a properly cleaned dataset using the same modeling techniques (multiple linear regression, MLR, and gradient boosting decision trees, XGBoost). The predictive ability of our best XGBoost model (R2 = 0.924, RMSE = 9.336 using 10-fold cross-validation) is comparable to the XGBoost model by the author of the initial dataset (R2 = 0.920 and RMSE = 9.5 K in ten-fold cross-validation). At the same time, our best model is based on less sophisticated parameters, which allows one to make more accurate interpretations while maintaining a generalizable model. In particular, we found that the highest relative influence is attributed to variables that represent the thermal conductivity of materials. In addition to MLR and XGBoost, we explored the potential of other machine learning techniques (NN, neural networks and RF, random forests).  相似文献   

7.
Random Projection (RP) technique has been widely applied in many scenarios because it can reduce high-dimensional features into low-dimensional space within short time and meet the need of real-time analysis of massive data. There is an urgent need of dimensionality reduction with fast increase of big genomics data. However, the performance of RP is usually lower. We attempt to improve classification accuracy of RP through combining other reduction dimension methods such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Feature Selection (FS). We compared classification accuracy and running time of different combination methods on three microarray datasets and a simulation dataset. Experimental results show a remarkable improvement of 14.77% in classification accuracy of FS followed by RP compared to RP on BC-TCGA dataset. LDA followed by RP also helps RP to yield a more discriminative subspace with an increase of 13.65% on classification accuracy on the same dataset. FS followed by RP outperforms other combination methods in classification accuracy on most of the datasets.  相似文献   

8.
针对代谢组学研究中的数据处理问题,本研究建立了基于质谱的数据分析系统MS-IAS(Mass spectrometry based integrated analysis system).此系统集成了特征选择、聚类、分类等多种方法,用以处理质谱数据,具有多种统计分析方法能对所选的特征变量进行比较,以发现与所研究问题相关的潜在生物标志物.MS-IAS支持数据与多种算法结果可图形化显示,有助于对数据的解释与分析.以肝病患者的质谱代谢组数据为例,展示MS-IAS的功能,两种特征选择算法从数据集中筛选出了40个对肝病具有区分能力的特征变量,展示了MS-IAS成为代谢组学研究中的通用质谱数据分析系统的潜力.  相似文献   

9.
《中国化学快报》2022,33(12):5184-5188
Exposure to environmental cadmium increases the health risk of residents. Early urine metabolic detection using high-resolution mass spectrometry and machine learning algorithms would be advantageous to predict the adverse health effects. Here, we conducted machine learning approaches to screen potential biomarkers under cadmium exposure in 403 urine samples. In positive and negative ionization mode, 4207 and 3558 features were extracted, respectively. We compared seven machine learning algorithms and found that the extreme gradient boosting (XGBoost) and random forest (RF) classifiers showed better accuracy and predictive performance than others. Following 5-fold cross-validation, the value of area under curve (AUC) was both 0.93 for positive and negative ionization modes in XGBoost classifier. In the RF classifier, AUC were 0.80 and 0.84 for positive and negative ionization modes, respectively. We then identified a biomarker panel based on XGBoost and RF classifiers. The incorporation of machine learning models into urine analysis using high-resolution mass spectrometry could allow a convenient assessment of cadmium exposure.  相似文献   

10.
印油种类区分是法庭科学文件检验领域的重要一环,为研究无损高效区分光敏印油种类的方法。以33种不同品牌光敏印油的原始光谱数据当作对照组,对原始数据进行t-SNE降维和UMAP降维后,选择XGBoost、SVM和MLP三种分类算法,以1比4的比例确定测试集和训练集,对原始数据和降维后的数据进行分类,同时使用网格搜索和五倍交叉验证来优化模型的性能和泛化能力。结果表明,上述三种分类算法对降维后光谱数据区分的平均准确率高于对原始光谱数据区分的平均准确率,且UMAP-MLP分类模型的区分准确率最高,可达到98%。提出的分类模型可用于光敏印油种类的快速区分。  相似文献   

11.
The pathological diagnosis of benign and malignant follicular thyroid tumors remains a major challenge using the current histopathological technique. To improve diagnosis accuracy, spatially resolved metabolomics analysis based on air flow-assisted desorption electrospray ionization mass spectrometry imaging (AFADESI-MSI) technique was used to establish a molecular diagnostic strategy for discriminating four pathological types of thyroid tumor. Without any specific labels, numerous metabolite features with their spatial distribution information can be acquired by AFADESI-MSI. The underlying metabolic heterogeneity can be visualized in line with the cellular heterogeneity in native tumor tissue. Through micro-regional feature extraction and in situ metabolomics analysis, three sets of metabolic biomarkers for the visual discrimination of benign follicular adenoma and differentiated thyroid carcinomas were discovered. Additionally, the automated prediction of tumor foci was supported by a diagnostic model based on the metabolic profile of 65 thyroid nodules. The model prediction accuracy was 83.3% when a test set of 12 independent samples was used. This diagnostic strategy presents a new way of performing in situ pathological examinations using small molecular biomarkers and provides a model diagnosis for clinically indeterminate thyroid tumor cases.  相似文献   

12.
As a recently developed and powerful classification tool, probabilistic neural network was used to distinguish cancer patients from healthy persons according to the levels of nucleosides in human urine. Two datasets (containing 32 and 50 patterns, respectively) were investigated and the total consistency rate obtained was 100% for dataset 1 and 94% for dataset 2. To evaluate the performance of probabilistic neural network, linear discriminant analysis and learning vector quantization network were also applied to the classification problem. The results showed that the predictive ability of the probabilistic neural network is stronger than the others in this study. Moreover, the recognition rate for dataset 2 can achieve to 100% if combining these three methods together, which indicated the promising potential of clinical diagnosis by combining different methods.  相似文献   

13.
(1) Background: Data accuracy plays a key role in determining the model performances and the field of metabolism prediction suffers from the lack of truly reliable data. To enhance the accuracy of metabolic data, we recently proposed a manually curated database collected by a meta-analysis of the specialized literature (MetaQSAR). Here we aim to further increase data accuracy by focusing on publications reporting exhaustive metabolic trees. This selection should indeed reduce the number of false negative data. (2) Methods: A new metabolic database (MetaTREE) was thus collected and utilized to extract a dataset for metabolic data concerning glutathione conjugation (MT-dataset). After proper pre-processing, this dataset, along with the corresponding dataset extracted from MetaQSAR (MQ-dataset), was utilized to develop binary classification models using a random forest algorithm. (3) Results: The comparison of the models generated by the two collected datasets reveals the better performances reached by the MT-dataset (MCC raised from 0.63 to 0.67, sensitivity from 0.56 to 0.58). The analysis of the applicability domain also confirms that the model based on the MT-dataset shows a more robust predictive power with a larger applicability domain. (4) Conclusions: These results confirm that focusing on metabolic trees represents a convenient approach to increase data accuracy by reducing the false negative cases. The encouraging performances shown by the models developed by the MT-dataset invites to use of MetaTREE for predictive studies in the field of xenobiotic metabolism.  相似文献   

14.
We have coupled 2D-NMR and infusion FT-ICR-MS with computer-assisted assignment to profile 13C-isotopologues of glycerophospholipids (GPL) directly in crude cell extracts, resulting in very high information throughput of >3000 isobaric molecules in a few minutes. A mass accuracy of better than 1 ppm combined with a resolution of 100,000 at the measured m/z was required to distinguish isotopomers from other GPL structures. Isotopologue analysis of GPLs extracted from LCC2 breast cancer cells grown on [U-13C]-glucose provided a rich trove of information about the biosynthesis and turnover of the GPLs. The isotopologue intensity ratios from the FT-ICR-MS were accurate to ≈1% or better based on natural abundance background, and depended on the signal-to-nose ratio. The time course of incorporation of 13C from [U-13C]-glucose into a particular phosphatidylcholine was analyzed in detail, to provide a quantitative measure of the sizes of glycerol, acetyl CoA and total GPL pools in growing LCC2 cells. Independent and complementary analysis of the positional 13C enrichment in the glycerol and fatty acyl chains obtained from high resolution 2D NMR was used to verify key aspects of the model. This technology enables simple and rapid sample preparation, has rapid analysis, and is generally applicable to unfractionated GPLs of almost any head group, and to mixtures of other classes of metabolites.  相似文献   

15.
16.
基于液体阵列味觉仿生传感器鉴别白酒香型的新方法   总被引:2,自引:0,他引:2  
通过模拟哺乳动物的味觉系统, 建立了交叉响应的液体阵列传感器, 为鉴别白酒香型提供了新方法. 选用7种染料和1种卟啉化合物作为传感单元, 构建液体阵列传感器, 集合8个传感单元的光谱响应信号构成分析物的指纹图谱, 达到识别的目的. 使用96孔板酶标仪采集响应数据, 结合主成分分析(PCA)、分层聚类分析(HCA)和判别分析(LDA)等模式识别方法进行数据处理, 对9种具有代表性的不同香型白酒样品进行了鉴别分析. PCA结果表明, 该方法对于白酒的检测主要基于酒体微量成分, 其中酸类物质对识别的贡献最大(贡献率达54.3%), 芳香类物质贡献率为18.6%; 同时, 仅用63.4%的数据信息量即可对白酒香型进行区分. HCA结果表明, 平行样均正确归类, 各白酒之间的相似程度在聚类图上得到体现. LDA结果表明, 该阵列对于9种白酒样品香型识别的准确率达到100%.  相似文献   

17.
针对高维小样本质谱数据在构造模型时易产生的过拟合现象、变量间的严重共线性、及结构与性质间的非线性关系,采用了核分段逆回归(KSIR)特征提取集成线性判别分析(LDA)新技术。首先以KSIR算法完成质谱数据的非线性特征提取,然后在由新特征矢量张成的低维空间构造样本类别的线性判别函数,负责各样本个体类别的判定。将KSIR-LDA方法应用于软饮料的质谱数据分类,结果表明:该方法不仅适应质谱数据与性质间的非线性关系,而且可以更少、解释能力更强的特征变量取得更高的分类精度,并能实现在低维特征空间对数据的解释及可视化。  相似文献   

18.
19.
The paper presents an approach to use Partial Least Squares Discriminant Analysis (PLS-DA) on X-ray powder diffractometry (XRPD) dataset to build a model which recognizes a presence (or absence) of particular drug substance (acetaminophen) in unknown mixture (OTC tablet). The dataset consisted of 33 XRPD signals, measured for 12 pure substances and 21 tablets containing them in different quantitative and qualitative ratios, along with unknown excipients. The model was built with an external validation dataset chosen by Kennard-Stone algorithm. The RMSECV value was equal to 0.3461 (87.8% of explained variance) and external predictive error (RMSEP) was equal to 0.3123 (86.2% of explained variance). The result suggests that small but properly prepared training datasets give ability to construct well-working discriminant models on XRPD signals.  相似文献   

20.
Previous modelling of the median lethal dose (oral rat LD50) has indicated that local class-based models yield better correlations than global models. We evaluated the hypothesis that dividing the dataset by pesticidal mechanisms would improve prediction accuracy. A linear discriminant analysis (LDA) based-approach was utilized to assign indicators such as the pesticide target species, mode of action, or target species - mode of action combination. LDA models were able to predict these indicators with about 87% accuracy. Toxicity is predicted utilizing the QSAR model fit to chemicals with that indicator. Toxicity was also predicted using a global hierarchical clustering (HC) approach which divides data set into clusters based on molecular similarity. At a comparable prediction coverage (~94%), the global HC method yielded slightly higher prediction accuracy (r2 = 0.50) than the LDA method (r2 ~ 0.47). A single model fit to the entire training set yielded the poorest results (r2 = 0.38), indicating that there is an advantage to clustering the dataset to predict acute toxicity. Finally, this study shows that whilst dividing the training set into subsets (i.e. clusters) improves prediction accuracy, it may not matter which method (expert based or purely machine learning) is used to divide the dataset into subsets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号