首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
提出了一种基于偏最小二乘判别分析和F-score的特征筛选方法,并将其用于蛋白质组学质谱数据分析。方法主要包含3个步骤:(1)用LIMPIC算法对原始数据进行预处理;(2)计算每个变量的F-score值并将所有变量按F-score值降底的顺序排列;(3)采用偏最小二乘判别分析交互检验按前向选择法选择最佳变量子集。用本方法对一组卵巢癌数据进行分析,最终从原始的15154个质荷比变量中选择了12个特征变量作为潜在生物标记物,它们在训练集上交叉检验的特异性和灵敏度分别为98.36%和98.15%,在独立测试集上的特异性和灵敏度分别为96.67%和100%。用筛选出的变量作PCA所得的结果显示这些变量能够较好地将样本分类,说明能够反映出样本的类别信息。所提出的方法可用于蛋白质组学质谱数据的特征筛选及样本分类。  相似文献   

2.
天然植物复杂化学模式特征的分步提取法   总被引:7,自引:0,他引:7  
在运用神经元计算技术对高维小样本复杂化学模式进行分类时,通过模式特征提取,降低输入变量维数,能使复杂的模式分类问题比较容易解决。根据模式类别相关分步分析思路,提出复杂化学模式特征分步提取法,可将原始模式数据中与类别指标相关较大的特征量有效地提取出来。应用于天然植物组效关系辨识结果表明,这种化学模式特征提取方法比经典主成分分析法更为实用可靠。  相似文献   

3.
基于前列腺癌检测中获取的表面增强激光解吸/离子化飞行时间质谱(SELDI-TOF-MS)数据,提出一种概率主成分分析(PPCA)联合支持向量机(SVM)的分类方法。对临床322例血清样本的质谱数据进行特征提取,以随机选取训练样本集(225例)构造SVM判别模型,对剩余样本集(97例)进行测试。采用均方根误差、识别率与预测率指标,将所构造的PPCA-SVM模型分别与偏最小二乘(Partial least squares,PLS)和PCA-SVM模型进行比较,发现PLS模型的识别率和预测率分别为90.92%和76.38%,PCA-SVM模型分别为99.23%和84.63%,而PPCA-SVM模型分别为99.01%和90.41%。因此SELDI-TOF-MS技术结合PPCA-SVM在样品分类中具有准确、重复性好等优点,为前列腺癌早期诊断提供了一种新方法。  相似文献   

4.
针对代谢组学研究中的数据处理问题,本研究建立了基于质谱的数据分析系统MS-IAS(Mass spectrometry based integrated analysis system).此系统集成了特征选择、聚类、分类等多种方法,用以处理质谱数据,具有多种统计分析方法能对所选的特征变量进行比较,以发现与所研究问题相关的潜在生物标志物.MS-IAS支持数据与多种算法结果可图形化显示,有助于对数据的解释与分析.以肝病患者的质谱代谢组数据为例,展示MS-IAS的功能,两种特征选择算法从数据集中筛选出了40个对肝病具有区分能力的特征变量,展示了MS-IAS成为代谢组学研究中的通用质谱数据分析系统的潜力.  相似文献   

5.
质谱成像技术能够在同一个实验里无需标记手段而获得样品表面的分子信息及其分布信息,是当前质谱分析的热点.其分析所得数据量大且复杂,使其特征难以提取.多元统计分析方法,特别是主成分分析法已应用于质谱成像数据的压缩和特征提取.然而由于主成分分析常产生负的数据结果,其意义难以解释且不易分解为单一的特征.本研究开发出一种基于非负分解的质谱成像数据提取方法,能够提取单一的分子特征及其在样品上的分布特征,并将多个单一的特征分布通过红、绿、蓝三色叠加显示,获得轮廓直观的综合特征分布.应用本方法对小鼠脑组织切片质谱成像数据进行分析,可直观分解出灰质区域、白质区域和背景区域,相对主成分分析方法更直观且易于解释.应用本方法对在同一个样品靶上的人膀胱癌变组织和其相邻非癌变组织切片质谱成像数据进行分析,癌变与非癌变组织间差异清晰直观.本研究设计的质谱成像软件可由http://www.msimaging.net获取.  相似文献   

6.
针对高维小样本光谱数据所显现的函数型数据(Functional data)特性、与性质参数的非线性关系及变量间存有的严重共线性,采用了样条变换集成罚函数偏最小二乘回归新技术.它首先以三次B基样条变换实现非线性光谱数据的线性化重构,随后将重构的新光谱矩阵交由罚函数偏最小二乘法(Penalized PLS)构建其与性质参变量间的校正模型,其中罚函数中的光滑因子由交叉验证优化确定以调控模型的拟合精度.最后,通过小麦样品水分含量的近红外光谱定量分析,结果显示该技术光谱数据重构稳健,去噪明显,并有效解决高维小样本的过拟合和变量间的共线性,而预测集的均方根误差(RMSEP)为0.1808%,方法的非线性校正模型预测能力得到了明显提高.  相似文献   

7.
利用红外光谱法对35个一次性塑料餐盒样本进行了谱图的无损采集。经过前期Savitzky-Golay平滑、多元散射校正等数据预处理后,利用样本之间的余弦相似度进行初步分类。结合主成分分析法提取样本主成分,根据样本的主成分得分进行分类,利用相关性分析对于不同分类结果做出解释和取舍。最后利用Fisher判别分析对样本分类情况进行检验,通过分析各类别重心分布情况和类间距离考察分类效果,判别分析中留一交叉验证正确率为97.1%,证明判别模型分类效果良好,可为未知样本模式识别提供模型基础。提供了一种普遍性较强的模型分类方法,对于物证的模式识别有一定参考意义。  相似文献   

8.
在核磁共振代谢组学数据预处理中,尺度归一化主要目的是提高特征代谢物信息的权重,减小噪声及无关代谢物信息的影响,从而降低后续模式识别分析的难度. 本文提出一种新的尺度归一化方法,该方法不强调各变量在尺度上的归一,而是在原始数据的基础上,通过提高那些稳定性高、且在不同类别样本中具有显著差异性的变量的权重,以增强与特征代谢物相关的信息. 文中分别采用模拟数据和真实代谢组学数据对新归一化方法的性能进行评估,并与单位方差法(Unit Variance)、变量稳定性(Variable Stability)和尺度缩放法(Level Scaling)等常用的尺度归一化方法做比较. 研究结果表明:新归一化方法能够提高多变量统计模型的预测能力,较好地保留核磁共振谱的分子信息,有助于特征代谢物的识别,并使后续的数据分析结果具有更好的可解释性.  相似文献   

9.
建立了金属标记结合高效液相色谱-选择性离子监测质谱(SIM)的蛋白质绝对定量新方法。实验考察了金属标记效率、金属标记的稳定性、标记后肽段的色谱保留和质谱行为、新定量方法的线性范围和准确度。实验结果表明金属标记具有标记效率高,稳定性好,色谱保留行为一致等优点。另外,金属标记-选择离子监测质谱绝对定量方法灵敏度高,其定量限低至1 fmol,线性范围为1~500 fmol,线性范围内R2值大于0.99,具有良好的线性关系;经过测量,标准肽段的回收率为117.01%,说明该方法具有较高的准确度。将该方法应用于腾冲嗜热菌中烯醇酶蛋白的定量分析,相对标准偏差为5.47%,表明该方法的精密度高。以上结果表明该方法可以用于生物样本中的蛋白质的绝对定量分析,为比较简单的生物样本中蛋白质的绝对定量方法提供了一种新的选择。  相似文献   

10.
罗明亮  李梦龙 《化学学报》2000,58(11):1409-1412
针对化学领域中的非线性关系特点,在常规BP网络基础上,提出了一种“杂交”型BP网络,包含两个隐层,并有输入层到输出层的直连接。它可很好地解释数据中同时存在的线性及非线性关系,效果优于多元回归法及普通BP算法。  相似文献   

11.
Kernel partial least squares (KPLS) has become a popular technique for regression and classification of complex data sets, which is a nonlinear extension of linear PLS in which training samples are transformed into a feature space via a nonlinear mapping. The PLS algorithm can then be carried out in the feature space. In the present study, we attempt to develop a novel tree KPLS (TKPLS) classification algorithm by constructing an informative kernel on the basis of decision tree ensembles. The constructed tree kernel can effectively discover the similarities of samples and select informative features by variable importance ranking in the process of building the kernel. Simultaneously, TKPLS can also handle nonlinear relationships in the structure–activity relationship data by such a kernel. Finally, three data sets related to different categorical bioactivities of compounds are used to evaluate the performance of TKPLS. The results show that the TKPLS algorithm can be regarded as an alternative and promising classification technique. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

12.
The feasibility of utilizing an Adaboost algorithm in conjuction with near-infrared (NIR) spectroscopy to automatically distinguish cigarettes of different brands was explored. Simple linear discriminant analysis (LDA) was used as the base algorithm to train all weak classifiers in Adaboost. Both principal component analysis (PCA) and its kernel version (kernel principal component analysis, KPCA) were used for feature extraction and were also compared to each other. The influence of the training set size on the final classification model was also investigated. Using a case study, it was demonstrated that Adaboost coupled with PCA or KPCA can obviously improve the ability to discriminate between samples that cannot be separated by a single linear classifier. However, in term of the overall performance, KPCA appears preferable to PCA for feature extraction, especially when the samples used for training are relatively small. The results also indicate that more training samples should be applied, if possible, in order to fully demonstrate the superiority of Adaboost. It seems that the use of an Adaboost algorithm in conjunction with NIR spectroscopy in combination with KPCA for feature extraction comprises a promising tool for distinguishing cigarettes of different brands, especially in situations where there is an obvious overlap between the NIR spectra afforded by cigarettes of different brands.  相似文献   

13.
14.
Idiosyncratic drug toxicity (IDT), considered as a toxic host-dependent event, with an apparent lack of dose response relationship, is usually not predictable from early phases of clinical trials, representing a particularly confounding complication in drug development. Albeit a rare event (usually <1/5000), IDT is often life threatening and is one of the major reasons new drugs never reach the market or are withdrawn post marketing. Computational methodologies, like the computer-based approach proposed in the present study, can play an important role in addressing IDT in early drug discovery. We report for the first time a systematic evaluation of classification models to predict idiosyncratic hepatotoxicity based on linear discriminant analysis (LDA), artificial neural networks (ANN), and machine learning algorithms (OneR) in conjunction with a 3D molecular structure representation and feature selection methods. These modeling techniques (LDA, feature selection to prevent over-fitting and multicollinearity, ANN to capture nonlinear relationships in the data, as well as the simple OneR classifier) were found to produce QSTR models with satisfactory internal cross-validation statistics and predictivity on an external subset of chemicals. More specifically, the models reached values of accuracy/sensitivity/specificity over 84%/78%/90%, respectively in the training series along with predictivity values ranging from ca. 78 to 86% of correctly classified drugs. An LDA-based desirability analysis was carried out in order to select the levels of the predictor variables needed to trigger the more desirable drug, i.e. the drug with lower potential for idiosyncratic hepatotoxicity. Finally, two external test sets were used to evaluate the ability of the models in discriminating toxic from nontoxic structurally and pharmacologically related drugs and the ability of the best model (LDA) in detecting potential idiosyncratic hepatotoxic drugs, respectively. The computational approach proposed here can be considered as a useful tool in early IDT prognosis.  相似文献   

15.
In tobacco research, the comparison of different tobacco blends as well as the puff-dependent behaviour of cigarettes is a matter of particular interest. For the investigation of smoke characteristics, GC x GC offers different ways for data analysis, namely, compound target analysis, automated peak-based compound classification and comprehensive pixel-based data analysis. This study will show the application as well as the pros and cons of these types of data analysis for very complex matrices like cigarette particulate matter. In addition, new aspects about the recently discovered puff-dependent behaviour of compounds in cigarette smoke will be presented. Automated peak-based compound classification including mass spectrometric pattern recognition is used for the classification of tobacco particulate matter samples and the puff-dependent investigation of different compound classes. This compound group specific analysis is further reinforced by applying an even more comprehensive pixel-based analysis. This kind of analysis is used to generate fingerprints of different types of cigarettes. The combination of fast feature reduction methods like analysis of variance (ANOVA) and t-test with multivariate feature transformation methods like partial least squares discriminate analysis (PLSDA) for feature selection provides a powerful tool for a detailed inspection of different types of cigarettes.  相似文献   

16.
The possibility provided by Chemometrics to extract and combine (fusion) information contained in NIR and MIR spectra in order to discriminate monovarietal extra virgin olive oils according to olive cultivar (Casaliva, Leccino, Frantoio) has been investigated.Linear discriminant analysis (LDA) was applied as a classification technique on these multivariate and non-specific spectral data both separately and jointly (NIR and MIR data together).In order to ensure a more appropriate ratio between the number of objects (samples) and number of variables (absorbance at different wavenumbers), LDA was preceded either by feature selection or variable compression. For feature selection, the SELECT algorithm was used while a wavelet transform was applied for data compression.Correct classification rates obtained by cross-validation varied between 60% and 90% depending on the followed procedure. Most accurate results were obtained using the fused NIR and MIR data, with either feature selection or data compression.Chemometrical strategies applied to fused NIR and MIR spectra represent an effective method for classification of extra virgin olive oils on the basis of the olive cultivar.  相似文献   

17.
Biomarker discovery is a challenging task of bioinformatics especially when targeting high dimensional problems such as SNP (single nucleotide polymorphism) datasets. Various types of feature selection methods can be applied to accomplish this task. Typically, using features versus class labels of samples in the training dataset, these methods aim at selecting feature subsets with maximal classification accuracies. Although finding such class-discriminative features is crucial, selection of relevant SNPs for maximizing other properties that exist in the nature of population genetics such as the correlation between genetic diversity and geographical distance of ethnic groups can also be equally important. In this work, a methodology using a multi objective optimization technique called Pareto Optimal is utilized for selecting SNP subsets offering both high classification accuracy and correlation between genomic and geographical distances. In this method, discriminatory power of an SNP is determined using mutual information and its contribution to the genomic–geographical correlation is estimated using its loadings on principal components. Combining these objectives, the proposed method identifies SNP subsets that can better discriminate ethnic groups than those obtained with sole mutual information and yield higher correlation than those obtained with sole principal components on the Human Genome Diversity Project (HGDP) SNP dataset.  相似文献   

18.
19.
20.
The main triacylglycerol (TAG) composition of different plant oils (almond, avocado, corn germ, grape seed, linseed, mustard seed, olive, peanut, pumpkin seed, sesame seed, soybean, sunflower, walnut and wheat germ) were analyzed using two different mass spectrometric techniques: HPLC/APCI-MS (high-performance liquid chromatography/atmospheric pressure chemical ionization mass spectrometry) and MALDI-TOFMS (matrix-assisted laser desorption/ionization time-of-flight mass spectrometry).Linear discriminant analysis (LDA) as a multivariate mathematical statistical method was successfully used to distinguish different plant oils based on their relative TAG composition. With LDA analysis of either APCI-MS or MALDI-MS data, the classification among the almond, avocado, grape seed, linseed, mustard seed, olive, sesame seed and soybean oil samples was 100% correct. In both cases only 6 different oil samples from a total of 73 were not classified correctly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号