首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
Kernel partial least squares (KPLS) has become popular techniques for chemical and biological modeling, which is a nonlinear extension of linear PLS. Training samples are transformed into a feature space via a nonlinear mapping, and then PLS algorithm can be carried out in the feature space. However, one of the main limitations of KPLS is that each feature is given the same importance in the kernel matrix, thus explaining the poor performance of KPLS for data with many irrelevant features. In this study, we provide a new strategy incorporated variable importance into KPLS, which is termed as the WKPLS approach. The WKPLS approach by modifying the kernel matrix provides a feasible way to differentiate between the true and noise variables. On the basis of the fact that the regression coefficients of the PLS model reflect the importance of variables, we firstly obtain the normalized regression coefficients by establishing the PLS model with all the variables. Then, Variable importance is incorporated into primary kernel. The performance of WKPLS is investigated with one simulated dataset and two structure–activity relationship (SAR) datasets. Compared with standard linear kernel PLS and Gaussian kernel PLS, The results show that WKPLS yields superior prediction performances to standard KPLS. WKPLS could be considered as a good mechanism by introducing extra information to improve the performance of KPLS for modeling SAR.  相似文献   

2.
Large amounts of data from high-throughput metabolomics experiments become commonly more and more complex, which brings an enormous amount of challenges to existing statistical modeling. Thus there is a need to develop statistically efficient approach for mining the underlying metabolite information contained by metabolomics data under investigation. In the work, we developed a novel kernel Fisher discriminant analysis (KFDA) algorithm by constructing an informative kernel based on decision tree ensemble. The constructed kernel can effectively encode the similarities of metabolomics samples between informative metabolites/biomarkers in specific parts of the measurement space. Simultaneously, informative metabolites or potential biomarkers can be successfully discovered by variable importance ranking in the process of building kernel. Moreover, KFDA can also deal with nonlinear relationship in the metabolomics data by such a kernel to some extent. Finally, two real metabolomics datasets together with a simulated data were used to demonstrate the performance of the proposed approach through the comparison of different approaches.  相似文献   

3.
Kernel partial least squares (KPLS) and support vector regression (SVR) have become popular techniques for regression of complex non-linear data sets. The modeling is performed by mapping the data in a higher dimensional feature space through the kernel transformation. The disadvantage of such a transformation is, however, that information about the contribution of the original variables in the regression is lost. In this paper we introduce a method which can retrieve and visualize the contribution of the variables to the regression model and the way the variables contribute to the regression of complex data sets. The method is based on the visualization of trajectories using so-called pseudo samples representing the original variables in the data. We test and illustrate the proposed method to several synthetic and real benchmark data sets. The results show that for linear and non-linear regression models the important variables were identified with corresponding linear or non-linear trajectories. The results were verified by comparing with ordinary PLS regression and by selecting those variables which were indicated as important and rebuilding a model with only those variables.  相似文献   

4.
5.
针对高维小样本质谱数据在构造模型时易产生的过拟合现象、变量间的严重共线性、及结构与性质间的非线性关系,采用了核分段逆回归(KSIR)特征提取集成线性判别分析(LDA)新技术。首先以KSIR算法完成质谱数据的非线性特征提取,然后在由新特征矢量张成的低维空间构造样本类别的线性判别函数,负责各样本个体类别的判定。将KSIR-LDA方法应用于软饮料的质谱数据分类,结果表明:该方法不仅适应质谱数据与性质间的非线性关系,而且可以更少、解释能力更强的特征变量取得更高的分类精度,并能实现在低维特征空间对数据的解释及可视化。  相似文献   

6.
Different calibration techniques are available for spectroscopic applications that show nonlinear behavior. This comprehensive comparative study presents a comparison of different nonlinear calibration techniques: kernel PLS (KPLS), support vector machines (SVM), least-squares SVM (LS-SVM), relevance vector machines (RVM), Gaussian process regression (GPR), artificial neural network (ANN), and Bayesian ANN (BANN). In this comparison, partial least squares (PLS) regression is used as a linear benchmark, while the relationship of the methods is considered in terms of traditional calibration by ridge regression (RR). The performance of the different methods is demonstrated by their practical applications using three real-life near infrared (NIR) data sets. Different aspects of the various approaches including computational time, model interpretability, potential over-fitting using the non-linear models on linear problems, robustness to small or medium sample sets, and robustness to pre-processing, are discussed. The results suggest that GPR and BANN are powerful and promising methods for handling linear as well as nonlinear systems, even when the data sets are moderately small. The LS-SVM is also attractive due to its good predictive performance for both linear and nonlinear calibrations.  相似文献   

7.
8.
与统计分析和神经网络相比,基于结构风险最小的支持向量机有更好的分类性能。它用于非线性分类时,先将样本映射到更高维的特征空间,往往会增加复共线性与冗余信息,将影响样本分布,降低线性支持向量机分类器(LSVC)的预测性能。本研究提出非线性分类相关分析算法(NLCCA),利用核函数技术,无需了解非线性映射的算式,从特征空间的样本映像中提取分类相关成分,以消除冗余信息,改善样本分布。由此构建的NLCCA-LSVC集成分类器具有优良的预测性能。经模拟数据的测试,并实际用于两个复杂的化学模式识别问题,均取得令人满意的效果,也印证了算法的有效性。  相似文献   

9.
The nearest shrunken centroid (NSC) Classifier is successfully applied for class prediction in a wide range of studies based on microarray data. The contribution from seemingly irrelevant variables to the classifier is minimized by the so‐called soft‐thresholding property of the approach. In this paper, we first show that for the two‐class prediction problem, the NSC Classifier is similar to a one‐component discriminant partial least squares (PLS) model with soft‐shrinkage of the loading weights. Then we introduce the soft‐threshold‐PLS (ST‐PLS) as a general discriminant‐PLS model with soft‐thresholding of the loading weights of multiple latent components. This method is especially suited for classification and variable selection when the number of variables is large compared to the number of samples, which is typical for gene expression data. A characteristic feature of ST‐PLS is the ability to identify important variables in multiple directions in the variable space. Both the ST‐PLS and the NSC classifiers are applied to four real data sets. The results indicate that ST‐PLS performs better than the shrunken centroid approach if there are several directions in the variable space which are important for classification, and there are strong dependencies between subsets of variables. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

10.
This paper introduces a technique to visualise the information content of the kernel matrix and a way to interpret the ingredients of the Support Vector Regression (SVR) model. Recently, the use of Support Vector Machines (SVM) for solving classification (SVC) and regression (SVR) problems has increased substantially in the field of chemistry and chemometrics. This is mainly due to its high generalisation performance and its ability to model non-linear relationships in a unique and global manner. Modeling of non-linear relationships will be enabled by applying a kernel function. The kernel function transforms the input data, usually non-linearly related to the associated output property, into a high dimensional feature space where the non-linear relationship can be represented in a linear form. Usually, SVMs are applied as a black box technique. Hence, the model cannot be interpreted like, e.g., Partial Least Squares (PLS). For example, the PLS scores and loadings make it possible to visualise and understand the driving force behind the optimal PLS machinery. In this study, we have investigated the possibilities to visualise and interpret the SVM model. Here, we exclusively have focused on Support Vector Regression to demonstrate these visualisation and interpretation techniques. Our observations show that we are now able to turn a SVR black box model into a transparent and interpretable regression modeling technique.  相似文献   

11.
Advances in sensory systems have led to many industrial applications with large amounts of highly correlated data, particularly in chemical and pharmaceutical processes. With these correlated data sets, it becomes important to consider advanced modeling approaches built to deal with correlated inputs in order to understand the underlying sources of variability and how this variability will affect the final quality of the product. Additional to the correlated nature of the data sets, it is also common to find missing elements and noise in these data matrices. Latent variable regression methods such as partial least squares or projection to latent structures (PLS) have gained much attention in industry for their ability to handle ill‐conditioned matrices with missing elements. This feature of the PLS method is accomplished through the nonlinear iterative PLS (NIPALS) algorithm, with a simple modification to consider the missing data. Moreover, in expectation maximization PLS (EM‐PLS), imputed values are provided for missing data elements as initial estimates, conventional PLS is then applied to update these elements, and the process iterates to convergence. This study is the extension of previous work for principal component analysis (PCA), where we introduced nonlinear programming (NLP) as a means to estimate the parameters of the PCA model. Here, we focus on the parameters of a PLS model. As an alternative to modified NIPALS and EM‐PLS, this paper presents an efficient NLP‐based technique to find model parameters for PLS, where the desired properties of the parameters can be explicitly posed as constraints in the optimization problem of the proposed algorithm. We also present a number of simulation studies, where we compare effectiveness of the proposed algorithm with competing algorithms. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

12.
李琳  陈德钊  束志恒  叶子青 《分析化学》2005,33(8):1091-1094
化学数据挖掘可从海量数据中提取蕴含的知识,决策树方法是一种重要的挖掘工具。鉴于决策树在处理连续数据上的局限性,本研究提出先进行预处理,将连续属性离散化,通过特征选择删除其冗余量,以此为基础构建决策树。该方法可防止决策树模型“过细”,使之具有良好的预报性能。将此方法应用于两个化学样品分类实例,效果良好。与贝叶斯分析和单一的决策树方法相比,其预报正确率有显著提高,且表达形式直观明确,易于理解和分析,适用于化学分类知识模式的挖掘。  相似文献   

13.
14.
With the aim of developing a nonlinear tool for near-infrared spectral (NIRS) calibration, an applicable algorithm, called MIKPLS, is designed based on the combination of two different strategies, i.e. mutual information (MI) for interval selection and kernel partial least squares (KPLS) for modeling. Due to the ability of capturing linear and nonlinear dependencies between variables simultaneously, mutual information between each candidate variables and target is calculated and employed to induce a continuous wavelength interval, which is subsequently applied to build a parsimonious calibration model for future use by kernel partial least squares. Through the experiments on two datasets, it seems that mutual information (MI)-induced interval selection, followed by KPLS, forms a very simple and practical tool, allowing a prediction model to be constructed using a much-reduced set of neighboring variables, but without any loss of generalizations and with improved prediction performance instead.  相似文献   

15.
Du W  Gu T  Tang LJ  Jiang JH  Wu HL  Shen GL  Yu RQ 《Talanta》2011,85(3):1689-1694
As a greedy search algorithm, classification and regression tree (CART) is easily relapsing into overfitting while modeling microarray gene expression data. A straightforward solution is to filter irrelevant genes via identifying significant ones. Considering some significant genes with multi-modal expression patterns exhibiting systematic difference in within-class samples are difficult to be identified by existing methods, a strategy that unimodal transform of variables selected by interval segmentation purity (UTISP) for CART modeling is proposed. First, significant genes exhibiting varied expression patterns can be properly identified by a variable selection method based on interval segmentation purity. Then, unimodal transform is implemented to offer unimodal featured variables for CART modeling via feature extraction. Because significant genes with complex expression patterns can be properly identified and unimodal feature extracted in advance, this developed strategy potentially improves the performance of CART in combating overfitting or underfitting while modeling microarray data. The developed strategy is demonstrated using two microarray data sets. The results reveal that UTISP-based CART provides superior performance to k-nearest neighbors or CARTs coupled with other gene identifying strategies, indicating UTISP-based CART holds great promise for microarray data analysis.  相似文献   

16.
We describe the application of particle swarms for the development of quantitative structure-activity relationship (QSAR) models based on k-nearest neighbor and kernel regression. Particle swarms is a population-based stochastic search method based on the principles of social interaction. Each individual explores the feature space guided by its previous success and that of its neighbors. Success is measured using leave-one-out (LOO) cross validation on the resulting model as determined by k-nearest neighbor kernel regression. The technique is shown to compare favorably to simulated annealing using three classical data sets from the QSAR literature.  相似文献   

17.
Support vector machines in water quality management   总被引:1,自引:0,他引:1  
Support vector classification (SVC) and regression (SVR) models were constructed and applied to the surface water quality data to optimize the monitoring program. The data set comprised of 1500 water samples representing 10 different sites monitored for 15 years. The objectives of the study were to classify the sampling sites (spatial) and months (temporal) to group the similar ones in terms of water quality with a view to reduce their number; and to develop a suitable SVR model for predicting the biochemical oxygen demand (BOD) of water using a set of variables. The spatial and temporal SVC models rendered grouping of 10 monitoring sites and 12 sampling months into the clusters of 3 each with misclassification rates of 12.39% and 17.61% in training, 17.70% and 26.38% in validation, and 14.86% and 31.41% in test sets, respectively. The SVR model predicted water BOD values in training, validation, and test sets with reasonably high correlation (0.952, 0.909, and 0.907) with the measured values, and low root mean squared errors of 1.53, 1.44, and 1.32, respectively. The values of the performance criteria parameters suggested for the adequacy of the constructed models and their good predictive capabilities. The SVC model achieved a data reduction of 92.5% for redesigning the future monitoring program and the SVR model provided a tool for the prediction of the water BOD using set of a few measurable variables. The performance of the nonlinear models (SVM, KDA, KPLS) was comparable and these performed relatively better than the corresponding linear methods (DA, PLS) of classification and regression modeling.  相似文献   

18.
Recently we have proposed a new variable selection algorithm, based on clustering of variable concept (CLoVA) in classification problem. With the same idea, this new concept has been applied to a regression problem and then the obtained results have been compared with conventional variable selection strategies for PLS. The basic idea behind the clustering of variable is that, the instrument channels are clustered into different clusters via clustering algorithms. Then, the spectral data of each cluster are subjected to PLS regression. Different real data sets (Cargill corn, Biscuit dough, ACE QSAR, Soy, and Tablet) have been used to evaluate the influence of the clustering of variables on the prediction performances of PLS. Almost in the all cases, the statistical parameter especially in prediction error shows the superiority of CLoVA-PLS respect to other variable selection strategies. Finally the synergy clustering of variable (sCLoVA-PLS), which is used the combination of cluster, has been proposed as an efficient and modification of CLoVA algorithm. The obtained statistical parameter indicates that variable clustering can split useful part from redundant ones, and then based on informative cluster; stable model can be reached.  相似文献   

19.
We present a novel algorithm for linear multivariate calibration that can generate good prediction results. This is accomplished by the idea of that testing samples are mixed by the calibration samples in proper proportion. The algorithm is based on the mixed model of samples and is therefore called MMS algorithm. With both theoretical support and analysis of two data sets, it is demonstrated that MMS algorithm produces lower prediction errors than partial least squares (PLS2) model, has similar prediction performance to PLS1. In the anti-interference test of background, MMS algorithm performs better than PLS2. At the condition of the lack of some component information, MMS algorithm shows better robustness than PLS2.  相似文献   

20.
He J  Fang G  Deng Q  Wang S 《Analytica chimica acta》2011,704(1-2):57-62
The classification and regression trees (CART) possess the advantage of being able to handle large data sets and yield readily interpretable models. A conventional method of building a regression tree is recursive partitioning, which results in a good but not optimal tree. Ant colony system (ACS), which is a meta-heuristic algorithm and derived from the observation of real ants, can be used to overcome this problem. The purpose of this study was to explore the use of CART and its combination with ACS for modeling of melting points of a large variety of chemical compounds. Genetic algorithm (GA) operators (e.g., cross averring and mutation operators) were combined with ACS algorithm to select the best solution model. In addition, at each terminal node of the resulted tree, variable selection was done by ACS-GA algorithm to build an appropriate partial least squares (PLS) model. To test the ability of the resulted tree, a set of approximately 4173 structures and their melting points were used (3000 compounds as training set and 1173 as validation set). Further, an external test set containing of 277 drugs was used to validate the prediction ability of the tree. Comparison of the results obtained from both trees showed that the tree constructed by ACS-GA algorithm performs better than that produced by recursive partitioning procedure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号