首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Many commercially available software programs claim similar efficiency and accuracy as variable selection tools. Genetic algorithms are commonly used variable selection methods where most relevant variables can be differentiated from less important variables using evolutionary computing techniques. However, different vendors offer several algorithms, and the puzzling question is: which one is the appropriate method of choice? In this study, several genetic algorithm tools (e.g. GFA from Cerius2, QuaSAR-Evolution from MOE and Parteks genetic algorithm) were compared. Stepwise multiple linear regression models were generated using the most relevant variables identified by the above genetic algorithms. This procedure led to the successful generation of Quantitative Structure–activity Relationship (QSAR) models for (a) proprietary datasets and (b) the Selwood dataset.  相似文献   

2.
3.
A new variable selection algorithm is described, based on ant colony optimization (ACO). The algorithm aim is to choose, from a large number of available spectral wavelengths, those relevant to the estimation of analyte concentrations or sample properties when spectroscopic analysis is combined with multivariate calibration techniques such as partial least-squares (PLS) regression. The new algorithm employs the concept of cooperative pheromone accumulation, which is typical of ACO selection methods, and optimizes PLS models using a pre-defined number of variables, employing a Monte Carlo approach to discard irrelevant sensors. The performance has been tested on a simulated system, where it shows a significant superiority over other commonly employed selection methods, such as genetic algorithms. Several near infrared spectroscopic experimental data sets have been subjected to the present ACO algorithm, with PLS leading to improved analytical figures of merit upon wavelength selection. The method could be helpful in other chemometric activities such as classification or quantitative structure-activity relationship (QSAR) problems.  相似文献   

4.
The non-linear regression technique known as alternating conditional expectations (ACE) method is only applicable when the number of objects available for calibration is considerably greater than the number of considered predictors. Alternating conditional expectations regression with selection of significant predictors by genetic algorithms (GA-ACE), the non-linear regression technique presented here, is based on the ACE algorithm but introducing several modifications to resolve the applicability limitations of the original ACE method, thus facilitating the practical implementation of a very interesting calibration tool. In order to overcome the lack of reliability displayed by the original ACE algorithm when working on data sets characterized by a too large number of variables and prior to the development of the non-linear regression model, GA-ACE applies genetic algorithms as a variable selection technique to select a reduced subset of significant predictors able to accurately model and predict a considered variable response. Furthermore, GA-ACE actually provides two alternative application approaches, since it allows either the performance of prior data compression computing a number of principal components to be subsequently subjected to GA-selection, or working directly on original variables.In this study, GA-ACE was applied to two real calibration problems, with a very low observation/variable ratio (NIR data), and the results were compared with those obtained by several linear regression techniques usually employed. When using the GA-ACE non-linear method, notably improved regression models were developed for the two response variables modeled, with root mean square errors of the residuals in external prediction (RMSEP) equal to 11.51 and 6.03% for moisture and lipid contents of roasted coffee samples, respectively. The improvement achieved by applying the new non-linear method introduced is even more remarkable taking into account the results obtained with the best performance linear method (IPW-PLS) applied to predict the studied responses (14.61 and 7.74% RMSEP, respectively).  相似文献   

5.
6.
Attenuated total reflectance-Fourier transform infrared spectrometry, in conjunction with multivariate calibration, was used for determination of reducing sugars, humidity and acidity in honey bee samples. Multivariate calibration models were built using partial least squares (PLS) and were refined through variable selection per interval (iPLS) and genetic algorithms. The calibration models show satisfactory results for all parameters with average relative errors of 6% for acidity, 1% for reducing sugars and 2% for humidity. For the acidity and reducing sugars parameters, variable selection was irrelevant, but for humidity it was essential. For the humidity parameter, it was necessary to use two variable selection techniques (by intervals and genetic algorithm) concomitantly in order to obtain a satisfactory calibration model.  相似文献   

7.
8.
In this work we evaluated the use of different variable selection techniques combined with partial least‐squares regression (PLS) – genetic algorithm PLS (GA‐PLS), interval PLS (iPLS), and synergy interval PLS (siPLS) – in the simultaneous determination of Cd(II), Cu(II), Pb(II) and Zn(II) by anodic stripping voltammetry at a bismuth film. Generally, variable selection provided an improvement in prediction results when compared to full‐voltammogram PLS. The use of interval selection based algorithms have shown to be most adequate than the selection of discrete variables by GA. Excellent analytical performances were obtained despite the inherent complexity of the simultaneous determination.  相似文献   

9.
In multivariate calibration with the spectral dataset, variable selection is often applied to identify relevant subset of variables, leading to improved prediction accuracy and easy interpretation of the selected fingerprint regions. Until now, numerous variable selection methods have been proposed, but a proper choice among them is not trivial. Furthermore, in many cases, a set of variables found by those methods might not be robust due to the irreproducibility and uncertainty issues, posing a great challenge in improving the reliability of the variable selection. In this study, the reproducibility of the 5 variable selection methods was investigated quantitatively for evaluating their performance. The reproducibility of variable selection was quantified by using Monte-Carlo sub-sampling (MCS) techniques together with the quantitative similarity measure designed for the highly collinear spectral dataset. The investigation of reproducibility and prediction accuracy of the several variable selection algorithms with two different near-infrared (NIR) datasets illustrated that the different variable selection methods exhibited wide variability in their performance, especially in their capabilities to identify the consistent subset of variables from the spectral datasets. Thus the thorough assessment of the reproducibility together with the predictive accuracy of the identified variables improved the statistical validity and confidence of the selection outcome, which cannot be addressed by the conventional evaluation schemes.  相似文献   

10.
11.
Derivation of quantitative structure-activity relationships (QSAR) usually involves computational models that relate a set of input variables describing the structural properties of the molecules for which the activity has been measured to the output variable representing activity. Many of the input variables may be correlated, and it is therefore often desirable to select an optimal subset of the input variables that results in the most predictive model. In this paper we describe an optimization technique for variable selection based on artificial ant colony systems. The algorithm is inspired by the behavior of real ants, which are able to find the shortest path between a food source and their nest using deposits of pheromone as a communication agent. The underlying basic self-organizing principle is exploited for the construction of parsimonious QSAR models based on neural networks for several classical QSAR data sets.  相似文献   

12.
Comparative molecular field analysis (CoMFA) with partial least squares (PLS) is one of the most frequently used tools in three-dimensional quantitative structure-activity relationships (3D-QSAR) studies. Although many successful CoMFA applications have proved the value of this approach, there are some problems in its proper application. Especially, the inability of PLS to handle the low signal-to-noise ratio (sample-to-variable ratio) has attracted much attention from QSAR researchers as an exciting research target, and several variable selection methods have been proposed. More recently, we have developed a novel variable selection method for CoMFA modeling (GARGS: genetic algorithm-based region selection), and its utility has been demonstrated in the previous paper (Kimura, T., et al. J. Chem. Inf. Comput. Sci. 1998, 38, 276-282). The purpose of this study is to evaluate whether GARGS can pinpoint known molecular interactions in 3D space. We have used a published set of acetylcholinesterase (AChE) inhibitors as a test example. By applying GARGS to a data set of AChE inhibitors, several improved models with high internal prediction and low number of field variables were obtained. External validation was performed to select a final model among them. The coefficient contour maps of the final GARGS model were compared with the properties of the active site in AChE and the consistency between them was evaluated.  相似文献   

13.

Derivation of quantitative structure-activity relationships (QSAR) usually involves computational models that relate a set of input variables describing the structural properties of the molecules for which the activity has been measured to the output variable representing activity. Many of the input variables may be correlated, and it is therefore often desirable to select an optimal subset of the input variables that results in the most predictive model. In this paper we describe an optimization technique for variable selection based on artificial ant colony systems. The algorithm is inspired by the behavior of real ants, which are able to find the shortest path between a food source and their nest using deposits of pheromone as a communication agent. The underlying basic self-organizing principle is exploited for the construction of parsimonious QSAR models based on neural networks for several classical QSAR data sets.  相似文献   

14.
In this study,different methods of variable selection using the multilinear step-wise regression(MLR) and support vector regression(SVR) have been compared when the performance of genetic algorithms(GAs) using various types of chromosomes is used.The first method is a GA with binary chromosome(GA-BC) and the other is a GA with a fixed-length character chromosome(GA-FCC).The overall prediction accuracy for the training set by means of 7-fold cross-validation was tested.All the regression models were evaluated by the test set.The poor prediction for the test set illustrates that the forward stepwise regression(FSR) model is easier to overfit for the training set.The results using SVR methods showed that the over-fitting could be overcome.Further,the over-fitting would be easier for the GA-BC-SVR method because too many variables fleetly induced into the model.The final optimal model was obtained with good predictive ability(R2 = 0.885,S = 0.469,Rcv2 = 0.700,Scv = 0.757,Rex2 = 0.692,Sex = 0.675) using GA-FCC-SVR method.Our investigation indicates the variable selection method using GA-FCC is the most appropriate for MLR and SVR methods.  相似文献   

15.
A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross‐validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal‐to‐noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near‐Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure‐activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

16.
17.
18.
This paper evaluates analytical methods based on near infrared (NIR) and middle infrared (MIR) spectroscopy and multivariate calibration to monitor the stability of biodiesel. There was a focus on three parameters: oxidative stability index, acid number and water content. Ethylic and methylic biodiesel from different feedstocks were used in experiments of accelerated aging, in order to take into account the wide variety of oilseeds and feedstocks available in Brazil. Partial least squares (PLS) and multiple linear regression (MLR) models were developed. Different pre-processing techniques and spectral variable/regions selection algorithms were evaluated. For MLR models, the successive projection algorithm (SPA) was employed. Interval PLS (iPLS) and selection of variables taking into account the significant regression coefficients were used for PLS models. Results showed that both near and middle infrared regions, and all variable selection methods tested were efficient for predicting these three important quality parameters of B100, the root mean squares error of prediction (RMSEP) values being comparable to the reproducibility of the corresponding standard method for each property investigated.  相似文献   

19.
Rapid and reliable discrimination among clinically relevant pathogenic organisms is a crucial task in microbiology. Microorganism resistance to antimicrobial agents increases prevalence of infections. The possibility of Fourier transform infrared (FT-IR) spectroscopy to assess the overall molecular composition of microbial cells in a non-destructive manner is reflected in the specific spectral fingerprints highly typical for different microorganisms. With the objective of using FT-IR spectroscopy for discrimination between diverse microbial species and strains on a routine basis, a wide range of chemometrics techniques need to be applied. Still a major issue in using FT-IR for successful bacteria characterization is the method for spectra pre-processing. We analyzed different spectra pre-processing methods and their impact on the reduction of spectral variability and on the increase of robustness of chemometrics models. Different types of the Enterococcus faecium bacterial strain were classified according to chromosomal DNA restriction patterns produced by pulsed-field gel electrophoresis (PFGE). Samples were collected from human patients. Collected FT-IR spectra were used to verify if the same classification was obtained. In order to further optimize bacteria classification we investigated whether a selected combination of the most discriminative spectral regions could improve results. Two different variable selection methods (genetic algorithms (GAs) and bootstrapping) were investigated and their relative merit for bacteria classification is reported by comparing with results obtained using the entire spectra. Discriminant partial least-squares (Di-PLS) models based on corrected spectra showed improved predictive ability up to 40% when compared to equivalent models using the entire spectral range. The uncertainty in estimating scores was reduced by about 50% when compared to models with all wavelengths. Spectral ranges with relevant chemical information for Enterococcus faecium bacteria discrimination were outlined.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号