首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
在核磁共振代谢组学数据预处理中,尺度归一化主要目的是提高特征代谢物信息的权重,减小噪声及无关代谢物信息的影响,从而降低后续模式识别分析的难度. 本文提出一种新的尺度归一化方法,该方法不强调各变量在尺度上的归一,而是在原始数据的基础上,通过提高那些稳定性高、且在不同类别样本中具有显著差异性的变量的权重,以增强与特征代谢物相关的信息. 文中分别采用模拟数据和真实代谢组学数据对新归一化方法的性能进行评估,并与单位方差法(Unit Variance)、变量稳定性(Variable Stability)和尺度缩放法(Level Scaling)等常用的尺度归一化方法做比较. 研究结果表明:新归一化方法能够提高多变量统计模型的预测能力,较好地保留核磁共振谱的分子信息,有助于特征代谢物的识别,并使后续的数据分析结果具有更好的可解释性.  相似文献   

2.
In this paper, multivariate calibration of complicated process fluorescence data is presented. Two data sets related to the production of white sugar are investigated. The first data set comprises 106 observations and 571 spectral variables, and the second data set 268 observations and 3997 spectral variables. In both applications, a single response, ash content, is modelled and predicted as a function of the spectral variables. Both data sets contain certain features making multivariate calibration efforts non-trivial. The objective is to show how principal component analysis (PCA) and partial least squares (PLS) regression can be used to overview the data sets and to establish predictively sound regression models. It is shown how a recently developed technique for signal filtering, orthogonal signal correction (OSC), can be applied in multivariate calibration to enhance predictive power. In addition, signal compression is tested on the larger data set using wavelet analysis. It is demonstrated that a compression down to 4% of the original matrix size — in the variable direction — is possible without loss of predictive power. It is concluded that the combination of OSC for pre-processing and wavelet analysis for compression of spectral data is promising for future use.  相似文献   

3.
A robust method was developed to cluster similar NMR spectra from partially purified extracts obtained from a range of marine sponges and a plant biota. The NMR data were acquired using microtiter plate NMR (VAST) in protonated solvents. A sample data set which contained several clusters was used to optimize the protocol. The evaluation of the robustness was performed using three different clustering methods: tree clustering analysis, K-means clustering and multidimensional scaling. These methods were compared for consistency using the sample data set and the optimized methodology was applied to clustering of a set of spectra from partially purified biota extracts.  相似文献   

4.
Abstract  This work describes a quantitative spectroscopic method for the analysis of ternary mixtures of ceratine (CER), creatinine (CRE), and uric acid (UA) using multivariate data models based upon ultraviolet spectroscopy. By multivariate calibration methods, such as partial least squares regression, it is possible to obtain a model adjusted to the concentration values of the mixtures used in the calibration range. In this study, the calibration model is based on absorption spectra in the 200–260 nm range for 36 different mixtures of CER, CRE, and UA. The unrelated information was removed by the orthogonal signal correction (OSC) method and the results were proved. Evaluation of the prediction errors for the prediction set reveals the OSC-treated data give substantially lower root mean square error of prediction (RMSEP) values than original data. The RMSEP for CER, CRE, and UA with OSC were 1.1686, 0.2195, and 0.3726, and without OSC were 1.9057, 0.3482, and 0.6164, respectively. This procedure allows the simultaneous determination of CER, CRE, and UA in synthetic and real samples. Graphical abstract     相似文献   

5.
《Analytica chimica acta》2004,509(2):217-227
In near-infrared (NIR) measurements, some physical features of the sample can be responsible for effects like light scattering, which lead to systematic variations unrelated to the studied responses. These errors can disturb the robustness and reliability of multivariate calibration models. Several mathematical treatments are usually applied to remove systematic noise in data, being the most common derivation, standard normal variate (SNV) and multiplicative scatter correction (MSC). New mathematical treatments, such as orthogonal signal correction (OSC) and direct orthogonal signal correction (DOSC), have been developed to minimize the variability unrelated to the response in spectral data. In this work, these two new pre-processing methods were applied to a set of roasted coffee NIR spectra. A separate calibration model was developed to quantify the ash content and lipids in roasted coffee samples by PLS regression. The results provided by these correction methods were compared to those obtained with the original data and the data corrected by derivation, SNV and MSC. For both responses, OSC and DOSC treatments gave PLS calibration models with improved prediction abilities (4.9 and 3.3% RMSEP with corrected data versus 7.1 and 8.3% RMSEP with original data, respectively).  相似文献   

6.
Multivariate spectral analysis has been widely applied in chemistry and other fields. Spectral data consisting of measurements at hundreds and even thousands of analytical channels can now be obtained in a few seconds. It is widely accepted that before a multivariate regression model is built, a well-performed variable selection can be helpful to improve the predictive ability of the model. In this paper, the concept of traditional wavelength variable selection has been extended and the idea of variable weighting is incorporated into least-squares support vector machine (LS-SVM). A recently proposed global optimization method, particle swarm optimization (PSO) algorithm is used to search for the weights of variables and the hyper-parameters involved in LS-SVM optimizing the training of a calibration set and the prediction of an independent validation set. All the computation process of this method is automatic. Two real data sets are investigated and the results are compared those of PLS, uninformative variable elimination-PLS (UVE-PLS) and LS-SVM models to demonstrate the advantages of the proposed method.  相似文献   

7.
1H nuclear magnetic resonance (NMR)-based metabonomics is a well-established technique used to analyse and interpret complex multiparametric metabolic data, and has a wide number of applications in the development of pharmaceuticals. However, interpretation of biological data can be confounded by extraneous variation in the data such as fluctuations in either experimental conditions or in physiological status. Here we have shown the novel application of a data filtering method, orthogonal signal correction (OSC), to biofluid NMR data to minimise the influence of inter- and intra-spectrometer variation during data acquisition, and also to minimise innate physiological variation. The removal of orthogonal variation exposed features of interest in the NMR data and facilitated interpretation of the derived multivariate models. Furthermore, analysis of the orthogonal variation provided an explanation of the systematic analytical/biological changes responsible for confounding the original NMR data.  相似文献   

8.
Biomarker discovery is one important goal in metabolomics, which is typically modeled as selecting the most discriminating metabolites for classification and often referred to as variable importance analysis or variable selection. Until now, a number of variable importance analysis methods to discover biomarkers in the metabolomics studies have been proposed. However, different methods are mostly likely to generate different variable ranking results due to their different principles. Each method generates a variable ranking list just as an expert presents an opinion. The problem of inconsistency between different variable ranking methods is often ignored. To address this problem, a simple and ideal solution is that every ranking should be taken into account. In this study, a strategy, called rank aggregation, was employed. It is an indispensable tool for merging individual ranking lists into a single “super”-list reflective of the overall preference or importance within the population. This “super”-list is regarded as the final ranking for biomarker discovery. Finally, it was used for biomarkers discovery and selecting the best variable subset with the highest predictive classification accuracy. Nine methods were used, including three univariate filtering and six multivariate methods. When applied to two metabolic datasets (Childhood overweight dataset and Tubulointerstitial lesions dataset), the results show that the performance of rank aggregation has improved greatly with higher prediction accuracy compared with using all variables. Moreover, it is also better than penalized method, least absolute shrinkage and selectionator operator (LASSO), with higher prediction accuracy or less number of selected variables which are more interpretable.  相似文献   

9.
Multivariate methods, such as principal component analysis (PCA) and multivariate curve resolution (MCR), are often employed to aid the analysis of large complex data sets such as time‐of‐flight secondary ion mass spectrometry (ToF‐SIMS) images. There is, however, much confusion over the most appropriate choice of method for any given application and the effects of data preprocessing, which is exacerbated by the confusing terminologies and the use of jargon in this field. In the present study, a simple model system consisting of a ToF‐SIMS image of an immiscible polymer blend is used to evaluate PCA and MCR in the accurate identification, localisation and quantification of the phase‐separated polymer domains, using four data preprocessing methods (no scaling, normalisation, variance scaling and Poisson scaling). This highlights significant issues and challenges in the quantitative multivariate analysis of mixed organic systems, including the discrimination of chemically significant features from experimental noise, the resolution of weak chemical contributions and potential bias introduced by data preprocessing. Multivariate analysis using Poisson scaling, identified as the most suitable data preprocessing method for both PCA and MCR, demonstrates a marked improvement upon traditional (manual) analysis and provides valuable additional information that is difficult to detect using traditional analysis. Using these results, we present recommendations for the optimum use of multivariate analysis by analysts and provide guidance on selecting the most appropriate methods. Confusing terminology is also clarified. © Crown copyright 2008. Reproduced with the permission of Her Majesty's Stationery Office. Published by John Wiley & Sons, Ltd.  相似文献   

10.
Data fusion in multivariate calibration transfer   总被引:1,自引:0,他引:1  
We report the use of stacked partial least-squares regression and stacked dual-domain regression analysis with four commonly used techniques for calibration transfer to improve predictive performance from transferred multivariate calibration models. The predictive performance from three conventional calibration transfer methods, piecewise direct standardization (PDS), orthogonal signal correction (OSC) and model updating (MUP), requiring standards measured on both instruments, was significantly improved from data fusion either by stacking of wavelet scales or by stacking of spectral intervals, as demonstrated by transfer of calibrations developed on near-infrared spectra of synthetic gasoline. Stacking did not produce as significant an improvement for calibration transfer using a finite impulse response (FIR) filter, but application of SPLS regression to FIR-transferred spectra improves predictive performance of the transferred model.  相似文献   

11.
Han QJ  Wu HL  Cai CB  Xu L  Yu RQ 《Analytica chimica acta》2008,612(2):121-125
An improved method based on an ensemble of Monte Carlo uninformative variable elimination (EMCUVE) is presented for wavelength selection in multivariate calibration of spectral data. The proposed algorithm introduces Monte Carlo (MC) strategy to uninformative variable elimination-PLS (UVE-PLS) instead of leave-one-out strategy for estimating the contributions of each wavelength variable in the PLS model. In EMCUVE wavelength variables are evaluated by different Monte Carlo uninformative variable elimination (MCUVE) models. Moreover, a fusion of MCUVE and the vote rule can obtain an improvement over the original uninformative variable elimination method. Results obtained from simulated data and real data sets demonstrate that EMCUVE can properly carry out wavelength selection in the course of data analysis and improve predictive ability for multivariate calibration model.  相似文献   

12.
A new variable selection algorithm is described, based on ant colony optimization (ACO). The algorithm aim is to choose, from a large number of available spectral wavelengths, those relevant to the estimation of analyte concentrations or sample properties when spectroscopic analysis is combined with multivariate calibration techniques such as partial least-squares (PLS) regression. The new algorithm employs the concept of cooperative pheromone accumulation, which is typical of ACO selection methods, and optimizes PLS models using a pre-defined number of variables, employing a Monte Carlo approach to discard irrelevant sensors. The performance has been tested on a simulated system, where it shows a significant superiority over other commonly employed selection methods, such as genetic algorithms. Several near infrared spectroscopic experimental data sets have been subjected to the present ACO algorithm, with PLS leading to improved analytical figures of merit upon wavelength selection. The method could be helpful in other chemometric activities such as classification or quantitative structure-activity relationship (QSAR) problems.  相似文献   

13.
A new tool for analyzing compound libraries by NMR has been developed. Aliquots of solution-state samples (between 120 and 350 microL) are directly injected, using a standard liquids handler, into an NMR (LC-NMR) flow probe. Automated NMR software tracks--and suppresses--intense signals arising from the nondeuterated solvents used (if any) and acquires high-sensitivity one-dimensional 1H NMR spectra. An 88-member combinatorial library, dissolved in DMSO and stored in a 96-well microtiter plate, has been analyzed a number of ways using this technique. This nondestructive technique, which we call direct-injection NMR (DI-NMR) and which is embodied in our versatile automated sample changer (VAST) hardware, has proven to be both routine and robust. Our success in automatically acquiring the NMR data for entire plates of library compounds (within 4-8 h) has caused us to develop new ways to display and analyze the resulting NMR data, as will be shown here.  相似文献   

14.
Quantitative structure–activity relationships (QSAR) methods are urgently needed for predicting ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties to select lead compounds for optimization at the early stage of drug discovery, and to screen drug candidates for clinical trials. Use of suitable QSAR models ultimately results in lesser time-cost and lower attrition rate during drug discovery and development. In the case of ADME/T parameters, drug metabolism is a key determinant of metabolic stability, drug–drug interactions, and drug toxicity. QSAR models for predicting drug metabolism have undergone significant advances recently. However, most of the models used lack sufficient interpretability and offer poor predictability for novel drugs. In this review, we describe some considerations to be taken into account by QSAR for modeling drug metabolism, such as the accuracy/consistency of the entire data set, representation and diversity of the training and test sets, and variable selection. We also describe some novel statistical techniques (ensemble methods, multivariate adaptive regression splines and graph machines), which are not yet used frequently to develop QSAR models for drug metabolism. Subsequently, rational recommendations for developing predictable and interpretable QSAR models are made. Finally, the recent advances in QSAR models for cytochrome P450-mediated drug metabolism prediction, including in vivo hepatic clearance, in vitro metabolic stability, inhibitors and substrates of cytochrome P450 families, are briefly summarized.  相似文献   

15.
Analyses of multifactorial experimental designs are used as an explorative technique describing hypothesized multifactorial effects based on their variation. The procedure of analyzing multifactorial designs is well established for univariate data, and it is known as analysis of variance (ANOVA) tests, whereas only a few methods have been developed for multivariate data. In this work, we present the weighted-effect ASCA, named WE-ASCA, as an enhanced version of ANOVA-simultaneous component analysis (ASCA) to deal with multivariate data in unbalanced multifactorial designs. The core of our work is to use general linear models (GLMs) in decomposing the response matrix into a design matrix and a parameter matrix, while the main improvement in WE-ASCA is to implement the weighted-effect (WE) coding in the design matrix. This WE-coding introduces a unique solution to solve GLMs and satisfies a constrain in which the sum of all level effects of a categorical variable equal to zero. To assess the WE-ASCA performance, two applications were demonstrated using a biomedical Raman spectral data set consisting of mice colorectal tissue. The results revealed that WE-ASCA is ideally suitable for analyzing unbalanced designs. Furthermore, if WE-ASCA is applied as a preprocessing tool, the classification performance and its reproducibility can significantly improve.  相似文献   

16.
LC/MS is an analytical technique that, due to its high sensitivity, has become increasingly popular for the generation of metabolic signatures in biological samples and for the building of metabolic data bases. However, to be able to create robust and interpretable (transparent) multivariate models for the comparison of many samples, the data must fulfil certain specific criteria: (i) that each sample is characterized by the same number of variables, (ii) that each of these variables is represented across all observations, and (iii) that a variable in one sample has the same biological meaning or represents the same metabolite in all other samples. In addition, the obtained models must have the ability to make predictions of, e.g. related and independent samples characterized accordingly to the model samples. This method involves the construction of a representative data set, including automatic peak detection, alignment, setting of retention time windows, summing in the chromatographic dimension and data compression by means of alternating regression, where the relevant metabolic variation is retained for further modelling using multivariate analysis. This approach has the advantage of allowing the comparison of large numbers of samples based on their LC/MS metabolic profiles, but also of creating a means for the interpretation of the investigated biological system. This includes finding relevant systematic patterns among samples, identifying influential variables, verifying the findings in the raw data, and finally using the models for predictions. The presented strategy was here applied to a population study using urine samples from two cohorts, Shanxi (People's Republic of China) and Honolulu (USA). The results showed that the evaluation of the extracted information data using partial least square discriminant analysis (PLS-DA) provided a robust, predictive and transparent model for the metabolic differences between the two populations. The presented findings suggest that this is a general approach for data handling, analysis, and evaluation of large metabolic LC/MS data sets.  相似文献   

17.
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures.In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one.The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks’ Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees.A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.  相似文献   

18.
A critical step in multivariate calibration is wavelength selection, which is used to build models with better prediction performance when applied to spectral data. Up to now, many feature selection techniques have been developed. Among all different types of feature selection techniques, those based on swarm intelligence optimization methodologies are more interesting since they are usually simulated based on animal and insect life behavior to, e.g., find the shortest path between a food source and their nests. This decision is made by a crowd, leading to a more robust model with less falling in local minima during the optimization cycle.  相似文献   

19.
Data modelling with neural networks: Advantages and limitations   总被引:1,自引:0,他引:1  
The origins and operation of artificial neural networks are briefly described and their early application to data modelling in drug design is reviewed. Four problems in the use of neural networks in data modelling are discussed, namely overfitting, chance effects, overtraining and interpretation, and examples are given of the means by which the first three of these may be avoided. The use of neural networks as a variable selection tool is shown and the advantage of networks as a nonlinear data modelling device is discussed. The display of multivariate data in two dimensions employing a neural network is illustrated using experimental and theoretical data for a set of charge transfer complexes.  相似文献   

20.
Da C  Wang F  Shao X  Su Q 《The Analyst》2003,128(9):1200-1203
A new hybrid algorithm is proposed to eliminate the interference information for multivariate calibration of near-infrared (NIR) spectra that includes noise, background and systemic spectral variation irrelevant to concentration. The method consists of two parts: approximate derivative based on continuous wavelet transform (CWT) and orthogonal signal correction (OSC). After the approximate derivative calculated by CWT, OSC was performed. It was successfully applied to real complex NIR spectral data to eliminate the interference information. Correction for the interference of NIR spectra resulted in a substantial improvement in the predicted precision, and a more concise calibration model was obtained. The proposed procedure also compared favourably with several pretreatment methods, and the new method appears to provide a high-performance pretreatment tool for multivariate calibration of NIR spectra. In addition, the strategy proposed here can be applied to various other spectral data for quantitative purposes as well.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号