首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
For the purpose of exploring and modeling the relationships between a dataset Y and several datasets () measured on the same individuals, multiblock Partial Least Squares is a regression technique which is widely used, particularly in process monitoring, chemometrics and sensometrics. In the same vein, a new multiblock method, called multiblock Redundancy Analysis, is proposed. It is introduced by maximizing a criterion that reflects the objectives to be addressed. The solution of this maximization problem is directly derived from the eigenanalysis of a matrix. In addition, this method is related to other multiblock methods. Multiblock modeling methods provide to the user a large spectrum of interpretation indices for the investigation of the relationships among variables and among datasets. They are related to the criterion to maximize and therefore directly derived from the maximization problem under consideration. The interest of multiblock Redundancy Analysis and the associated interpretation tools are illustrated using a dataset in the field of veterinary epidemiology. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

2.
An approach for the analysis of large experimental datasets in electrochemical impedance spectroscopy (EIS) has been developed. The approach uses the idea of successive Bayesian estimation and splits the multidimensional EIS datasets into parts with reduced dimensionality. Afterwards, estimation of the parameters of the EIS-models is performed successively, from one part to another, using complex nonlinear least squares (CNLS) method. The results obtained on the previous step are used as a priori values (in the Bayesian form) for the analysis of the next part. To provide high stability of the sequential CNLS minimisation procedure, a new hybrid algorithm has been developed. This algorithm fits the datasets of reduced dimensionality to the selected EIS models, provides high stability of the fitting and allows semi-automatic data analysis on a reasonable timescale. The hybrid algorithm consists of two stages in which different zero-order optimisation strategies are used, reducing both the computational time and the probability to overlook the global optimum. The performance of the developed approach has been evaluated using (i) simulated large EIS dataset which represents a possible output of a scanning electrochemical impedance microscopy experiments, and (ii) experimental dataset, where EIS spectra were acquired as a function of the electrode potential and time. The developed data analysis strategy showed promise and can be further extended to other electroanalytical EIS applications which require multidimensional data analysis.  相似文献   

3.
Principal component analysis (PCA) and other multivariate analysis methods have been used increasingly to analyse and understand depth profiles in X‐ray photoelectron spectroscopy (XPS), Auger electron spectroscopy (AES) and secondary ion mass spectrometry (SIMS). These methods have proved equally useful in fundamental studies as in applied work where speed of interpretation is very valuable. Until now these methods have been difficult to apply to very large datasets such as spectra associated with 2D images or 3D depth‐profiles. Existing algorithms for computing PCA matrices have been either too slow or demanded more memory than is available on desktop PCs. This often forces analysts to ‘bin’ spectra on much more coarse a grid than they would like, perhaps even to unity mass bins even though much higher resolution is available, or select only part of an image for PCA analysis, even though PCA of the full data would be preferred. We apply the new ‘random vectors’ method of singular value decomposition proposed by Halko and co‐authors to time‐of‐flight (ToF)SIMS data for the first time. This increases the speed of calculation by a factor of several hundred, making PCA of these datasets practical on desktop PCs for the first time. For large images or 3D depth profiles we have implemented a version of this algorithm which minimises memory needs, so that even datasets too large to store in memory can be processed into PCA results on an ordinary PC with a few gigabytes of memory in a few hours. We present results from ToFSIMS imaging of a citrate crystal and a basalt rock sample, the largest of which is 134GB in file size corresponding to 67 111 mass values at each of 512 × 512 pixels. This was processed into 100 PCA components in six hours on a conventional Windows desktop PC. © 2015 The Authors. Surface and Interface Analysis published by John Wiley & Sons Ltd.  相似文献   

4.
In modern omics research, it is more rule than exception that multiple data sets are collected in a study pertaining to the same biological organism. In such cases, it is worthwhile to analyze all data tables simultaneously to arrive at global information of the biological system. This is the area of data fusion or multi‐set analysis, which is a lively research topic in chemometrics, bioinformatics, and biostatistics. Most methods of analyzing such complex data focus on group means, treatment effects, or time courses. There is also information present in the covariances among variables within a group, because this relates directly to individual differences, heterogeneity of responses, and changes of regulation in the biological system. We present a framework for analyzing covariance matrices and a new method that fits nicely in this framework. This new method is based on combining covariance prototypes using simultaneous components and is, therefore, coined Covariances Simultaneous Component Analysis (COVSCA). We present the framework and our new method in mathematical terms, thereby explaining the (dis)similarities of the methods. Systems biology models based on differential equations illustrate the type of variation generated in real‐life biological systems and how this type of variation can be modeled within the framework and with COVSCA. The method is subsequently applied to two real‐life data sets from human and plant metabolomics studies showing biologically meaningful results. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

5.
The present work proposes a new approach for the evaluation of the information content in latent variables, and therefore, for the determination of the regression model dimensionality. Several examples are provided, using simulated, real-world, and reference datasets. The results showed that the application of the Durbin-Watson (DW) criterion could be used for the determination of the number of latent variables. Moreover, the method is straightforward in its implementation and could help in the understanding of model behaviour, particularly in complex datasets. A comparison is made with cross-validation techniques for the case of reference datasets, showing the potential of the Durbin-Watson criterion in the characterisation of the regression model. The advantages and disadvantages of this procedure (compared to cross-validation) are discussed. The properties of the information content of the regression vectors (loadings p, w and b vectors) are shown as well as how to use them for the current purpose.  相似文献   

6.
The combination of results from different large-scale datasets of multidimensional biological signals (such as gene expression profiling) presents a major challenge. Methodologies are needed that can efficiently combine diverse datasets, but can also test the extent of diversity (heterogeneity) across the combined studies. We developed METa-analysis of RAnked DISCovery datasets (METRADISC), a generalized meta-analysis method for combining information across discovery-oriented datasets and for testing between-study heterogeneity for each biological variable of interest. The method is based on non-parametric Monte Carlo permutation testing. The tested biological variables are ranked in each study according to the level of statistical significance. METRADISC tests for each biological variable of interest its average rank and the between-study heterogeneity of the study-specific ranks. After accounting for ties and differences in tested variables across studies, we randomly permute the ranks of each study and the simulated metrics of average rank and heterogeneity are calculated. The procedure is repeated to generate null distributions for the metrics. The use of METRADISC is demonstrated empirically using gene expression data from seven studies comparing prostate cancer cases and normal controls. We offer a new tool for combining complex datasets derived from massive testing, discovery-oriented research and for examining the diversity of results across the combined studies.  相似文献   

7.
An immortal N-(diphenylphosphanyl)-1,3-diisopropyl-4,5-dimethyl-1,3-dihydro-2H-imidazol-2-imine/diisobutyl (2,6-di-tert-butyl-4-methylphenoxy) aluminum (P(NIiPr)Ph2/(BHT)AliBu2)-based frustrated Lewis pair (FLP) polymerization strategy is presented for rapid and scalable synthesis of the sequence-controlled multiblock copolymers at room temperature. Without addition of extra initiator or catalyst and complex synthetic procedure, this method enabled a tripentacontablock copolymer (n=53, k=4, dpn=50) to be achieved with the highest reported block number (n=53) and molecular weight (Mn=310 kg mol−1) within 30 min. More importantly, this FLP polymerization strategy provided access to the multiblock copolymers with tailored properties by precisely adjusting the monomer sequence and block numbers.  相似文献   

8.
In metabonomics it is difficult to tell which peak is which in datasets with many samples. This is known as the correspondence problem. Data from different samples are not synchronised, i.e., the peak from one metabolite does not appear in exactly the same place in all samples. For datasets with many samples, this problem is nontrivial, because each sample contains hundreds to thousands of peaks that shift and are identified ambiguously. Statistical analysis of the data assumes that peaks from one metabolite are found in one column of a data table. For every error in the data table, the statistical analysis loses power and the risk of missing a biomarker increases. It is therefore important to solve the correspondence problem by synchronising samples and there is no method that solves it once and for all. In this review, we analyse the correspondence problem, discuss current state-of-the-art methods for synchronising samples, and predict the properties of future methods.  相似文献   

9.
An immortal N‐(diphenylphosphanyl)‐1,3‐diisopropyl‐4,5‐dimethyl‐1,3‐dihydro‐2H‐imidazol‐2‐imine/diisobutyl (2,6‐di‐tert‐butyl‐4‐methylphenoxy) aluminum (P(NIiPr)Ph2/(BHT)AliBu2)‐based frustrated Lewis pair (FLP) polymerization strategy is presented for rapid and scalable synthesis of the sequence‐controlled multiblock copolymers at room temperature. Without addition of extra initiator or catalyst and complex synthetic procedure, this method enabled a tripentacontablock copolymer (n=53, k=4, dpn=50) to be achieved with the highest reported block number (n=53) and molecular weight (Mn=310 kg mol?1) within 30 min. More importantly, this FLP polymerization strategy provided access to the multiblock copolymers with tailored properties by precisely adjusting the monomer sequence and block numbers.  相似文献   

10.
Plant‐wide process monitoring is challenging because of the complex relationships among numerous variables in modern industrial processes. The multi‐block process monitoring method is an efficient approach applied to plant‐wide processes. However, dividing the original space into subspaces remains an open issue. The loading matrix generated by principal component analysis (PCA) describes the correlation between original variables and extracted components and reveals the internal relations within the plant‐wide process. Thus, a multi‐block PCA method that constructs principal component (PC) sub‐blocks according to the generalized Dice coefficient of the loading matrix is proposed. The PCs corresponding to similar loading vectors are divided within the same sub‐block. Thus, the PCs in the same sub‐block share similar variational behavior for certain faults. This behavior improves the sensitivity of process monitoring in the sub‐block. A monitoring statistic T2 corresponding to each sub‐block is produced and is integrated into the final probability index based on Bayesian inference. A corresponding contribution plot is also developed to identify the root cause. The superiority of the proposed method is demonstrated by two case studies: a numerical example and the Tennessee Eastman benchmark. Comparisons with other PCA‐based methods are also provided. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

11.
Untargeted analyses in mass spectrometry imaging produce hundreds of ion images representing spatial distributions of biomolecules in biological tissues. Due to the large diversity of ions detected in untargeted analyses, normalization standards are often difficult to implement to account for pixel-to-pixel variability in imaging studies. Many normalization strategies exist to account for this variability, but they largely do not improve image quality. In this study, we present a new approach for improving image quality and visualization of tissue features by application of sequential paired covariance (SPC). This approach was demonstrated using previously published tissue datasets such as rat brain and human prostate with different biomolecules like metabolites and N-linked glycans. Data transformation by SPC improved ion images resulting in increased smoothing of biological features compared with commonly used normalization approaches.  相似文献   

12.
The large size of the hyperspectral datasets that are produced with modern mass spectrometric imaging techniques makes it difficult to analyze the results. Unsupervised statistical techniques are needed to extract relevant information from these datasets and reduce the data into a surveyable overview. Multivariate statistics are commonly used for this purpose. Computational power and computer memory limit the resolution at which the datasets can be analyzed with these techniques. We introduce the use of a data format capable of efficiently storing sparse datasets for multivariate analysis. This format is more memory-efficient and therefore it increases the possible resolution together with a decrease of computation time. Three multivariate techniques are compared for both sparse-type data and non-sparse data acquired in two different imaging ToF-SIMS experiments and one LDI-ToF imaging experiment. There is no significant qualitative difference in the use of different data formats for the same multivariate algorithms. All evaluated multivariate techniques could be applied on both SIMS and the LDI imaging datasets. Principal component analysis is shown to be the fastest choice; however a small increase of computation time using a VARIMAX optimization increases the decomposition quality significantly. PARAFAC analysis is shown to be very effective in separating different chemical components but the calculations take a significant amount of time, limiting its use as a routine technique. An effective visualization of the results of the multivariate analysis is as important for the analyst as the computational issues. For this reason, a new technique for visualization is presented, combining both spectral loadings and spatial scores into one three-dimensional view on the complete datacube.  相似文献   

13.
In analytical chemistry, the evaluation on performance accuracy of an analytical method is an important issue. When an adjusted or new method (test method) is developed, the linear measurement error model is commonly used to compare it with another reference method. For this routine practice, the measurements on the reference method can be placed on the x‐axis, whereas those of the test method on the y‐axis, then the slope of this linear relationship indicates the agreement between them and also the performance of the test method. Under the assumption that both variables are subject to heteroscedastic measurement errors, a novel approach based on the concepts of a generalized pivotal quantity (GPQ) is proposed to construct confidence intervals for the slope. Its performance is compared with two maximum likelihood estimation (MLE)‐based approaches through simulation studies. It is shown that the proposed GPQ‐based approach is capable of maintaining the empirical coverage probabilities close to the nominal level and yielding reasonable expected lengths. The GPQ‐based approach can be recommended in practical use because of its easy implementation and better performance than the MLE‐based approaches in most simulation scenarios. Two real datasets are given to illustrate the approaches. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

14.
Extended topochemical atom (ETA) indices developed by our group have been extensively applied in our previous reports for toxicity and ecotoxicity modelling in the field of quantitative structure–activity relationships (QSARs). In the present study these indices have been further explored by defining additional novel parameters to model n-octanol–water partition coefficient (two data sets; n?=?168 and 139), water solubility (n?=?193), molar refractivity (n?=?166), and aromatic substituent constants π, MR, σ m, and σ p (n?=?99). All the models developed in the present study have undergone rigorous internal and external validation tests and the models have high statistical significance and prediction potential. In terms of Q 2 and r 2 values the models developed for the datasets of whole molecules are better than those previously reported, with topochemically arrived unique (TAU) indices on the same datasets of chemicals. An attempt has also been made to develop models using non-ETA topological and information indices. Interestingly, ETA and non-ETA models have been found to have similar predictive capacity.  相似文献   

15.
Several approaches of investigation of the relationships between two datasets where the individuals are structured into groups are discussed. These strategies fit within the framework of partial least squares (PLS) regression. Each strategy of analysis is introduced on the basis of a maximization criterion, which involves the covariances between components associated with the groups of individuals in each dataset. Thereafter, algorithms are proposed to solve these maximization problems. The strategies of analysis can be considered as extensions of multi‐group principal components analysis to the context of PLS regression. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

16.
The selection abilities of the two well‐known techniques of variable selection, synergy interval‐partial least‐squares (SiPLS) and genetic algorithm‐partial least‐squares (GA‐PLS), have been examined and compared. By using different simulated and real (corn and metabolite) datasets, keeping in view the spectral overlapping of the components, the influence of the selection of either intervals of variables or individual variables on the prediction performances was examined. In the simulated datasets, with decrease in the overlapping of the spectra of components and cases with components of narrow bands, GA‐PLS results were better. In contrast, the performance of SiPLS was higher for data of intermediate overlapping. For mixtures of high overlapping analytes, GA‐PLS showed slightly better performance. However, significant differences between the results of the two selection methods were not observed in most of the cases. Although SiPLS resulted in slightly better performance of prediction in the case of corn dataset except for the prediction of the moisture content, the improvement obtained by SiPLS compared with that by GA‐PLS was not significant. For real data of less overlapped components (metabolite dataset), GA‐PLS that tends to select far fewer variables did not give significantly better root mean square error of cross‐validation (RMSECV), cross‐validated R2 (Q2), and root mean square error of prediction (RMSEP) compared with SiPLS. Irrespective of the type of dataset, GA‐PLS resulted in models with fewer latent variables (LVs). When comparing the computational time of the methods, GA‐PLS is considered superior to SiPLS. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

17.
18.
A new approach for automatic parallel processing of large mass spectral datasets in a distributed computing environment is demonstrated to significantly decrease the total processing time. The implementation of this novel approach is described and evaluated for large nanoLC-FTICR-MS datasets. The speed benefits are determined by the network speed and file transfer protocols only and allow almost real-time analysis of complex data (e.g., a 3-gigabyte raw dataset is fully processed within 5 min). Key advantages of this approach are not limited to the improved analysis speed, but also include the improved flexibility, reproducibility, and the possibility to share and reuse the pre- and postprocessing strategies. The storage of all raw data combined with the massively parallel processing approach described here allows the scientist to reprocess data with a different set of parameters (e.g., apodization, calibration, noise reduction), as is recommended by the proteomics community. This approach of parallel processing was developed in the Virtual Laboratory for e-Science (VL-e), a science portal that aims at allowing access to users outside the computer research community. As such, this strategy can be applied to all types of serially acquired large mass spectral datasets such as LC-MS, LC-MS/MS, and high-resolution imaging MS results.  相似文献   

19.
We have found that molecular shape and electrostatics, in conjunction with 2D structural fingerprints, are important variables in discriminating classes of active and inactive compounds. The subject of this paper is how to explore the selection of these variables and identify their relative importance in quantitative structure–activity relationships (QSAR) analysis. We show the use of these variables in a form of similarity searching with respect to a crystal structure of a known bound ligand. This analysis is then validated through k-fold cross-validation of enrichments via several common classifiers. Additionally, we show an effective methodology using the variables in hypothesis generation; namely, when the crystal structure of a bound ligand is not known.  相似文献   

20.
In spectroscopy the measured spectra are typically plotted as a function of the wavelength (or wavenumber), but analysed with multivariate data analysis techniques (multiple linear regression (MLR), principal components regression (PCR), partial least squares (PLS)) which consider the spectrum as a set of m different variables. From a physical point of view it could be more informative to describe the spectrum as a function rather than as a set of points, hereby taking into account the physical background of the spectrum, being a sum of absorption peaks for the different chemical components, where the absorbance at two wavelengths close to each other is highly correlated. In a first part of this contribution, a motivating example for this functional approach is given. In a second part, the potential of functional data analysis is discussed in the field of chemometrics and compared to the ubiquitous PLS regression technique using two practical data sets. It is shown that for spectral data, the use of B-splines proves to be an appealing basis to accurately describe the data. By applying both functional data analysis and PLS on the data sets the predictive ability of functional data analysis is found to be comparable to that of PLS. Moreover, many chemometric datasets have some specific structure (e.g. replicate measurements, on the same object or objects that are grouped), but the structure is often removed before analysis (e.g. by averaging the replicates). In the second part of this contribution, we suggest a method to adapt traditional analysis of variance (ANOVA) methods to datasets with spectroscopic data. In particular, the possibilities to explore and interpret sources of variation, such as variations in sample and ambient temperature, are examined. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号