首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 40 毫秒
1.
Single imputation methods have been wide-discussed topics among researchers in the field of bioinformatics. One major shortcoming of methods proposed until now is the lack of robustness considerations. Like all data, gene expression data can possess outlying values. The presence of these outliers could have negative effects on the imputated values for the missing values. Afterwards, the outcome of any statistical analysis on the completed data could lead to incorrect conclusions. Therefore it is important to consider the possibility of outliers in the data set, and to evaluate how imputation techniques will handle these values. In this paper, a simulation study is performed to test existing techniques for data imputation in case outlying values are present in the data. To overcome some shortcomings of the existing imputation techniques, a new robust imputation method that can deal with the presence of outliers in the data is introduced. In addition, the robust imputation procedure cleans the data for further statistical analysis. Moreover, this method can be easily extended towards a multiple imputation approach by which the uncertainty of the imputed values is emphasised. Finally, a classification example illustrates the lack of robustness of some existing imputation methods and shows the advantage of the multiple imputation approach of the new robust imputation technique.  相似文献   

2.
Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non‐measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression‐based methods, with other state‐of‐the‐art methods. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

3.
As missing values are often encountered in gene expression data, many imputation methods have been developed to substitute these unknown values with estimated values. Despite the presence of many imputation methods, these available techniques have some disadvantages. Some imputation techniques constrain the imputation of missing values to a limited set of genes, whereas other imputation methods optimise a more global criterion whereby the computation time of the method becomes infeasible. Others might be fast but inaccurate. Therefore in this paper a new, fast and accurate estimation procedure, called SEQimpute, is proposed. By introducing the idea of minimisation of a statistical distance rather than a Euclidean distance the method is intrinsically different from the thus far existing imputation methods. Moreover, this newly proposed method can be easily embedded in a multiple imputation technique which is better suited to highlight the uncertainties about the missing value estimates. A comparative study is performed to assess the estimation of the missing values by different imputation approaches. The proposed imputation method is shown to outperform some of the existing imputation methods in terms of accuracy and computation speed.  相似文献   

4.
Cross‐validation has become one of the principal methods to adjust the meta‐parameters in predictive models. Extensions of the cross‐validation idea have been proposed to select the number of components in principal components analysis (PCA). The element‐wise k‐fold (ekf) cross‐validation is among the most used algorithms for principal components analysis cross‐validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithm. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

5.
The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k‐means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a “gray area” and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k‐means nearest neighbor and the best approximation of positioning real zeros.  相似文献   

6.
In industrial processes, investigating the root causes of abnormal events is a crucial task when process faults are detected; isolating the faulty variables provides additional information for investigating the root causes of the faults. The traditional contribution plot is a popular and perspicuous tool to isolate faulty variables. However, this method can only determine one faulty variable (the biggest contributor) when there are several variables out of control at the same time. In the presented work, a novel fault diagnosis method is derived using k‐nearest neighbor (kNN) reconstruction on maximize reduce index (MRI) sensors; it is aimed at identifying all fault variables precisely. This method can identify the faulty variables effectively through reconstructing MRI variables one by one. A numerical example focuses on validating the performance of kNN missing data analysis method firstly, then multi‐sensors fault identification results are also given. Tennessee Eastman process is provided to demonstrate that the proposed approach can identify the responsible variables for the multiple sensors fault. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

7.
Pseudotargeted metabolomics is a novel strategy integrating the advantages of both untargeted and targeted methods. The conventional pseudotargeted metabolomics required two MS instruments, i.e., ultra-high performance liquid chromatography/quadrupole-time- of-flight mass spectrometry (UHPLC/Q-TOF MS) and UHPLC/triple quadrupole mass spectrometry (UHPLC/QQQ-MS), which makes method transformation inevitable. Furthermore, the picking of ion pairs from thousands of candidates and the swapping of the data between two instruments are the most labor-intensive steps, which greatly limit its application in metabolomic analysis. In the present study, we proposed an improved pseudotargeted metabolomics method that could be achieved on an UHPLC/Q-TOF/MS instrument operated in the multiple ion monitoring (MIM) mode with time-staggered ion lists (tsMIM). Full scan-based untargeted analysis was applied to extract the target ions. After peak alignment and ion fusion, a stepwise ion picking procedure was used to generate the ion lists for subsequent single MIM and tsMIM. The UHPLC/Q-TOF tsMIM MS-based pseudotargeted approach exhibited better repeatability and a wider linear range than the UHPLC/Q-TOF MS-based untargeted metabolomics method. Compared to the single MIM mode, the tsMIM significantly increased the coverage of the metabolites detected. The newly developed method was successfully applied to discover plasma biomarkers for alcohol-induced liver injury in mice, which indicated its practicability and great potential in future metabolomics studies.  相似文献   

8.
Processing plants can produce large amounts of data that process engineers use for analysis, monitoring, or control. Principal component analysis (PCA) is well suited to analyze large amounts of (possibly) correlated data, and for reducing the dimensionality of the variable space. Failing online sensors, lost historical data, or missing experiments can lead to data sets that have missing values where the current methods for obtaining the PCA model parameters may give questionable results due to the properties of the estimated parameters. This paper proposes a method based on nonlinear programming (NLP) techniques to obtain the parameters of PCA models in the presence of incomplete data sets. We show the relationship that exists between the nonlinear iterative partial least squares (NIPALS) algorithm and the optimality conditions of the squared residuals minimization problem, and how this leads to the modified NIPALS used for the missing value problem. Moreover, we compare the current NIPALS‐based methods with the proposed NLP with a simulation example and an industrial case study, and show how the latter is better suited when there are large amounts of missing values. The solutions obtained with the NLP and the iterative algorithm (IA) are very similar. However when using the NLP‐based method, the loadings and scores are guaranteed to be orthogonal, and the scores will have zero mean. The latter is emphasized in the industrial case study. Also, with the industrial data used here we are able to show that the models obtained with the NLP were easier to interpret. Moreover, when using the NLP many fewer iterations were required to obtain them. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

9.
A central problem in the emerging field of metabolomics is how to identify the compounds comprising a chemical mixture of biological origin. NMR spectroscopy can greatly assist in this identification process, by means of multi-dimensional correlation spectroscopy, particularly total correlation spectroscopy (TOCSY). This Communication demonstrates how non-negative matrix factorization (NMF) provides an efficient means of data reduction and clustering of TOCSY spectra for the identification of unique traces representing the NMR spectra of individual compounds. The method is applied to a metabolic mixture whose compounds could be unambiguously identified by peak matching of NMF components against the BMRB metabolomics database.  相似文献   

10.
Biomarker selection through the metabolomics approach involves the acquisition of nontargeted metabolic profiles. In this study, some critical factors that may affect this process were investigated using urine test samples and a UPLC‐TOF system. Repeated injections of a single sample demonstrated that the percentage of undetected and poorly repeatable measurements (intensity RSD > 15%) decreased from 22.5 to 5.8% and from 32.9 to 14.7%, respectively, as the scan time was increased up to 0.6 s (approximately 11 data points per peak). An additional critical factor was identified in the presence of broad concentration differences between the samples; the application of a dilution scheme that minimized these differences reduced the number of missing values in the final datasets by 36%. The impact of missing values was further investigated in the study of two groups of samples produced by using a spike as artificial marker. Missing values weakened the models used for the interpretation of the metabolic profiles, and greatly hindered the identification of possible markers. Finally, a simple strategy for an effective analysis of urine samples was outlined; it proved to limit the need for the post‐acquisition elaboration of the data. The same strategy can easily be adapted to other matrices. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

11.
Stanimirova I  Walczak B 《Talanta》2008,76(3):602-609
Missing elements and outliers can often occur in experimental data. The presence of outliers makes the evaluation of any least squares model parameters difficult, while the missing values influence the adequate identification of outliers. Therefore, approaches that can handle incomplete data containing outliers are highly valued. In this paper, we present the expectation-maximization robust soft independent modeling of class analogy approach (EM-S-SIMCA) based on the recently introduced spherical SIMCA method. Several important issues like the possibility of choosing the complexity of the model with the leverage correction procedure, the selection of training and test sets using methods of uniform design for incomplete data and prediction of new samples containing missing elements are discussed. The results of a comparison study showed that EM-S-SIMCA outperforms the classic expectation-maximization SIMCA method. The performance of the method was illustrated on simulated and real data sets and led to satisfactory results.  相似文献   

12.
The oil sands regions of Northern Alberta, Canada, contain an estimated 1.7 trillion barrels of oil in the form of bitumen, representing the second largest deposit of crude oil in the world. A rapidly expanding industry extracts surface-mined bitumen using alkaline hot water, resulting in large volumes of oil sands process water (OSPW) that must be contained on site due to toxicity. The toxicity has largely been attributed to naphthenic acids (NAs), a complex mixture of naturally occurring aliphatic and (poly-)alicyclic carboxylic acids. Research has increasingly focused on the environmental fate and remediation of OSPW NAs, but an understanding of these processes necessitates an analytical method that can accurately characterize and quantify NA mixtures. Here we report results of an interlaboratory comparison for the analysis of pure commercial NAs and environmental OSPW NAs using direct injection electrospray ionization mass spectrometry (ESI-MS) and high-pressure liquid chromatography/high-resolution mass spectrometry (HPLC/HRMS). Both methods provided very similar characterization of pure commercial NA mixture; however, the m/z selectivity of HPLC/HRMS was essential to prevent substantial false-positive detections and misclassifications in OSPW NA mixtures. For a range of concentrations encompassing those found in OSPW (10-100 mg/L), both methods produced linear response, although concentrations of commercial NAs above 50 mg/L resulted in slight non-linearity by HPLC/HRMS. A three-fold lower response factor for total OSPW NAs by HPLC/HRMS was largely attributable to other organic compounds in the OSPW, including hydroxylated NAs, which may explain the substantial misclassification by ESI-MS. For the quantitative analysis of unknown OSPW samples, both methods yielded total NA concentrations that correlated with results from Fourier transform infrared (FTIR), but the coefficients of determination were not high. Quantification by either MS method should therefore be considered semi-quantitative at best, albeit either method has substantial value in environmental fate experiments where relative concentration changes are the desired endpoints rather than absolute concentrations.  相似文献   

13.
Mass spectrometry-based metabolomics applied to the chemical safety of food   总被引:1,自引:0,他引:1  
Mass spectrometry (MS)-based metabolomics is emerging as an important field of research in many scientific areas, including chemical safety of food. A particular strength of this approach is its potential to reveal some physiological effects induced by complex mixtures of chemicals present at trace concentrations. The limitations of other analytical approaches currently employed to detect low-dose and mixture effects of chemicals make detection very problematic. Besides this basic technical challenge, numerous analytical choices have to be made at each step of a metabolomics study, and each step can have a direct impact on the final results obtained and their interpretation (i.e. sample preparation, sample introduction, ionization, signal acquisition, data processing, and data analysis). As the application of metabolomics to chemical analysis of food is still in its infancy, no consensus has yet been reached on defining many of these important parameters. In this context, the aim of the present study is to review all these aspects of MS-based approaches to metabolomics, and to give a comprehensive, critical overview of the current state of the art, possible pitfalls, and future challenges and trends linked to this emerging field.  相似文献   

14.
Neomangiferin (NMF) is an extremely special xanthone that could be simultaneously attributed to C-glycoside and O-glycoside with a variety of biological activities, such as anti-inflammatory, antitumor, antipyretic, and so on. So far as we know, the metabolism profiling has been insufficient until now. Herein, Drug Metabolite Cluster Centers (DMCCs)-based Strategy has been developed to profile the NMF metabolites in vivo and in vitro. Firstly, the DMCCs was proposed depending on literature-related and preliminary analysis results. Secondly, the specific metabolic rule was implemented to screen the metabolites of candidate DMCCs from the acquired Ultra High Performance Liquid Chromatography Quadrupole Exactive Orbitrap Mass Spectrometry (UHPLC-Q-Exactive Orbitrap MS) data by extracted ion chromatography (EIC) method. Thirdly, candidate metabolites were accurately and tentatively identified according to the pyrolysis law of mass spectrometry, literature reports, comparison of reference substances, and especially the diagnostic product ions (DPIs) deduced preliminarily. Finally, network pharmacology was adopted to elucidate the anti-inflammatory action mechanism of NMF on the basis of DMCCs. As a result, 3 critical metabolites including NMF, Mangiferin (MF) and Norathyriol (NA) were proposed as DMCCs, and a total of 61 NMF metabolites (NMF included) were finally screened and characterized coupled with 3 different biological sample preparation methods including solid phase extraction (SPE), acetonitrile precipitation and methanol precipitation. Among them, 32 metabolites were discovered in rat urine, 30 in rat plasma, 12 in rat liver, 9 metabolites in liver microsomes and 8 in rat faeces, respectively. Our results also illustrated that NMF primarily underwent deglucosylation, glucuronidation, methylation, sulfation, dihydroxylation and their composite reactions in vivo and in vitro. Additionally, network pharmacology analysis based on DMCCs revealed 85 common targets of disease-metabolites, and the key targets were TNF, EGFR, ESR1, PTGS2, HIF1A, IL-2, PRKCA and PRKCB. They exerted anti-inflammatory effects mainly through the pathways of inflammatory response, calcium-dependent protein kinase C activity, nitrogen metabolism, pathways in cancer and so on. In general, our study constructed a novel strategy to comprehensive elucidate the biotransformation pathways of NMF in vivo and in vitro, and provided vital reference for further understanding its anti-inflammatory action mechanism. Moreover, the established strategy could be generalized to the metabolism and action mechanism study of other natural products.  相似文献   

15.
The quenching of pyrene and 1‐methylpyrene fluorescence by nitroanilines (NAs), such as 2‐, 3‐, and 4‐nitroaniline (2‐NA, 3‐NA, and 4‐NA, respectively), 4‐methyl‐3‐nitroaniline (4‐M ‐3‐NA), 2‐methyl‐4‐nitroaniline (2‐M‐4‐NA), and 4‐methyl‐3,5‐dinitroaniline (4‐M‐3,5‐DNA), are studied in toluene and 1,4‐dioxane. Steady‐state fluorescence data show the higher efficiency of the 4‐NAs as quenchers and fit with a sphere‐of‐action model. This suggests a 4‐NA tendency of being in close proximity to the fluorophore, which could be connected with their high polarity/hyperpolarizability. In addition, emission and excitation spectra evidence the formation of emissive pyrene—NA ground‐state complexes in the case of the 4‐NAs and, in a minor degree, in the 2‐NA. Moreover, time‐resolved fluorescence experiments show that increasing amounts of NA decrease the pyrene fluorescence lifetime to a degree that depends on the NA nature and is larger in the less viscous solvent (toluene). Although the NA absorption and the pyrene (Py) emission overlap, we found no evidence of dipole–dipole energy transfer from the pyrene singlet excited state (1Py) to the NAs; this could be due to the low NA concentration used in these experiments. Transient absorption spectra show that the formation of the pyrene triplet excited state (3Py) is barely affected by the presence of the NAs in spite of their efficiency in 1Py quenching, suggesting the involvement of 1Py—NA exciplexes which—after intersystem crossing—decay efficiently into 3Py.  相似文献   

16.
Advances in sensory systems have led to many industrial applications with large amounts of highly correlated data, particularly in chemical and pharmaceutical processes. With these correlated data sets, it becomes important to consider advanced modeling approaches built to deal with correlated inputs in order to understand the underlying sources of variability and how this variability will affect the final quality of the product. Additional to the correlated nature of the data sets, it is also common to find missing elements and noise in these data matrices. Latent variable regression methods such as partial least squares or projection to latent structures (PLS) have gained much attention in industry for their ability to handle ill‐conditioned matrices with missing elements. This feature of the PLS method is accomplished through the nonlinear iterative PLS (NIPALS) algorithm, with a simple modification to consider the missing data. Moreover, in expectation maximization PLS (EM‐PLS), imputed values are provided for missing data elements as initial estimates, conventional PLS is then applied to update these elements, and the process iterates to convergence. This study is the extension of previous work for principal component analysis (PCA), where we introduced nonlinear programming (NLP) as a means to estimate the parameters of the PCA model. Here, we focus on the parameters of a PLS model. As an alternative to modified NIPALS and EM‐PLS, this paper presents an efficient NLP‐based technique to find model parameters for PLS, where the desired properties of the parameters can be explicitly posed as constraints in the optimization problem of the proposed algorithm. We also present a number of simulation studies, where we compare effectiveness of the proposed algorithm with competing algorithms. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

17.
Non-negative matrix factorization (NMF) is a recently developed method for real time data analysis. In the past it has been used for facial recognition and spectral data analysis. Most of the NMF algorithms do not converge to a stable limit point and uniqueness in results is also a problem in NMF. To improve the convergence, a new NMF algorithm with modified multiplicative update (ML-NMFmse) has been proposed in this work for strongly overlapped and embedded chromatograms separation. To get same results for all the runs, instead of random initialization, three different initialization methods have been used namely, ALS–NMF (robust initialization), NNDSVD based initialization and EFA based initializations. The proposed ML-NMFmse algorithm is applied on the simulated and experimental overlapped chromatograms obtained for acetone and acrolein mixture, using Gas Chromatography–Flame Ionization Detector. Before applying NMF, Principal Component Analysis (PCA) was applied to determine number of components in the mixture taken. The result of proposed ML-NMFmse is compared with that of existing Multivariate Curve Resolution-Alternating Least Squares method in optimal conditions for both the algorithms. In the case of embedded chromatogram, the proposed ML-NMFmse with Robust method (ALS-NMF) of initialization performs better than all other methods. For a resolution of severely overlapped chromatograms, the proposed ML-NMFmse with NNDSVD method of initialization outperforms all other methods.  相似文献   

18.
Naphthenic acids (NAs) have been implicated as some of the most toxic substances in oil sands leachates and identified as priority substances impacting on aquatic environments. As a group of compounds, NAs are not well characterized and comprise a large group of saturated aliphatic and alicyclic carboxylic acids found in hydrocarbon deposits (petroleum, oil sands bitumen, and crude oils). Described is an analytical method using negative-ion electrospray ionization mass spectrometry (ES/MS) of extracts. Preconcentration was achieved by using a solid-phase extraction procedure utilizing a crosslinked polystyrene-based polymer with acetonitrile elution. Recovery of the Fluka Chemicals NA mixture was highly pH-dependent, with 100% recovery at pH 3.0, but only 66 and 51% recoveries at pHs 7 and 9, respectively. The dissolved phase of the NA was very dependent on sample pH. It is thus critical to measure the pH and determine the appropriate mass profiles to identify NAs in natural waters. The ES/MS analytical procedure proved to be a fast and sensitive method for the recovery and detection of NAs in natural waters, with a detection limit of 0.01 mg/L.  相似文献   

19.
Multivariate chemical data often contain elements that are missing completely at random and the so-called left-censored elements whose values are only known to be below a definite threshold value (reporting limit). In the last several years, attention has been paid to developing methods for dealing with data containing missing elements and those that can handle data with missing elements and outliers. However, processing data with both missing and left-censored elements is still an ongoing problem.  相似文献   

20.
Oil-sand naphthenic acids (NAs) are organic wastes produced during the oil-sand digestion and extraction processes and are very difficult to separate and analyze as individual components due to their complex compositions. A comprehensive two-dimensional gas chromatography/time of flight mass spectrometry (GC x GC/TOF-MS) system was applied for the characterization of two commercial mixtures of naphthenic acids (Fluka and Acros) and a naphthenic acid sample extracted from the Syncrude tailings. Contour plots of chromatographic distributions of different Z homologous series of the Fluka, Acros and Syncrude NAs were constructed using fragment ions that were characteristic of the NA's molecular structures. Well-ordered patterns were observed for NAs of Z= 0 and -2 which corresponded to acyclic acids and monocyclic acids, respectively. For NAs of Z= -4, -6, and -8, specific zones were observed which would allow the pattern recognition of these NAs obtained from different origins. As expected, gas chromatographic retention times increase with the number of the carbons and the number of rings in the molecules. Little signal was obtained for NAs with Z numbers of -10, or lower. Deconvoluted mass spectra of various NA isomers were derived from the reconstructed GC x GC chromatogram, permitting detailed structural elucidations for NAs in the future. The current study demonstrated that the combination of GC x GC and the TOF-MS is a powerful to identify origins of the NAs in an effective manner. GC x GC/TOF-MS alone, however, may not be enough to characterize each individual isomer in a complex mixture such as NAs. The use of mass deconvolution software followed by library search have thus become necessary to separate and study the mass spectrum of each individual NA component, allowing a detailed identification of the toxic components within the NAs mixture.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号