首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A new method of imputation for left‐censored datasets is reported. This method is evaluated by examining datasets in which the true values of the censored data are known so that the quality of the imputation can be assessed both visually and by means of cluster analysis. Its performance in retaining certain data structures on imputation is compared with that of three other imputation algorithms by using cluster analysis on the imputed data. It is found that the new imputation method benefits a subsequent model‐based cluster analysis performed on the left‐censored data. The stochastic nature of the imputations performed in the new method can provide multiple imputed sets from the same incomplete data. The analysis of these provides an estimate of the uncertainty of the cluster analysis. Results from clustering suggest that the imputation is robust, with smaller uncertainty than that obtained from other multiple imputation methods applied to the same data. In addition, the use of the new method avoids problems with ill‐conditioning of group covariances during imputation as well as in the subsequent clustering based on expectation–maximization. The strong imputation performance of the proposed method on simulated datasets becomes more apparent as the groups in the mixture models are increasingly overlapped. Results from real datasets suggest that the best performance occurs when the requirement of normality of each group is fulfilled, which is the main assumption of the new method. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non‐measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression‐based methods, with other state‐of‐the‐art methods. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

3.
As missing values are often encountered in gene expression data, many imputation methods have been developed to substitute these unknown values with estimated values. Despite the presence of many imputation methods, these available techniques have some disadvantages. Some imputation techniques constrain the imputation of missing values to a limited set of genes, whereas other imputation methods optimise a more global criterion whereby the computation time of the method becomes infeasible. Others might be fast but inaccurate. Therefore in this paper a new, fast and accurate estimation procedure, called SEQimpute, is proposed. By introducing the idea of minimisation of a statistical distance rather than a Euclidean distance the method is intrinsically different from the thus far existing imputation methods. Moreover, this newly proposed method can be easily embedded in a multiple imputation technique which is better suited to highlight the uncertainties about the missing value estimates. A comparative study is performed to assess the estimation of the missing values by different imputation approaches. The proposed imputation method is shown to outperform some of the existing imputation methods in terms of accuracy and computation speed.  相似文献   

4.
In mass spectrometry (MS)-based metabolomics, missing values (NAs) may be due to different causes, including sample heterogeneity, ion suppression, spectral overlap, inappropriate data processing, and instrumental errors. Although a number of methodologies have been applied to handle NAs, NA imputation remains a challenging problem. Here, we propose a non-negative matrix factorization (NMF)-based method for NA imputation in MS-based metabolomics data, which makes use of both global and local information of the data. The proposed method was compared with three commonly used methods: k-nearest neighbors (kNN), random forest (RF), and outlier-robust (ORI) missing values imputation. These methods were evaluated from the perspectives of accuracy of imputation, retrieval of data structures, and rank of imputation superiority. The experimental results showed that the NMF-based method is well-adapted to various cases of data missingness and the presence of outliers in MS-based metabolic profiles. It outperformed kNN and ORI and showed results comparable with the RF method. Furthermore, the NMF method is more robust and less susceptible to outliers as compared with the RF method. The proposed NMF-based scheme may serve as an alternative NA imputation method which may facilitate biological interpretations of metabolomics data.  相似文献   

5.
Single imputation methods have been wide-discussed topics among researchers in the field of bioinformatics. One major shortcoming of methods proposed until now is the lack of robustness considerations. Like all data, gene expression data can possess outlying values. The presence of these outliers could have negative effects on the imputated values for the missing values. Afterwards, the outcome of any statistical analysis on the completed data could lead to incorrect conclusions. Therefore it is important to consider the possibility of outliers in the data set, and to evaluate how imputation techniques will handle these values. In this paper, a simulation study is performed to test existing techniques for data imputation in case outlying values are present in the data. To overcome some shortcomings of the existing imputation techniques, a new robust imputation method that can deal with the presence of outliers in the data is introduced. In addition, the robust imputation procedure cleans the data for further statistical analysis. Moreover, this method can be easily extended towards a multiple imputation approach by which the uncertainty of the imputed values is emphasised. Finally, a classification example illustrates the lack of robustness of some existing imputation methods and shows the advantage of the multiple imputation approach of the new robust imputation technique.  相似文献   

6.
Sampling and uncertainty of sampling are important tasks, when industrial processes are monitored. Missing values and unequal sources can cause problems in almost all industrial fields. One major problem is that during weekends samples may not be collected. On the other hand a composite sample may be collected during weekend. These systematically occurring missing values (gaps) will have an effect on the uncertainties of the measurements. Another type of missing values is random missing values. These random gaps are caused, for example, by instrument failures. Pierre Gy's sampling theory includes tools to evaluate all error components that are involved in sampling of heterogeneous materials. Variograms, introduced by Gy's sampling theory, have been developed to estimate the uncertainty of auto-correlated process measurements. Variographic experiments are utilized for estimating the variance for different sample selection strategies. The different sample selection strategies are random sampling, stratified random sampling and systematic sampling. In this paper both systematic and random gaps were estimated by using simulations and real process data. These process data were taken from bark boilers of pulp and paper mills (combustion processes). When systematic gaps were examined a linear interpolation was utilized. Also cases introducing composite sampling were studied. Aims of this paper are: (1) how reliable the variogram is to estimate the process variogram calculated from data with systematic gaps, (2) how the uncertainty of missing gap can be estimated in reporting time-averages of auto-correlated time series measurements. The results show that when systematic gaps were filled by linear interpolation only minor changes in the values of variogram were observed. The differences between the variograms were constantly smallest with composite samples. While estimating the effect of random gaps, the results show that for the non-periodic processes the stratified random sampling strategy gives more reliable results than systematic sampling strategy. Therefore stratified random sampling should be used while estimating the uncertainty of random gaps in reporting time-averages of auto-correlated time series measurements.  相似文献   

7.
This paper reports an experimental design optimization of a recently proposed silylation procedure that avoids the introduction of false positives and false negatives in the simultaneous determination of steroid hormone estrone (E1) and 17-alpha-ethinylestradiol (EE2) by gas chromatography-mass spectrometry (GC/MS). The figures of merit for several calibration procedures were evaluated under optimum conditions in the silylation step. Internal standardization strategies were applied and global models were constructed by gathering signals recorded on three non-consecutive days. Three calibration models were examined: a univariate model with a sum of six monitorized ions and a three-way PARAFAC-based model (the analyte scores were standardized on the basis of the scores of the internal standard). The global PARAFAC-based calibration model showed the best performance with detection capabilities of 4.3 microg l(-1) and 7.0 microg l(-1) for E1 and EE2, respectively, when the probability of false positives was fixed at 1% and that of false negatives at 5%. Mean relative error in absolute terms for E1 and for EE2 was 11.1% and 8.5%, respectively, and trueness was likewise confirmed. The proposed optimized derivatization procedure using a three-way calibration function was also applied in the determination of E1, 17-beta-estradiol (E2) and EE2 in bovine urine samples: recovery values were 68.5%, 40.4% and 43.4%, respectively, and the detection capability was 18.4, 19.3 and 18.6 microg l(-1) when the probability of false positives was fixed at 1% and that of false negatives at 5%. Mean relative error in absolute terms for E1, E2 and EE2 was 7.4%, 9.4% and 8.6%, respectively, and trueness was likewise confirmed.  相似文献   

8.
Cross‐validation has become one of the principal methods to adjust the meta‐parameters in predictive models. Extensions of the cross‐validation idea have been proposed to select the number of components in principal components analysis (PCA). The element‐wise k‐fold (ekf) cross‐validation is among the most used algorithms for principal components analysis cross‐validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithm. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

9.
Data records with equidistant time intervals are fundamental prerequisites for the development of water quality simulation models. Usually long-term water quality data time series contain missing data or data with different sampling intervals. In such cases artificial data have to be added to obtain records based on a regular time grid. Generally, this can be done by interpolation, approximation or filtering of data sets. In contrast to approximation by an analytical function, interpolation methods estimate missing data by means of measured concentration values. In this paper, methods of interpolation and approximation are applied to long-term water quality data sets with daily sampling intervals. Using such data for the water temperature and phosphate phosphorus in some shallow lakes, it was possible to identify the process of phosphate remobilisation from sediment.  相似文献   

10.
A PC-based interactive programme is described which is designed to help in suggesting the best estimate of the true value of an analyte content from results of collective studies aiming at deriving consensus values and/or reference material preparation by employing combined statistical and analytical considerations. The Grubbs, Dixon, Huber tests, and the coefficients of skewness and curtosis tests are used for outlier detection, the Bartlett, Cochran, and the standard error tests are employed for testing variance homogeneity testing and/or variance outliers identification, while the normality of results distribution is tested according to the Kolmogoroff-Smirnoff-Lilliefors and Shapiro-Wilk tests. One-way analysis of variance (ANOVA) is employed to test differences among means of results obtained in different conditions (laboratories, analytical methods, etc.) and to calculate the overall mean and its confidence interval accordingly. Points for an analytical discussion are given which should be considered prior to a decision whether a result of a trace element determination, identified as an outlier from statistical reasons, should be rejected.  相似文献   

11.
Mantle cell lymphoma (MCL) cell lines have been difficult to generate, since only few have been described so far and even fewer have been thoroughly characterized. Among them, there is only one cell line, called GRANTA-519, which is well established and universally adopted for most lymphoma studies. We succeeded in establishing a new MCL cell line, called MAVER-1, from a leukemic MCL, and performed a thorough phenotypical, cytogenetical and molecular characterization of the cell line. In the present report, the phenotypic expression of GRANTA-519 and MAVER-1 cell lines has been compared and evaluated by a proteomic approach, exploiting 2-D map analysis. By univariate statistical analysis (Student's t-test, as commonly used in most commercial software packages), most of the protein spots were found to be identical between the two cell lines. Thirty spots were found to be unique for the GRANTA-519, whereas another 11 polypeptides appeared to be expressed only by the MAVER-1 cell line. A number of these spots could be identified by MS. These data were confirmed and expanded by multivariate statistical tools (principal component analysis and soft-independent model of class analogy) that allowed identification of a larger number of differently expressed spots. Multivariate statistical tools have the advantage of reducing the risk of false positives and of identifying spots that are significantly altered in terms of correlated expression rather than absolute expression values. It is thus suggested that, in future work in differential proteomic profiling, both univariate and multivariate statistical tools should be adopted.  相似文献   

12.
This contribution presents and discusses an efficient algorithm for multivariate linear regression analysis of data sets with missing values. The algorithm is based on the insight that multivariate linear regression can be formulated as a set of individual univariate linear regressions. All available information is used and the calculations are explicit. The only restriction is that the independent variable matrix has to be non-singular. There is no need for imputation of interpolated or otherwise guessed values which require subsequent iterative refinement.  相似文献   

13.
A method of comparing predicted and experimental chemical shifts was used to confirm or refute postulated structures. 1H NMR spectra returned all true positives with a false positive rate of 4%. When an analogous procedure was adopted for 13C NMR spectra, the false positive rate dropped to 1%, whereas the more practical HSQC data yielded a false positive rate of 2%. If the HSQC results were combined with 1H results, a false positive rate of 1% resulted, 4 times more accurate than 1H alone.  相似文献   

14.
Autoantibodies obtained from cancer patients have been identified as useful tools for cancer diagnostics, prognostics, and as potential targets for immunotherapy. Serological proteome analysis in combination with 2‐DE is a classic strategy for identification of tumor‐associated antigens in the serum of cancer patients. However, serological proteome analysis cannot always indicate the true antigen out of a complex proteome identified from a single protein spot because the most abundant protein is not always the most antigenic. To address this problem, we utilized multiple parallel separation (MPS) for proteome separation. The common identities present in the fractions obtained using different separation methods were regarded as the true antigens. The merit of our MPS technique was validated using anti‐ARPC2 and anti‐PTEN antibodies. Next, we applied the MPS technique for the identification of glycyl‐tRNA synthetase as the cognate antigen for an autoantibody that was overexpressed in the plasma of breast cancer patients. These results reveal that MPS can unambiguously identify an antibody cognate antigen by reducing false‐positives. Therefore, MPS could be used for the characterization of diagnostic antibodies raised in laboratory animals as well as autoantibodies isolated from diseased patients.  相似文献   

15.
I. Stanimirova 《Talanta》2007,72(1):172-178
An efficient methodology for dealing with missing values and outlying observations simultaneously in principal component analysis (PCA) is proposed. The concept described in the paper consists of using a robust technique to obtain robust principal components combined with the expectation maximization approach to process data with missing elements. It is shown that the proposed strategy works well for highly contaminated data containing different amounts of missing elements. The authors come to this conclusion on the basis of the results obtained from a simulation study and from analysis of a real environmental data set.  相似文献   

16.
Biomarker selection through the metabolomics approach involves the acquisition of nontargeted metabolic profiles. In this study, some critical factors that may affect this process were investigated using urine test samples and a UPLC‐TOF system. Repeated injections of a single sample demonstrated that the percentage of undetected and poorly repeatable measurements (intensity RSD > 15%) decreased from 22.5 to 5.8% and from 32.9 to 14.7%, respectively, as the scan time was increased up to 0.6 s (approximately 11 data points per peak). An additional critical factor was identified in the presence of broad concentration differences between the samples; the application of a dilution scheme that minimized these differences reduced the number of missing values in the final datasets by 36%. The impact of missing values was further investigated in the study of two groups of samples produced by using a spike as artificial marker. Missing values weakened the models used for the interpretation of the metabolic profiles, and greatly hindered the identification of possible markers. Finally, a simple strategy for an effective analysis of urine samples was outlined; it proved to limit the need for the post‐acquisition elaboration of the data. The same strategy can easily be adapted to other matrices. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

17.
It has been shown that the generalized F-statistics can give satisfactory performances in identifying differentially expressed genes with microarray data. However, for some complex diseases, it is still possible to identify a high proportion of false positives because of the modest differential expressions of disease related genes and the systematic noises of microarrays. The main purpose of this study is to develop statistical methods for Affymetrix microarray gene expression data so that the impact on false positives from non-expressed genes can be reduced. I proposed two novel generalized F-statistics for identifying differentially expressed genes and a novel approach for estimating adjusting factors. The proposed statistical methods systematically combine filtering of non-expressed genes and identification of differentially expressed genes. For comparison, the discussed statistical methods were applied to an experimental data set for a type 2 diabetes study. In both two- and three-sample analyses, the proposed statistics showed improvement on the control of false positives.  相似文献   

18.
A commonly used class of impedance spectra validity tests is based on fitting serially connected parallel coupled pairs of resistor and capacitor (RC elements) to a measured impedance spectrum. If the model approximates the spectrum well, the measurement is considered valid. It is considered invalid if approximation is poor. Despite being widely used, theoretical justification is still missing. It is not clear which electrochemical processes could be approximated by such a model and hence, for which processes a poor approximation of the spectrum truly indicates a false measurement, instead of merely a lack of generality of the model. The scope of this paper is to derive from a system theory point of view, which class of systems can be approximated by serially connected RC elements and from that to conclude for which electrochemical systems the mentioned class of validity tests is applicable. Moreover, the results will yield theoretical justification for generalizing the concept of distribution function of relaxation times (DRT) by using positive and negative RC elements so that its benefits can be utilized not only for strictly capacitive, but for any possible non-oscillating electrochemical system.  相似文献   

19.
High throughput screening (HTS) data is often noisy, containing both false positives and negatives. Thus, careful triaging and prioritization of the primary hit list can save time and money by identifying potential false positives before incurring the expense of followup. Of particular concern are cell-based reporter gene assays (RGAs) where the number of hits may be prohibitively high to be scrutinized manually for weeding out erroneous data. Based on statistical models built from chemical structures of 650 000 compounds tested in RGAs, we created "frequent hitter" models that make it possible to prioritize potential false positives. Furthermore, we followed up the frequent hitter evaluation with chemical structure based in silico target predictions to hypothesize a mechanism for the observed "off target" response. It was observed that the predicted cellular targets for the frequent hitters were known to be associated with undesirable effects such as cytotoxicity. More specifically, the most frequently predicted targets relate to apoptosis and cell differentiation, including kinases, topoisomerases, and protein phosphatases. The mechanism-based frequent hitter hypothesis was tested using 160 additional druglike compounds predicted by the model to be nonspecific actives in RGAs. This validation was successful (showing a 50% hit rate compared to a normal hit rate as low as 2%), and it demonstrates the power of computational models toward understanding complex relations between chemical structure and biological function.  相似文献   

20.
Simultaneous determination of the fat-soluble vitamins A and E and the water-soluble vitamins B1, B2 and B6 has been carried using a screening method from fluorescence contour graphs. These graphs show different colour zones in relation to the fluorescence intensity measured for the pair of excitation/emission wavelengths. The identification of the corresponding excitation/emission wavelength zones allows the detection of different vitamins in an aqueous medium regardless of the fat or water solubility of each vitamin, owing to the presence of a surfactant which forms micelles in water at the used concentration (over the critical micelle concentration). The micelles dissolve very water insoluble compounds, such as fat-soluble vitamins, inside the aggregates. This approach avoids the use of organic solvents in determining these vitamins and offers the possibility of analysing fat- and water-soluble vitamins simultaneously. The method has been validated in terms of detection limit, cut-off limit, sensitivity, number of false positives, number of false negatives and uncertainty range. The detection limit is about g L–1. The screening method was applied to different samples such as pharmaceuticals, juices and isotonic drinks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号