首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
As missing values are often encountered in gene expression data, many imputation methods have been developed to substitute these unknown values with estimated values. Despite the presence of many imputation methods, these available techniques have some disadvantages. Some imputation techniques constrain the imputation of missing values to a limited set of genes, whereas other imputation methods optimise a more global criterion whereby the computation time of the method becomes infeasible. Others might be fast but inaccurate. Therefore in this paper a new, fast and accurate estimation procedure, called SEQimpute, is proposed. By introducing the idea of minimisation of a statistical distance rather than a Euclidean distance the method is intrinsically different from the thus far existing imputation methods. Moreover, this newly proposed method can be easily embedded in a multiple imputation technique which is better suited to highlight the uncertainties about the missing value estimates. A comparative study is performed to assess the estimation of the missing values by different imputation approaches. The proposed imputation method is shown to outperform some of the existing imputation methods in terms of accuracy and computation speed.  相似文献   

2.
To explore multi-way data, different methods have been proposed. Here, we study the popular PARAFAC (Parallel factor analysis) model, which expresses multi-way data in a more compact way, without ignoring the underlying complex structure. To estimate the score and loading matrices, an alternating least squares procedure is typically used. It is however well known that least squares techniques suffer from outlying observations, making the models useless when outliers are present in the data. In this paper, we present a robust PARAFAC method. Essentially, it searches for an outlier-free subset of the data, on which we can then perform the classical PARAFAC algorithm. An outlier map is constructed to identify outliers. Simulations and examples show the robustness of our approach.  相似文献   

3.
In mass spectrometry (MS)-based metabolomics, missing values (NAs) may be due to different causes, including sample heterogeneity, ion suppression, spectral overlap, inappropriate data processing, and instrumental errors. Although a number of methodologies have been applied to handle NAs, NA imputation remains a challenging problem. Here, we propose a non-negative matrix factorization (NMF)-based method for NA imputation in MS-based metabolomics data, which makes use of both global and local information of the data. The proposed method was compared with three commonly used methods: k-nearest neighbors (kNN), random forest (RF), and outlier-robust (ORI) missing values imputation. These methods were evaluated from the perspectives of accuracy of imputation, retrieval of data structures, and rank of imputation superiority. The experimental results showed that the NMF-based method is well-adapted to various cases of data missingness and the presence of outliers in MS-based metabolic profiles. It outperformed kNN and ORI and showed results comparable with the RF method. Furthermore, the NMF method is more robust and less susceptible to outliers as compared with the RF method. The proposed NMF-based scheme may serve as an alternative NA imputation method which may facilitate biological interpretations of metabolomics data.  相似文献   

4.
In analytical chemistry, proficiency testing usually consists in tests that laboratories conduct under routine conditions and report the result to the PT provider who then converts the result to a score which helps the participant to assess the accuracy of the result. The aim of this work is to show PT providers, accreditations bodies, and participating laboratories that different scoring results can be achieved depending on the evaluation system selected. The influence of different evaluation techniques on the results of an interlaboratory comparison for determination of gold in precious metals alloys was investigated. Results from 19 participating laboratories were evaluated by means of the three procedures: (1) classical statistical approach—outliers detection; (2) robust methods—(2A) robust procedure and (2B) ISO 13528; and (3) fitness for purpose. Evaluation of the same PT data revealed very interesting issues depending on the different scoring systems that were used and the robustness of the statistical methods used for detecting outliers. As a general rule, laboratories with scoring Z > 2 offered clearly poorer performance in robust approaches than classical ones. In order to support this first evidence, we evaluated a second data set with results from 24 laboratories (mercury from soil samples) by means of the four mentioned approaches. Selection and comparison of different scoring systems must be done very carefully, because sometimes they are not the best approach for studying the data population or the more appropriate one for evaluating the distribution of the data. Finally it should be taken into account that sometimes the robust scoring systems are not always suitable for evaluating the results of some PT schemes.  相似文献   

5.
Robustness tests are usually based on an experimental design approach. As designed experiments generally lead to a large variability among the results, erroneous results are often not readily detected. As a consequence, the ordinary least squares (OLS) estimates of the effects of the robustness test can be biased. Here, two robustness tests are studied, which both contain a suspicious result. Moreover, simulated datasets are considered to examine the influence of the extent of the outlier as well as the influence of multiple outliers. On the one hand, different methods are applied to inspect the results of the experiments for outliers: the half-normal plot of the OLS residuals, the normal probability plot of the effects and a method, which is based on experimental design reconstruction. On the other hand, two robust regression methods are applied to calculate the effects with a minimum influence of possible outliers. The different methods are compared and it is evaluated under which circumstances they can be applied.  相似文献   

6.
Parallel factor analysis (PARAFAC) is a widespread method for modeling fluorescence data by means of an alternating least squares procedure. Consequently, the PARAFAC estimates are highly influenced by outlying excitation–emission landscapes (EEM) and element‐wise outliers, like for example Raman and Rayleigh scatter. Recently, a robust PARAFAC method that circumvents the harmful effects of outlying samples has been developed. For removing the scatter effects on the final PARAFAC model, different techniques exist. Newly, an automated scatter identification tool has been constructed. However, there still exists no robust method for handling fluorescence data encountering both outlying EEM landscapes and scatter. In this paper, we present an iterative algorithm where the robust PARAFAC method and the scatter identification tool are alternately performed. A fully automated robust PARAFAC method is obtained in that way. The method is assessed by means of simulations and a laboratory‐made data set. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

7.
Ortiz MC  Sarabia LA  Herrero A 《Talanta》2006,70(3):499-512
The validation of an analytical procedure means the evaluation of some performance criteria such as accuracy, sensitivity, linear range, capability of detection, selectivity, calibration curve, etc. This implies the use of different statistical methodologies, some of them related with statistical regression techniques, which may be robust or not. The presence of outlier data has a significant effect on the determination of sensitivity, linear range or capability of detection amongst others, when these figures of merit are evaluated with non-robust methodologies.In this paper some of the robust methods used for calibration in analytical chemistry are reviewed: the Huber M-estimator; the Andrews, Tukey and Welsh GM-estimators; the fuzzy estimators; the constrained M-estimators, CM; the least trimmed squares, LTS. The paper also shows that the mathematical properties of the least median squares (LMS) regression can be of great interest in the detection of outlier data in chemical analysis. A comparative analysis is made of the results obtained by applying these regression methods to synthetic and real data. There is also a review of some applications where this robust regression works in a suitable and simple way that proves very useful to secure an objective detection of outliers. The use of a robust regression is recommended in ISO 5725-5.  相似文献   

8.
del Río FJ  Riu J  Rius FX 《The Analyst》2001,126(7):1113-1117
We developed a robust regression technique that is a generalization of the least median of squares (LMS) technique to the field in which the errors in both the predictor and the response variables are taken into account. This simple generalization is limited in the sense that the resulting straight line is found by using only two points from the initial data set. In this way a simulation step is added by using the Monte Carlo method to generate the best robust regression line. We call this new technique 'bivariate least median of squares' (BLMS), following the notation of the LMS method. We checked the robustness of the new regression technique by calculating its breakdown point, which was 50%. This confirms the robustness of the BLMS regression line. In order to show its applicability to the chemical field we tested it on simulated data sets and real data sets with outliers. The BLMS robust regression line was not affected by many types of outlying points in the data sets.  相似文献   

9.
In this paper a robust version of the partial least squares model (partial robust M-regression, PRM) was built to predict the total antioxidant capacity of green tea extracts. In order to construct a calibration model, chromatograms obtained by a fast high-performance liquid chromatographic method on a monolithic silica column were related with the total antioxidant capacity of green tea extracts as determined by the Trolox antioxidant capacity method. Since natural samples are the subject of the study, some outlying samples are present in the data, as shown in an earlier work. Therefore, to construct reliable calibration models, they were detected and removed prior to modeling. With the applied robust partial least squares approach, where a weighting scheme is embedded to down-weight the negative influence of outliers upon the model it is possible to construct a robust calibration model, without prior identification of outlying objects. It was shown that a robust model, allowing satisfactory prediction for test samples, can be used in controlling green tea antioxidant capacity based on their chromatograms. The constructed robust partial least squares model was shown to have virtually the same fit and predictive power as the classical partial least squares model when outlying samples were removed from the data.  相似文献   

10.
 The robustness of Shewhart control charts for subgroup means and subgroup ranges was tested by using the Monte Carlo method using training data sets comprising various numbers of points, with two repetitions in each subgroup (as in routine laboratory practice). The following control chart designs were tested: conventional based on the arithmetic mean and standard deviation, robust based on the median and/or the trimmed mean and Winsorized standard deviation, and a two-step design. The methods were applied to the system in the state of statistical control (outliers excluded) and to the system without statistical control (outliers included). Satisfactory results for both cases were only obtained when using the two-stage control charts. The conventional charts led to underestimation of the effect of outliers in the system without statistical control, whereas the robust control charts led to overestimation of the effect of outliers (false alarm) in the system under statistical control. The tests also gave evidence that the training set should include 20 points as a minimum. Received: 13 January 1997 Accepted: 12 February 1997  相似文献   

11.
Stanimirova I  Walczak B 《Talanta》2008,76(3):602-609
Missing elements and outliers can often occur in experimental data. The presence of outliers makes the evaluation of any least squares model parameters difficult, while the missing values influence the adequate identification of outliers. Therefore, approaches that can handle incomplete data containing outliers are highly valued. In this paper, we present the expectation-maximization robust soft independent modeling of class analogy approach (EM-S-SIMCA) based on the recently introduced spherical SIMCA method. Several important issues like the possibility of choosing the complexity of the model with the leverage correction procedure, the selection of training and test sets using methods of uniform design for incomplete data and prediction of new samples containing missing elements are discussed. The results of a comparison study showed that EM-S-SIMCA outperforms the classic expectation-maximization SIMCA method. The performance of the method was illustrated on simulated and real data sets and led to satisfactory results.  相似文献   

12.
13.
A new method of imputation for left‐censored datasets is reported. This method is evaluated by examining datasets in which the true values of the censored data are known so that the quality of the imputation can be assessed both visually and by means of cluster analysis. Its performance in retaining certain data structures on imputation is compared with that of three other imputation algorithms by using cluster analysis on the imputed data. It is found that the new imputation method benefits a subsequent model‐based cluster analysis performed on the left‐censored data. The stochastic nature of the imputations performed in the new method can provide multiple imputed sets from the same incomplete data. The analysis of these provides an estimate of the uncertainty of the cluster analysis. Results from clustering suggest that the imputation is robust, with smaller uncertainty than that obtained from other multiple imputation methods applied to the same data. In addition, the use of the new method avoids problems with ill‐conditioning of group covariances during imputation as well as in the subsequent clustering based on expectation–maximization. The strong imputation performance of the proposed method on simulated datasets becomes more apparent as the groups in the mixture models are increasingly overlapped. Results from real datasets suggest that the best performance occurs when the requirement of normality of each group is fulfilled, which is the main assumption of the new method. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

14.
Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness.A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided.To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list.The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.  相似文献   

15.
An efficient protocol, based on advanced statistical diagnostics and robust fitting techniques applied to the least‐squares processing of kinetic data of chemical reactions, is presented and discussed. The procedure, which is aimed at obtaining highly accurate estimation of the fitting parameters, consists of the identification of the outliers that remarkably impair the fitting by means of the so‐called “leverage analysis” and some related diagnostics. This approach allows the elimination of the actually aberrant observations from the data set and/or their robust weighting to inhibit the negative effects induced on the data fitting, with consequent reduction of the bias introduced into the parameter estimates. It has been found that the proposed procedure, applied to experimental kinetic data, does yield to a significant improvement in the regression results. © 2010 Wiley Periodicals, Inc. Int J Chem Kinet 42: 587–607, 2010  相似文献   

16.
The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k‐means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a “gray area” and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k‐means nearest neighbor and the best approximation of positioning real zeros.  相似文献   

17.
邵学广  陈达  徐恒  刘智超  蔡文生 《中国化学》2009,27(7):1328-1332
偏最小二乘法(PLS)在近红外光谱(NIR)定量分析中占有重要地位,但预测结果往往容易受到样本分组和奇异样本等因素的影响,稳健性不强。多模型PLS (EPLS)方法在模型稳健性上得到提高,然而它无法识别样本中存在的奇异样本。为了同时提高模型的预测准确性和稳健性,本文提出了一种根据取样概率重新取样的多模型PLS方法,称为稳健共识PLS(RE-PLS)方法。该方法通过迭代赋权偏最小二乘法(IRPLS)计算样本回归残差得到每个校正集样本的取样概率,然后根据样本的取样概率来选择训练子集建立多个PLS模型,最后将所有PLS模型的预测结果平均作为最终预测结果。该方法用于两种不同植物样品的近红外光谱建模,并与传统的PLS及EPLS方法进行比较。结果表明该方法可以有效的避免校正集中奇异样本对模型的影响,同时可以提高预测精确度和稳健性。对于含有较多奇异样本的,复杂近红外光谱烟草实际样本,利用简单PLS或者EPLS方法建模预测效果不是很理想,而RE-PLS凭借其独特优势则有望在这种复杂光谱定量分析中得到广泛的应用。  相似文献   

18.
The aim of this study is to show the usefulness of robust multiple regression techniques implemented in the expectation maximization framework in order to model successfully data containing missing elements and outlying objects. In particular, results from a comparative study of partial least squares and partial robust M-regression models implemented in the expectation maximization algorithm are presented. The performances of the proposed approaches are illustrated on simulated data with and without outliers, containing different percentages of missing elements and on a real data set. The obtained results suggest that the proposed methodology can be used for constructing satisfactory regression models in terms of their trimmed root mean squared errors.  相似文献   

19.
The bootstrap method is commonly used to estimate the distribution of estimators and their associated uncertainty when explicit analytic expressions are not available or are difficult to obtain. It has been widely applied in environmental and geochemical studies, where the data generated often represent parts of whole, typically chemical concentrations. This kind of constrained data is generically called compositional data, and they require specialised statistical methods to properly account for their particular covariance structure. On the other hand, it is not unusual in practice that those data contain labels denoting nondetects, that is, concentrations falling below detection limits. Nondetects impede the implementation of the bootstrap and represent an additional source of uncertainty that must be taken into account. In this work, a bootstrap scheme is devised that handles nondetects by adding an imputation step within the resampling process and conveniently propagates their associated uncertainly. In doing so, it considers the constrained relationships between chemical concentrations originated from their compositional nature. Bootstrap estimates using a range of imputation methods, including new stochastic proposals, are compared across scenarios of increasing difficulty. They are formulated to meet compositional principles following the log‐ratio approach, and an adjustment is introduced in the multivariate case to deal with nonclosed samples. Results suggest that nondetect bootstrap based on model‐based imputation is generally preferable. A robust approach based on isometric log‐ratio transformations appears to be particularly suited in this context. Computer routines in the R statistical programming language are provided. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号