首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
Single imputation methods have been wide-discussed topics among researchers in the field of bioinformatics. One major shortcoming of methods proposed until now is the lack of robustness considerations. Like all data, gene expression data can possess outlying values. The presence of these outliers could have negative effects on the imputated values for the missing values. Afterwards, the outcome of any statistical analysis on the completed data could lead to incorrect conclusions. Therefore it is important to consider the possibility of outliers in the data set, and to evaluate how imputation techniques will handle these values. In this paper, a simulation study is performed to test existing techniques for data imputation in case outlying values are present in the data. To overcome some shortcomings of the existing imputation techniques, a new robust imputation method that can deal with the presence of outliers in the data is introduced. In addition, the robust imputation procedure cleans the data for further statistical analysis. Moreover, this method can be easily extended towards a multiple imputation approach by which the uncertainty of the imputed values is emphasised. Finally, a classification example illustrates the lack of robustness of some existing imputation methods and shows the advantage of the multiple imputation approach of the new robust imputation technique.  相似文献   

2.
The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k‐means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a “gray area” and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k‐means nearest neighbor and the best approximation of positioning real zeros.  相似文献   

3.
In mass spectrometry (MS)-based metabolomics, missing values (NAs) may be due to different causes, including sample heterogeneity, ion suppression, spectral overlap, inappropriate data processing, and instrumental errors. Although a number of methodologies have been applied to handle NAs, NA imputation remains a challenging problem. Here, we propose a non-negative matrix factorization (NMF)-based method for NA imputation in MS-based metabolomics data, which makes use of both global and local information of the data. The proposed method was compared with three commonly used methods: k-nearest neighbors (kNN), random forest (RF), and outlier-robust (ORI) missing values imputation. These methods were evaluated from the perspectives of accuracy of imputation, retrieval of data structures, and rank of imputation superiority. The experimental results showed that the NMF-based method is well-adapted to various cases of data missingness and the presence of outliers in MS-based metabolic profiles. It outperformed kNN and ORI and showed results comparable with the RF method. Furthermore, the NMF method is more robust and less susceptible to outliers as compared with the RF method. The proposed NMF-based scheme may serve as an alternative NA imputation method which may facilitate biological interpretations of metabolomics data.  相似文献   

4.
Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non‐measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression‐based methods, with other state‐of‐the‐art methods. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

5.
Multivariate data sets often contain gaps in the data matrix. Especially with medical data, missing values are not always avoidable. Most techniques of data analysis do not allow for data gaps; a brief overview is given of the methods currently used to cope with this problem. There are two major groups of missing-data handling techniques: preprocessing techniques used before the data anaysis, and techniques integrated into the data analysis. Preprocessing tecniques can involve deletion of incomplete objects or variables, which loses existing values, or replacement of missing data by estimates, which introduces pseudo-information and bias. Integrated methods are not usually satisfactory. To avoid most of these disadvantages, a new preprocessing technique is proposed for deleting missing data. The algorithm comprises a stepwise deletion of both variables and objects, which retains as much of the data as possible. It is demonstrated on several artificially constructed problem data sets and on some real clinical data collections. It is shown to retain considerably more of the original data sets than other deleting procedures.  相似文献   

6.
Many of the existing molecular simulation tools require the efficient identification of the set of nonbonded interacting atoms. This is necessary, for instance, to compute the energy values or the steric contacts between atoms. Cell linked-lists can be used to determine the pairs of atoms closer than a given cutoff distance in asymptotically optimal time. Despite this long-term optimality, many spurious distances are anyway computed with this method. Therefore, several improvements have been proposed, most of them aiming to refine the volume of influence for each atom. Here, we suggest a different improvement strategy based on avoiding to fill cells with those atoms that are always at a constant distance of a given atom. This technique is particularly effective when large groups of the particles in the simulation behave as rigid bodies as it is the case in simplified models considering only few of the degrees of freedom of the molecule. In these cases, the proposed technique can reduce the number of distance computations by more than one order of magnitude, as compared with the standard cell linked-list technique. The benefits of this technique are obtained without incurring in additional computation costs, because it carries out the same operations as the standard cell linked-list algorithm, although in a different order. Since the focus of the technique is the order of the operations, it might be combined with existing improvements based on bounding the volume of influence for each atom.  相似文献   

7.
Cross‐validation has become one of the principal methods to adjust the meta‐parameters in predictive models. Extensions of the cross‐validation idea have been proposed to select the number of components in principal components analysis (PCA). The element‐wise k‐fold (ekf) cross‐validation is among the most used algorithms for principal components analysis cross‐validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithm. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

8.
Multivariate classification methods are needed to assist in extracting information from analytical data. The most appropriate method for each problem must be chosen. The applicability of a method mainly depends on the distributional characteristics of the data population (normality, correlations between variables, separation of classes, nature of variables) and on the characteristics of the data sample available (numbers of objects, variables and classes, missing values, measurement errors). The CLAS program is designed to combine classification methods with evaluation of their performance, for batch data processing. It incorporates two-group linear discriminant analysis (SLDA), independent class modelling with principal components (SIMCA), kernel density estimation (ALLOC), and principal component class modelling with kernel density estimation (CLASSY). Most of these methods are implemented so as to give probabilistic classifications. Multiple linear regression is provided for, and other methods are scheduled. CLAS evaluates the classification method using the training set data (resubstitution), independent test data, and pseudo test data (leave-one-out method). This last method is optimized for faster computation. Criteria for classification performance and reliability of the given probabilities, etc. are determined. The package contains flexible possibilities for data manipulation, variable transformation and missing data handling.  相似文献   

9.
A new method of imputation for left‐censored datasets is reported. This method is evaluated by examining datasets in which the true values of the censored data are known so that the quality of the imputation can be assessed both visually and by means of cluster analysis. Its performance in retaining certain data structures on imputation is compared with that of three other imputation algorithms by using cluster analysis on the imputed data. It is found that the new imputation method benefits a subsequent model‐based cluster analysis performed on the left‐censored data. The stochastic nature of the imputations performed in the new method can provide multiple imputed sets from the same incomplete data. The analysis of these provides an estimate of the uncertainty of the cluster analysis. Results from clustering suggest that the imputation is robust, with smaller uncertainty than that obtained from other multiple imputation methods applied to the same data. In addition, the use of the new method avoids problems with ill‐conditioning of group covariances during imputation as well as in the subsequent clustering based on expectation–maximization. The strong imputation performance of the proposed method on simulated datasets becomes more apparent as the groups in the mixture models are increasingly overlapped. Results from real datasets suggest that the best performance occurs when the requirement of normality of each group is fulfilled, which is the main assumption of the new method. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
Processing plants can produce large amounts of data that process engineers use for analysis, monitoring, or control. Principal component analysis (PCA) is well suited to analyze large amounts of (possibly) correlated data, and for reducing the dimensionality of the variable space. Failing online sensors, lost historical data, or missing experiments can lead to data sets that have missing values where the current methods for obtaining the PCA model parameters may give questionable results due to the properties of the estimated parameters. This paper proposes a method based on nonlinear programming (NLP) techniques to obtain the parameters of PCA models in the presence of incomplete data sets. We show the relationship that exists between the nonlinear iterative partial least squares (NIPALS) algorithm and the optimality conditions of the squared residuals minimization problem, and how this leads to the modified NIPALS used for the missing value problem. Moreover, we compare the current NIPALS‐based methods with the proposed NLP with a simulation example and an industrial case study, and show how the latter is better suited when there are large amounts of missing values. The solutions obtained with the NLP and the iterative algorithm (IA) are very similar. However when using the NLP‐based method, the loadings and scores are guaranteed to be orthogonal, and the scores will have zero mean. The latter is emphasized in the industrial case study. Also, with the industrial data used here we are able to show that the models obtained with the NLP were easier to interpret. Moreover, when using the NLP many fewer iterations were required to obtain them. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

11.
Microarrays have been widely used to identify differentially expressed genes. One related problem is to estimate the proportion of differentially expressed genes. For some complex diseases, the amount of differentially expressed genes may be relatively small and these genes may only have subtly differential expressions. For these microarray data, it is generally difficult to efficiently estimate the proportion of differentially expressed genes. In this study, I propose a likelihood-based method coupled with an expectation-maximization (E-M) algorithm for estimating the proportion of differentially expressed genes. The proposed method has favorable performances if either (i) the P values of differentially expressed genes are homogeneously distributed or (ii) the proportion of differentially expressed genes is relatively small. In both of these situations, I showed through simulations that the proposed method gave satisfactory performances when it was compared to other existing methods. As applications, these methods were applied to two microarray gene expression data sets generated from different platforms.  相似文献   

12.
We have formulated the ab-initio prediction of the 3D-structure of proteins as a probabilistic programming problem where the inter-residue 3D-distances are treated as random variables. Lower and upper bounds for these random variables and the corresponding probabilities are estimated by nonparametric statistical methods and knowledge-based heuristics. In this paper we focus on the probabilistic computation of the 3D-structure using these distance interval estimates. Validation of the predicted structures shows our method to be more accurate than other computational methods reported so far. Our method is also found to be computationally more efficient than other existing ab-initio structure prediction methods. Moreover, we provide a reliability index for the predicted structures too. Because of its computational simplicity and its applicability to any random sequence, our algorithm called PROPAINOR (PROtein structure Prediction by AI an Nonparametric Regression) has significant scope in computational protein structural genomics.  相似文献   

13.
Omics studies such as metabolomics, lipidomics, and proteomics have become important for understanding the mechanisms in living organisms. However, the compounds detected are structurally different and contain isomers, with each structure or isomer leading to a different result in terms of the role they play in the cell or tissue in the organism. Therefore, it is important to detect, characterize, and elucidate the structures of these compounds. Liquid chromatography and mass spectrometry have been utilized for decades in the structure elucidation of key compounds. While prediction models of parameters (such as retention time and fragmentation pattern) have also been developed for these separation techniques, they have some limitations. Moreover, ion mobility has become one of the most promising techniques to give a fingerprint to these compounds by determining their collision cross section (CCS) values, which reflect their shape and size. Obtaining accurate CCS enables its use as a filter for potential analyte structures. These CCS values can be measured experimentally using calibrant-independent and calibrant-dependent approaches. Identification of compounds based on experimental CCS values in untargeted analysis typically requires CCS references from standards, which are currently limited and, if available, would require a large amount of time for experimental measurements. Therefore, researchers use theoretical tools to predict CCS values for untargeted and targeted analysis. In this review, an overview of the different methods for the experimental and theoretical estimation of CCS values is given where theoretical prediction tools include computational and machine modeling type approaches. Moreover, the limitations of the current experimental and theoretical approaches and their potential mitigation methods were discussed.  相似文献   

14.
Advances in sensory systems have led to many industrial applications with large amounts of highly correlated data, particularly in chemical and pharmaceutical processes. With these correlated data sets, it becomes important to consider advanced modeling approaches built to deal with correlated inputs in order to understand the underlying sources of variability and how this variability will affect the final quality of the product. Additional to the correlated nature of the data sets, it is also common to find missing elements and noise in these data matrices. Latent variable regression methods such as partial least squares or projection to latent structures (PLS) have gained much attention in industry for their ability to handle ill‐conditioned matrices with missing elements. This feature of the PLS method is accomplished through the nonlinear iterative PLS (NIPALS) algorithm, with a simple modification to consider the missing data. Moreover, in expectation maximization PLS (EM‐PLS), imputed values are provided for missing data elements as initial estimates, conventional PLS is then applied to update these elements, and the process iterates to convergence. This study is the extension of previous work for principal component analysis (PCA), where we introduced nonlinear programming (NLP) as a means to estimate the parameters of the PCA model. Here, we focus on the parameters of a PLS model. As an alternative to modified NIPALS and EM‐PLS, this paper presents an efficient NLP‐based technique to find model parameters for PLS, where the desired properties of the parameters can be explicitly posed as constraints in the optimization problem of the proposed algorithm. We also present a number of simulation studies, where we compare effectiveness of the proposed algorithm with competing algorithms. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

15.
In a typical high-throughput screening (HTS) campaign, less than 1 % of the small-molecule library is characterized by confirmatory experiments. As much as 99 % of the library’s molecules are set aside—and not included in downstream analysis—although some of these molecules would prove active were they sent for confirmatory testing. These missing experimental measurements prevent active molecules from being identified by screeners. In this study, we propose managing missing measurements using imputation—a powerful technique from the machine learning community—to fill in accurate guesses where measurements are missing. We then use these imputed measurements to construct an imputed visualization of HTS results, based on the scaffold tree visualization from the literature. This imputed visualization identifies almost all groups of active molecules from a HTS, even those that would otherwise be missed. We validate our methodology by simulating HTS experiments using the data from eight quantitative HTS campaigns, and the implications for drug discovery are discussed. In particular, this method can rapidly and economically identify novel active molecules, each of which could have novel function in either binding or selectivity in addition to representing new intellectual property.  相似文献   

16.
Predicting an accurate binding free energy between a target protein and a ligand can be one of the most important steps in a drug discovery process. Often, many molecules must be screened to find probable high potency ones. Thus, a computational technique with low cost is highly desirable for the estimation of binding free energies of many molecules. Several techniques have thus far been developed for estimating binding free energies. Some techniques provide accurate predictions of binding free energies but high large computational cost. Other methods give good predictions but require tuning of some parameters to predict them with high accuracy. In this study, we propose a method to predict relative binding free energies with accuracy comparable to the results of prior methods but with lower computational cost and with no parameter needing to be carefully tuned. Our technique is based on the free energy variational principle. FK506 binding protein (FKBP) with 18 ligands is taken as a test system. Our results are compared to those from other widely used techniques. Our method provides a correlation coefficient (r 2 ) of 0.80 between experimental and calculated relative binding free energies and yields an average absolute error of 0.70 kcal/mol compared to experimental values. These results are comparable to or better than results from other techniques. We also discuss the possibility to improve our method further.  相似文献   

17.
18.
The Hansen solubility parameter (HSP) seems to be a useful tool for the thermodynamic characterization of different materials. Unfortunately, estimation of the HSP values can cause some problems. In this work different procedures by using inverse gas chromatography have been presented for calculation of pharmaceutical excipients' solubility parameter. The new procedure proposed, based on the Lindvig et al. methodology, where experimental data of Flory-Huggins interaction parameter are used, can be a reasonable alternative for the estimation of HSP values. The advantage of this method is that the values of Flory-Huggins interaction parameter chi for all test solutes are used for further calculation, thus diverse interactions between test solute and material are taken into consideration.  相似文献   

19.
S Tominaga 《Radioisotopes》1984,33(7):423-430
A new computational method is described for estimating the exposure-rate spectral distributions of X-rays from attenuation data measured with various filtrations. The estimation problem of X-ray spectra is formulated as the numerical computation of solving a set of linear equation with an ill-conditional nature. In this paper, the singular-value decomposition technique, which differs from the iterative method, is applied to this singular numerical computation problem. The principle of the analysis method is based on that the response matrix of filtrations can be decomposed into some inherent component matrices. X-ray spectral distributions are then represented in a simple combination of some component curves, so that the estimation process can be systematically constructed. The singularity in its computation is removed by selecting the components of the combination, and a performance index is also presented for the optimal selection. The feasibility of the proposed method is studied in detail in a computer simulation using a hypothetical X-ray spectrum produced by assuming experimental conditions. The application results are also shown about the spectral distribution from a 140 kV constant voltage X-ray source.  相似文献   

20.
《Fluid Phase Equilibria》2004,219(2):245-255
For the computation of chemical and phase equilibrium at constant temperature and pressure, there have been proposed a wide variety of problem formulations and numerical solution procedures, involving both direct minimization of the Gibbs energy and the solution of equivalent nonlinear equation systems. Still, with very few exceptions, these methodologies may fail to solve the chemical and phase equilibrium problem correctly. Nevertheless, there are many existing solution methods that are extremely reliable in general and fail only occasionally. To take good advantage of this wealth of available techniques, we demonstrate here an approach in which such techniques can be combined with procedures that have the power to validate results that are correct, and to identify results that are incorrect. Furthermore, in the latter case, corrective feedback can be provided until a result that can be validated as correct is found. The validation procedure is deterministic, and provides a mathematical and computational guarantee that the global minimum in the Gibbs energy has been found. To demonstrate this validated computing approach to the chemical and phase equilibrium problem, we present several examples involving reactive and nonreactive components at high pressure, using cubic equation-of-state models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号