首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Generalized canonical correlation analysis is a versatile technique that allows the joint analysis of several sets of data matrices. The generalized canonical correlation analysis solution can be obtained through an eigenequation and distributional assumptions are not required. When dealing with multiple set data, the situation frequently occurs that some values are missing. In this paper, two new methods for dealing with missing values in generalized canonical correlation analysis are introduced. The first approach, which does not require iterations, is a generalization of the Test Equating method available for principal component analysis. In the second approach, missing values are imputed in such a way that the generalized canonical correlation analysis objective function does not increase in subsequent steps. Convergence is achieved when the value of the objective function remains constant. By means of a simulation study, we assess the performance of the new methods. We compare the results with those of two available methods; the missing-data passive method, introduced in Gifi’s homogeneity analysis framework, and the GENCOM algorithm developed by Green and Carroll. An application using world bank data is used to illustrate the proposed methods.  相似文献   

2.
The problem of imputing missing observations under the linear regression model is considered. It is assumed that observations are missing at random and all the observations on the auxiliary or independent variables are available. Estimates of the regression parameters based on singly and multiply imputed values are given. Jackknife as well as bootstrap estimates of the variance of the singly imputed estimator of the regression parameters are given. These estimators are shown to be consistent estimators. The asymptotic distributions of the imputed estimators are also given to obtain interval estimates of the parameters of interest. These interval estimates are then compared with the interval estimates obtained from multiple imputation. It is shown that singly imputed estimators perform at least as good as multiply imputed estimators. A new nonparametric multiply imputed estimator is proposed and shown to perform as good as a multiply imputed estimator under normality. The singly imputed estimator, however, still remains at least as good as a multiply imputed estimator.  相似文献   

3.
Gathering information on natural resource inventories is expensive, but lack of data inhibits resource sector modeling and policy analysis. Most work has focused on drawing broader inventory estimates from small survey samples. Other studies have used simple forward forecasting equations to project missing values. This research develops a method to impute missing inventory and growth observations when annual survey observations are not available. A one-way error component model is estimated and missing inventory values are imputed using an optimally weighted combination of forward and backward projections. This method ensures conformity of imputed observations with beginning and ending inventories. Confidence intervals for imputed inventory estimates are formed using the bootstrap method. Empirical results for estimated softwood and hardwood inventories in Louisiana are presented.  相似文献   

4.
We develop two methods for imputing missing values in regression situations. We examine the standard fixed-effects linear-regression model Y = X β + ?, where the regressors X are fixed and ? is the error term. This research focuses on the problem of missing X values. A particular component of market-share analysis has motivated this research where the price and other promotional instruments of each brand are allowed to have their own impact on the total sales volume in a consumer-products category. When a brand is not distributed in a particular week, only a few of the many measures occurring in that observation are missing. ‘What values should be imputed for the missing measures?’ is the central question this paper addresses. This context creates a unique problem in the missing-data literature, i.e. there is no true value for the missing measure. Using influence functions, from robust statistics we develop two loss functions, each of which is a function of the missing and existing X values. These loss functions turn out to be sums of ratios of low-order polynomials. The minimization of either loss function is an unconstrained non-linear-optimization problem. The solution to this non-linear optimization leads to imputed values that have minimal influence on the estimates of the parameters of the regression model. Estimates using the method for replacing missing values are compared with estimates obtained via some conventional methods.  相似文献   

5.
In this paper, considering of the special geometry of compositional data, two new methods for estimating missing values in compositional data are introduced. The first method uses the mean in the simplex space which mainly finds the-nearest neighbor procedure based on the Aitchison distance, combining with two basic operations on the simplex, perturbation and powering. As a second proposal the principal component regression imputation method is introduced which initially starts from the result of the proposed the mean in the simplex. The method uses ilr transformation to transform the compositional data set, and then uses principal component regression in a transformed space. The proposed methods are tested on real data and simulated data sets, the results show that the proposed methods work well.  相似文献   

6.
We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.  相似文献   

7.
We explore the use of principal differential analysis as a tool for performing dimensional reduction of functional data sets. In particular, we compare the results provided by principal differential analysis and by functional principal component analysis in the dimensional reduction of three synthetic data sets, and of a real data set concerning 65 three-dimensional cerebral geometries, the AneuRisk65 data set. The analyses show that principal differential analysis can provide an alternative and effective representation of functional data, easily interpretable in terms of exponential, sinusoidal, or damped-sinusoidal functions and providing a different insight to the functional data set under investigation. Moreover, in the analysis of the AneuRisk65 data set, principal differential analysis is able to detect interesting features of the data, such as the rippling effect of the vessel surface, that functional principal component analysis is not able to detect.  相似文献   

8.
The problem of missing values is common in statistical analysis. One approach to deal with missing values is to delete the incomplete cases from the data set. This approach may disregard valuable information, especially in small samples. An alternative approach is to reconstruct the missing values using the information in the data set. The major purpose of this paper is to investigate how a neural network approach performs compared to statistical techniques for reconstructing missing values. The backpropagation algorithm is used as the learning method to reconstruct missing values. The results of back-propagation are compared with results from two methods, viz., (1) using averages, and (2) using iterative regression analysis, to compute missing values. Experimental results show that backpropagation consistently outperforms other methods in both the training and the test data sets, and suggest that the neural network approach is a useful tool for reconstructing missing values in multivariate analysis.  相似文献   

9.
We establish computationally flexible tools for the analysis of multivariate skew normal mixtures when missing values occur in data. To facilitate the computation and simplify the theoretical derivation, two auxiliary permutation matrices are incorporated into the model for the determination of observed and missing components of each observation and are manifestly effective in reducing the computational complexity. We present an analytically feasible EM algorithm for the supervised learning of parameters as well as missing observations. The proposed mixture analyzer, including the most commonly used Gaussian mixtures as a special case, allows practitioners to handle incomplete multivariate data sets in a wide range of considerations. The methodology is illustrated through a real data set with varying proportions of synthetic missing values generated by MCAR and MAR mechanisms and shown to perform well on classification tasks.  相似文献   

10.
A biplot, which is the multivariate generalization of the two-variable scatterplot, can be used to visualize the results of many multivariate techniques, especially those that are based on the singular value decomposition. We consider data sets consisting of continuous-scale measurements, their fuzzy coding and the biplots that visualize them, using a fuzzy version of multiple correspondence analysis. Of special interest is the way quality of fit of the biplot is measured, since it is well known that regular (i.e., crisp) multiple correspondence analysis seriously under-estimates this measure. We show how the results of fuzzy multiple correspondence analysis can be defuzzified to obtain estimated values of the original data, and prove that this implies an orthogonal decomposition of variance. This permits a measure-of-fit to be calculated in the familiar form of a percentage of explained variance, which is directly comparable to the corresponding fit measure used in principal component analysis of the original data. The approach is motivated initially by its application to a simulated data set, showing how the fuzzy approach can lead to diagnosing nonlinear relationships, and finally it is applied to a real set of meteorological data.  相似文献   

11.
We establish computationally flexible methods and algorithms for the analysis of multivariate skew normal models when missing values occur in the data. To facilitate the computation and simplify the theoretic derivation, two auxiliary permutation matrices are incorporated into the model for the determination of observed and missing components of each observation. Under missing at random mechanisms, we formulate an analytically simple ECM algorithm for calculating parameter estimation and retrieving each missing value with a single-valued imputation. Gibbs sampling is used to perform a Bayesian inference on model parameters and to create multiple imputations for missing values. The proposed methodologies are illustrated through a real data set and comparisons are made with those obtained from fitting the normal counterparts.  相似文献   

12.
Nonlinear least squares optimization problems in which the parameters can be partitioned into two sets such that optimal estimates of parameters in one set are easy to solve for given fixed values of the parameters in the other set are common in practice. Particularly ubiquitous are data fitting problems in which the model function is a linear combination of nonlinear functions, which may be addressed with the variable projection algorithm due to Golub and Pereyra. In this paper we review variable projection, with special emphasis on its application to matrix data. The generalization of the algorithm to separable problems in which the linear coefficients of the nonlinear functions are subject to constraints is also discussed. Variable projection has been instrumental for model-based data analysis in multi-way spectroscopy, time-resolved microscopy and gas or liquid chromatography mass spectrometry, and we give an overview of applications in these domains, illustrated by brief case studies.  相似文献   

13.
The problem of estimating the number of hidden states in a hidden Markov model is considered. Emphasis is placed on cross-validated likelihood criteria. Using cross-validation to assess the number of hidden states allows to circumvent the well-documented technical difficulties of the order identification problem in mixture models. Moreover, in a predictive perspective, it does not require that the sampling distribution belongs to one of the models in competition. However, computing cross-validated likelihood for hidden Markov models for which only one training sample is available, involves difficulties since the data are not independent. Two approaches are proposed to compute cross-validated likelihood for a hidden Markov model. The first one consists of using a deterministic half-sampling procedure, and the second one consists of an adaptation of the EM algorithm for hidden Markov models, to take into account randomly missing values induced by cross-validation. Numerical experiments on both simulated and real data sets compare different versions of cross-validated likelihood criterion and penalised likelihood criteria, including BIC and a penalised marginal likelihood criterion. Those numerical experiments highlight a promising behaviour of the deterministic half-sampling criterion.  相似文献   

14.
This paper is concerned with the allocation of multi-attribute records on several disks so as to achieve high degree of concurrency of disk access when responding to partial match queries.An algorithm to distribute a set of multi-attribute records onto different disks is presented. Since our allocation method will use the principal component analysis, this concept is first introduced. We then use it to generate a set of real numbers which are the projections on the first principal component direction and can be viewed as hashing addresses.Then we propose an algorithm based upon these hashing addresses to allocate multi-attribute records onto different disks. Some experimental results show that our method can indeed be used to solve the multi-disk data allocation problem for concurrent accessing.  相似文献   

15.
The 2004 Basel II Accord has pointed out the benefits of credit risk management through internal models using internal data to estimate risk components: probability of default (PD), loss given default, exposure at default and maturity. Internal data are the primary data source for PD estimates; banks are permitted to use statistical default prediction models to estimate the borrowers’ PD, subject to some requirements concerning accuracy, completeness and appropriateness of data. However, in practice, internal records are usually incomplete or do not contain adequate history to estimate the PD. Current missing data are critical with regard to low default portfolios, characterised by inadequate default records, making it difficult to design statistically significant prediction models. Several methods might be used to deal with missing data such as list-wise deletion, application-specific list-wise deletion, substitution techniques or imputation models (simple and multiple variants). List-wise deletion is an easy-to-use method widely applied by social scientists, but it loses substantial data and reduces the diversity of information resulting in a bias in the model's parameters, results and inferences. The choice of the best method to solve the missing data problem largely depends on the nature of missing values (MCAR, MAR and MNAR processes) but there is a lack of empirical analysis about their effect on credit risk that limits the validity of resulting models. In this paper, we analyse the nature and effects of missing data in credit risk modelling (MCAR, MAR and NMAR processes) and take into account current scarce data set on consumer borrowers, which include different percents and distributions of missing data. The findings are used to analyse the performance of several methods for dealing with missing data such as likewise deletion, simple imputation methods, MLE models and advanced multiple imputation (MI) alternatives based on MarkovChain-MonteCarlo and re-sampling methods. Results are evaluated and discussed between models in terms of robustness, accuracy and complexity. In particular, MI models are found to provide very valuable solutions with regard to credit risk missing data.  相似文献   

16.
Incomplete data models typically involve strong untestable assumptions about the missing data distribution. As inference may critically depend on them, the importance of sensitivity analysis is well recognized. Molenberghs, Kenward, and Goetghebeur proposed a formal frequentist approach to sensitivity analysis which distinguishes ignorance due to unintended incompleteness from imprecision due to finite sampling by design. They combine both sources of variation into uncertainty. This article develops estimation tools for ignorance and uncertainty concerning regression coefficients in a complete data model when some of the intended outcome values are missing. Exhaustive enumeration of all possible imputations for the missing data requires enormous computational resources. In contrast, when the boundary of the occupied region is of greatest interest, reasonable computational efficiency may be achieved via the imputation towards directional extremes (IDE) algorithm. This is a special imputation method designed to mark the boundary of the region by maximizing the direction of change of the complete data estimator caused by perturbations to the imputed outcomes. For multi-dimensional parameters, a dimension reduction approach is considered. Additional insights are obtained by considering structures within the region, and by introducing external knowledge to narrow the boundary to useful proportions. Special properties hold for the generalized linear model. Examples from a Kenyan HIV study will illustrate the points.  相似文献   

17.
Semiparametric random censorship (SRC) models (Dikta, 1998) provide an attractive framework for estimating survival functions when censoring indicators are fully or partially available. When there are missing censoring indicators (MCIs), the SRC approach employs a model-based estimate of the conditional expectation of the censoring indicator given the observed time, where the model parameters are estimated using only the complete cases. The multiple imputations approach, on the other hand, utilizes this model-based estimate to impute the missing censoring indicators and form several completed data sets. The Kaplan-Meier and SRC estimators based on the several completed data sets are averaged to arrive at the multiple imputations Kaplan-Meier (MIKM) and the multiple imputations SRC (MISRC) estimators. While the MIKM estimator is asymptotically as efficient as or less efficient than the standard SRC-based estimator that involves no imputations, here we investigate the performance of the MISRC estimator and prove that it attains the benchmark variance set by the SRC-based estimator. We also present numerical results comparing the performances of the estimators under several misspecified models for the above mentioned conditional expectation.  相似文献   

18.
With contemporary data collection capacity, data sets containing large numbers of different multivariate time series relating to a common entity (e.g., fMRI, financial stocks) are becoming more prevalent. One pervasive question is whether or not there are patterns or groups of series within the larger data set (e.g., disease patterns in brain scans, mining stocks may be internally similar but themselves may be distinct from banking stocks). There is a relatively large body of literature centered on clustering methods for univariate and multivariate time series, though most do not utilize the time dependencies inherent to time series. This paper develops an exploratory data methodology which in addition to the time dependencies, utilizes the dependency information between S series themselves as well as the dependency information between p variables within the series simultaneously while still retaining the distinctiveness of the two types of variables. This is achieved by combining the principles of both canonical correlation analysis and principal component analysis for time series to obtain a new type of covariance/correlation matrix for a principal component analysis to produce a so-called “principal component time series”. The results are illustrated on two data sets.  相似文献   

19.
Model misspecification has significant impacts on data envelopment analysis (DEA) efficiency estimates. This paper discusses the four most widely-used approaches to guide variable specification in DEA. We analyze efficiency contribution measure (ECM), principal component analysis (PCA-DEA), a regression-based test, and bootstrapping for variable selection via Monte Carlo simulations to determine each approach’s advantages and disadvantages. For a three input, one output production process, we find that: PCA-DEA performs well with highly correlated inputs (greater than 0.8) and even for small data sets (less than 300 observations); both the regression and ECM approaches perform well under low correlation (less than 0.2) and relatively larger data sets (at least 300 observations); and bootstrapping performs relatively poorly. Bootstrapping requires hours of computational time whereas the three other methods require minutes. Based on the results, we offer guidelines for effectively choosing among the four selection methods.  相似文献   

20.
We address the problem of rationing common components among multiple products in a configure-to-order system with order configuration uncertainty. The objective of this problem is to maximize expected revenue by implementing a threshold rationing policy. Under this policy, a product is available to promise if fulfilling the order for the product will not cause the inventory of any one of its required components to fall below the component’s threshold level for that product. The problem is modeled as a two-stage stochastic integer program and solved using the sample average approximation approach. A heuristic is developed to generate good feasible solutions and lower bound estimates. Using industry data, we examine the benefit of component rationing as compared to a First-Come-First-Served policy and show that this benefit is correlated to the average revenue per product and the variability in the revenue across products whose components are constrained.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号