首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The interest in variable selection for clustering has increased recently due to the growing need in clustering high-dimensional data. Variable selection allows in particular to ease both the clustering and the interpretation of the results. Existing approaches have demonstrated the importance of variable selection for clustering but turn out to be either very time consuming or not sparse enough in high-dimensional spaces. This work proposes to perform a selection of the discriminative variables by introducing sparsity in the loading matrix of the Fisher-EM algorithm. This clustering method has been recently proposed for the simultaneous visualization and clustering of high-dimensional data. It is based on a latent mixture model which fits the data into a low-dimensional discriminative subspace. Three different approaches are proposed in this work to introduce sparsity in the orientation matrix of the discriminative subspace through \(\ell _{1}\) -type penalizations. Experimental comparisons with existing approaches on simulated and real-world data sets demonstrate the interest of the proposed methodology. An application to the segmentation of hyperspectral images of the planet Mars is also presented.  相似文献   

2.
Supervised clustering of variables   总被引:1,自引:0,他引:1  
In predictive modelling, highly correlated predictors lead to unstable models that are often difficult to interpret. The selection of features, or the use of latent components that reduce the complexity among correlated observed variables, are common strategies. Our objective with the new procedure that we advocate here is to achieve both purposes: to highlight the group structure among the variables and to identify the most relevant groups of variables for prediction. The proposed procedure is an iterative adaptation of a method developed for the clustering of variables around latent variables (CLV). Modification of the standard CLV algorithm leads to a supervised procedure, in the sense that the variable to be predicted plays an active role in the clustering. The latent variables associated with the groups of variables, selected for their “proximity” to the variable to be predicted and their “internal homogeneity”, are progressively added in a predictive model. The features of the methodology are illustrated based on a simulation study and a real-world application.  相似文献   

3.
A cluster-based method for constructing sparse principal components is proposed. The method initially forms clusters of variables, using a new clustering approach called the semi-partition, in two steps. First, the variables are ordered sequentially according to a criterion involving the correlations between variables. Then, the ordered variables are split into two parts based on their generalized variance. The first group of variables becomes an output cluster, while the second one—input for another run of the sequential process. After the optimal clusters have been formed, sparse components are constructed from the singular value decomposition of the data matrices of each cluster. The method is applied to simple data sets with smaller number of variables (p) than observations (n), as well as large gene expression data sets with p ? n. The resulting cluster-based sparse principal components are very promising as evaluated by objective criteria. The method is also compared with other existing approaches and is found to perform well.  相似文献   

4.
The aim of this paper is to enlarge the usual domain of cluster analysis. A procedure for clustering time varying data is presented which takes into account the time dimension with its intrinsic properties.

This procedure consists of two steps. In the first step a dissimilarity between variables is defined and the dissimilarity matrix is calculated for each unit separately. In the second step the dissimilarity between units is calculated in terms of the dissimilarity matrices defined in the first step. The dissimilarity matrix obtained is the base for a suitable clustering method.

The procedure is illustrated on an empirical example.  相似文献   

5.
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.  相似文献   

6.
We consider models for the covariance between two blocks of variables. Such models are often used in situations where latent variables are believed to present. In this paper we characterize exactly the set of distributions given by a class of models with one-dimensional latent variables. These models relate two blocks of observed variables, modeling only the cross-covariance matrix. We describe the relation of this model to the singular value decomposition of the cross-covariance matrix. We show that, although the model is underidentified, useful information may be extracted. We further consider an alternative parameterization in which one latent variable is associated with each block, and we extend the result to models with r-dimensional latent variables.  相似文献   

7.
在一般因子分析模型的基础上,假设连续的潜在向量(公共因子)与另一观察随机向量有关,并假定是一个多元线性回归模型,对由此扩展的因子分析模型进行分析.主要通过EM算法给出模型中参数的估计.文中给出了它的详细推导过程.  相似文献   

8.
Suppose that random factor models with k factors are assumed to hold for m, p-variate populations. A model for factorial invariance has been proposed wherein the covariance or correlation matrices can be written as Σi = LCiL′ + σi2I, where Ci is the covariance matrix of factor variables and L is a common factor loading matrix, i = 1,…, m. Also a goodness of fit statistic has been proposed for this model. The asymptotic distribution of this statistic is shown to be that of a quadratic form in normal variables. An approximation to this distribution is given and thus a test for goodness of fit is derived. The problem of dimension is considered and a numerical example is given to illustrate the results.  相似文献   

9.
研究了加总式和乘积式的方差分解问题,证明了在因变量等于各自变量之和的条件下,因变量方差等于各自变量与因变量的协方差之和;在因变量等于各自变量之乘积的条件下,因变量对数值的方差等于各自变量对数值与因变量对数值的协方差之和.以中国31个省份2005-2012年的居民人均收入及其影响因素的统计数据资料为例,说明了加总式和乘积式的方差分解法的具体应用.  相似文献   

10.
Clustering can be treated as an optimization problem over a set of feasible clusterings. This paper deals with a clustering problem where the set of feasible clusterings is determined by constraining the function of values of a given (constraining) variable in each cluster. It can be shown that agglomerative clustering methods are not suitable for solving problems with constraining variables. For solving clustering problems of this type, local optimization procedures can be adapted. In the study of the influence of constraints on the clustering, a special coefficient is defined. The proposed procedures of clustering with constraining variables are illustrated by clustering the Slovene communes on the basis of the socioeconomic indicators.  相似文献   

11.
In this paper, algorithms which realize some operations over scalar polynomials in one and two variables and their computer realization are suggested. The following operations are considered: 1) the computation of the GCD for given scalar polynomials and the decomposition of each polynomial into a product of two factors: the first factor is the GCD, and the second factors form a sequence of relatively prime polynomials; 2) the division of polynomials by their common divisor; 3) the decomposition of polynomials in two variables into irreducible factors; 4) the computation of the LCM for given scalar polynomials. Bibliography: 5 titles. Translated fromZapiski Nauchnykh Seminarov POMI, Vol. 219, 1994, pp. 158–175. This work was supported by the Russian Foundation of Fundamental Research (grant 94-01-00919). Translated by V. N. Kublanovskaya.  相似文献   

12.
In multivariate regression models, a sparse singular value decomposition of the regression component matrix is appealing for reducing dimensionality and facilitating interpretation. However, the recovery of such a decomposition remains very challenging, largely due to the simultaneous presence of orthogonality constraints and co-sparsity regularization. By delving into the underlying statistical data-generation mechanism, we reformulate the problem as a supervised co-sparse factor analysis, and develop an efficient computational procedure, named sequential factor extraction via co-sparse unit-rank estimation (SeCURE), that completely bypasses the orthogonality requirements. At each step, the problem reduces to a sparse multivariate regression with a unit-rank constraint. Nicely, each sequentially extracted sparse and unit-rank coefficient matrix automatically leads to co-sparsity in its pair of singular vectors. Each latent factor is thus a sparse linear combination of the predictors and may influence only a subset of responses. The proposed algorithm is guaranteed to converge, and it ensures efficient computation even with incomplete data and/or when enforcing exact orthogonality is desired. Our estimators enjoy the oracle properties asymptotically; a non-asymptotic error bound further reveals some interesting finite-sample behaviors of the estimators. The efficacy of SeCURE is demonstrated by simulation studies and two applications in genetics. Supplementary materials for this article are available online.  相似文献   

13.
In this paper, the problem of fitting the exploratory factor analysis (EFA) model to data matrices with more variables than observations is reconsidered. A new algorithm named ‘zig-zag EFA’ is introduced for the simultaneous least squares estimation of all EFA model unknowns. As in principal component analysis, zig-zag EFA is based on the singular value decomposition of data matrices. Another advantage of the proposed computational routine is that it facilitates the estimation of both common and unique factor scores. Applications to both real and artificial data illustrate the algorithm and the EFA solutions.  相似文献   

14.
Simple random subsampling is an integral part of S estimation algorithms for linear regression. Subsamples are required to be nonsingular. Usually, discarding a singular subsample and drawing a new one leads to a sufficient number of nonsingular subsamples with a reasonable computational effort. However, this procedure can require so many subsamples that it becomes infeasible, especially if levels of categorical variables have low frequency. A subsampling algorithm called nonsingular subsampling is presented, which generates only nonsingular subsamples. When no singular subsamples occur, nonsingular subsampling is as fast as the simple algorithm, and if singular subsamples do occur, it maintains the same computational order. The algorithm works consistently, unless the full design matrix is singular. The method is based on a modified LU decomposition algorithm that combines sample generation with solving the least squares problem. The algorithm may also be useful for ordinary bootstrapping. Since the method allows for S estimation in designs with factors and interactions between factors and continuous regressors, we study properties of the resulting estimators, both in the sense of their dependence on the randomness of the sampling and of their statistical performance.  相似文献   

15.
In this paper, a matrix modular neural network (MMNN) based on task decomposition with subspace division by adaptive affinity propagation clustering is developed to solve classification tasks. First, we propose an adaptive version to affinity propagation clustering, which is adopted to divide each class subspace into several clusters. By these divisions of class spaces, a classification problem can be decomposed into many binary classification subtasks between cluster pairs, which are much easier than the classification task in the original multi-class space. Each of these binary classification subtasks is solved by a neural network designed by a dynamic process. Then all designed network modules form a network matrix structure, which produces a matrix of outputs that will be fed to an integration machine so that a classification decision can be made. Finally, the experimental results show that our proposed MMNN system has more powerful generalization capability than the classifiers of single 3-layered perceptron and modular neural networks adopting other task decomposition techniques, and has a less training time consumption.  相似文献   

16.
A data analysis method is proposed to derive a latent structure matrix from a sample covariance matrix. The matrix can be used to explore the linear latent effect between two sets of observed variables. Procedures with which to estimate a set of dependent variables from a set of explanatory variables by using latent structure matrix are also proposed. The proposed method can assist the researchers in improving the effectiveness of the SEM models by exploring the latent structure between two sets of variables. In addition, a structure residual matrix can also be derived as a by-product of the proposed method, with which researchers can conduct experimental procedures for variables combinations and selections to build various models for hypotheses testing. These capabilities of data analysis method can improve the effectiveness of traditional SEM methods in data property characterization and models hypotheses testing. Case studies are provided to demonstrate the procedure of deriving latent structure matrix step by step, and the latent structure estimation results are quite close to the results of PLS regression. A structure coefficient index is suggested to explore the relationships among various combinations of variables and their effects on the variance of the latent structure.  相似文献   

17.
We propose a method for selecting variables in latent class analysis, which is the most common model-based clustering method for discrete data. The method assesses a variable’s usefulness for clustering by comparing two models, given the clustering variables already selected. In one model the variable contributes information about cluster allocation beyond that contained in the already selected variables, and in the other model it does not. A headlong search algorithm is used to explore the model space and select clustering variables. In simulated datasets we found that the method selected the correct clustering variables, and also led to improvements in classification performance and in accuracy of the choice of the number of classes. In two real datasets, our method discovered the same group structure with fewer variables. In a dataset from the International HapMap Project consisting of 639 single nucleotide polymorphisms (SNPs) from 210 members of different groups, our method discovered the same group structure with a much smaller number of SNPs.  相似文献   

18.
Nonconvex programming problems are frequently encountered in engineering and operations research. A large variety of global optimization algorithms have been proposed for the various classes of programming problems. A new approach for global optimum search is presented in this paper which involves a decomposition of the variable set into two sets —complicating and noncomplicating variables. This results in a decomposition of the constraint set leading to two subproblems. The decomposition of the original problem induces special structure in the resulting subproblems and a series of these subproblems are then solved, using the Generalized Benders' Decomposition technique, to determine the optimal solution. The key idea is to combine a judicious selection of the complicating variables with suitable transformations leading to subproblems which can attain their respective global solutions at each iteration. Mathematical properties of the proposed approach are presented. Even though the proposed approach cannot guarantee the determination of the global optimum, computational experience on a number of nonconvex QP, NLP and MINLP example problems indicates that a global optimum solution can be obtained from various starting points.  相似文献   

19.
地区恶性肿瘤死亡率的对应分析   总被引:1,自引:0,他引:1  
目的—了解山东省某县2000-2002年恶性肿瘤的地区分布和肿瘤类型分布特征.方法—应用分组对应分析对该县恶性肿瘤死亡资料进行分析.结果—得到各地区和各肿瘤类型的公因子及其负荷系数,并根据第一、二因子负荷系数绘制因子负荷平面图,可以清楚看出恶性肿瘤死亡率的聚集性及其高发地与低发地的分布.结论—将变量与样本结合起来的对应分析是对因子分析的有益补充,它可以分析二维数据阵的行因素与列因素之关系,达到研究目的.  相似文献   

20.
In this paper, we consider the problem of selecting the variables of the fixed effects in the linear mixed models where the random effects are present and the observation vectors have been obtained from many clusters. As the variable selection procedure, here we use the Akaike Information Criterion, AIC. In the context of the mixed linear models, two kinds of AIC have been proposed: marginal AIC and conditional AIC. In this paper, we derive three versions of conditional AIC depending upon different estimators of the regression coefficients and the random effects. Through the simulation studies, it is shown that the proposed conditional AIC’s are superior to the marginal and conditional AIC’s proposed in the literature in the sense of selecting the true model. Finally, the results are extended to the case when the random effects in all the clusters are of the same dimension but have a common unknown covariance matrix.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号