首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
For clustering objects, we often collect not only continuous variables, but binary attributes as well. This paper proposes a model-based clustering approach with mixed binary and continuous variables where each binary attribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value, and where the scores of the latent variables are estimated from the binary data. In economics, such variables are called utility functions and the assumption is that the binary attributes (the presence or the absence of a public service or utility) are determined by low and high values of these functions. In genetics, the latent response is interpreted as the ??liability?? to develop a qualitative trait or phenotype. The estimated scores of the latent variables, together with the observed continuous ones, allow to use a multivariate Gaussian mixture model for clustering, instead of using a mixture of discrete and continuous distributions. After describing the method, this paper presents the results of both simulated and real-case data and compares the performances of the multivariate Gaussian mixture model and of a mixture of joint multivariate and multinomial distributions. Results show that the former model outperforms the mixture model for variables with different scales, both in terms of classification error rate and reproduction of the clusters means.  相似文献   

2.
We propose a method that allows for instrument classification from a piece of sound. Features are derived from a pre-filtered time series divided into small windows. Afterwards, features from the (transformed) spectrum, Perceptive Linear Prediction (PLP), and Mel Frequency Cepstral Coefficients (MFCCs) as known from speech processing are selected. As a clustering method, k-means is applied yielding a reduced number of features for the classification task. A SVM classifier using a polynomial kernel yields good results. The accuracy is very convincing given a misclassification error of roughly 19% for 59 different classes of instruments. As expected, misclassification error is smaller for a problem with less classes. The rastamat library (Ellis in PLP and RASTA (and MFCC, and inversion) in Matlab. , online web resource, 2005) functionality has been ported from Matlab to R. This means feature extraction as known from speech processing is now easily available from the statistical programming language R. This software has been used on a cluster of machines for the computer intensive evaluation of the proposed method.  相似文献   

3.
The interest in variable selection for clustering has increased recently due to the growing need in clustering high-dimensional data. Variable selection allows in particular to ease both the clustering and the interpretation of the results. Existing approaches have demonstrated the importance of variable selection for clustering but turn out to be either very time consuming or not sparse enough in high-dimensional spaces. This work proposes to perform a selection of the discriminative variables by introducing sparsity in the loading matrix of the Fisher-EM algorithm. This clustering method has been recently proposed for the simultaneous visualization and clustering of high-dimensional data. It is based on a latent mixture model which fits the data into a low-dimensional discriminative subspace. Three different approaches are proposed in this work to introduce sparsity in the orientation matrix of the discriminative subspace through \(\ell _{1}\) -type penalizations. Experimental comparisons with existing approaches on simulated and real-world data sets demonstrate the interest of the proposed methodology. An application to the segmentation of hyperspectral images of the planet Mars is also presented.  相似文献   

4.
With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.  相似文献   

5.
6.
Numerical Algorithms - The Kaczmarz algorithm is one of the most popular methods for solving large-scale over-determined linear systems due to its simplicity and computational efficiency. This...  相似文献   

7.
8.
In this paper, we propose a new kernel-based fuzzy clustering algorithm which tries to find the best clustering results using optimal parameters of each kernel in each cluster. It is known that data with nonlinear relationships can be separated using one of the kernel-based fuzzy clustering methods. Two common fuzzy clustering approaches are: clustering with a single kernel and clustering with multiple kernels. While clustering with a single kernel doesn’t work well with “multiple-density” clusters, multiple kernel-based fuzzy clustering tries to find an optimal linear weighted combination of kernels with initial fixed (not necessarily the best) parameters. Our algorithm is an extension of the single kernel-based fuzzy c-means and the multiple kernel-based fuzzy clustering algorithms. In this algorithm, there is no need to give “good” parameters of each kernel and no need to give an initial “good” number of kernels. Every cluster will be characterized by a Gaussian kernel with optimal parameters. In order to show its effective clustering performance, we have compared it to other similar clustering algorithms using different databases and different clustering validity measures.  相似文献   

9.
Summary The analogue of Strassen's functional law of the iterated logarithm in known for many Gaussian processes which have suitable scaling properties, and here we establish rates at which this convergence takes place. We provide a new proof of the best upper bound for the convergence toK by suitably normalized Brownian motion, and then continue with this method to get similar bounds for the Brownian sheet and other self-similar Gaussian processes. The previous method, which produced these results for Brownian motion in 1, was highly dependent on many special properties unavailable when dealing with other Gaussian processes.Supported in part by NSF Grant NSF-88-07121Supported in part by NSF Grant DMS-85-21586  相似文献   

10.
A new approach is suggested for the nonparametric estimation of the unknown distribution density under the assumption of the bounded variation of the true density. As an estimator there occurs a statistic based on the application of the maximum likelihood method to the estimation of an infinite-dimensional shifi of a Gaussian process with a known correlation function. The quality of the obtained estimate is investigated.Translated from Zapiski Nauchnykh Seminarov Leningradskogo Otdeleniya Matematicheskogo Instituta im. V. A. Steklova Akademii Nauk SSSR, Vol. 177, pp. 6–7, 1989.  相似文献   

11.
This paper studies the defect data analysis method for semiconductor yield enhancement. Given the defect locations on a wafer, the local defects generated from the assignable causes are classified from the global defects generated from the random causes by model-based clustering, and the clustering methods can identify the characteristics of local defect clusters. The information obtained from this method can facilitate process control, particularly, root-cause analysis. The global defects are modeled by the spatial non-homogeneous Poisson process, and the local defects are modeled by the bivariate normal distribution or by the principal curve.  相似文献   

12.
This paper addresses the problem of insufficient performance of statistical classification with the medium-sized database (thousands of classes). Each object is represented as a sequence of independent segments. Each segment is defined as a random sample of independent features with the distribution of multivariate exponential type. To increase the speed of the optimal Kullback–Leibler minimum information discrimination principle, we apply the clustering of the training set and an approximate nearest neighbor search of the input object in a set of cluster medoids. By using the asymptotic properties of the Kullback–Leibler divergence, we propose the maximal likelihood search procedure. In this method the medoid to check is selected from the cluster with the maximal joint density (likelihood) of the distances to the previously checked medoids. Experimental results in image recognition with artificially generated dataset and Essex facial database prove that the proposed approach is much more effective, than an exhaustive search and the known approximate nearest neighbor methods from FLANN and NonMetricSpace libraries.  相似文献   

13.
14.
Advances in Data Analysis and Classification - We consider model-based clustering methods for continuous, correlated data that account for external information available in the presence of...  相似文献   

15.
The predictive likelihood of a model specified by data is defined when the model satisfies certain conditions. It reduces to the conventional definition when the model is specified independently of the data. The definition is applied to some Gaussian models and a method of handling the improper uniform prior distributions is obtained for the Bayesian modeling of a multi-model situation where the submodels may have different numbers of parameters. The practical utility of the method is checked by a Monte Carlo experiment of some quasi-Bayesian procedures realized by using the predictive likelihoods. The Institute of Statistical Mathematics  相似文献   

16.
In this paper we investigate various third-order asymptotic properties of maximum likelihood estimators for Gaussian ARMA processes by the third-order Edgeworth expansions of the sampling distributions. We define a third-order asymptotic efficiency by the highest probability concentration around the true value with respect to the third-order Edgeworth expansion. Then we show that the maximum likelihood estimator is not always third-order asymptotically efficient in the class A3 of third-order asymptotically median unbiased estimators. But, if we confine our discussions to an appropriate class D (⊂ A3) of estimators, we can show that appropriately modified maximum likelihood estimator is always third-order asymptotically efficient in D.  相似文献   

17.
Latent class (LC) analysis is used to construct empirical evidence on the existence of latent subgroups based on the associations among a set of observed discrete variables. One of the tests used to infer about the number of underlying subgroups is the bootstrap likelihood ratio test (BLRT). Although power analysis is rarely conducted for this test, it is important to identify, clarify, and specify the design issues that influence the statistical inference on the number of latent classes based on the BLRT. This paper proposes a computationally efficient ‘short-cut’ method to evaluate the power of the BLRT, as well as presents a procedure to determine a required sample size to attain a specific power level. Results of our numerical study showed that this short-cut method yields reliable estimates of the power of the BLRT. The numerical study also showed that the sample size required to achieve a specified power level depends on various factors of which the class separation plays a dominant role. In some situations, a sample size of 200 may be enough, while in others 2000 or more subjects are required to achieve the required power.  相似文献   

18.
Joint latent class modeling of disease prevalence and high-dimensional semicontinuous biomarker data has been proposed to study the relationship between diseases and their related biomarkers. However, statistical inference of the joint latent class modeling approach has proved very challenging due to its computational complexity in seeking maximum likelihood estimates. In this article, we propose a series of composite likelihoods for maximum composite likelihood estimation, as well as an enhanced Monte Carlo expectation–maximization (MCEM) algorithm for maximum likelihood estimation, in the context of joint latent class models. Theoretically, the maximum composite likelihood estimates are consistent and asymptotically normal. Numerically, we have shown that, as compared to the MCEM algorithm that maximizes the full likelihood, not only the composite likelihood approach that is coupled with the quasi-Newton method can substantially reduce the computational complexity and duration, but it can simultaneously retain comparative estimation efficiency.  相似文献   

19.
Let P t f be a measure generated by a Gaussian stationary process and with spectral density f on an interval of time length t, and let be the likelihood function. One investigates the correspondence between the asymptotic behavior of the function Lt and a regularity condition of the process u.Translated from Zapiski Nauchnykh Seminarov Leningradskogo Otdeleniya Matematicheskogo Instituta im. V. A. Steklova AN SSSR, Vol. 119, pp. 203–217, 1982.  相似文献   

20.
This paper is mainly devoted to a precise analysis of what kind of penalties should be used in order to perform model selection via the minimization of a penalized least-squares type criterion within some general Gaussian framework including the classical ones. As compared to our previous paper on this topic (Birgé and Massart in J. Eur. Math. Soc. 3, 203–268 (2001)), more elaborate forms of the penalties are given which are shown to be, in some sense, optimal. We indeed provide more precise upper bounds for the risk of the penalized estimators and lower bounds for the penalty terms, showing that the use of smaller penalties may lead to disastrous results. These lower bounds may also be used to design a practical strategy that allows to estimate the penalty from the data when the amount of noise is unknown. We provide an illustration of the method for the problem of estimating a piecewise constant signal in Gaussian noise when neither the number, nor the location of the change points are known.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号