首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Two robustness criteria are presented that are applicable to general clustering methods. Robustness and stability in cluster analysis are not only data dependent, but even cluster dependent. Robustness is in the present paper defined as a property of not only the clustering method, but also of every individual cluster in a data set. The main principles are: (a) dissimilarity measurement of an original cluster with the most similar cluster in the induced clustering obtained by adding data points, (b) the dissolution point, which is an adaptation of the breakdown point concept to single clusters, (c) isolation robustness: given a clustering method, is it possible to join, by addition of g points, arbitrarily well separated clusters?Results are derived for k-means, k-medoids (k estimated by average silhouette width), trimmed k-means, mixture models (with and without noise component, with and without estimation of the number of clusters by BIC), single and complete linkage.  相似文献   

2.
In model-based clustering, the density of each cluster is usually assumed to be a certain basic parametric distribution, for example, the normal distribution. In practice, it is often difficult to decide which parametric distribution is suitable to characterize a cluster, especially for multivariate data. Moreover, the densities of individual clusters may be multimodal themselves, and therefore cannot be accurately modeled by basic parametric distributions. This article explores a clustering approach that models each cluster by a mixture of normals. The resulting overall model is a multilayer mixture of normals. Algorithms to estimate the model and perform clustering are developed based on the classification maximum likelihood (CML) and mixture maximum likelihood (MML) criteria. BIC and ICL-BIC are examined for choosing the number of normal components per cluster. Experiments on both simulated and real data are presented.  相似文献   

3.
In the cluster analysis problem one seeks to partition a finite set of objects into disjoint groups (or clusters) such that each group contains relatively similar objects and, relatively dissimilar objects are placed in different groups. For certain classes of the problem or, under certain assumptions, the partitioning exercise can be formulated as a sequence of linear programs (LPs), each with a parametric objective function. Such LPs can be solved using the parametric linear programming procedure developed by Gass and Saaty [(Gass, S., Saaty, T. (1955), Naval Research Logistics Quarterly 2, 39–45)]. In this paper, a parametric linear programming model for solving cluster analysis problems is presented. We show how this model can be used to find optimal solutions for certain variations of the clustering problem or, in other cases, for an approximation of the general clustering problem.  相似文献   

4.
For several years, model-based clustering methods have successfully tackled many of the challenges presented by data-analysts. However, as the scope of data analysis has evolved, some problems may be beyond the standard mixture model framework. One such problem is when observations in a dataset come from overlapping clusters, whereby different clusters will possess similar parameters for multiple variables. In this setting, mixed membership models, a soft clustering approach whereby observations are not restricted to single cluster membership, have proved to be an effective tool. In this paper, a method for fitting mixed membership models to data generated by a member of an exponential family is outlined. The method is applied to count data obtained from an ultra running competition, and compared with a standard mixture model approach.  相似文献   

5.
In this work, we assess the suitability of cluster analysis for the gene grouping problem confronted with microarray data. Gene clustering is the exercise of grouping genes based on attributes, which are generally the expression levels over a number of conditions or subpopulations. The hope is that similarity with respect to expression is often indicative of similarity with respect to much more fundamental and elusive qualities, such as function. By formally defining the true gene-specific attributes as parameters, such as expected expression across the conditions, we obtain a well-defined gene clustering parameter of interest, which greatly facilitates the statistical treatment of gene clustering. We point out that genome-wide collections of expression trajectories often lack natural clustering structure, prior to ad hoc gene filtering. The gene filters in common use induce a certain circularity to most gene cluster analyses: genes are points in the attribute space, a filter is applied to depopulate certain areas of the space, and then clusters are sought (and often found!) in the “cleaned” attribute space. As a result, statistical investigations of cluster number and clustering strength are just as much a study of the stringency and nature of the filter as they are of any biological gene clusters. In the absence of natural clusters, gene clustering may still be a worthwhile exercise in data segmentation. In this context, partitions can be fruitfully encoded in adjacency matrices and the sampling distribution of such matrices can be studied with a variety of bootstrapping techniques.  相似文献   

6.
Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose a novel Bayesian approach based on Dirichlet process with a sparsity prior that simultaneous performs variable selection and clustering, and also discover variables that only distinguish a subset of the cluster components. Unlike previous Bayesian formulations, we use Dirichlet process (DP) for both clustering of samples as well as for regularizing the high-dimensional mean/variance structure. To solve the computational challenge brought by this double usage of DP, we propose to make use of a sequential sampling scheme embedded within Markov chain Monte Carlo (MCMC) updates to improve the naive implementation of existing algorithms for DP mixture models. Our method is demonstrated on a simulation study and illustrated with the leukemia gene expression dataset.  相似文献   

7.
8.
There exist many data clustering algorithms, but they can not adequately handle the number of clusters or cluster shapes. Their performance mainly depends on a choice of algorithm parameters. Our approach to data clustering and algorithm does not require the parameter choice; it can be treated as a natural adaptation to the existing structure of distances between data points. The outlier factor introduced by the author specifies a degree of being an outlier for each data point. The outlier factor notion is based on the difference between the frequency distribution of interpoint distances in a given dataset and the corresponding distribution of uniformly distributed points. Then data clusters can be determined by maximizing the outlier factor function. The data points in dataset are divided into clusters according to the attractor regions of local optima. An experimental evaluation of the proposed algorithm shows that the proposed method can identify complex cluster shapes. Key advantages of the approach are: good clustering properties for datasets with comparatively large amount of noise (an additional data points), and an absence of important parameters which adequate choice determines the quality of results.  相似文献   

9.
In this paper, we propose a new kernel-based fuzzy clustering algorithm which tries to find the best clustering results using optimal parameters of each kernel in each cluster. It is known that data with nonlinear relationships can be separated using one of the kernel-based fuzzy clustering methods. Two common fuzzy clustering approaches are: clustering with a single kernel and clustering with multiple kernels. While clustering with a single kernel doesn’t work well with “multiple-density” clusters, multiple kernel-based fuzzy clustering tries to find an optimal linear weighted combination of kernels with initial fixed (not necessarily the best) parameters. Our algorithm is an extension of the single kernel-based fuzzy c-means and the multiple kernel-based fuzzy clustering algorithms. In this algorithm, there is no need to give “good” parameters of each kernel and no need to give an initial “good” number of kernels. Every cluster will be characterized by a Gaussian kernel with optimal parameters. In order to show its effective clustering performance, we have compared it to other similar clustering algorithms using different databases and different clustering validity measures.  相似文献   

10.
We developed two kernel smoothing based tests of a parametric mean-regression model against a nonparametric alternative when the response variable is right-censored. The new test statistics are inspired by the synthetic data and the weighted least squares approaches for estimating the parameters of a (non)linear regression model under censoring. The asymptotic critical values of our tests are given by the quantiles of the standard normal law. The tests are consistent against fixed alternatives, local Pitman alternatives and uniformly over alternatives in Hölder classes of functions of known regularity.  相似文献   

11.
A massive amount of data about individual electrical consumptions are now provided with new metering technologies and smart grids. These new data are especially useful for load profiling and load modeling at different scales of the electrical network. A new methodology based on mixture of high‐dimensional regression models is used to perform clustering of individual customers. It leads to uncovering clusters corresponding to different regression models. Temporal information is incorporated in order to prepare the next step, the fit of a forecasting model in each cluster. Only the electrical signal is involved, slicing the electrical signal into consecutive curves to consider it as a discrete time series of curves. Interpretation of the models is given on a real smart meter dataset of Irish customers.  相似文献   

12.
Consider a varying-coefficient single-index model which consists of two parts: the linear part with varying coefficients and the nonlinear part with a single-index structure, and are hence termed as varying-coefficient single-index models. This model includes many important regression models such as single-index models, partially linear single-index models, varying-coefficient model and varying-coefficient partially linear models as special examples. In this paper, we mainly study estimating problems of the varying-coefficient vector, the nonparametric link function and the unknown parametric vector describing the single-index in the model. A stepwise approach is developed to obtain asymptotic normality estimators of the varying-coefficient vector and the parametric vector, and estimators of the nonparametric link function with a convergence rate. The consistent estimator of the structural error variance is also obtained. In addition, asymptotic pointwise confidence intervals and confidence regions are constructed for the varying coefficients and the parametric vector. The bandwidth selection problem is also considered. A simulation study is conducted to evaluate the proposed methods, and real data analysis is also used to illustrate our methods.  相似文献   

13.
The goal of clustering is to detect the presence of distinct groups in a dataset and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly; for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Supplemental materials for the article, including an R package implementing generalized single linkage clustering, all datasets used in the examples, and R code producing the figures and numerical results, are available online.  相似文献   

14.
Customer segmentation is one of the most important purposes of customer base analysis for telecommunication companies. Because companies accumulate very large amounts of data on customer behavior, segmentation is typically achieved by profiling and clustering traffic behavior jointly with demographic data and contracts characteristics. Unfortunately, most algorithms and models used for segmentation do not take into account the longitudinal characteristics of data. In particular, in telecommunication traffic analysis, the importance of decreasing patterns of traffic in customers' lives is well known, and it is relevant to aggregate all clients with such a pattern, while other unknown clusters may be of interest for the marketing manager. Our approach to address this problem is based on specifying the distribution of functions as a mixture of a parametric hierarchical model describing the decreasing pattern segment and a nonparametric contamination that allows unanticipated curve shapes in subjects' traffic. The parametric component is chosen based on prior knowledge, while the contamination is characterized as a functional Dirichlet process. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

15.
A new variable selection algorithm is developed for clustering based on mode association. In conventional mixture-model-based clustering, each mixture component is treated as one cluster and the separation between clusters is usually measured by the ratio of between- and within-component dispersion. In this article, we allow one cluster to contain several components depending on whether they merge into one mode. The extent of separation between clusters is quantified using critical points on the ridgeline between two modes, which reflects the exact geometry of the density function. The computational foundation consists of the recently developed Modal expectation–maximization (MEM) algorithm which solves the modes of a Gaussian mixture density, and the Ridgeline expectation–maximization (REM) algorithm which solves the ridgeline passing through the critical points of the mixed density of two unimode clusters. Forward selection is used to find a subset of variables that maximizes an aggregated index of pairwise cluster separability. Theoretical analysis of the procedure is provided. We experiment with both simulated and real datasets and compare with several state-of-the-art variable selection algorithms. Supplemental materials including an R-package, datasets, and appendices for proofs are available online.  相似文献   

16.
We prove polynomial-time solvability of a large class of clustering problems where a weighted set of items has to be partitioned into clusters with respect to some balancing constraints. The data points are weighted with respect to different features and the clusters adhere to given lower and upper bounds on the total weight of their points with respect to each of these features. Further the weight-contribution of a vector to a cluster can depend on the cluster it is assigned to. Our interest in these types of clustering problems is motivated by an application in land consolidation where the ability to perform this kind of balancing is crucial.Our framework maximizes an objective function that is convex in the summed-up utility of the items in each cluster. Despite hardness of convex maximization and many related problems, for fixed dimension and number of clusters, we are able to show that our clustering model is solvable in time polynomial in the number of items if the weight-balancing restrictions are defined using vectors from a fixed, finite domain. We conclude our discussion with a new, efficient model and algorithm for land consolidation.  相似文献   

17.
The field of cluster analysis is primarily concerned with the partitioning of data points into different clusters so as to optimize a certain criterion. Rapid advances in technology have made it possible to address clustering problems via optimization theory. In this paper, we present a global optimization algorithm to solve the fuzzy clustering problem, where each data point is to be assigned to (possibly) several clusters, with a membership grade assigned to each data point that reflects the likelihood of the data point belonging to that cluster. The fuzzy clustering problem is formulated as a nonlinear program, for which a tight linear programming relaxation is constructed via the Reformulation-Linearization Technique (RLT) in concert with additional valid inequalities. This construct is embedded within a specialized branch-and-bound (B&B) algorithm to solve the problem to global optimality. Computational experience is reported using several standard data sets from the literature as well as using synthetically generated larger problem instances. The results validate the robustness of the proposed algorithmic procedure and exhibit its dominance over the popular fuzzy c-means algorithmic technique and the commercial global optimizer BARON.  相似文献   

18.
We present a dual-view mixture model to cluster users based on their features and latent behavioral functions. Every component of the mixture model represents a probability density over a feature view for observed user attributes and a behavior view for latent behavioral functions that are indirectly observed through user actions or behaviors. Our task is to infer the groups of users as well as their latent behavioral functions. We also propose a non-parametric version based on a Dirichlet Process to automatically infer the number of clusters. We test the properties and performance of the model on a synthetic dataset that represents the participation of users in the threads of an online forum. Experiments show that dual-view models outperform single-view ones when one of the views lacks information.  相似文献   

19.
§1IntroductionConsiderthefixeddesignsemiparametricnonlinearregressionmodelsgivenbyyi=f(xi,θ)+λ(ti)+εi,i=1,...,n,(1)wheref(,)i...  相似文献   

20.
In this article, we propose a novel Bayesian nonparametric clustering algorithm based on a Dirichlet process mixture of Dirichlet distributions which have been shown to be very flexible for modeling proportional data. The idea is to let the number of mixture components increases as new data to cluster arrive in such a manner that the model selection problem (i.e. determination of the number of clusters) can be answered without recourse to classic selection criteria. Thus, the proposed model can be considered as an infinite Dirichlet mixture model. An expectation propagation inference framework is developed to learn this model by obtaining a full posterior distribution on its parameters. Within this learning framework, the model complexity and all the involved parameters are evaluated simultaneously. To show the practical relevance and efficiency of our model, we perform a detailed analysis using extensive simulations based on both synthetic and real data. In particular, real data are generated from three challenging applications namely images categorization, anomaly intrusion detection and videos summarization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号