首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The clusterwise regression model is used to perform cluster analysis within a regression framework. While the traditional regression model assumes the regression coefficient (β) to be identical for all subjects in the sample, the clusterwise regression model allows β to vary with subjects of different clusters. Since the cluster membership is unknown, the estimation of the clusterwise regression is a tough combinatorial optimization problem. In this research, we propose a “Generalized Clusterwise Regression Model” which is formulated as a mathematical programming (MP) problem. A nonlinear programming procedure (with linear constraints) is proposed to solve the combinatorial problem and to estimate the cluster membership and β simultaneously. Moreover, by integrating the cluster analysis with the discriminant analysis, a clusterwise discriminant model is developed to incorporate parameter heterogeneity into the traditional discriminant analysis. The cluster membership and discriminant parameters are estimated simultaneously by another nonlinear programming model.  相似文献   

2.
Abstract

The primary model for cluster analysis is the latent class model. This model yields the mixture likelihood. Due to numerous local maxima, the success of the EM algorithm in maximizing the mixture likelihood depends on the initial starting point of the algorithm. In this article, good starting points for the EM algorithm are obtained by applying classification methods to randomly selected subsamples of the data. The performance of the resulting two-step algorithm, classification followed by EM, is compared to, and found superior to, the baseline algorithm of EM started from a random partition of the data. Though the algorithm is not complicated, comparing it to the baseline algorithm and assessing its performance with several classification methods is nontrivial. The strategy employed for comparing the algorithms is to identify canonical forms for the easiest and most difficult datasets to cluster within a large collection of cluster datasets and then to compare the performance of the two algorithms on these datasets. This has led to the discovery that, in the case of three homogeneous clusters, the most difficult datasets to cluster are those in which the clusters are arranged on a line and the easiest are those in which the clusters are arranged on an equilateral triangle. The performance of the two-step algorithm is assessed using several classification methods and is shown to be able to cluster large, difficult datasets consisting of three highly overlapping clusters arranged on a line with 10,000 observations and 8 variables.  相似文献   

3.

In model-based clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al. (Stat Comput 26:303–324, 2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with K components is chosen in such a way that a priori the number of clusters in the data is random and is allowed to be smaller than K with high probability. The number of clusters is then inferred a posteriori from the data. The present paper makes the following contributions in the context of sparse finite mixture modelling. First, it is illustrated that the concept of sparse finite mixture is very generic and easily extended to cluster various types of non-Gaussian data, in particular discrete data and continuous multivariate data arising from non-Gaussian clusters. Second, sparse finite mixtures are compared to Dirichlet process mixtures with respect to their ability to identify the number of clusters. For both model classes, a random hyper prior is considered for the parameters determining the weight distribution. By suitable matching of these priors, it is shown that the choice of this hyper prior is far more influential on the cluster solution than whether a sparse finite mixture or a Dirichlet process mixture is taken into consideration.

  相似文献   

4.
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

Supplemental materials with additional experiments for this article are available online.  相似文献   

5.
The analysis of data generated by animal habitat selection studies, by family studies of genetic diseases, or by longitudinal follow-up of households often involves fitting a mixed conditional logistic regression model to longitudinal data composed of clusters of matched case-control strata. The estimation of model parameters by maximum likelihood is especially difficult when the number of cases per stratum is greater than one. In this case, the denominator of each cluster contribution to the conditional likelihood involves a complex integral in high dimension, which leads to convergence problems in the numerical maximization. In this article we show how these computational complexities can be bypassed using a global two-step analysis for nonlinear mixed effects models. The first step estimates the cluster-specific parameters and can be achieved with standard statistical methods and software based on maximum likelihood for independent data. The second step uses the EM-algorithm in conjunction with conditional restricted maximum likelihood to estimate the population parameters. We use simulations to demonstrate that the method works well when the analysis is based on a large number of strata per cluster, as in many ecological studies. We apply the proposed two-step approach to evaluate habitat selection by pairs of bison roaming freely in their natural environment. This article has supplementary material online.  相似文献   

6.
In this article, we propose a novel Bayesian nonparametric clustering algorithm based on a Dirichlet process mixture of Dirichlet distributions which have been shown to be very flexible for modeling proportional data. The idea is to let the number of mixture components increases as new data to cluster arrive in such a manner that the model selection problem (i.e. determination of the number of clusters) can be answered without recourse to classic selection criteria. Thus, the proposed model can be considered as an infinite Dirichlet mixture model. An expectation propagation inference framework is developed to learn this model by obtaining a full posterior distribution on its parameters. Within this learning framework, the model complexity and all the involved parameters are evaluated simultaneously. To show the practical relevance and efficiency of our model, we perform a detailed analysis using extensive simulations based on both synthetic and real data. In particular, real data are generated from three challenging applications namely images categorization, anomaly intrusion detection and videos summarization.  相似文献   

7.
Data envelopment analysis-discriminant analysis (DEA-DA) has been used for predicting cluster membership of decision-making units (DMUs). One of the possible applications of DEA-DA is in the marketing research area. This paper uses cluster analysis to cluster customers into two clusters: Gold and Lead. Then, to predict cluster membership of new customers, DEA-DA is applied. In DEA-DA, an arbitrary parameter imposing a small gap between two clusters (η) is incorporated. It is shown that different η leads to different prediction accuracy levels since an unsuitable value for η leads to an incorrect classification of DMUs. We show that even the data set with no overlap between two clusters can be misclassified. This paper proposes a new DEA-DA model to tackle this issue. The aim of this paper is to illustrate some computational difficulties in previous DEA-DA approaches and then to propose a new DEA-DA model to overcome the difficulties. A case study demonstrates the efficacy of the proposed model.  相似文献   

8.
The detection of community structures within network data is a type of graph analysis with increasing interest across a broad range of disciplines. In a network, communities represent clusters of nodes that exhibit strong intra-connections or relationships among nodes in the cluster. Current methodology for community detection often involves an algorithmic approach, and commonly partitions a graph into node clusters in an iterative manner before some stopping criterion is met. Other statistical approaches for community detection often require model choices and prior selection in Bayesian analyses, which are difficult without some amount of data inspection and pre-processing. Because communities are often fuzzily-defined human concepts, an alternative approach is to leverage human vision to identify communities. The work presents a tool for community detection in form of a web application, called gravicom, which facilitates the detection of community structures through visualization and direct user interaction. In the process of detecting communities, the gravicom application can serve as a standalone tool or as a step to potentially initialize (and/or post-process) another community detection algorithm. In this paper we discuss the design of gravicom and demonstrate its use for community detection with several network data sets. An “Appendix” describes details in the technical formulation of this web application built on the R package Shiny and the JavaScript library D3.  相似文献   

9.

Privacy-preserving data splitting is a technique that aims to protect data privacy by storing different fragments of data in different locations. In this work we give a new combinatorial formulation to the data splitting problem. We see the data splitting problem as a purely combinatorial problem, in which we have to split data attributes into different fragments in a way that satisfies certain combinatorial properties derived from processing and privacy constraints. Using this formulation, we develop new combinatorial and algebraic techniques to obtain solutions to the data splitting problem. We present an algebraic method which builds an optimal data splitting solution by using Gröbner bases. Since this method is not efficient in general, we also develop a greedy algorithm for finding solutions that are not necessarily minimally sized.

  相似文献   

10.

It is well known that variable selection in multiple regression can be unstable and that the model uncertainty can be considerable. The model uncertainty can be quantified and explored by bootstrap resampling, see Sauerbrei et al. (Biom J 57:531–555, 2015). Here approaches are introduced that use the results of bootstrap replications of the variable selection process to obtain more detailed information about the data. Analyses will be based on dissimilarities between the results of the analyses of different bootstrap samples. Dissimilarities are computed between the vector of predictions, and between the sets of selected variables. The dissimilarities are used to map the models by multidimensional scaling, to cluster them, and to construct heatplots. Clusters can point to different interpretations of the data that could arise from different selections of variables supported by different bootstrap samples. A new measure of variable selection instability is also defined. The methodology can be applied to various regression models, estimators, and variable selection methods. It will be illustrated by three real data examples, using linear regression and a Cox proportional hazards model, and model selection by AIC and BIC.

  相似文献   

11.
Based on the data-cutoff method,we study quantile regression in linear models,where the noise process is of Ornstein-Uhlenbeck type with possible jumps.In single-level quantile regression,we allow the noise process to be heteroscedastic,while in composite quantile regression,we require that the noise process be homoscedastic so that the slopes are invariant across quantiles.Similar to the independent noise case,the proposed quantile estimators are root-n consistent and asymptotic normal.Furthermore,the adaptive least absolute shrinkage and selection operator(LASSO)is applied for the purpose of variable selection.As a result,the quantile estimators are consistent in variable selection,and the nonzero coefficient estimators enjoy the same asymptotic distribution as their counterparts under the true model.Extensive numerical simulations are conducted to evaluate the performance of the proposed approaches and foreign exchange rate data are analyzed for the illustration purpose.  相似文献   

12.
Abstract

Test-based variable selection algorithms in regression often are based on sequential comparison of test statistics to cutoff values. A predetermined a level typically is used to determine the cutoffs based on an assumed probability distribution for the test statistic. For example, backward elimination or forward stepwise involve comparisons of test statistics to prespecified t or F cutoffs in Gaussian linear regression, while a likelihood ratio. Wald, or score statistic, is typically used with standard normal or chi square cutoffs in nonlinear settings. Although such algorithms enjoy widespread use, their statistical properties are not well understood, either theoretically or empirically. Two inherent problems with these methods are that (1) as in classical hypothesis testing, the value of α is arbitrary, while (2) unlike hypothesis testing, there is no simple analog of type I error rate corresponding to application of the entire algorithm to a data set. In this article we propose a new method, backward elimination via cross-validation (BECV), for test-based variable selection in regression. It is implemented by first finding the empirical p value α*, which minimizes a cross-validation estimate of squared prediction error, then selecting the model by running backward elimination on the entire data set using α* as the nominal p value for each test. We present results of an extensive computer simulation to evaluate BECV and compare its performance to standard backward elimination and forward stepwise selection.  相似文献   

13.
We introduce a new network-based data mining approach to selecting diversified portfolios by modeling the stock market as a network and utilizing combinatorial optimization techniques to find maximum-weight s-plexes in the obtained networks. The considered approach is based on the weighted market graph model, which is used for identifying clusters of stocks according to a correlation-based criterion. The proposed techniques provide a new framework for selecting profitable diversified portfolios, which is verified by computational experiments on historical data over the past decade. In addition, the proposed approach can be used as a complementary tool for narrowing down a set of “candidate” stocks for a diversified portfolio, which can potentially be analyzed using other known portfolio selection techniques.  相似文献   

14.
Abstract

Consider the general linear model (GLM) Y = Xβ + ε. Suppose Θ1,…, Θk, a subset of the β's, are of interest; Θ1,…, Θk may be treatment contrasts in an ANOVA setting or regression coefficients in a response surface setting. Existing simultaneous confidence intervals for Θ1,…, Θk are relatively conservative or, in the case of the MEANS option in PROC GLM of SAS, possibly misleading. The difficulty is with the multidimensionality of the integration required to compute exact coverage probability when X does not follow a nice textbook design. Noting that such exact coverage probabilities can be computed if the correlation matrix R of the estimators of Θ1, …, Θk has a one-factor structure in the factor analytic sense, it is proposed that approximate simultaneous confidence intervals be computed by approximating R with the closest one-factor structure correlation matrix. Computer simulations of hundreds of randomly generated designs in the settings of regression, analysis of covariance, and unbalanced block designs show that the coverage probabilities are practically exact, more so than can be anticipated by even second-order probability bounds.  相似文献   

15.
The energy distribution of wind-driven ocean waves is of great interest in marine science. Discovering the generating process of ocean waves is often challenging and the direction is the key for a better understanding. Typically, wave records are transformed into a directional spectrum which provides information about the wave energy distribution across different frequencies and directions. Here, we propose a new time series clustering method for a series of directional spectra to extract the spectral features of ocean waves and develop informative visualization tools to summarize identified wave clusters. We treat directional distributions as functional data of directions and construct a directional functional boxplot to display the main directional distribution of the wave energy within a cluster. We also trace back when these spectra were observed, and we present color-coded clusters on a calendar plot to show their temporal variability. For each identified wave cluster, we analyze wind speed and wind direction hourly to investigate the link between wind data and wave directional spectra. The performance of the proposed clustering method is evaluated by simulations and illustrated by a real-world dataset from the Red Sea. Supplementary materials for this article are available online.  相似文献   

16.
The partitioning clustering is a technique to classify n objects into k disjoint clusters, and has been developed for years and widely used in many applications. In this paper, a new overlapping cluster algorithm is defined. It differs from traditional clustering algorithms in three respects. First, the new clustering is overlapping, because clusters are allowed to overlap with one another. Second, the clustering is non-exhaustive, because an object is permitted to belong to no cluster. Third, the goals considered in this research are the maximization of the average number of objects contained in a cluster and the maximization of the distances among cluster centers, while the goals in previous research are the maximization of the similarities of objects in the same clusters and the minimization of the similarities of objects in different clusters. Furthermore, the new clustering is also different from the traditional fuzzy clustering, because the object–cluster relationship in the new clustering is represented by a crisp value rather than that represented by using a fuzzy membership degree. Accordingly, a new overlapping partitioning cluster (OPC) algorithm is proposed to provide overlapping and non-exhaustive clustering of objects. Finally, several simulation and real world data sets are used to evaluate the effectiveness and the efficiency of the OPC algorithm, and the outcomes indicate that the algorithm can generate satisfactory clustering results.  相似文献   

17.
Detecting low-diameter clusters is an important graph-based data mining technique used in social network analysis, bioinformatics and text-mining. Low pairwise distances within a cluster can facilitate fast communication or good reachability between vertices in the cluster. Formally, a subset of vertices that induce a subgraph of diameter at most k is called a k-club. For low values of the parameter k, this model offers a graph-theoretic relaxation of the clique model that formalizes the notion of a low-diameter cluster. Using a combination of graph decomposition and model decomposition techniques, we demonstrate how the fundamental optimization problem of finding a maximum size k-club can be solved optimally on large-scale benchmark instances that are available in the public domain. Our approach circumvents the use of complicated formulations of the maximum k-club problem in favor of a simple relaxation based on necessary conditions, combined with canonical hypercube cuts introduced by Balas and Jeroslow.  相似文献   

18.

We consider a weighted local linear estimator based on the inverse selection probability for nonparametric regression with missing covariates at random. The asymptotic distribution of the maximal deviation between the estimator and the true regression function is derived and an asymptotically accurate simultaneous confidence band is constructed. The estimator for the regression function is shown to be oracally efficient in the sense that it is uniformly indistinguishable from that when the selection probabilities are known. Finite sample performance is examined via simulation studies which support our asymptotic theory. The proposed method is demonstrated via an analysis of a data set from the Canada 2010/2011 Youth Student Survey.

  相似文献   

19.
We apply the cross-entropy (CE) method to problems in clustering and vector quantization. The CE algorithm for clustering involves the following iterative steps: (a) generate random clusters according to a specified parametric probability distribution, (b) update the parameters of this distribution according to the Kullback–Leibler cross-entropy. Through various numerical experiments, we demonstrate the high accuracy of the CE algorithm and show that it can generate near-optimal clusters for fairly large data sets. We compare the CE method with well-known clustering and vector quantization methods such as K-means, fuzzy K-means and linear vector quantization, and apply each method to benchmark and image analysis data.  相似文献   

20.
Two robustness criteria are presented that are applicable to general clustering methods. Robustness and stability in cluster analysis are not only data dependent, but even cluster dependent. Robustness is in the present paper defined as a property of not only the clustering method, but also of every individual cluster in a data set. The main principles are: (a) dissimilarity measurement of an original cluster with the most similar cluster in the induced clustering obtained by adding data points, (b) the dissolution point, which is an adaptation of the breakdown point concept to single clusters, (c) isolation robustness: given a clustering method, is it possible to join, by addition of g points, arbitrarily well separated clusters?Results are derived for k-means, k-medoids (k estimated by average silhouette width), trimmed k-means, mixture models (with and without noise component, with and without estimation of the number of clusters by BIC), single and complete linkage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号