首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The identification of different dynamics in sequential data has become an every day need in scientific fields such as marketing, bioinformatics, finance, or social sciences. Contrary to cross-sectional or static data, this type of observations (also known as stream data, temporal data, longitudinal data or repeated measures) are more challenging as one has to incorporate data dependency in the clustering process. In this research we focus on clustering categorical sequences. The method proposed here combines model-based and heuristic clustering. In the first step, the categorical sequences are transformed by an extension of the hidden Markov model into a probabilistic space, where a symmetric Kullback–Leibler distance can operate. Then, in the second step, using hierarchical clustering on the matrix of distances, the sequences can be clustered. This paper illustrates the enormous potential of this type of hybrid approach using a synthetic data set as well as the well-known Microsoft dataset with website users search patterns and a survey on job career dynamics.  相似文献   

2.
We present a general framework for treating categorical data with errors of observation. We show how both latent class models and models for doubly sampled data can be treated as exponential family nonlinear models. These are extended generalized linear models with the link function substituted by an observationwise defined non-linear function of the model parameters. The models are formulated in terms of structural probabilities and conditional error probabilities, thus allowing natural constraints when modelling errors of observation. We use an iteratively reweighted least squares procedure for obtaining maximum likelihood estimates. This is faster than the traditionally used EM algorithm and the computations can be made in GLIM.1 As examples we analyse three sets of categorical data with errors of observation which have been analysed before by Ashford and Sowden,2 Goodman3 and Chen,4 respectively.  相似文献   

3.
A clustering method is presented for analysing multivariate binary data with missing values. When not all values are observed, Govaert3 has studied the relations between clustering methods and statistical models. The author has shown how the identification of a mixture of Bernoulli distributions with the same parameter for all clusters and for all variables corresponds to a clustering criterion which uses L1 distance characterizing the MNDBIN method (Marchetti8). He first generalized this model by selecting parameters which can depend on variables and finally by selecting parameters which can depend both on variables and on clusters. We use the previous models to derive a clustering method adapted to missing data. This method optimizes a criterion by a standard iterative partitioning algorithm which removes the necessity either to ignore objects or to substitute the missing data. We study several versions of this algorithm and, finally, a brief account is given of the application of this method to some simulated data.  相似文献   

4.
The analysis of finite mixture models for exponential repeated data is considered. The mixture components correspond to different unknown groups of the statistical units. Dependency and variability of repeated data are taken into account through random effects. For each component, an exponential mixed model is thus defined. When considering parameter estimation in this mixture of exponential mixed models, the EM-algorithm cannot be directly used since the marginal distribution of each mixture component cannot be analytically derived. In this paper, we propose two parameter estimation methods. The first one uses a linearisation specific to the exponential distribution hypothesis within each component. The second approach uses a Metropolis–Hastings algorithm as a building block of a general MCEM-algorithm.  相似文献   

5.
A dynamic clustering based differential evolution algorithm (CDE) for global optimization is proposed to improve the performance of the differential evolution (DE) algorithm. With population evolution, CDE algorithm gradually changes from exploring promising areas at the early stages to exploiting solution with high precision at the later stages. Experiments on 28 benchmark problems, including 13 high dimensional functions, show that the new method is able to find near optimal solutions efficiently. Compared with other existing algorithms, CDE improves solution accuracy with less computational effort.  相似文献   

6.
This paper presents an extension of the standard regression tree method to clustered data. Previous works extending tree methods to accommodate correlated data are mainly based on the multivariate repeated-measures approach. We propose a “mixed effects regression tree” method where the correlated observations are viewed as nested within clusters rather than as vectors of multivariate repeated responses. The proposed method can handle unbalanced clusters, allows observations within clusters to be split, and can incorporate random effects and observation-level covariates. We implemented the proposed method using a standard tree algorithm within the framework of the expectation-maximization (EM) algorithm. The simulation results show that the proposed regression tree method provides substantial improvements over standard trees when the random effects are non negligible. A real data example is used to illustrate the method.  相似文献   

7.
In this paper, we investigate the problem of determining the number of clusters in the k-modes based categorical data clustering process. We propose a new categorical data clustering algorithm with automatic selection of k. The new algorithm extends the k-modes clustering algorithm by introducing a penalty term to the objective function to make more clusters compete for objects. In the new objective function, we employ a regularization parameter to control the number of clusters in a clustering process. Instead of finding k directly, we choose a suitable value of regularization parameter such that the corresponding clustering result is the most stable one among all the generated clustering results. Experimental results on synthetic data sets and the real data sets are used to demonstrate the effectiveness of the proposed algorithm.  相似文献   

8.
A new efficient algorithm based on DC programming and DCA for clustering   总被引:1,自引:0,他引:1  
In this paper, a version of K-median problem, one of the most popular and best studied clustering measures, is discussed. The model using squared Euclidean distances terms to which the K-means algorithm has been successfully applied is considered. A fast and robust algorithm based on DC (Difference of Convex functions) programming and DC Algorithms (DCA) is investigated. Preliminary numerical solutions on real-world databases show the efficiency and the superiority of the appropriate DCA with respect to the standard K-means algorithm.   相似文献   

9.
Digital circuits have grown exponentially in their sizes over the past decades. To be able to automate the design of these circuits, efficient algorithms are needed. One of the challenging stages of circuit design is the physical design where the physical locations of the components of a circuit are determined. Coarsening or clustering algorithms have become popular with physical designers due to their ability to reduce circuit sizes in the intermediate design steps such that the design can be performed faster and with higher quality. In this paper, a new clustering algorithm based on the algebraic multigrid (AMG) technique is presented. In the proposed algorithm, AMG is used to assign weights to connections between cells of a circuit and find cells that are best suited to become the initial cells for clusters, seed cells. The seed cells and the weights between them and the other cells are then used to cluster the cells of a circuit. The analysis of the proposed algorithm proves linear-time complexity, O(N), where N is the number of pins in a circuit. The numerical experiments demonstrate that AMG-based clustering can achieve high quality clusters and improve circuit placement designs with low computational cost.  相似文献   

10.
To take sample biases and skewness in the observations into account, practitioners frequently weight their observations according to some marginal distribution. The present paper demonstrates that such weighting can indeed improve the estimation. Studying contingency tables, estimators for marginal distributions are proposed under the assumption that another marginal is known. It is shown that the weighted estimators have a strictly smaller asymptotic variance whenever the two marginals are correlated. The finite sample performance is illustrated in a simulation study. As an application to traffic accident data the method allows for correcting a well‐known bias in the observed injury severity distribution.  相似文献   

11.
Statistical estimation of the model parameters of component lifetime distribution based on system lifetime data with known system structure is discussed here. We propose the use of stochastic expectation-maximization (SEM) algorithm for obtaining the maximum likelihood estimates of model parameters based on complete and censored system lifetimes. Different ways of implementing the SEM algorithm are also studied. We have shown that the proposed methods are feasible and are easy to implement for various families of component lifetime distributions. The proposed methodologies are then illustrated with two popular lifetime models—the Weibull and Birnbaum-Saunders distributions. Monte Carlo simulation is then used to compare the performance of the proposed methods with the corresponding estimation by direct maximization. Finally, two illustrative examples are presented along with some concluding remarks.  相似文献   

12.
A hybrid Pareto model for asymmetric fat-tailed data: the univariate case   总被引:1,自引:0,他引:1  
Density estimators that can adapt to asymmetric heavy tails are required in many applications such as finance and insurance. Extreme value theory (EVT) has developed principled methods based on asymptotic results to estimate the tails of most distributions. However, the finite sample approximation might introduce a severe bias in many cases. Moreover, the full range of the distribution is often needed, not only the tail area. On the other hand, non-parametric methods, while being powerful where data are abundant, fail to extrapolate properly in the tail area. We put forward a non-parametric density estimator that brings together the strengths of non-parametric density estimation and of EVT. A hybrid Pareto distribution that can be used in a mixture model is proposed to extend the generalized Pareto (GP) to the whole real axis. Experiments on simulated data show the following. On one hand, the mixture of hybrid Paretos converges faster in terms of log-likelihood and provides good estimates of the tail of the distributions when compared with other density estimators including the GP distribution. On the other hand, the mixture of hybrid Paretos offers an alternate way to estimate the tail index which is comparable to the one estimated with the standard GP methodology. The mixture of hybrids is also evaluated on the Danish fire insurance data set.   相似文献   

13.
Candidate groups search for K-harmonic means data clustering   总被引:2,自引:0,他引:2  
Clustering is a very popular data analysis and data mining technique. K-means is one of the most popular methods for clustering. Although K-mean is easy to implement and works fast in most situations, it suffers from two major drawbacks, sensitivity to initialization and convergence to local optimum. K-harmonic means clustering has been proposed to overcome the first drawback, sensitivity to initialization. In this paper we propose a new algorithm, candidate groups search (CGS), combining with K-harmonic mean to solve clustering problem. Computational results showed CGS does get better performance with less computational time in clustering, especially for large datasets or the number of centers is big.  相似文献   

14.
In this article we present a computational study for solving the distance-dependent rearrangement clustering problem using mixed-integer linear programming (MILP). To address sparse data sets, we present an objective function for evaluating the pair-wise interactions between two elements as a function of the distance between them in the final ordering. The physical permutations of the rows and columns of the data matrix can be modeled using mixed-integer linear programming and we present three models based on (1) the relative ordering of elements, (2) the assignment of elements to a final position, and (3) the assignment of a distance between a pair of elements. These models can be augmented with the use of cutting planes and heuristic methods to increase computational efficiency. The performance of the models is compared for three distinct re-ordering problems corresponding to glass transition temperature data for polymers and two drug inhibition data matrices. The results of the comparative study suggest that the assignment model is the most effective for identifying the optimal re-ordering of rows and columns of sparse data matrices.  相似文献   

15.
A method is proposed for estimating the parameters in a parametric statistical model when the observations are fuzzy and are assumed to be related to underlying crisp realizations of a random sample. This method is based on maximizing the observed-data likelihood defined as the probability of the fuzzy data. It is shown that the EM algorithm may be used for that purpose, which makes it possible to solve a wide range of statistical problems involving fuzzy data. This approach, called the fuzzy EM (FEM) method, is illustrated using three classical problems: normal mean and variance estimation from a fuzzy sample, multiple linear regression with crisp inputs and fuzzy outputs, and univariate finite normal mixture estimation from fuzzy data.  相似文献   

16.
In k-means clustering we are given a set of n data points in d-dimensional space and an integer k, and the problem is to determine a set of k points in  , called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.

We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9−) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.  相似文献   


17.
Dynamic clustering for interval data based on L 2 distance   总被引:2,自引:0,他引:2  
Summary  This paper introduces a partitioning clustering method for objects described by interval data. It follows the dynamic clustering approach and uses and L 2 distance. Particular emphasis is put on the standardization problem where we propose and investigate three standardization techniques for interval-type variables. Moreover, various tools for cluster interpretation are presented and illustrated by simulated and real-case data.  相似文献   

18.
Transfer algorithms are usually used to optimize an objective function that is defined on the set of partitions of a finite set X. In this paper we define an equivalence relation ? on the set of fuzzy equivalence relations on X and establish a bijection from the set of hierarchies on X to the set of equivalence classes with respect to ?. Thus, hierarchies can be identified with fuzzy equivalence relations and the transfer algorithm can be modified in order to optimize an objective function that is defined on the set of hierarchies on X.  相似文献   

19.
New clustering methods for interval data   总被引:3,自引:0,他引:3  
Summary  In this paper we propose two clustering methods for interval data based on the dynamic cluster algorithm. These methods use different homogeneity criteria as well as different kinds of cluster representations (prototypes). Some tools to interpret the final partitions are also introduced. An application of one of the methods concludes the paper.  相似文献   

20.
This paper discusses the maximum likelihood estimate of βunder linear inequalities A0β≥a in a linear model with missing data, proposes the restricted EM algorithm and proves the convergence.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号