首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper, arising from population studies, develops clustering algorithms for identifying patterns in data. Based on the concept of geometric variability, we have developed one polythetic-divisive and three agglomerative algorithms. The effectiveness of these procedures is shown by relating them to classical clustering algorithms. They are very general since they do not impose constraints on the type of data, so they are applicable to general (economics, ecological, genetics...) studies. Our major contributions include a rigorous formulation for novel clustering algorithms, and the discovery of new relationship between geometric variability and clustering. Finally, these novel procedures give a theoretical frame with an intuitive interpretation to some classical clustering methods to be applied with any type of data, including mixed data. These approaches are illustrated with real data on Drosophila chromosomal inversions.  相似文献   

2.
Space-Time Point-Process Models for Earthquake Occurrences   总被引:5,自引:0,他引:5  
Several space-time statistical models are constructed based on both classical empirical studies of clustering and some more speculative hypotheses. Then we discuss the discrimination between models incorporating contrasting assumptions concerning the form of the space-time clusters. We also examine further practical extensions of the model to situations where the background seismicity is spatially non-homogeneous, and the clusters are non-isotropic. The goodness-of-fit of the models, as measured by AIC values, is discussed for two high quality data sets, in different tectonic regions. AIC also allows the details of the clustering structure in space to be clarified. A simulation algorithm for the models is provided, and used to confirm the numerical accuracy of the likelihood calculations. The simulated data sets show the similar spatial distributions to the real ones, but differ from them in some features of space-time clustering. These differences may provide useful indicators of directions for further study.  相似文献   

3.
This work develops a general procedure for clustering functional data which adapts the clustering method high dimensional data clustering (HDDC), originally proposed in the multivariate context. The resulting clustering method, called funHDDC, is based on a functional latent mixture model which fits the functional data in group-specific functional subspaces. By constraining model parameters within and between groups, a family of parsimonious models is exhibited which allow to fit onto various situations. An estimation procedure based on the EM algorithm is proposed for determining both the model parameters and the group-specific functional subspaces. Experiments on real-world datasets show that the proposed approach performs better or similarly than classical two-step clustering methods while providing useful interpretations of the groups and avoiding the uneasy choice of the discretization technique. In particular, funHDDC appears to always outperform HDDC applied on spline coefficients.  相似文献   

4.
We have developed a new class of circular distributions named wrapped weighted exponential distributions. The estimation of unknown parameters along with some characteristics of these distributions is also investigated. Some theorems that relate the distribution to some other circular distributions are established and we clarify their modeling potential using a classical data set on movements of sea stars.  相似文献   

5.
In data mining, the unsupervised learning technique of clustering is a useful method for ascertaining trends and patterns in data. Most general clustering techniques do not take into consideration the time-order of data. In this paper, mathematical programming and statistical techniques and methodologies are combined to develop a seasonal clustering technique for determining clusters of time series data. We apply this technique to weather and aviation data to determine probabilistic distributions of arrival capacity scenarios, which can be used for efficient traffic flow management. In general, this technique may be used for seasonal forecasting and planning.  相似文献   

6.
数据描述又称为一类分类方法,用于描述现有数据的分布特征,以研究待测试数据是否与该分布相吻合.首先简要叙述了基于核方法的数据描述原理,指出:选择适当的核函数以及与之对应的参数,数据描述可应用于模式聚类中,并且这种聚类方法具有边界紧致、易剔除噪声的优势.针对基于数据描述的聚类方法在确定类别数目和具体样本类别归属上所存在的问题,提出了基于搜索的解决方法,理论分析和实例计算都验证了该方法的可行性.最后将该聚类算法应用到企业关系评价中,取得了较为合理的结果.  相似文献   

7.
A clustering method is presented for analysing multivariate binary data with missing values. When not all values are observed, Govaert3 has studied the relations between clustering methods and statistical models. The author has shown how the identification of a mixture of Bernoulli distributions with the same parameter for all clusters and for all variables corresponds to a clustering criterion which uses L1 distance characterizing the MNDBIN method (Marchetti8). He first generalized this model by selecting parameters which can depend on variables and finally by selecting parameters which can depend both on variables and on clusters. We use the previous models to derive a clustering method adapted to missing data. This method optimizes a criterion by a standard iterative partitioning algorithm which removes the necessity either to ignore objects or to substitute the missing data. We study several versions of this algorithm and, finally, a brief account is given of the application of this method to some simulated data.  相似文献   

8.
Clustering is one of the most useful methods for understanding similarity among data. However, most conventional clustering methods do not pay sufficient attention to the geometric distributions of data. Geometric algebra (GA) is a generalization of complex numbers and quaternions able to describe spatial objects and the geometric relations between them. This paper uses conformal GA (CGA), which is a part of GA. This paper transforms data from a real Euclidean vector space into a CGA space and presents a new clustering method using conformal vectors. In particular, this paper shows that the proposed method was able to extract the geometric clusters which could not be detected by conventional methods.  相似文献   

9.
Exploratory graphical tools based on trimming are proposed for detecting main clusters in a given dataset. The trimming is obtained by resorting to trimmed k-means methodology. The analysis always reduces to the examination of real valued curves, even in the multivariate case. As the technique is based on a robust clustering criterium, it is able to handle the presence of different kinds of outliers. An algorithm is proposed to carry out this (computer intensive) method. As with classical k-means, the method is specially oriented to mixtures of spherical distributions. A possible generalization is outlined to overcome this drawback.  相似文献   

10.
Very often, one needs to perform (classical or Bayesian) inference, when essentially nothing is known about the distribution of the dependent variable given certain covariates. The paper proposes to approximate the unknown distribution by its non-parametric counterpart—a step function—and treat the points of the support and the corresponding density values, as parameters, whose posterior distributions should be determined based on the available data. The paper proposes distributions should be determined based on the available data. The paper proposes Markov chain Monte Carlo methods to perform posterior analysis, and applies the new method to an analysis of stock returns. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

11.
A clustering methodology based on biological visual models that imitates how humans visually cluster data by spatially associating patterns has been recently proposed. The method is based on Cellular Neural Networks and some resolution adjustments. The Cellular Neural Network rebuilds low-density areas while different resolutions find the best clustering option. The algorithm has demonstrated good performance compared to other clustering techniques. However, its main drawbacks correspond to its inability to operate with more than two-dimensional data sets and the computational time required for the resolution adjustment mechanism. This paper proposes a new version of this clustering methodology to solve such flaws. In the new approach, a pre-processing stage is incorporated featuring a Self-Organization Map that maps complex high-dimensional relations into a reduced lattice yet preserving the topological organization of the initial data set. This reduced representation is employed as the two-dimensional data set for further processing. In the new version, the resolution adjustment process is also accelerated through the use of an optimization method that combines the Hill-Climbing and the Random Search techniques. By incorporating such mechanisms rather than evaluating all possible resolutions, the optimization strategy finds the best resolution for a clustering problem by using a limited number of iterations. The proposed approach has been evaluated, considering several two-dimensional and high-dimensional datasets. Experimental evidence exhibits that the proposed algorithm performs the clustering task over complex problems delivering a 46% faster on average than the original method. The approach is also compared to other popular clustering techniques reported in the literature. Computational experiments demonstrate competitive results in comparison to other algorithms in terms of accuracy and robustness.  相似文献   

12.
We propose minimum volume ellipsoids (MVE) clustering as an alternative clustering technique to k-means for data clusters with ellipsoidal shapes and explore its value and practicality. MVE clustering allocates data points into clusters in a way that minimizes the geometric mean of the volumes of each cluster’s covering ellipsoids. Motivations for this approach include its scale-invariance, its ability to handle asymmetric and unequal clusters, and our ability to formulate it as a mixed-integer semidefinite programming problem that can be solved to global optimality. We present some preliminary empirical results that illustrate MVE clustering as an appropriate method for clustering data from mixtures of “ellipsoidal” distributions and compare its performance with the k-means clustering algorithm as well as the MCLUST algorithm (which is based on a maximum likelihood EM algorithm) available in the statistical package R. Research of the first author was supported in part by a Discovery Grant from NSERC and a research grant from Faculty of Mathematics, University of Waterloo. Research of the second author was supported in part by a Discovery Grant from NSERC and a PREA from Ontario, Canada.  相似文献   

13.
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.  相似文献   

14.
A new weighted version of the Gompertz distribution is introduced. It is noted that the model represents a mixture of classical Gompertz and second upper record value of Gompertz densities, and using a certain transformation it gives a new version of the two-parameter Lindley distribution. The model can be also regarded as a dual member of the log-Lindley-X family. Various properties of the model are obtained, including hazard rate function, moments, moment generating function, quantile function, skewness, kurtosis, conditional moments, mean deviations, some types of entropy, mean residual lifetime and stochastic orderings. Estimation of the model parameters is justified by the method of maximum likelihood. Two real data sets are used to assess the performance of the model among some classical and recent distributions based on some evaluation goodness-of-fit statistics. As a result, the variance-covariance matrix and the confidence interval of the parameters, and some theoretical measures have been calculated for such data for the proposed model with discussions.  相似文献   

15.
Correlation coefficients have many applications for studying the relationship among multivariate observations. Classical inferences on correlation coefficients are mainly based on the normality assumption. This assumption is hardly realistic in the real world, which implies that the procedures on correlation coefficients used in many statistical software packages may not be relevant to most data sets in practice. However, we show that the classical procedures, possibly after simple corrections, are also valid in classes of distributions with large skewnesses and heterogeneous marginal kurtoses. A useful class of nonnormal distributions is identified for each of several types of correlation coefficients. The marginals of these distributions may include a variety of univariate distributions with different shapes. The results generalize the classical procedures to much larger classes of distributions than previously known and give a better understanding of the historical controversy regarding the behavior of the sample correlation coefficient. An implication is that one need not be worried so much by the nonnormality of data sets when using these classical procedures, providing simple corrections are evaluated and possibly undertaken.  相似文献   

16.
We describe a methodology to examine bipartite relational data structures as exemplified in networks of corporate interlocking. These structures can be represented as bipartite graphs of directors and companies, but direct comparison of empirical datasets is often problematic because graphs have different numbers of nodes and different densities. We compare empirical bipartite graphs to simulated random graph distributions conditional on constraints implicit in the observed datasets. We examine bipartite graphs directly, rather than simply converting them to two 1-mode graphs, allowing investigation of bipartite statistics important to connection redundancy and bipartite connectivity. We introduce a new bipartite clustering coefficient that measures tendencies for localized bipartite cycles. This coefficient can be interpreted as an indicator of inter-company and inter-director closeness; but high levels of bipartite clustering have a cost for long range connectivity. We also investigate degree distributions, path lengths, and counts of localized subgraphs. Using this new approach, we compare global structural properties of US and Australian interlocking company directors. By comparing observed statistics against those from the simulations, we assess how the observed graphs are structured, and make comparisons between them relative to the simulated graph distributions. We conclude that the two networks share many similarities and some differences. Notably, both structures tend to be influenced by the clustering of directors on boards, more than by the accumulation of board seats by individual directors; that shared multiple board memberships (multiple interlocks) are an important feature of both infrastructures, detracting from global connectivity (but more so in the Australian case); and that company structural power may be relatively more diffuse in the US structure than in Australia.  相似文献   

17.
For clustering objects, we often collect not only continuous variables, but binary attributes as well. This paper proposes a model-based clustering approach with mixed binary and continuous variables where each binary attribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value, and where the scores of the latent variables are estimated from the binary data. In economics, such variables are called utility functions and the assumption is that the binary attributes (the presence or the absence of a public service or utility) are determined by low and high values of these functions. In genetics, the latent response is interpreted as the ??liability?? to develop a qualitative trait or phenotype. The estimated scores of the latent variables, together with the observed continuous ones, allow to use a multivariate Gaussian mixture model for clustering, instead of using a mixture of discrete and continuous distributions. After describing the method, this paper presents the results of both simulated and real-case data and compares the performances of the multivariate Gaussian mixture model and of a mixture of joint multivariate and multinomial distributions. Results show that the former model outperforms the mixture model for variables with different scales, both in terms of classification error rate and reproduction of the clusters means.  相似文献   

18.
A mixture approach to clustering is an important technique in cluster analysis. A mixture of multivariate multinomial distributions is usually used to analyze categorical data with latent class model. The parameter estimation is an important step for a mixture distribution. Described here are four approaches to estimating the parameters of a mixture of multivariate multinomial distributions. The first approach is an extended maximum likelihood (ML) method. The second approach is based on the well-known expectation maximization (EM) algorithm. The third approach is the classification maximum likelihood (CML) algorithm. In this paper, we propose a new approach using the so-called fuzzy class model and then create the fuzzy classification maximum likelihood (FCML) approach for categorical data. The accuracy, robustness and effectiveness of these four types of algorithms for estimating the parameters of multivariate binomial mixtures are compared using real empirical data and samples drawn from the multivariate binomial mixtures of two classes. The results show that the proposed FCML algorithm presents better accuracy, robustness and effectiveness. Overall, the FCML algorithm has the superiority over the ML, EM and CML algorithms. Thus, we recommend FCML as another good tool for estimating the parameters of mixture multivariate multinomial models.  相似文献   

19.
The problem of Hybrid Linear Modeling (HLM) is to model and segment data using a mixture of affine subspaces. Different strategies have been proposed to solve this problem, however, rigorous analysis justifying their performance is missing. This paper suggests the Theoretical Spectral Curvature Clustering (TSCC) algorithm for solving the HLM problem and provides careful analysis to justify it. The TSCC algorithm is practically a combination of Govindu’s multi-way spectral clustering framework (CVPR 2005) and Ng et al.’s spectral clustering algorithm (NIPS 2001). The main result of this paper states that if the given data is sampled from a mixture of distributions concentrated around affine subspaces, then with high sampling probability the TSCC algorithm segments well the different underlying clusters. The goodness of clustering depends on the within-cluster errors, the between-clusters interaction, and a tuning parameter applied by TSCC. The proof also provides new insights for the analysis of Ng et al. (NIPS 2001). This work was supported by NSF grant #0612608.  相似文献   

20.
本文提出一种新的聚类算法-基于模糊的投影寻踪算法,可以有效的处理医学中常常遇到的高维混合数据的模糊聚类问题.并将其应用在慢性肾衰的辩证分析问题中,为已有的慢性肾衰证候的分型标准提供科学支持.本文的研究方法为中医辩证的现代化研究开拓了新的思路,值得进一步深入探讨。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号