首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Block clustering aims to reveal homogeneous block structures in a data table. Among the different approaches of block clustering, we consider here a model-based method: the Gaussian latent block model for continuous data which is an extension of the Gaussian mixture model for one-way clustering. For a given data table, several candidate models are usually examined, which differ for example in the number of clusters. Model selection then becomes a critical issue. To this end, we develop a criterion based on an approximation of the integrated classification likelihood for the Gaussian latent block model, and propose a Bayesian information criterion-like variant following the same pattern. We also propose a non-asymptotic exact criterion, thus circumventing the controversial definition of the asymptotic regime arising from the dual nature of the rows and columns in co-clustering. The experimental results show steady performances of these criteria for medium to large data tables.  相似文献   

2.
The analysis of large-scale data sets using clustering techniques arises in many different disciplines and has important applications. Most traditional clustering techniques require heuristic methods for finding good solutions and produce suboptimal clusters as a result. In this article, we present a rigorous biclustering approach, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix. The physical permutations of the rows and columns are accomplished via a network flow model according to a given objective function. This optimal re-ordering model is used in an iterative framework where cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices. The performance of OREO is demonstrated on metabolite concentration data to validate the ability of the proposed method and compare it to existing clustering methods.  相似文献   

3.
Tests of ignoring and eliminating in nonsymmetric correspondence analysis   总被引:1,自引:0,他引:1  
Nonsymmetric correspondence analysis (NSCA) aims to examine predictive relationships between rows and columns of a contingency table. The predictor categories of such tables are often accompanied by some auxiliary information. Constrained NSCA (CNSCA) incorporates such information as linear constraints on the predictor categories. However, imposing constraints also means that part of the predictive relationship is left unaccounted for by the constraints. A method of NSCA is proposed for analyzing the residual part along with the part accounted for by the constraints. The CATANOVA test may be invoked to test the significance of each part. The two tests parallel the distinction between tests of ignoring and eliminating, and help gain some insight into what is known as Simpson’s paradox in the analysis of contingency tables. Two examples are given to illustrate the distinction.  相似文献   

4.
Following a critique of existing algorithms, an algorithm is presented which will re-organize a 2-way data table to bring like rows together, and like columns together. Extensions of the method are described, and justified, to accommodate distances measured in modular arithmetic, and with bipolar columns/rows, as in repertory grid analysis. One value of the algorithms is that the user can see relationships in the tables without the data in the cells themselves ever having been transformed. Thus, users will continue to feel they own their data.  相似文献   

5.
This paper follows a word-document co-clustering model independently introduced in 2001 by several authors such as I.S. Dhillon, H. Zha and C. Ding. This model consists in creating a bipartite graph based on word frequencies in documents, and whose vertices are both documents and words. The created bipartite graph is then partitioned in a way that minimizes the normalized cut objective function to produce the document clustering. The fusion-fission graph partitioning metaheuristic is applied on several document collections using this word-document co-clustering model. Results demonstrate a real problem in this model partitions found almost always have a normalized cut value lowest than the original document collection clustering. Moreover, measures of the goodness of solutions seem to be relatively independent of the normalized cut values of partitions.  相似文献   

6.
Basing cluster analysis on mixture models has become a classical and powerful approach. It enables some classical criteria such as the well-known k-means criterion to be explained. To classify the rows or the columns of a contingency table, an adapted version of k-means known as Mndki2, which uses the chi-square distance, can be used. Unfortunately, this simple, effective method which can be used jointly with correspondence analysis based on the same representation of the data, cannot be associated with a mixture model in the same way as the classical k-means algorithm. In this paper we show that the Mndki2 algorithm can be viewed as an approximation of a classifying version of the EM algorithm for a mixture of multinomial distributions. A comparison of the algorithms belonging in this context are experimentally investigated using Monte Carlo simulations.  相似文献   

7.
8.
We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log-likelihood. We apply an ?1 penalty to the means of the biclusters to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for biclustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression dataset. This article has supplementary material online.  相似文献   

9.
We develop a set of sequential importance sampling (SIS) strategies for sampling nearly uniformly from two-way zero-one or contingency tables with fixed marginal sums and a given set of structural zeros. The SIS procedure samples tables column by column or cell by cell by using appropriate proposal distributions, and enables us to approximate closely the null distributions of a number of test statistics involved in such tables. When structural zeros are on the diagonal or follow certain patterns, more efficient SIS algorithms are developed which guarantee that every generated table is valid. Examples show that our methods can be applied to make conditional inference on zero-one and contingency tables, and are more efficient than other existing Monte Carlo algorithms.  相似文献   

10.
11.
The development of fast algorithms for the solution of linear systems of equations with a Cauchy matrix has recently received considerable attention. Several of these algorithms factor a Cauchy matrix or its inverse into triangular and possibly diagonal matrices. The numerical properties of the factorization methods depend on the selection of pivots. This note presents elementary derivations of some factorization methods and describes a new strategy for searching both rows and columns for suitable pivots.  相似文献   

12.
The machine-part relation in the group technology problem can be represented by a 0-1 matrix A where the rows represent the machines and the columns stand for the parts. The grouping of machines and parts into families is then equivalent to clustering the rows and the columns of A so that the resulting matrix may review some useful patterns of the original data. One frequently used objective function is the total ‘bond energy’ between the rows and the columns, which is a quadratic assignment problem formulation. We will show that this formulation is equivalent to solving two rectilinear travelling-salesman problems. On the basis of this observation, we propose a new approach to solve the group technology problem and establish a new worst-case bound for this problem.  相似文献   

13.
In this article we present a computational study for solving the distance-dependent rearrangement clustering problem using mixed-integer linear programming (MILP). To address sparse data sets, we present an objective function for evaluating the pair-wise interactions between two elements as a function of the distance between them in the final ordering. The physical permutations of the rows and columns of the data matrix can be modeled using mixed-integer linear programming and we present three models based on (1) the relative ordering of elements, (2) the assignment of elements to a final position, and (3) the assignment of a distance between a pair of elements. These models can be augmented with the use of cutting planes and heuristic methods to increase computational efficiency. The performance of the models is compared for three distinct re-ordering problems corresponding to glass transition temperature data for polymers and two drug inhibition data matrices. The results of the comparative study suggest that the assignment model is the most effective for identifying the optimal re-ordering of rows and columns of sparse data matrices.  相似文献   

14.
There are many data clustering techniques available to extract meaningful information from real world data, but the obtained clustering results of the available techniques, running time for the performance of clustering techniques in clustering real world data are highly important. This work is strongly felt that fuzzy clustering technique is suitable one to find meaningful information and appropriate groups into real world datasets. In fuzzy clustering the objective function controls the groups or clusters and computation parts of clustering. Hence researchers in fuzzy clustering algorithm aim is to minimize the objective function that usually has number of computation parts, like calculation of cluster prototypes, degree of membership for objects, computation part for updating and stopping algorithms. This paper introduces some new effective fuzzy objective functions with effective fuzzy parameters that can help to minimize the running time and to obtain strong meaningful information or clusters into the real world datasets. Further this paper tries to introduce new way for predicting membership, centres by minimizing the proposed new fuzzy objective functions. And experimental results of proposed algorithms are given to illustrate the effectiveness of proposed methods.  相似文献   

15.
Correspondence analysis, a data analytic technique used to study two‐way cross‐classifications, is applied to social relational data. Such data are frequently termed “sociometric” or “network” data. The method allows one to model forms of relational data and types of empirical relationships not easily analyzed using either standard social network methods or common scaling or clustering techniques. In particular, correspondence analysis allows one to model:

—two‐mode networks (rows and columns of a sociomatrix refer to different objects)

—valued relations (e.g. counts, ratings, or frequencies).

In general, the technique provides scale values for row and column units, visual presentation of relationships among rows and columns, and criteria for assessing “dimensionality” or graphical complexity of the data and goodness‐of‐fit to particular models. Correspondence analysis has recently been the subject of research by Goodman, Haberman, and Gilula, who have termed their approach to the problem “canonical analysis” to reflect its similarity to canonical correlation analysis of continuous multivariate data. This generalization links the technique to more standard categorical data analysis models, and provides a much‐needed statistical justificatioa

We review both correspondence and canonical analysis, and present these ideas by analyzing relational data on the 1980 monetary donations from corporations to nonprofit organizations in the Minneapolis St. Paul metropolitan area. We also show how these techniques are related to dyadic independence models, first introduced by Holland, Leinhardt, Fienberg, and Wasserman in the early 1980's. The highlight of this paper is the relationship between correspondence and canonical analysis, and these dyadic independence models, which are designed specifically for relational data. The paper concludes with a discussion of this relationship, and some data analyses that illustrate the fart that correspondence analysis models can be used as approximate dyadic independence models.  相似文献   

16.
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.  相似文献   

17.
This paper, arising from population studies, develops clustering algorithms for identifying patterns in data. Based on the concept of geometric variability, we have developed one polythetic-divisive and three agglomerative algorithms. The effectiveness of these procedures is shown by relating them to classical clustering algorithms. They are very general since they do not impose constraints on the type of data, so they are applicable to general (economics, ecological, genetics...) studies. Our major contributions include a rigorous formulation for novel clustering algorithms, and the discovery of new relationship between geometric variability and clustering. Finally, these novel procedures give a theoretical frame with an intuitive interpretation to some classical clustering methods to be applied with any type of data, including mixed data. These approaches are illustrated with real data on Drosophila chromosomal inversions.  相似文献   

18.
Maximal margin based frameworks have emerged as a powerful tool for supervised learning. The extension of these ideas to the unsupervised case, however, is problematic since the underlying optimization entails a discrete component. In this paper, we first study the computational complexity of maximal hard margin clustering and show that the hard margin clustering problem can be precisely solved in O(n d+2) time where n is the number of the data points and d is the dimensionality of the input data. However, since it is well known that many datasets commonly ‘express’ themselves primarily in far fewer dimensions, our interest is in evaluating if a careful use of dimensionality reduction can lead to practical and effective algorithms. We build upon these observations and propose a new algorithm that gradually increases the number of features used in the separation model in each iteration, and analyze the convergence properties of this scheme. We report on promising numerical experiments based on a ‘truncated’ version of this approach. Our experiments indicate that for a variety of datasets, good solutions equivalent to those from other existing techniques can be obtained in significantly less time.  相似文献   

19.
由于推荐系统中存在巨量的用户和商品,现有的协同过滤方法很难处理用户-商品推荐中的数据稀疏性和计算可扩展性问题。本文提出了一种基于聚类矩阵近似的协同过滤推荐方法CF-cluMA。一方面,CF-cluMA方法通过对用户和商品进行分别聚类,并利用聚类后的用户-商品分块评分矩阵来刻画用户对于商品兴趣的局部性特点,以降低用户-商品评分矩阵的全局稀疏性。另一方面,CF-cluMA方法通过对局部稠密分块矩阵实施奇异值分解,并利用施密特变换近似全局用户-商品评分矩阵来预测用户对未知商品评分,以降低协同过滤算法的复杂性。在EachMovie电影评分真实数据集上的实验表明,相比于已有的基于矩阵近似的协同过滤推荐方法,本文所提出的CF-cluMA方法能够有效提升推荐系统的准确性并降低推荐系统的计算复杂性。本文的研究对于电子商务推荐系统具有重要的管理启示。  相似文献   

20.
Application of honey-bee mating optimization algorithm on clustering   总被引:4,自引:0,他引:4  
Cluster analysis is one of attractive data mining technique that use in many fields. One popular class of data clustering algorithms is the center based clustering algorithm. K-means used as a popular clustering method due to its simplicity and high speed in clustering large datasets. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. Over the last decade, modeling the behavior of social insects, such as ants and bees, for the purpose of search and problem solving has been the context of the emerging area of swarm intelligence. Honey-bees are among the most closely studied social insects. Honey-bee mating may also be considered as a typical swarm-based approach to optimization, in which the search algorithm is inspired by the process of marriage in real honey-bee. Honey-bee has been used to model agent-based systems. In this paper, we proposed application of honeybee mating optimization in clustering (HBMK-means). We compared HBMK-means with other heuristics algorithm in clustering, such as GA, SA, TS, and ACO, by implementing them on several well-known datasets. Our finding shows that the proposed algorithm works than the best one.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号