首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
In this paper, we use the Fuzzy C-means method for clustering 3-way gene expression data via optimization of multiple objectives. A reformulation of the total clustering criterion is used to obtain an expression which has fewer variables compared to the classical FCM criterion. This transformation allows the use of a direct global optimizer in constrast to the alternating search commonly used. Gene expression data from microarray technology is generally of high dimension. The problem of empty space is known for this kind of data. We propose in this paper a transformation allowing more contrast in distances between all pairs of data samples. This, hence, increases the likelihood of detecting group structure, if any, in high dimensional datasets.  相似文献   

3.
Group decision making through the AHP has received significant attention in contemporary research, the primary focus of which has been on the issues of consistency and consensus building. In this paper, we concentrate on the latter and present a two-phase algorithm based on the optimal clustering of decision makers (members of a group) into sub groups followed by consensus building both within sub groups and between sub groups. Two-dimensional Sammon’s mapping is proposed as a tool for generating an approximate visualization of sub groups identified in multidimensional vector space, while the consensus convergence model is suggested for reaching agreement amongst individuals in and between sub groups. As a given, all decision makers evaluate the same decision elements within the AHP framework and produce individual scores of these decision elements. The consensual scores are obtained through the iterative procedure and the final scores are declared as the group decision. The results of two selected numerical examples are compared with two sets of results: the results obtained by the commonly used geometric mean aggregation method and also the results obtained if the consensus convergence model is applied directly without the prior clustering of the decision makers. The comparisons indicated the expected differences among the aggregation schemes and the final group scores. The matrices of respect values in the consensus convergence model, obtained for cases when the decision makers are optimally clustered and when they are not, show that in the latter case the decision makers receive lower weights of respect from other members in the group. Various tests showed that our approach is efficient in cases when no clusters can be visually and undoubtedly identified, especially if the number of group members is high.  相似文献   

4.
Supervised clustering of variables   总被引:1,自引:0,他引:1  
In predictive modelling, highly correlated predictors lead to unstable models that are often difficult to interpret. The selection of features, or the use of latent components that reduce the complexity among correlated observed variables, are common strategies. Our objective with the new procedure that we advocate here is to achieve both purposes: to highlight the group structure among the variables and to identify the most relevant groups of variables for prediction. The proposed procedure is an iterative adaptation of a method developed for the clustering of variables around latent variables (CLV). Modification of the standard CLV algorithm leads to a supervised procedure, in the sense that the variable to be predicted plays an active role in the clustering. The latent variables associated with the groups of variables, selected for their “proximity” to the variable to be predicted and their “internal homogeneity”, are progressively added in a predictive model. The features of the methodology are illustrated based on a simulation study and a real-world application.  相似文献   

5.
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.  相似文献   

6.
The use of a finite mixture of normal distributions in model-based clustering allows us to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework, we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior, where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition, this prior allows us to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows us to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semiparametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark datasets. Supplementary materials for this article are available online.  相似文献   

7.
在海量征信数据的背景下,为降低缺失数据插补的计算成本,提出收缩近邻插补方法.收缩近邻方法通过三阶段完成数据插补,第一阶段基于样本和变量的缺失比例计算入样概率,通过不等概抽样完成数据的收缩,第二阶段基于样本间距离,选取与缺失样本近邻的样本组成训练集,第三阶段建立随机森林模型进行迭代插补.利用Australian数据集和中国各银行数据集进行模拟研究,结果表明在确保一定插补精度的情况下,收缩近邻方法较大程度减少了计算量.  相似文献   

8.
Cluster analysis is an important task in data mining and refers to group a set of objects such that the similarities among objects within the same group are maximal while similarities among objects from different groups are minimal. The particle swarm optimization algorithm (PSO) is one of the famous metaheuristic optimization algorithms, which has been successfully applied to solve the clustering problem. However, it has two major shortcomings. The PSO algorithm converges rapidly during the initial stages of the search process, but near global optimum, the convergence speed will become very slow. Moreover, it may get trapped in local optimum if the global best and local best values are equal to the particle’s position over a certain number of iterations. In this paper we hybridized the PSO with a heuristic search algorithm to overcome the shortcomings of the PSO algorithm. In the proposed algorithm, called PSOHS, the particle swarm optimization is used to produce an initial solution to the clustering problem and then a heuristic search algorithm is applied to improve the quality of this solution by searching around it. The superiority of the proposed PSOHS clustering method, as compared to other popular methods for clustering problem is established for seven benchmark and real datasets including Iris, Wine, Crude Oil, Cancer, CMC, Glass and Vowel.  相似文献   

9.
张璐  孔令臣  陈黄岳 《计算数学》2019,41(3):320-334
随着大数据时代的到来,各个领域涌现出海量数据且结构复杂.如变量的维数不同、尺度不同等.而现实中变量之间往往存在着不确定关系,经典的Pearson相关系数仅能反映两个同维变量间的线性相关关系,不足以完全刻画变量间的相关关系.2007年Szekely等提出的距离相关系数则能描述不同维数变量间的非线性关系.为了探索变量之间的内在信息,本文基于距离相关系数提出了最大距离相关系数法对变量聚类,且有超度量性和空间收缩性.为充分发挥距离相关系数的优势,对上述方法改进得到类整体距离相关系数法.该方法在刻画两类间相似性时,将每类中的所有变量合并成一个整体,再计算这两个不同维数的整体间的距离相关系数.最后,将类整体距离相关系数法应用到几个实际问题中,验证了算法的有效性.  相似文献   

10.
This article proposes a Bayesian approach for the sparse group selection problem in the regression model. In this problem, the variables are partitioned into different groups. It is assumed that only a small number of groups are active for explaining the response variable, and it is further assumed that within each active group only a small number of variables are active. We adopt a Bayesian hierarchical formulation, where each candidate group is associated with a binary variable indicating whether the group is active or not. Within each group, each candidate variable is also associated with a binary indicator, too. Thus, the sparse group selection problem can be solved by sampling from the posterior distribution of the two layers of indicator variables. We adopt a group-wise Gibbs sampler for posterior sampling. We demonstrate the proposed method by simulation studies as well as real examples. The simulation results show that the proposed method performs better than the sparse group Lasso in terms of selecting the active groups as well as identifying the active variables within the selected groups. Supplementary materials for this article are available online.  相似文献   

11.
While graphical models for continuous data (Gaussian graphical models) and discrete data (Ising models) have been extensively studied, there is little work on graphical models for datasets with both continuous and discrete variables (mixed data), which are common in many scientific applications. We propose a novel graphical model for mixed data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible graph structures. We develop a computationally efficient regression-based algorithm for fitting the model by focusing on the conditional log-likelihood of each variable given the rest. The parameters have a natural group structure, and sparsity in the fitted graph is attained by incorporating a group lasso penalty, approximated by a weighted lasso penalty for computational efficiency. We demonstrate the effectiveness of our method through an extensive simulation study and apply it to a music annotation dataset (CAL500), obtaining a sparse and interpretable graphical model relating the continuous features of the audio signal to binary variables such as genre, emotions, and usage associated with particular songs. While we focus on binary discrete variables for the main presentation, we also show that the proposed methodology can be easily extended to general discrete variables.  相似文献   

12.
Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.  相似文献   

13.
文本聚类是聚类技术的重要研究领域.该技术根据文本的相似特征或相似表达式对文本进行聚类,使得属于同类的文本具有最大的相似性,而属不同类文本具有最大的差异性.与其它文字相比,蒙古文的结构和书写方式具有许多特征.本文结合K-means与克隆免疫算法提出了一种称为ICKM的新型聚类技术.四种元素集上的仿真实验说明了我们提出的方法在蒙古文聚类的有效性.  相似文献   

14.
This paper centres on clustering approaches that deal with multiple DNA microarray datasets. Four clustering algorithms for deriving a clustering solution from multiple gene expression matrices studying the same biological phenomenon are considered: two unsupervised cluster techniques based on information integration, a hybrid consensus clustering method combining Particle Swarm Optimization and k-means that can be referred to supervised clustering, and a supervised consensus clustering algorithm enhanced by Formal Concept Analysis (FCA), which initially produces a list of different clustering solutions, one per each experiment and then these solutions are transformed by portioning the cluster centres into a single overlapping partition, which is further analyzed by employing FCA. The four algorithms are evaluated on gene expression time series obtained from a study examining the global cell-cycle control of gene expression in fission yeast Schizosaccharomyces pombe.  相似文献   

15.
Candidate groups search for K-harmonic means data clustering   总被引:2,自引:0,他引:2  
Clustering is a very popular data analysis and data mining technique. K-means is one of the most popular methods for clustering. Although K-mean is easy to implement and works fast in most situations, it suffers from two major drawbacks, sensitivity to initialization and convergence to local optimum. K-harmonic means clustering has been proposed to overcome the first drawback, sensitivity to initialization. In this paper we propose a new algorithm, candidate groups search (CGS), combining with K-harmonic mean to solve clustering problem. Computational results showed CGS does get better performance with less computational time in clustering, especially for large datasets or the number of centers is big.  相似文献   

16.
A powerful data transformation method named guided projections is proposed creating new possibilities to reveal the group structure of high-dimensional data in the presence of noise variables. Using projections onto a space spanned by a selection of a small number of observations allows measuring the similarity of other observations to the selection based on orthogonal and score distances. Observations are iteratively exchanged from the selection creating a nonrandom sequence of projections, which we call guided projections. In contrast to conventional projection pursuit methods, which typically identify a low-dimensional projection revealing some interesting features contained in the data, guided projections generate a series of projections that serve as a basis not just for diagnostic plots but to directly investigate the group structure in data. Based on simulated data, we identify the strengths and limitations of guided projections in comparison to commonly employed data transformation methods. We further show the relevance of the transformation by applying it to real-world datasets.  相似文献   

17.
We investigate the class of σ-stable Poisson–Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs, which encompasses most of the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman–Yor process, the normalized inverse Gaussian process, and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterizations of σ-stable Poisson–Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for performing posterior inference with a Bayesian nonparametric mixture model. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a small number of auxiliary variables per iteration. We apply our sampling scheme to a density estimation and clustering tasks with unidimensional and multidimensional datasets, and compare it against competing MCMC sampling schemes. Supplementary materials for this article are available online.  相似文献   

18.
Binarization has always been a challenging problem in document image processing because of various types of degradation. In this paper, we present a nonlinear reaction–diffusion model for binarization of bleed-through document images, which is the Perona–Malik equation involving diffusion coefficient based on structure tensor along with a nonlinear reaction term. The Perona–Malik diffusion is utilized to selectively smooth document images with bleed-through removal. Meanwhile, the nonlinear reaction term takes the responsibility for the desired binarization. In order to solve our model numerically, we develop a parallel–series splitting algorithm by combining finite differencing with two kinds of splitting methods in the literature. Our algorithm is tested on seven publicly available datasets (DIBCO 2009 to 2014 and 2016). The experimental results show that our method averagely outperforms six relevant models for the nineteen document images with bleed-through in the DIBCO series datasets.  相似文献   

19.
Application of honey-bee mating optimization algorithm on clustering   总被引:4,自引:0,他引:4  
Cluster analysis is one of attractive data mining technique that use in many fields. One popular class of data clustering algorithms is the center based clustering algorithm. K-means used as a popular clustering method due to its simplicity and high speed in clustering large datasets. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. Over the last decade, modeling the behavior of social insects, such as ants and bees, for the purpose of search and problem solving has been the context of the emerging area of swarm intelligence. Honey-bees are among the most closely studied social insects. Honey-bee mating may also be considered as a typical swarm-based approach to optimization, in which the search algorithm is inspired by the process of marriage in real honey-bee. Honey-bee has been used to model agent-based systems. In this paper, we proposed application of honeybee mating optimization in clustering (HBMK-means). We compared HBMK-means with other heuristics algorithm in clustering, such as GA, SA, TS, and ACO, by implementing them on several well-known datasets. Our finding shows that the proposed algorithm works than the best one.  相似文献   

20.
This paper presents two effective algorithms for clustering n entities into p mutually exclusive and exhaustive groups where the ‘size’ of each group is restricted. As its objective, the clustering model minimizes the sum of distance between each entity and a designated group median. Empirical results using both a primal heuristic and a hybrid heuristic-subgradient method for problems having n ? 100 (i.e. 10 100 binary variables) show that the algorithms locate close to optimal solutions without resorting to tree enumeration. The capacitated clustering model is applied to the problem of sales force territorial design.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号