首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
At the heart of both lossy compression and clustering is a trade-off between the fidelity and size of the learned representation. Our goal is to map out and study the Pareto frontier that quantifies this trade-off. We focus on the optimization of the Deterministic Information Bottleneck (DIB) objective over the space of hard clusterings. To this end, we introduce the primal DIB problem, which we show results in a much richer frontier than its previously studied Lagrangian relaxation when optimized over discrete search spaces. We present an algorithm for mapping out the Pareto frontier of the primal DIB trade-off that is also applicable to other two-objective clustering problems. We study general properties of the Pareto frontier, and we give both analytic and numerical evidence for logarithmic sparsity of the frontier in general. We provide evidence that our algorithm has polynomial scaling despite the super-exponential search space, and additionally, we propose a modification to the algorithm that can be used where sampling noise is expected to be significant. Finally, we use our algorithm to map the DIB frontier of three different tasks: compressing the English alphabet, extracting informative color classes from natural images, and compressing a group theory-inspired dataset, revealing interesting features of frontier, and demonstrating how the structure of the frontier can be used for model selection with a focus on points previously hidden by the cloak of the convex hull.  相似文献   

2.
Accurate clustering is a challenging task with unlabeled data. Ensemble clustering aims to combine sets of base clusterings to obtain a better and more stable clustering and has shown its ability to improve clustering accuracy. Dense representation ensemble clustering (DREC) and entropy-based locally weighted ensemble clustering (ELWEC) are two typical methods for ensemble clustering. However, DREC treats each microcluster equally and hence, ignores the differences between each microcluster, while ELWEC conducts clustering on clusters rather than microclusters and ignores the sample–cluster relationship. To address these issues, a divergence-based locally weighted ensemble clustering with dictionary learning (DLWECDL) is proposed in this paper. Specifically, the DLWECDL consists of four phases. First, the clusters from the base clustering are used to generate microclusters. Second, a Kullback–Leibler divergence-based ensemble-driven cluster index is used to measure the weight of each microcluster. With these weights, an ensemble clustering algorithm with dictionary learning and the L2,1-norm is employed in the third phase. Meanwhile, the objective function is resolved by optimizing four subproblems and a similarity matrix is learned. Finally, a normalized cut (Ncut) is used to partition the similarity matrix and the ensemble clustering results are obtained. In this study, the proposed DLWECDL was validated on 20 widely used datasets and compared to some other state-of-the-art ensemble clustering methods. The experimental results demonstrated that the proposed DLWECDL is a very promising method for ensemble clustering.  相似文献   

3.
In the parameter estimation of limit extreme value distributions, most employed methods only use some of the available data. Using the peaks-over-threshold method for Generalized Pareto Distribution (GPD), only the observations above a certain threshold are considered; therefore, a big amount of information is wasted. The aim of this work is to make the most of the information provided by the observations in order to improve the accuracy of Bayesian parameter estimation. We present two new Bayesian methods to estimate the parameters of the GPD, taking into account the whole data set from the baseline distribution and the existing relations between the baseline and the limit GPD parameters in order to define highly informative priors. We make a comparison between the Bayesian Metropolis–Hastings algorithm with data over the threshold and the new methods when the baseline distribution is a stable distribution, whose properties assure we can reduce the problem to study standard distributions and also allow us to propose new estimators for the parameters of the tail distribution. Specifically, three cases of stable distributions were considered: Normal, Lévy and Cauchy distributions, as main examples of the different behaviors of the tails of a distribution. Nevertheless, the methods would be applicable to many other baseline distributions through finding relations between baseline and GPD parameters via studies of simulations. To illustrate this situation, we study the application of the methods with real data of air pollution in Badajoz (Spain), whose baseline distribution fits a Gamma, and show that the baseline methods improve estimates compared to the Bayesian Metropolis–Hastings algorithm.  相似文献   

4.
Link prediction in complex networks: a clustering perspective   总被引:1,自引:0,他引:1  
Link prediction is an open problem in the complex network, which attracts much research interest currently. However, little attention has been paid to the relation between network structure and the performance of prediction methods. In order to fill this vital gap, we try to understand how the network structure affects the performance of link prediction methods in the view of clustering. Our experiments on both synthetic and real-world networks show that as the clustering grows, the accuracy of these methods could be improved remarkably, while for the sparse and weakly clustered network, they perform poorly. We explain this through the distinguishment caused by increased clustering between the score distribution of positive and negative instances. Our finding also sheds light on the problem of how to select appropriate approaches for different networks with various densities and clusterings.  相似文献   

5.
Brain–computer interface (BCI) technology allows people with disabilities to communicate with the physical environment. One of the most promising signals is the non-invasive electroencephalogram (EEG) signal. However, due to the non-stationary nature of EEGs, a subject’s signal may change over time, which poses a challenge for models that work across time. Recently, domain adaptive learning (DAL) has shown its superior performance in various classification tasks. In this paper, we propose a regularized reproducing kernel Hilbert space (RKHS) subspace learning algorithm with K-nearest neighbors (KNNs) as a classifier for the task of motion imagery signal classification. First, we reformulate the framework of RKHS subspace learning with a rigorous mathematical inference. Secondly, since the commonly used maximum mean difference (MMD) criterion measures the distribution variance based on the mean value only and ignores the local information of the distribution, a regularization term of source domain linear discriminant analysis (SLDA) is proposed for the first time, which reduces the variance of similar data and increases the variance of dissimilar data to optimize the distribution of source domain data. Finally, the RKHS subspace framework was constructed sparsely considering the sensitivity of the BCI data. We test the proposed algorithm in this paper, first on four standard datasets, and the experimental results show that the other baseline algorithms improve the average accuracy by 2–9% after adding SLDA. In the motion imagery classification experiments, the average accuracy of our algorithm is 3% higher than the other algorithms, demonstrating the adaptability and effectiveness of the proposed algorithm.  相似文献   

6.
In the research of using Radial Basis Function Neural Network (RBF NN) forecasting nonlinear time series, we investigate how the different clusterings affect the process of learning and forecasting. We find that k-means clustering is very suitable. In order to increase the precision we introduce a nonlinear feedback term to escape from the local minima of energy, then we use the model to forecast the nonlinear time series which are produced by Mackey-Glass equation and stocks. By selecting the k-means clustering and the suitable feedback term, much better forecasting results are obtained.  相似文献   

7.
《Annals of Physics》1987,173(1):1-29
The amount of a particular type of clustering in a nuclear state is defined in this paper as the norm square of the projection of the wave function onto the particular cluster-model subspace. It is pointed out that, although the clusters cannot be localized in space by measurement, the amount of clustering characterizes the cluster formation in close analogy with a quantum mechanical probability. The cluster-model component of the wave function is proved to have a variational property. This facilitates the computation of the amount of clustering. The model dependence of the amounts of various clusterings and their relationship to the corresponding spectroscopic factors are studied via numerical examples for the nucleus 6Li. It is pointed out that, in a relative sense, the spectroscopic factor, which is more directly related to experiment, is also characteristic of the clustering contents of different states of the same nucleus, but it cannot be used for comparisons between different nuclei or clusterings. An analysis by means of the amount of clustering shows that even apparently minute changes on the cluster-model space modify the grount-state wave function of 6Li appreciably. But the extensions of the model space considered do not invalidate the α + d picture since the amount of α + d clustering in terms of consistently modified cluster internal wave functions remains close to unity.  相似文献   

8.
We develop a methodology with which to calculate typical network statistics by sampling a network through a random walk. By examining the statistics of degree and return times of walks which return to the same vertex, we can estimate characteristics of the giant component such as average clustering coefficient, degree distribution, degree correlations and network size. We confirm the validity of the methods using a variety of available network network data sets and then apply these methods to data collected by performing a random walk on the large on-line social networking website, Bebo. We find good agreement between our results and the results of previous studies of on-line social networks in which data collection was performed by a BFS (“snow-ball”) sampling algorithm. In particular, we find that the degree distribution exhibits a multi-scaling power-law tail and the network exhibits clustering and positive degree correlations.  相似文献   

9.
An important aspect of using entropy-based models and proposed “synthetic languages”, is the seemingly simple task of knowing how to identify the probabilistic symbols. If the system has discrete features, then this task may be trivial; however, for observed analog behaviors described by continuous values, this raises the question of how we should determine such symbols. This task of symbolization extends the concept of scalar and vector quantization to consider explicit linguistic properties. Unlike previous quantization algorithms where the aim is primarily data compression and fidelity, the goal in this case is to produce a symbolic output sequence which incorporates some linguistic properties and hence is useful in forming language-based models. Hence, in this paper, we present methods for symbolization which take into account such properties in the form of probabilistic constraints. In particular, we propose new symbolization algorithms which constrain the symbols to have a Zipf–Mandelbrot–Li distribution which approximates the behavior of language elements. We introduce a novel constrained EM algorithm which is shown to effectively learn to produce symbols which approximate a Zipfian distribution. We demonstrate the efficacy of the proposed approaches on some examples using real world data in different tasks, including the translation of animal behavior into a possible human language understandable equivalent.  相似文献   

10.
We propose a technique for estimating gene expression values for duplicated data on cDNA microarrays. In the scatter plots, the distribution is constructed from a mixture of normal two-dimensional distributions, which represent fluctuations in gene expression values due to noise. An expectation-maximization (EM) algorithm is used for estimating the modeling parameters. The probability that duplicated data is shifted by noise is calculated using Bayesian estimation. Six data sets of rice cDNA microarray assays were used to test the proposed technique. Genes in the data sets were subjected to clustering based on probability of true value. Clustering successfully identified candidate genes regulated by circadian rhythms in rice.  相似文献   

11.
In the random quantum walk, which is a quantum simulation of the classical walk, data points interacted when selecting the appropriate walk strategy by taking advantage of quantum-entanglement features; thus, the results obtained when the quantum walk is used are different from those when the classical walk is adopted. A new quantum walk clustering algorithm based on space is proposed by applying the quantum walk to clustering analysis. In this algorithm, data points are viewed as walking participants, and similar data points are clustered using the walk function in the pay-off matrix according to a certain rule. The walk process is simplified by implementing a space-combining rule. The proposed algorithm is validated by a simulation test and is proved superior to existing clustering algorithms, namely, Kmeans, PCA + Kmeans, and LDA-Km. The effects of some of the parameters in the proposed algorithm on its performance are also analyzed and discussed. Specific suggestions are provided.  相似文献   

12.
We introduce a modern, optimized, and publicly available implementation of the sequential Information Bottleneck clustering algorithm, which strikes a highly competitive balance between clustering quality and speed. We describe a set of optimizations that make the algorithm computation more efficient, particularly for the common case of sparse data representation. The results are substantiated by an extensive evaluation that compares the algorithm to commonly used alternatives, focusing on the practically important use case of text clustering. The evaluation covers a range of publicly available benchmark datasets and a set of clustering setups employing modern word and sentence embeddings obtained by state-of-the-art neural models. The results show that in spite of using the more basic Term-Frequency representation, the proposed implementation provides a highly attractive trade-off between quality and speed that outperforms the alternatives considered. This new release facilitates the use of the algorithm in real-world applications of text clustering.  相似文献   

13.
Clustering gene expression data is an important research topic in bioinformatics because knowing which genes act similarly can lead to the discovery of important biological information. Many clustering algorithms have been used in the field of gene clustering. The multivariate Gaussian mixture distribution function was frequently used as the component of the finite mixture model for clustering, however the clustering cannot be restricted to the normal distribution in the real dataset. In order to make the cluster algorithm strong adaptability, this paper proposes a new scheme for clustering gene expression data based on the multivariate elliptical contoured mixture models (MECMMs). To solve the problem of over-reliance on the initialization, we propose an improved expectation maximization (EM) algorithm by adding and deleting initial value for the classical EM algorithm, and the number of clusters can be treated as a known parameter and inferred with the QAIC criterion. The improved EM algorithm based on the MECMMs is tested and compared with some other clustering algorithms, the performance of our clustering algorithm has been extensively compared over several simulated and real gene expression datasets. Our results indicated that improved EM clustering algorithm is superior to the classical EM algorithm and the support vector machines (SVMs) algorithm, and can be widely used for gene clustering.  相似文献   

14.
This work presents a comprehensive study of the evolution of the expenditure distribution in India. The consumption process is theoretically modeled based on certain physical assumptions. The proposed statistical model for the expenditure distribution may follow either a double Pareto distribution or a mixture of log-normal and Pareto distribution. The goodness-of-fit tests with the Indian data, collected from the National Sample Survey Organisation Reports for the years of 1983-2007, validate the proposal of a mixture of log-normal and Pareto distribution. The relative weight of the Pareto tail has a remarkable magnitude of approximately 10%-20% of the population. Moreover, though the Pareto tail is widening over time for the rural sector only, there is no significant change in the overall inequality measurement across the entire period of study.  相似文献   

15.
A. Fujihara  M. Uchida 《Physica A》2010,389(5):1124-1130
We theoretically and numerically investigated the threshold network model with a generic weight function where there were a large number of nodes and a high threshold. Our analysis was based on extreme value theory, which gave us a theoretical understanding of the distribution of independent and identically distributed random variables within a sufficiently high range. Specifically, the distribution could be generally expressed by a generalized Pareto distribution, which enabled us to formulate the generic weight distribution function. By using the theorem, we obtained the exact expressions of degree distribution and clustering coefficient which behaved as universal power laws within certain ranges of degrees. We also compared the theoretical predictions with numerical results and found that they were extremely consistent.  相似文献   

16.
We present a novel method for interpolating univariate time series data. The proposed method combines multi-point fractional Brownian bridges, a genetic algorithm, and Takens’ theorem for reconstructing a phase space from univariate time series data. The basic idea is to first generate a population of different stochastically-interpolated time series data, and secondly, to use a genetic algorithm to find the pieces in the population which generate the smoothest reconstructed phase space trajectory. A smooth trajectory curve is hereby found to have a low variance of second derivatives along the curve. For simplicity, we refer to the developed method as PhaSpaSto-interpolation, which is an abbreviation for phase-space-trajectory-smoothing stochastic interpolation. The proposed approach is tested and validated with a univariate time series of the Lorenz system, five non-model data sets and compared to a cubic spline interpolation and a linear interpolation. We find that the criterion for smoothness guarantees low errors on known model and non-model data. Finally, we interpolate the discussed non-model data sets, and show the corresponding improved phase space portraits. The proposed method is useful for interpolating low-sampled time series data sets for, e.g., machine learning, regression analysis, or time series prediction approaches. Further, the results suggest that the variance of second derivatives along a given phase space trajectory is a valuable tool for phase space analysis of non-model time series data, and we expect it to be useful for future research.  相似文献   

17.
A load-sharing system is defined as a parallel system whose load will be redistributed to its surviving components as each of the components fails in the system. Our focus is on making statistical inference of the parameters associated with the lifetime distribution of each component in the system. In this paper, we introduce a methodology which integrates the conventional procedure under the assumption of the load-sharing system being made up of fundamental hypothetical latent random variables. We then develop an expectation maximization algorithm for performing the maximum likelihood estimation of the system with Lindley-distributed component lifetimes. We adopt several standard simulation techniques to compare the performance of the proposed methodology with the Newton–Raphson-type algorithm for the maximum likelihood estimate of the parameter. Numerical results indicate that the proposed method is more effective by consistently reaching a global maximum.  相似文献   

18.
Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf–Mandelbrot–Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf–Mandelbrot–Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.  相似文献   

19.
大规模光谱巡天项目如LAMOST等产生了海量极具研究价值的观测数据,如何对此数量级的数据进行有效的分析是当前的一个研究热点。聚类算法是一类无监督的机器学习算法,可以在不依赖于领域知识的情况下对数据进行处理,发现其中的规律与结构。恒星光谱聚类是天文数据处理中一项非常重要的工作,主要对海量光谱巡天数据按照其物理及化学性质分类。针对LAMOST巡天中的早M型矮恒星的光谱数据,使用多种聚类算法如K-Means,Bisecting K-Means和OPTICS算法做了聚类分析,研究不同聚类算法在早M型恒星数据的表现。聚类算法在一定程度依赖于其使用的距离度量算法,同时研究了欧氏距离、曼哈顿距离、残差分布距离和上述三种聚类算法搭配下的表现。实验结果表明:(1)聚类算法可以很好地辅助分析早M型矮恒星的光谱数据,聚类产生的簇心数据和MK分类吻合得非常好。(2)三种不同聚类算法表现不尽相同,Bisecting K-Means在恒星光谱细分类方面更有优势。(3) 在聚类的同时也会产生一些数量较少的簇,从这些簇中可以发现一些稀有天体候选体,相对而言OPTICS适合用来寻找稀有天体候选体。  相似文献   

20.
Functional modules can be predicted using genome-wide protein–protein interactions (PPIs) from a systematic perspective. Various graph clustering algorithms have been applied to PPI networks for this task. In particular, the detection of overlapping clusters is necessary because a protein is involved in multiple functions under different conditions. graph entropy (GE) is a novel metric to assess the quality of clusters in a large, complex network. In this study, the unweighted and weighted GE algorithm is evaluated to prove the validity of predicting function modules. To measure clustering accuracy, the clustering results are compared to protein complexes and Gene Ontology (GO) annotations as references. We demonstrate that the GE algorithm is more accurate in overlapping clusters than the other competitive methods. Moreover, we confirm the biological feasibility of the proteins that occur most frequently in the set of identified clusters. Finally, novel proteins for the additional annotation of GO terms are revealed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号