共查询到20条相似文献,搜索用时 339 毫秒
1.
系统聚类分析中应注意的两类问题 总被引:2,自引:0,他引:2
给出了选用九种相似性度量,用最短距离法聚类,结果互不相同的一个有趣的例子。对该例,用欧氏距离求出距离矩阵后,除用最短距离法聚类结果唯一外,用最长距离法、重心法、类平均法、离差平方和法聚类,结果均不唯一。 相似文献
2.
聚类结果的相似性比较是聚类分析中有着重要意义但理论又很不完善的分支,本文首先在文献[1]给出的聚类结果的相似性指标B_k的基础上,提出了一个改进的指标B_(k1),从理论上和用蒙特卡罗摸拟的结果阐明了这个改进的指标既保持了B_k的相似性意义,又弥补了B_k的不足。 相似文献
3.
4.
5.
《数学的实践与认识》2013,(20)
针对传统的谱聚类算法不适合处理多尺度问题,引入一种新的相似性度量—密度敏感的相似性度量,该度量可以放大不同高密度区域内数据点间距离,缩短同一高密度区域内数据点间距离,最终有效描述数据的实际聚类分布.本文引入特征间隙的概念,给出一种自动确定聚类数目的方法.数值实验验证本文所提的算法的可行性和有效性. 相似文献
6.
《数学的实践与认识》2013,(14)
聚类分析是研究对样品或指标进行综合分类的一种多元统计分析方法.聚类结果常表现为树状图的形式.如何合理确定聚类的个数,一直是一个比较困难的问题,至今没有很好的解决方案,尤其当样本量较大时,树状图层次较多,很难直观确定聚类个数.介绍一种基于贝叶斯理论的聚类方法,通过对后验似然最大化的原则确定最佳聚类个数和方案,避免了聚类个数选择的主观性.一个已知分类情况的实际数据验证了该聚类方法的有效性. 相似文献
7.
8.
区间型符号数据是一种重要的符号数据类型,现有文献往往假设区间内的点数据服从均匀分布,导致其应用的局限性。本文基于一般分布的假设,给出了一般分布区间型符号数据的扩展的Hausdorff距离度量,基于此提出了一般分布的区间型符号数据的SOM聚类算法。随机模拟试验的结果表明,基于本文提出的基于扩展的Hausdorff距离度量的SOM聚类算法的有效性优于基于传统Hausdorff距离度量的SOM聚类算法和基于μσ距离度量的SOM聚类算法。最后将文中方法应用于气象数据的聚类分析,示例文中方法的应用步骤与可操作性,并进一步评价文中方法在解决实际问题中的有效性。 相似文献
9.
双基本图形的度量方程及其应用尤秀英,杨池(广东机械学院)(上海市普陀区业余大学)l超平面的夹角余弦设e是P中一个m维超平面,彦一(51,··J,元.*是与e垂直且各向量之间相互垂直的单位向量组,则称g为e的一个法向量组,指定一类法向量组(法向量组中任... 相似文献
10.
将模糊聚类最大矩阵元原理与基于数据迭代为基础的水质模糊评价理论模型相结合,形成模糊聚类迭代方法.并用该方法对甘肃金昌市地下水质进行了分类评价,得到了今人满意的结果. 相似文献
11.
《Journal of computational and graphical statistics》2013,22(3):511-528
This article proposes a new quantity for assessing the number of groups or clusters in a dataset. The key idea is to view clustering as a supervised classification problem, in which we must also estimate the “true” class labels. The resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well. In the process, we develop novel notions of bias and variance for unlabeled data. Prediction strength performs well in simulation studies, and we apply it to clusters of breast cancer samples from a DNA microarray study. Finally, some consistency properties of the method are established. 相似文献
12.
13.
Chao-Ming Hwang Miin-Shen Yang Wen-Liang Hung E. Stanley Lee 《Mathematical and Computer Modelling》2011,53(9-10):1788-1797
Similarity measures of type-2 fuzzy sets are used to indicate the similarity degree between type-2 fuzzy sets. Inclusion measures for type-2 fuzzy sets are the degrees to which a type-2 fuzzy set is a subset of another type-2 fuzzy set. The entropy of type-2 fuzzy sets is the measure of fuzziness between type-2 fuzzy sets. Although several similarity, inclusion and entropy measures for type-2 fuzzy sets have been proposed in the literatures, no one has considered the use of the Sugeno integral to define those for type-2 fuzzy sets. In this paper, new similarity, inclusion and entropy measure formulas between type-2 fuzzy sets based on the Sugeno integral are proposed. Several examples are used to present the calculation and to compare these proposed measures with several existing methods for type-2 fuzzy sets. Numerical results show that the proposed measures are more reasonable than existing measures. On the other hand, measuring the similarity between type-2 fuzzy sets is important in clustering for type-2 fuzzy data. We finally use the proposed similarity measure with a robust clustering method for clustering the patterns of type-2 fuzzy sets. 相似文献
14.
Michael P. Windham 《Fuzzy Sets and Systems》1981,5(2):177-185
The proportion exponent is introduced as a measure of the validity of the clustering obtained for a data set using a fuzzy clustering algorithm. It is assumed that the output of an algorithm includes a fuzzy nembership function for each data point. We show how to compute the proportion of possible memberships whose maximum entry exceeds the maximum entry of a given membership function, and use these proportions to define the proportion exponent. Its use as a validity functional is illustrated with four numerical examples and its effectiveness compared to other validity functionals, namely, classification entropy and partition coefficient. 相似文献
15.
《Journal of computational and graphical statistics》2013,22(2):397-418
The goal of clustering is to detect the presence of distinct groups in a dataset and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly; for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Supplemental materials for the article, including an R package implementing generalized single linkage clustering, all datasets used in the examples, and R code producing the figures and numerical results, are available online. 相似文献
16.
Yunjae Jung Haesun Park Ding-Zhu Du Barry L. Drake 《Journal of Global Optimization》2003,25(1):91-111
Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of different groups. Though many effective and efficient clustering algorithms have been developed and deployed, most of them still suffer from the lack of automatic or online decision for optimal number of clusters. In this paper, we define clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds. When the measure is applied to a hierarchical clustering algorithm, an optimal number of clusters can be found. Our clustering measure shows good performance producing intuitively reasonable clustering configurations in Euclidean space according to the evidence from experimental results. Furthermore, the measure can be utilized to estimate the desired number of clusters for partitional clustering methods as well. Therefore, the clustering gain measure provides a promising technique for achieving a higher level of quality for a wide range of clustering methods. 相似文献
17.
Cluster analysis is used in various scientific and applied fields and is a topical subject of research. In contrast to the existing methods, the algorithms offered in this paper are intended for clustering objects described by feature vectors in a space in which the symmetry axiom is not satisfied. In this case, the clustering problem is solved using an asymmetric proximity measure. The essence of the first of the proposed clustering algorithms consists in sequential generation of clusters with simultaneous transfer of the objects clustered from previously created clusters into a current cluster if this reduces the quality criterion. In comparison with the existing algorithms of non-hierarchical clustering, such an approach to cluster generation makes it possible to reduce the computational costs. The second algorithmis a modified version of the first one andmakes it possible to reassign the main objects of clusters to further decrease the value of the proposed quality criterion. 相似文献
18.
话题发现是网络社交平台上进行热点话题预测的一个重要研究问题。针对已有话题发现算法大多基于传统余弦相似度衡量文本数据间的相似性,无法识别各维度取值成比例变化时数据对象间的差异,文本数据相似度计算结果不准确,影响话题发现正确率的问题,提出基于双向改进余弦相似度的话题发现算法(TABOC),首先从方向和取值两个角度改进余弦相似度,提出双向改进余弦相似度,能够区分各维度取值成比例变化的数据对象,保留传统余弦相似度在方向判别上的优势,提高衡量文本相似度的准确性;进一步定义集合的双向改进余弦特征向量和双向改进余弦特征向量的加法等相关定义定理,舍弃无关信息,直接计算新合并集合的特征向量,减小话题发现过程中的时间和空间消耗;还结合增量聚类框架,高效处理新增数据。采用百度贴吧数据进行实验表明,TABOC算法进行话题发现是有效可行的,算法正确率和时间效率总体上优于其他对比算法。 相似文献
19.
20.
在解决模糊多属性决策问题中,相似度是一种有效的方法.针对已有的相似度的不足,构造了一种新的两个矢量之间的相似度,证明其满足相似度的性质,并把它应用解决直觉梯形模糊偏好多属性决策问题.方法用语言值的直觉梯形模糊数来表示决策方案的信息,通过计算每个决策方案的期望矢量,与正理想方案和负理想方案的期望矢量的相对相似度,并由相对相似度大小来排列决策方案.最后用一案例来讨论方法的可行性,数值结果表明方法计算简单,实用性强. 相似文献