Several methods to compress suffix trees were defined, most of them with the aim of obtaining compact (that is, space economical) index structures. Besides this practical aspect, a compression method can reveal structural properties of the resulting data structure, allowing a better understanding of it and a better estimation of its performances.
In this paper, we propose a simple method to compress suffix trees by merging couples of nodes. This idea was already used in the literature in a context different from ours. The originality of our approach is that the nodes we merge are not chosen with respect to their subtrees (which is difficult to test algorithmically), nor with respect to the words spelled along branches (which usually requires testing several branches before finding the good one) but with respect to their position in the tree (which is easy to compute). Another particularity of our method is it needs to read no edge label: it is exclusively based on the topology of the suffix tree. The compact structure resulting after compression is the factor/suffix oracle introduced by Allauzen, Crochemore and Raffinot whose accepted language includes the accepted language of the corresponding suffix tree.
The interest of our paper is therefore threefold:
1. A topology-based compression method is defined for (compact) suffix trees.
2. A new property of a factor/suffix oracle is established, that is, like a DAG, it results from the corresponding suffix tree after a linear number of appropriate node mergings; unlike a DAG, the merged nodes do not necessarily have isomorphical subtrees.
3. A new algorithm to transform a suffix tree into a factor/suffix oracle is given, which has linear running time and thus improves the quadratic complexity previously known for the same task.
A new clustering method is presented which proposes a class of objective functions and an algorithm which sub-optimizes the objective functions over the whole space of partitions. The objective functions have a global nature, encompassing both the cluster contents and the cluster number. However, the accompanying suboptimization algorithm works according to a simple progressive merger scheme. The algorithmic scheme produces in a quite natural way an indexed hierarchy. The hierarchy index is not just tacked on to the method—see Diday and Moreau1—on the contrary, the algorithm refers directly to its values which measure, depending upon the particular formulation, either the relative affinity or the relative difference of the two clusters merged at a given level of hierarchy. In this way, the scale of hierarchy and hierarchy-wise validity of clusters can easily be established, which is of great importance in analysing unstructured data sets whose generating process is unknown and can only be hypothesized after an initial structure had been established, e.g. owing to clustering, as is the case in pattern recognition—see Kaminuma2. 相似文献