首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 625 毫秒
1.
For 30 years the Lempel–Ziv factorization LZ x of a string xx[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZ x was based on Θ(n)-time (or, depending on the measure used, O(n log n)-time) processing of the suffix tree ST x of x. Recently Abouelhoda et al. proposed an efficient Lempel–Ziv factorization algorithm based on an “enhanced” suffix array – that is, a suffix array SA x together with supporting data structures, principally an “interval tree”. In this paper we introduce a collection of fast space-efficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed; one family out of this collection achieves true Θ(n)-time alphabet-independent processing in the worst case by avoiding tree structures altogether. The work of the first and third authors was supported in part by grants from the Natural Sciences & Engineering Research Council of Canada.  相似文献   

2.
Replacing suffix trees with enhanced suffix arrays   总被引:9,自引:0,他引:9  
The suffix tree is one of the most important data structures in string processing and comparative genomics. However, the space consumption of the suffix tree is a bottleneck in large scale applications such as genome analysis. In this article, we will overcome this obstacle. We will show how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that uses an enhanced suffix array and solves the same problem in the same time complexity. The generic name enhanced suffix array stands for data structures consisting of the suffix array and additional tables. Our new algorithms are not only more space efficient than previous ones, but they are also faster and easier to implement.  相似文献   

3.
There is a class of data compression techniques that involve replacing repeating strings by pointers to previous occurrences of those strings. In order to implement these techniques, it would be useful to have an index which can quickly locate repeats within a fixed window of the text seen so far. One such index is the directed acyclic word graph (DAWG). An algorithm is presented which constructs the DAWG for a fixed window of k letters moving from left to right along the input text. It is shown that this algorithm has worst case behavior proportional to nk, where n is the length of the input text and k is the size of the window. This bound is the best possible for a window moving through the DAWG and also for a window moving through the suffix tree, a functionally related structure. The average case for the algorithm is analyzed under the assumption that input strings are random texts constructed from a fixed alphabet where each letter has equal probability. The resulting upper bound is O(nlogk).  相似文献   

4.
It is shown how to compute the lexicographically maximum suffix of a string of n?2 characters over a totally ordered alphabet using at most (4/3)n−5/3 three-way character comparisons. The best previous bound, which has stood unchallenged for more than 25 years, is (3/2)nO(1) comparisons. We also prove an interesting property of an algorithm for computing the maximum suffix both with respect to a total order < and with respect to its inverse order >.  相似文献   

5.
We give first the representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern string in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the length of the pattern. The structure is quite simple and answers a question raised by Muthukrishnan (in Proceedings of the FST and TCS Preconference Workshop on Randomization, 1997, pp. 23–27). Previous compact representations of suffix trees had either a higher lower order term in space and had some expectation assumption or required more time for searching. When the size of the alphabet k is not viewed as a constant, this structure can be modified to use the same space but take O(m lg k) time for string searching or to use an additional O(n lg k) bits but take the same O(m) time for searching. We then give several index structures for binary texts, with less space including
• a structure that uses a suffix array (lg  bits) and an additional () bits,
• an indexing structure that takes bits of space, and
• an ( lg ) bit structure which answers in () time, the decision question of whether a given pattern of length occurs in the text.
Each of these structures uses a different technique, either in the storage scheme or in the search algorithm, in reducing the space requirement. The first one uses a suffix array, a sparse suffix tree, and a table structure. Finding all the occurrences of a pattern using this structure takes O(m + s) time, where s is the number of occurrences of the pattern in the text. The second structure constructs a sparse suffix tree for all the suffixes that start with the bit that occurs more times in the given binary text. The last structure uses an iterative algorithm to search for the pattern. This structure is the first o(n lg n) bit index to support the decision version of indexing queries in time linear in the length of the pattern. But this does not support the general indexing queries where we want to find the position of the occurrence of the pattern.Our main contribution is the development of techniques to use the succinct tree representation through balanced parentheses for suffix trees.  相似文献   

6.
(δ,γ)-matching is a string matching problem with applications to music retrieval. The goal is, given a pattern P1…m and a text T1…n on an alphabet of integers, find the occurrences P of the pattern in the text such that (i) , and (ii) . The problem makes sense for δγδm. Several techniques for (δ,γ)-matching have been proposed, based on bit-parallelism or on skipping characters. We first present an O(mnlog(γ)/w) worst-case time and O(n) average-case time bit-parallel algorithm (being w the number of bits in the computer word). It improves the previous O(mnlog(δm)/w) worst-case time algorithm of the same type. Second, we combine our bit-parallel algorithm with suffix automata to obtain the first algorithm that skips characters using both δ and γ. This algorithm examines less characters than any previous approach, as the others do just δ-matching and check the γ-condition on the candidates. We implemented our algorithms and drew experimental results on real music, showing that our algorithms are superior to current alternatives with high values of δ.  相似文献   

7.
《Journal of Complexity》1999,15(1):72-127
We say that a data structure is builton-lineif, at any instant, we have the data structure corresponding to the input we have seen up to that instant. For instance, consider the suffix tree of a stringx[1, n]. An algorithm building iton-lineis such that, when we have read the firstisymbols ofx[1, n], we have the suffix tree forx[1, i]. We present a new technique, which we refer to asimplicit updates, based on which we obtain: (a) an algorithm for theon-lineconstruction of the Lsuffix tree of ann×nmatrixA—this data structure is the two-dimensional analog of the suffix tree of a string; (b) simple algorithms implementing primitive operations forLZ1-typeon-line losslessimage compression methods. Those methods, recently introduced by Storer, are generalizations ofLZ1-typecompression methods for strings. For the problem in (a), we get nearly an order of magnitude improvement over algorithms that can be derived from known techniques. For the problem in (b), we do not get an asymptotic speed-up with respect to what can be done with known techniques; rather we show that our algorithms are a natural support for the primitive operations. This may lead to faster implementations of those primitive operations. To the best of our knowledge, our technique is the first one that effectively addresses problems related to theon-lineconstruction of two-dimensional suffix trees.  相似文献   

8.
Given a string xx[1..n] on an alphabet of size α, and a threshold p min ≥ 1, we describe four variants of an algorithm PSY1 that, using a suffix array, computes all the complete nonextendible repeats in x of length pp min . The basic algorithm PSY1–1 and its simple extension PSY1–2 are fast on strings that occur in biological, natural language and other applications (not highly periodic strings), while PSY1–3 guarantees Θ(n) worst-case execution time. The final variant, PSY1–4, also achieves Θ(n) processing time and, over the complete range of strings tested, is the fastest of the four. The space requirement of all four algorithms is about 5n bytes, but all make use of the “longest common prefix” (LCP) array, whose construction requires about 6n bytes. The four algorithms are faster in applications and use less space than a recently-proposed algorithm (Narisawa in Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351, 2007) that produces equivalent output. The suffix array is not explicitly used by algorithms PSY1, but may be required for postprocessing; in this case, storage requirements rise to 9n bytes. We also describe two variants of a fast Θ(n)-time algorithm PSY2 for computing all complete supernonextendible repeats in x.  相似文献   

9.
10.
This paper proposes a new model that generalizes the linear sliding window system to the case of multiple failures. The considered k ‐within‐ m ‐from‐ r / n sliding window system consists of n linearly ordered multi‐state elements and fails if at least k groups out of m consecutive groups of r consecutive multi‐state elements have cumulative performance lower than the demand W . A reliability evaluation algorithm is suggested for the proposed system. In order to increase the system availability, maintenance actions can be performed, and the elements can be optimally allocated. A joint element allocation and maintenance optimization model is formulated with the objective of minimizing the total maintenance cost subjected to the pre‐specified system availability requirement. Basic procedures of genetic algorithms are adapted to solve the optimization problem. Numerical experiments are presented to illustrate the applications. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.
New text indexing functionalities of the compressed suffix arrays are proposed. The compressed suffix array proposed by Grossi and Vitter is a space-efficient data structure for text indexing. It occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies bits for the alphabet . In this paper we modify the data structure so that pattern matching can be done without any access to the text. In addition to the original functions of the compressed suffix array, we add new operations search, decompress and inverse to the compressed suffix arrays. We show that the new index can find occ occurrences of any substring P of the text in O(|P|logn+occlogεn) time for any fixed 1ε>0 without access to the text. The index also can decompress a part of the text of length m in O(m+logεn) time. For a text of length n on an alphabet such that , our new index occupies only bits where is the order-0 entropy of the text. Especially for ε=1 the size is bits. Therefore the index will be smaller than the text, which means we can perform fast queries from compressed texts.  相似文献   

12.
13.
Space efficient linear time construction of suffix arrays   总被引:1,自引:0,他引:1  
We present a linear time algorithm to sort all the suffixes of a string over a large alphabet of integers. The sorted order of suffixes of a string is also called suffix array, a data structure introduced by Manber and Myers that has numerous applications in pattern matching, string processing, and computational biology. Though the suffix tree of a string can be constructed in linear time and the sorted order of suffixes derived from it, a direct algorithm for suffix sorting is of great interest due to the space requirements of suffix trees. Our result is one of the first linear time suffix array construction algorithms, which improve upon the previously known O(nlogn) time direct algorithms for suffix sorting. It can also be used to derive a different linear time construction algorithm for suffix trees. Apart from being simple and applicable for alphabets not necessarily of fixed size, this method of constructing suffix trees is more space efficient.  相似文献   

14.
The suffix tree data structure has been intensively described, studied and used in the eighties and nineties, its linear-time construction counterbalancing his space-consuming requirements. An equivalent data structure, the suffix array, has been described by Manber and Myers in 1990. This space-economical structure has been neglected during more than a decade, its construction being too slow. Since 2003, several linear-time suffix array construction algorithms have been proposed, and this structure has slowly replaced the suffix tree in many string processing problems. All these constructions are building the suffix array from the text, and any edit operation on the text leads to the construction of a brand new suffix array. In this article, we are presenting an algorithm that modifies the suffix array and the Longest Common Prefix (LCP) array when the text is edited (insertion, substitution or deletion of a letter or a factor). This algorithm is based on a recent four-stage algorithm developed for dynamic Burrows–Wheeler Transforms (BWT). For minimizing the space complexity, we are sampling the Suffix Array, a technique used in BWT-based compressed indexes. We furthermore explain how this technique can be adapted for maintaining a sample of the Extended Suffix Array, containing a sample of the Suffix Array, a sample of the Inverse Suffix Array and the whole LCP array. Our practical experiments show that it operates very well in practice, being quicker than the fastest suffix array construction algorithm.  相似文献   

15.
Abstract. A graph is called a string graph if its vertices can be represented by continuous curves (``strings') in the plane so that two of them cross each other if and only if the corresponding vertices are adjacent. It is shown that there exists a recursive function f(n) with the property that every string graph of n vertices has a representation in which any two curves cross at most f(n) times. We obtain as a corollary that there is an algorithm for deciding whether a given graph is a string graph. This solves an old problem of Benzer (1959), Sinden (1966), and Graham (1971).  相似文献   

16.
The study and comparison of strings of symbols from a finite or an infinite alphabet is relevant to various areas of science, notably molecular biology, speech recognition, and computer science. In particular, the problem of finding the minimum “distance” between two strings (in general, two blocks of data) is of a practical importance. In this article we investigate the (string) pattern matching problem in a probabilistic framework, namely, it is assumed that both strings form an independent sequences of i.i.d. symbols. Given a text string a of length n and a pattern string b of length m, let Mm,n be the maximum number of matches between b and all m-substrings of a . Our main probabilistic result shows that for a wide range of input parameters in probability (pr.) provided m, n →∞ such that log n = o(m), where P is the probability of a match between any two symbols of these strings, and T is the probability of a match between two positions in the text string and a given position of the pattern string. We also prove that Mm,n/mP almost surely (a.s.) for log n = o(m). © 1993 John Wiley & Sons. Inc.  相似文献   

17.
Summary We propose a fast Monte-Carlo algorithm for calculating reliable estimates of the trace of the influence matrixA involved in regularization of linear equations or data smoothing problems, where is the regularization or smoothing parameter. This general algorithm is simply as follows: i) generaten pseudo-random valuesw 1, ...,w n , from the standard normal distribution (wheren is the number of data points) and letw=(w 1, ...,w n ) T , ii) compute the residual vectorwA w, iii) take the normalized inner-product (w T (wA w))/(w T w) as an approximation to (1/n)tr(I–A ). We show, both by theoretical bounds and by numerical simulations on some typical problems, that the expected relative precision of these estimates is very good whenn is large enough, and that they can be used in practice for the minimization with respect to of the well known Generalized Cross-Validation (GCV) function. This permits the use of the GCV method for choosing in any particular large-scale application, with only a similar amount of work as the standard residual method. Numerical applications of this procedure to optimal spline smoothing in one or two dimensions show its efficiency.  相似文献   

18.
The consensus string problem for a metric is NP-complete   总被引:1,自引:0,他引:1  
Given a set S of strings, a consensus string of S based on consensus error is a string w that minimizes the sum of the distances between w and all the strings in S. In this paper, we show that the problem of finding a consensus string based on consensus error is NP-complete when the penalty matrix is a metric.  相似文献   

19.
In this article we consider tries built from n strings such that each string can be chosen from a pool of k strings, each of them generated by a discrete i.i.d. source. Three cases are considered: k = 2, k is large but fixed, and k ? clog n. The goal in each case is to obtain tries as balanced as possible. Various parameters such as height and fill‐up level are analyzed. It is shown that for two‐choice tries a 50% reduction in height is achieved when compared with ordinary tries. In a greedy online construction when the string that minimizes the depth of insertion for every pair is inserted, the height is only reduced by 25%. To further reduce the height by another 25%, we design a more refined online algorithm. The total computation time of the algorithm is O(nlog n). Furthermore, when we choose the best among k ≥ 2 strings, then for large but fixed k the height is asymptotically equal to the typical depth in a trie. Finally, we show that further improvement can be achieved if the number of choices for each string is proportional to log n. In this case highly balanced trees can be constructed by a simple greedy algorithm for which the difference between the height and the fill‐up level is bounded by a constant with high probability. This, in turn, has implications for distributed hash tables, leading to a randomized ID management algorithm in peer‐to‐peer networks such that, with high probability, the ratio between the maximum and the minimum load of a processor is O(1). © 2008 Wiley Periodicals, Inc. Random Struct. Alg., 2009  相似文献   

20.
The problem of sorting n integers from a restricted range [1…m], where m is a superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = npolylog(n) we have an O(n log log n) algorithm.) The algorithm is parallelizable. The resulting parallel algorithm achieves optimal speedup. Some features of the algorithm make us believe that it is relevant for practical applications. A result of independent interest is a parallel hashing technique. The expected construction time is logarithmic using an optimal number of processors, and searching for a value takes O(1) time in the worst case. This technique enables drastic reduction of space requirements for the price of using randomness. Applicability of the technique is demonstrated for the parallel sorting algorithm and for some parallel string matching algorithms. The parallel sorting algorithm is designed for a strong and nonstandard model of parallel computation. Efficient simulations of the strong model on a CRCW PRAM are introduced. One of the simulations even achieves optimal speedup. This is probably the first optimal speedup simulation of a certain kind.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号