首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 921 毫秒
1.
New text indexing functionalities of the compressed suffix arrays are proposed. The compressed suffix array proposed by Grossi and Vitter is a space-efficient data structure for text indexing. It occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies bits for the alphabet . In this paper we modify the data structure so that pattern matching can be done without any access to the text. In addition to the original functions of the compressed suffix array, we add new operations search, decompress and inverse to the compressed suffix arrays. We show that the new index can find occ occurrences of any substring P of the text in O(|P|logn+occlogεn) time for any fixed 1ε>0 without access to the text. The index also can decompress a part of the text of length m in O(m+logεn) time. For a text of length n on an alphabet such that , our new index occupies only bits where is the order-0 entropy of the text. Especially for ε=1 the size is bits. Therefore the index will be smaller than the text, which means we can perform fast queries from compressed texts.  相似文献   

2.
The indexing problem is where a text is preprocessed and subsequent queries of the form “Find all occurrences of pattern P in the text” are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns is preprocessed and subsequent queries of the form “Find all occurrences of dictionary patterns in text T” are answered in time proportional to the length of the text and the number of occurrences.There exist efficient worst-case solutions for the indexing problem and the dictionary matching problem, but none that find approximate occurrences of the patterns, i.e., where the pattern is within a bound edit (or Hamming) distance from the appropriate text location.In this paper we present a uniform deterministic solution to both the indexing and the general dictionary matching problem with one error. We preprocess the data in time O(n log2 n), where n is the text size in the indexing problem and the dictionary size in the dictionary matching problem. Our query time for the indexing problem is O(m log n log log n + tocc), where m is the query string size and tocc is the number of occurrences. Our query time for the dictionary matching problem is O(n log3 d log log d + tocc), where n is the text size and d the dictionary size. The time bounds above apply to both bounded and unbounded alphabets.  相似文献   

3.
Let a text string T of n symbols and a pattern string P of m symbols from alphabet Σ be given. A swapped version T′ of T is a length n string derived from T by a series of local swaps (i.e., t ← tℓ + 1 and tℓ + 1 ← t), where each element can participate in no more than one swap. The pattern matching with swaps problem is that of finding all locations i for which there exists a swapped version T′ of T with an exact matching of P in location i of T′. It has been an open problem whether swapped matching can be done in less than O(nm) time. In this paper we show the first algorithm that solves the pattern matching with swaps problem in time o(nm). We present an algorithm whose time complexity is O(nm1/3 log m log σ) for a general alphabet Σ, where σ = min(m,Σ).  相似文献   

4.
Character sets of strings   总被引:2,自引:1,他引:1  
Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S of S is the subset CΣ of the symbols occurring in S. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:
(1) Output all the maximal locations of substrings having a given character set.
(2) Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C.
Denoting by n the total length of the considered string or collection of strings, we solve the first problem in Θ(n) time using Θ(n) space. We present two algorithms solving the second problem. The first one runs in Θ(n2) time using Θ(n) space. The second algorithm has Θ(n|Σ|log|Σ|) time and Θ(n) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13].  相似文献   

5.
We address the problem of computing homotopic shortest paths in the presence of obstacles in the plane. Problems on homotopy of paths received attention very recently [Cabello et al., in: Proc. 18th Annu. ACM Sympos. Comput. Geom., 2002, pp. 160–169; Efrat et al., in: Proc. 10th Annu. European Sympos. Algorithms, 2002, pp. 411–423]. We present two output-sensitive algorithms, for simple paths and non-simple paths. The algorithm for simple paths improves the previous algorithm [Efrat et al., in: Proc. 10th Annu. European Sympos. Algorithms, 2002, pp. 411–423]. The algorithm for non-simple paths achieves O(log2n) time per output vertex which is an improvement by a factor of O(n/log2n) of the previous algorithm [Hershberger, Snoeyink, Comput. Geom. Theory Appl. 4 (1994) 63–98], where n is the number of obstacles. The running time has an overhead O(n2+) for any positive constant . In the case k<n2+, where k is the total size of the input and output, we improve the running to O((n+k+(nk)2/3)logO(1)n).  相似文献   

6.
Let a text of u characters over an alphabet of size σ be compressible to n phrases by the LZ78 algorithm. We show how to build a data structure based on the Ziv–Lempel trie, called the LZ-index, that takes 4nlog2n(1+o(1)) bits of space (that is, 4 times the entropy of the text for ergodic sources) and reports the R occurrences of a pattern of length m in worst case time O(m3logσ+(m+R)logn). We present a practical implementation of the LZ-index, which is faster than current alternatives when we take into consideration the time to report the positions or text contexts of the occurrences found.  相似文献   

7.
We give first the representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern string in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the length of the pattern. The structure is quite simple and answers a question raised by Muthukrishnan (in Proceedings of the FST and TCS Preconference Workshop on Randomization, 1997, pp. 23–27). Previous compact representations of suffix trees had either a higher lower order term in space and had some expectation assumption or required more time for searching. When the size of the alphabet k is not viewed as a constant, this structure can be modified to use the same space but take O(m lg k) time for string searching or to use an additional O(n lg k) bits but take the same O(m) time for searching. We then give several index structures for binary texts, with less space including
• a structure that uses a suffix array (lg  bits) and an additional () bits,
• an indexing structure that takes bits of space, and
• an ( lg ) bit structure which answers in () time, the decision question of whether a given pattern of length occurs in the text.
Each of these structures uses a different technique, either in the storage scheme or in the search algorithm, in reducing the space requirement. The first one uses a suffix array, a sparse suffix tree, and a table structure. Finding all the occurrences of a pattern using this structure takes O(m + s) time, where s is the number of occurrences of the pattern in the text. The second structure constructs a sparse suffix tree for all the suffixes that start with the bit that occurs more times in the given binary text. The last structure uses an iterative algorithm to search for the pattern. This structure is the first o(n lg n) bit index to support the decision version of indexing queries in time linear in the length of the pattern. But this does not support the general indexing queries where we want to find the position of the occurrence of the pattern.Our main contribution is the development of techniques to use the succinct tree representation through balanced parentheses for suffix trees.  相似文献   

8.
The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T. Currently, the fastest algorithms for this problem are the following. The Galil–Giancarlo algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk). The Abrahamson algorithm finds the number of mismatches at every location in time . We present an algorithm that is faster than both. Our algorithm finds all locations where the pattern has at most k errors in time . We also show an algorithm that solves the above problem in time O((n+(nk3)/m)logk).  相似文献   

9.
Our goal in this paper is to analyze carry propagation in addition using only elementary methods (that is, those not involving residues, contour integration, or methods of complex analysis). Our results concern the length of the longest carry chain when two independent uniformly distributed n-bit numbers are added. First, we show using just first- and second-moment arguments that the expected length Cn of the longest carry chain satisfies Cn = log2n + O(1). Second, we use a sieve (inclusion–exclusion) argument to give an exact formula for Cn. Third, we give an elementary derivation of an asymptotic formula due to Knuth, Cn = log2n + Φ(log2 n) + O((logn)4/n), where Φ(ν) is a bounded periodic function of ν, with period 1, for which we give both a simple integral expression and a Fourier series. Fourth, we give an analogous asymptotic formula for the variance Vn of the length of the longest carry chain: Vn = Ψ(log2 n) + O((logn)5/n), where Ψ(ν) is another bounded periodic function of ν, with period 1. Our approach can be adapted to addition with the “end-around” carry that occurs in the sign-magnitude and 1s-complement representations. Finally, our approach can be adapted to give elementary derivations of some asymptotic formulas arising in connection with radix-exchange sorting and collision-resolution algorithms, which have previously been derived using contour integration and residues.  相似文献   

10.
Cheriyan and Hagerup developed a randomized algorithm to compute the maximum flow in a graph with n nodes and m edges in O(mn + n2 log2n) expected time. The randomization is used to efficiently play a certain combinatorial game that arises during the computation. We give a version of their algorithm where a general version of their game arises. Then we give a strategy for the game that yields a deterministic algorithm for computing the maximum flow in a directed graph with n nodes and m edges that runs in time O(mn(logm/n log nn)). Our algorithm gives an O(mn) deterministic algorithm for all m/n = Ω(nε) for any positive constant ε, and is currently the fastest deterministic algorithm for computing maximum flow as long as m/n = ω(log n).  相似文献   

11.
In thecollect problem(M. Saks, N. Shavit, and H. Woll,in“Proceedings of the 2nd ACM–SIAM Symposium on Discrete Algorithms, 1991),nprocessors in a shared-memory system must each learn the values ofnregisters. We give a randomized algorithm that solves the collect problem inO(n log3 n) total read and write operations with high probability, even if timing is under the control of a content-oblivious adversary (a slight weakening of the usual adaptive adversary). This improves on both the trivial upper bound ofO(n2) steps and the best previously known bound ofO(n3/2 log n) steps, and is close to the lower bound of Ω(n log n) steps. Furthermore, we show how this algorithm can be used to obtain a multiuse cooperative collect protocol that isO(log3 n)-competitive in the latency model of Ajtaiet al.(“Proceedings of the 33rd IEEE Symposium on Foundations of Computer Science,” 1994); andO(n1/2 log3/2 n)-competitive in the throughput model of Aspnes and Waarts (“Proceedings of the 28th ACM Symposium on Theory of Computing,” 1996). In both cases the competitive ratios are within a polylogarithmic factor of optimal.  相似文献   

12.
We introduce a new multidimensional pattern matching problem that is a natural generalization of string matching, a well studied problem[1]. The motivation for its algorithmic study is mainly theoretical. LetA[1:n1,…,1:nd] be a text matrix withN = n1ndentries andB[1:m1,…,1:mr] be a pattern matrix withM = m1mrentries, wheredr ≥ 1 (the matrix entries are taken from an ordered alphabet Σ). We study the problem of checking whether somer-dimensional submatrix ofAis equal toB(i.e., adecisionquery).Acan be preprocessed andBis given on-line. We define a new data structure for preprocessingAand propose CRCW-PRAM algorithms that build it inO(log N) time withN2/nmaxprocessors, wherenmax = max(n1,…,nd), such that the decision query forBtakesO(M) work andO(log M) time. By using known techniques, we would get the same preprocessing bounds but anO((dr)M) work bound for the decision query. The latter bound is undesirable since it can depend exponentially ond; our bound, in contrast, is independent ofdand optimal. We can also answer, in optimal work, two further types of queries: (a) anenumerationquery retrieving all ther-dimensional submatrices ofAthat are equal toBand (b) anoccurrencequery retrieving only the distinct positions inAthat correspond to all of these submatrices. As a byproduct, we also derive the first efficient sequential algorithms for the new problem.  相似文献   

13.
We consider several problems involving points and planes in three dimensions. Our main results are: (i) The maximum number of faces boundingm distinct cells in an arrangement ofn planes isO(m 2/3 n logn +n 2); we can calculatem such cells specified by a point in each, in worst-case timeO(m 2/3 n log3 n+n 2 logn). (ii) The maximum number of incidences betweenn planes andm vertices of their arrangement isO(m 2/3 n logn+n 2), but this number is onlyO(m 3/5– n 4/5+2 +m+n logm), for any>0, for any collection of points no three of which are collinear. (iii) For an arbitrary collection ofm points, we can calculate the number of incidences between them andn planes by a randomized algorithm whose expected time complexity isO((m 3/4– n 3/4+3 +m) log2 n+n logn logm) for any>0. (iv) Givenm points andn planes, we can find the plane lying immediately below each point in randomized expected timeO([m 3/4– n 3/4+3 +m] log2 n+n logn logm) for any>0. (v) The maximum number of facets (i.e., (d–1)-dimensional faces) boundingm distinct cells in an arrangement ofn hyperplanes ind dimensions,d>3, isO(m 2/3 n d/3 logn+n d–1). This is also an upper bound for the number of incidences betweenn hyperplanes ind dimensions andm vertices of their arrangement. The combinatorial bounds in (i) and (v) and the general bound in (ii) are almost tight.Work on this paper by the first author has been supported by Amoco Fnd. Fac. Dev. Comput. Sci. 1-6-44862 and by NSF Grant CCR-8714565. Work by the third author has been supported by Office of Naval Research Grant N00014-87-K-0129, by National Science Foundation Grant DCR-82-20085, by grants from the Digital Equipment Corporation, and the IBM Corporation, and by a research grant from the NCRD—the Israeli National Council for Research and Development. An abstract of this paper has appeared in theProceedings of the 13th International Mathematical Programming Symposium, Tokyo, 1988, p. 147.  相似文献   

14.
The compressed matching problem is the problem of finding all occurrences of a pattern in a compressed text. In this paper we discuss the 2-dimensional compressed matching problem in Lempel–Ziv compressed images. Given a pattern P of (uncompressed) size m×m, and a text T of (uncompressed) size n×n, both in 2D-LZ compressed form, our algorithm finds all occurrences of P in T. The algorithm is strongly inplace, that is, the amount of extra space used is proportional to the best possible compression of a pattern of size m2. The best compression that the 2D-LZ technique can obtain for a file of size m2 is O(m). The time for performing the search is O(n2) and the preprocessing time is O(m3). Our algorithm is general in the sense that it can be used for any 2D compression which can be sequentially decompressed in small space.  相似文献   

15.
In this paper we discuss the problem of finding optimal prefix-free codes for unequal letter costs, a variation of the classical Huffman coding problem. Our problem consists of finding a minimal cost prefix-free code in which the encoding alphabet consists of unequal cost (length) letters, with lengths α and β. The most efficient algorithm known previously requires O(n2 + max(α, β)) time to construct such a minimal-cost set of n codewords, provided α and β are integers. In this paper we provide an O(nmax(α, β)) time algorithm. Our improvement comes from the use of a more sophisticated modeling of the problem, combined with the observation that the problem possesses a “Monge property” and that the SMAWK algorithm on monotone matrices can therefore be applied.  相似文献   

16.
Bioinformatics, the discipline which studies the computational problems arising from molecular biology, poses many interesting problems to the string searching community. We will describe two problems arising from Bioinformatics, their preliminary solutions, and the more general problem that they pose. The first problem is searching for α-helices in protein sequences. This particular instance of the search is based on matching of hydrophobicity/hydrophilicity. We find an algorithm which is linear in the sequence length for fixed helix length and is O(nlogn) for any helix length. The second problem is on matching probabilistic sequences against sequences or against other probabilistic sequences. In both cases we derive efficient formulas to compute scores according to a Markovian model of evolution.  相似文献   

17.
We establish an O(nlog2n) upper bound on the time for deterministic distributed broadcasting in multi-hop radio networks with unknown topology. This nearly matches the known lower bound of Ω(nlogn). The fastest previously known algorithm for this problem works in time O(n3/2). Using our broadcasting algorithm, we develop an O(n3/2log2n) algorithm for gossiping in the same network model.  相似文献   

18.
It is well-known that one can sort (order)n real numbers in at mostF 0(n) =nl – 2 l + 1 steps (comparisons), wherel = [log2 n]. We snow how to find the strict saddlepoint or prove its absence in anm byn matrix,m n, in at mostF 0(m)+F 0(m+1)+n+m – 3 + (nm) [log2(m+1)] steps.  相似文献   

19.
We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the Ziv–Lempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to kinsertions, deletions and substitutions. On LZ78/LZW we need O(mkn+R) time in the worst case and O(k2n+mkmin(n,(mσ)k)+R) on average where σ is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.  相似文献   

20.
Let p and q be two permutations over {1, 2,…, n}. We denote by m(p, q) the number of integers i, 1 ≤ in, such that p(i) = q(i). For each fixed permutation p, a query is a permutation q of the same size and the answer a(q) to this query is m(p, q). We investigate the problem of finding the minimum number of queries required to identify an unknown permutation p. A polynomial-time algorithm that identifies a permutation of size n by O(n · log2n) queries is presented. The lower bound of this problem is also considered. It is proved that the problem of determining the size of the search space created by a given set of queries and answers is #P-complete. Since this counting problem is essential for the analysis of the lower bound, a complete analysis of the lower bound appears infeasible. We conjecture, based on some preliminary analysis, that the lower bound is Ω(n · log2n).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号