首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present three new algorithms for on‐line multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length, and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We theoretically analyze when each algorithm should be used, and show their performance experimentally. The only previous solution for this problem allows only one error. Our algorithms are the first to allow more errors, and are faster than previous work for a moderate number of patterns (e.g. less than 50–100 on English text, depending on the pattern length). © 2002 John Wiley & Sons, Inc. Random Struct. Alg., 20, 23–49, 2002  相似文献   

2.
String matching is the problem of finding all the occurrences of a pattern in a text. We present a new method to compute the combinatorial shift function (“matching shift”) of the well-known Boyer–Moore string matching algorithm. This method implies the computation of the length of the longest suffixes of the pattern ending at each position in this pattern. These values constituted an extra-preprocessing for a variant of the Boyer–Moore algorithm designed by Apostolico and Giancarlo. We give here a new presentation of this algorithm that avoids extra preprocessing together with a tight bound of 1.5n character comparisons (where n is the length of the text).  相似文献   

3.
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In this model an algorithm operates on words of length w, grouping blocks of characters, and arithmetic and logic operations on the words take one unit of time.In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design a very fast string matching algorithm. We evaluate our solution in terms of efficiency, stability and flexibility, where we propose to use the deviation in running time of an algorithm on distinct equal length patterns as a measure of stability.From our experimental results it turns out that, despite their quadratic worst case time complexity, the new presented algorithm becomes the clear winner on the average in many cases, when compared against the most recent and effective algorithms known in literature.  相似文献   

4.
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Condensed neighborhoods, however, are not a minimal representation of a pattern neighborhood. Super condensed neighborhoods, proposed in this work, are smaller, provably minimal and can be used to locate approximate matches that can later be extended by on-line search. We present an algorithm for generating Super Condensed Neighborhoods. The algorithm can be implemented either by using dynamic programming or non-deterministic automata. The time complexity is O(ms) for the first case and O(kms) for the second, where m is the pattern size, s is the size of the super condensed neighborhood and k the number of errors. Previous algorithms depended on the size of the condensed neighborhood instead. These algorithms can be implemented using Bit-Parallelism and Increased Bit-Parallelism techniques. Our experimental results show that the resulting algorithms are fast and achieve significant speedups, when compared with the existing proposals that use condensed neighborhoods.  相似文献   

5.
We show how to determine whether a given pattern p of length m occurs in a given text t of length n in time (where allows for logarithmic factors in m and n/m) with inverse polynomial failure probability. This algorithm combines quantum searching algorithms with a technique from parallel string matching, called Deterministic Sampling.  相似文献   

6.
The subsequence matching problem is to decide, for given strings S and T, whether S is a subsequence of T. The string S is called the pattern and the string T the text. We consider the case of multiple texts and show how to solve the subsequence matching problem in time linear in the length of the pattern. For this purpose we build an automaton that accepts all subsequences of given texts. This automaton is called the Directed Acyclic Subsequence Graph (DASG). We prove an upper bound for its number of states. Furthermore, we consider a modification of the subsequence matching problem: given a string S and a finite language L, we are to decide whether S is a subsequence of any string in L. We suppose that a finite automaton accepting L is given and present an algorithm for building the DASG for language L. We also mention applications of the DASG to some problems related to subsequences.  相似文献   

7.
In this paper, we study the approximate string matching problem under a string distance whose edit operations are translocations of equal length factors. We extend a graph-theoretic approach proposed by Rahman and Illiopoulos (2008) to model our problem. In the sequel, we devise efficient algorithms based on this model to solve a number of variants of the string matching problem with translocations.  相似文献   

8.
The string matching with mismatches problem requires finding the Hamming distance between a pattern P of length m and every length m substring of text T with length n. Fischer and Paterson's FFT-based algorithm solves the problem without error in O(σnlogm), where σ is the size of the alphabet Σ [SIAM–AMS Proc. 7 (1973) 113–125]. However, this in the worst case reduces to O(nmlogm). Atallah, Chyzak and Dumas used the idea of randomly mapping the letters of the alphabet to complex roots of unity to estimate the score vector in time O(nlogm) [Algorithmica 29 (2001) 468–486]. We show that the algorithm's score variance can be substantially lowered by using a bijective mapping, and specifically to zero in the case of binary and ternary alphabets. This result is extended via alphabet remappings to deterministically solve the string matching with mismatches problem with a constant factor of 2 improvement over Fischer–Paterson's method.  相似文献   

9.
The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T. Currently, the fastest algorithms for this problem are the following. The Galil–Giancarlo algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk). The Abrahamson algorithm finds the number of mismatches at every location in time . We present an algorithm that is faster than both. Our algorithm finds all locations where the pattern has at most k errors in time . We also show an algorithm that solves the above problem in time O((n+(nk3)/m)logk).  相似文献   

10.
《Journal of Complexity》1993,9(3):339-365
We study the exact number of symbol comparisons that are required to solve the string matching problem and present a family of efficient algorithms. Unlike previous string matching algorithms, the algorithms in this family do not "forget" results of comparisons, what makes their analysis much simpler. In particular, we give a linear-time algorithm that finds all occurrences of a pattern of length m in a text of length n in [formula] comparisons. The pattern preprocessing takes linear time and makes at most 2m comparisons. This algorithm establishes that, in general, searching for a long pattern is easier than searching for a short one. We also show that any algorithm in the family of the algorithms presented must make at least [formula] symbol comparisons, for m = 2k − 1 and any integer k ≥ 1.  相似文献   

11.
Two strings parameterize match if there is a bijection defined on the alphabet that transforms the first string character by character into the second string. The problem of finding all parameterized matches of a pattern in a text has been studied in both one and two dimensions but the research has been centered on developing algorithms with good worst-case performance. We present algorithms that solve this problem in sublinear time on average for moderately repetitive patterns.  相似文献   

12.
Bioinformatics, the discipline which studies the computational problems arising from molecular biology, poses many interesting problems to the string searching community. We will describe two problems arising from Bioinformatics, their preliminary solutions, and the more general problem that they pose. The first problem is searching for α-helices in protein sequences. This particular instance of the search is based on matching of hydrophobicity/hydrophilicity. We find an algorithm which is linear in the sequence length for fixed helix length and is O(nlogn) for any helix length. The second problem is on matching probabilistic sequences against sequences or against other probabilistic sequences. In both cases we derive efficient formulas to compute scores according to a Markovian model of evolution.  相似文献   

13.
Let a text string T of n symbols and a pattern string P of m symbols from alphabet Σ be given. A swapped version T′ of T is a length n string derived from T by a series of local swaps (i.e., t ← tℓ + 1 and tℓ + 1 ← t), where each element can participate in no more than one swap. The pattern matching with swaps problem is that of finding all locations i for which there exists a swapped version T′ of T with an exact matching of P in location i of T′. It has been an open problem whether swapped matching can be done in less than O(nm) time. In this paper we show the first algorithm that solves the pattern matching with swaps problem in time o(nm). We present an algorithm whose time complexity is O(nm1/3 log m log σ) for a general alphabet Σ, where σ = min(m,Σ).  相似文献   

14.
Bitwise operations are executed very fast in computer architecture. Algorithms aiming to benefit from this intrinsic property can be classified as bit-parallel algorithms. Bit-parallelism has been widely investigated in the pattern matching area since the introduction of the Shift-Or algorithm. In the original idea, there is no shift mechanism, and the input pattern length is required to be less than the computer word size (W) to benefit from the full power of bit-parallelism. The lack of the shift mechanism was removed by the succeeding algorithms of this genre, but W limitation has not been overcome in an elegant way. This study proposes a new bit-parallel algorithm, given name BLIM (bit-parallel length independent matching), for exact pattern matching that does not restrict the input pattern to be shorter than the word size. The multiple pattern case is also addressed, and it is shown that up to computer word size number of patterns, whatever their lengths are, can be searched simultaneously in a single bit-parallel framework. Similar to other algorithms of this genre, BLIM is also capable of handling fixed-length gaps and character classes in the input strings as well. The proposed algorithm is compared with the other alternatives of its class, mainly the shift-or and BNDM variants. Experimental results indicate that BLIM is compatible with the previous bit-parallel algorithms with an additional gain of overcoming the word size limitation.  相似文献   

15.
In this note we present three efficient variations of the occurrence heuristic, adopted by many exact string matching algorithms and first introduced in the well-known Boyer–Moore algorithm. Our first heuristic, called improved-occurrence heuristic, is a simple improvement of the rule introduced by Sunday in his Quick-Search algorithm. Our second heuristic, called worst-occurrence heuristic, achieves its speed-up by selecting the relative position which yields the largest average advancement. Finally, our third heuristic, called jumping-occurrence heuristic, uses two characters for computing the next shift. Setting the distance between these two characters optimally allows one to maximize the average advancement. The worst-occurrence and jumping-occurrence heuristics tune their parameters according to the distribution of the characters in the text. Experimental results show that the newly proposed heuristics achieve very good results on average, especially in the case of small alphabets.  相似文献   

16.
Parallel algorithms for evaluating arithmetic expressions generally assume the computation tree form to be at hand. The computation tree form can be generated within the same resource bounds as the parenthesis matching problem can be solved. We provide a new cost optimal parallel algorithm for the latter problem, which runs in time O(log n) using O(n/log n) processors on an . We also prove that the algorithm is the fastest possible independently of the number of processors available.  相似文献   

17.
We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the Ziv–Lempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to kinsertions, deletions and substitutions. On LZ78/LZW we need O(mkn+R) time in the worst case and O(k2n+mkmin(n,(mσ)k)+R) on average where σ is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.  相似文献   

18.
在基因工程中,经常需要在一个较长的DNA链中寻找一小段DNA片段.本文提出了一个新的匹配算法使得当对一个长为n的DNA链t进行检索时,在最坏的情况下克只需要比较n次就能找到一个预先给定的长为m的DNA片段户在t中所有出现的地方.而且对该算法稍加改动即可用于一般的关键字搜索(或称串匹配).在同类算法中,该算法可能是迄今为止最有效的.  相似文献   

19.
Replacing suffix trees with enhanced suffix arrays   总被引:9,自引:0,他引:9  
The suffix tree is one of the most important data structures in string processing and comparative genomics. However, the space consumption of the suffix tree is a bottleneck in large scale applications such as genome analysis. In this article, we will overcome this obstacle. We will show how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that uses an enhanced suffix array and solves the same problem in the same time complexity. The generic name enhanced suffix array stands for data structures consisting of the suffix array and additional tables. Our new algorithms are not only more space efficient than previous ones, but they are also faster and easier to implement.  相似文献   

20.
Abstract

A comparison and evaluation is made of recent proposals for multivariate matched sampling in observational studies, where the following three questions are answered: (1) Algorithms: In current statistical practice, matched samples are formed using “nearest available” matching, a greedy algorithm. Greedy matching does not minimize the total distance within matched pairs, though good algorithms exist for optimal matching that do minimize the total distance. How much better is optimal matching than greedy matching? We find that optimal matching is sometimes noticeably better than greedy matching in the sense of producing closely matched pairs, sometimes only marginally better, but it is no better than greedy matching in the sense of producing balanced matched samples. (2) Structures: In common practice, treated units are matched to one control, called pair matching or 1–1 matching, or treated units are matched to two controls, called 1–2 matching, and so on. It is known, however, that the optimal structure is a full matching in which a treated unit may have one or more controls or a control may have one or more treated units. Optimal 1 — k matching is compared to optimal full matching, finding that optimal full matching is often much better. (3) Distances: Matching involves defining a distance between covariate vectors, and several such distances exist. Three recent proposals are compared. Practical advice is summarized in a final section.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号