首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
We consider the problem of fingerprinting text by sets of symbols. Specifically, if S is a string, of length n, over a finite, ordered alphabet Σ, and S′ is a substring of S, then the fingerprint of S′ is the subset φ of Σ of precisely the symbols appearing in S′. In this paper we show efficient methods of answering various queries on fingerprint statistics. Our preprocessing is done in time O(n|Σ|lognlog|Σ|) and enables answering the following queries:
(1)Given an integer k, compute the number of distinct fingerprints of size k in time O(1).
(2)Given a set φΣ, compute the total number of distinct occurrences in S of substrings with fingerprint φ in time O(|Σ|logn).
  相似文献   

3.
Adaptive text mining: inferring structure from sequences   总被引:1,自引:0,他引:1  
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively.  相似文献   

4.
Character sets of strings   总被引:2,自引:1,他引:1  
Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S of S is the subset CΣ of the symbols occurring in S. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:
(1) Output all the maximal locations of substrings having a given character set.
(2) Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C.
Denoting by n the total length of the considered string or collection of strings, we solve the first problem in Θ(n) time using Θ(n) space. We present two algorithms solving the second problem. The first one runs in Θ(n2) time using Θ(n) space. The second algorithm has Θ(n|Σ|log|Σ|) time and Θ(n) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13].  相似文献   

5.
The two-dimensional representation of documents which allows documents to be represented in a two-dimensional Cartesian plane has proved to be a valid visualization tool for Automated Text Categorization (ATC) for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. This paper analyzes a specific use of this visualization approach in the case of the Naive Bayes (NB) model for text classification and the Binary Independence Model (BIM) for text retrieval. For text categorization, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component P(d|ci), and a constant component P(ci), the prior of the category. When plotted in the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which NB model, Bernoulli or multinomial, is chosen. For text retrieval, the same reformulation can be applied in the case of the BIM model. The visualization helps to understand the decisions that are taken to order the documents, in particular in the case of relevance feedback.  相似文献   

6.
We present a new index for approximate string matching. The index collects text q-samples, that is, disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. Hence the index points out the text areas that could contain occurrences and must be verified. The index parameters permit load balancing between filtering and verification work, and provide a compromise between the space requirement of the index and the error level for which the filtration is still efficient. We show experimentally that the index is competitive against others that take more space, being in fact the fastest choice for intermediate error levels, an area where no current index is useful.  相似文献   

7.
New text indexing functionalities of the compressed suffix arrays are proposed. The compressed suffix array proposed by Grossi and Vitter is a space-efficient data structure for text indexing. It occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies bits for the alphabet . In this paper we modify the data structure so that pattern matching can be done without any access to the text. In addition to the original functions of the compressed suffix array, we add new operations search, decompress and inverse to the compressed suffix arrays. We show that the new index can find occ occurrences of any substring P of the text in O(|P|logn+occlogεn) time for any fixed 1ε>0 without access to the text. The index also can decompress a part of the text of length m in O(m+logεn) time. For a text of length n on an alphabet such that , our new index occupies only bits where is the order-0 entropy of the text. Especially for ε=1 the size is bits. Therefore the index will be smaller than the text, which means we can perform fast queries from compressed texts.  相似文献   

8.
It is shown how to compute the lexicographically maximum suffix of a string of n?2 characters over a totally ordered alphabet using at most (4/3)n−5/3 three-way character comparisons. The best previous bound, which has stood unchallenged for more than 25 years, is (3/2)nO(1) comparisons. We also prove an interesting property of an algorithm for computing the maximum suffix both with respect to a total order < and with respect to its inverse order >.  相似文献   

9.
Dimension reduction is a well-known pre-processing step in the text clustering to remove irrelevant, redundant and noisy features without sacrificing performance of the underlying algorithm. Dimension reduction methods are primarily classified as feature selection (FS) methods and feature extraction (FE) methods. Though FS methods are robust against irrelevant features, they occasionally fail to retain important information present in the original feature space. On the other hand, though FE methods reduce dimensions in the feature space without losing much information, they are significantly affected by the irrelevant features. The one-stage models, FS/FE methods, and the two-stage models, a combination of FS and FE methods proposed in the literature are not sufficient to fulfil all the above mentioned requirements of the dimension reduction. Therefore, we propose three-stage dimension reduction models to remove irrelevant, redundant and noisy features in the original feature space without loss of much valuable information. These models incorporates advantages of the FS and the FE methods to create a low dimension feature subspace. The experiments over three well-known benchmark text datasets of different characteristics show that the proposed three-stage models significantly improve performance of the clustering algorithm as measured by micro F-score, macro F-score, and total execution time.  相似文献   

10.
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In this model an algorithm operates on words of length w, grouping blocks of characters, and arithmetic and logic operations on the words take one unit of time.In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design a very fast string matching algorithm. We evaluate our solution in terms of efficiency, stability and flexibility, where we propose to use the deviation in running time of an algorithm on distinct equal length patterns as a measure of stability.From our experimental results it turns out that, despite their quadratic worst case time complexity, the new presented algorithm becomes the clear winner on the average in many cases, when compared against the most recent and effective algorithms known in literature.  相似文献   

11.
We use character sums to derive new bounds on the additive energy of the set of distances (counted with multiplicities) between two subsets of a vector space over a given finite field. We also give applications to sumsets of distance sets.  相似文献   

12.
Various difficulties have been encountered in using decision set-based vector maximization methods to solve a multiple-objective linear programming problem (MOLP). Motivated by these difficulties, Benson recently developed a finite, outer-approximation algorithm for generating the set of all efficient extreme points in the outcome set, rather than in the decision set, of problem (MOLP). In this article, we show that the Benson algorithm also generates the set of all weakly efficient points in the outcome set of problem (MOLP). As a result, the usefulness of the algorithm as a decision aid in multiple objective linear programming is further enhanced.  相似文献   

13.
Censored Quantile Regressions of Powell (1984, 1986) are very powerful inferencing tools in economics and engineering. As the calculation of censored quantile regressions involves minimizing a nonconvex and nondifferentiable function, global optimization techniques can be the only breakthroughs. The first implementation of a global optimization technique, namely Threshold Accepting of Fitzenberger and Winker (1998, 2007), is challenged by the Genetic Algorithm (GA) in this paper. The results show that the GA provides substantial improvements over Threshold Accepting for cases with randomly distributed censoring points.  相似文献   

14.
The main result of this paper is a (2 + )-approximation scheme for the minimum dominating set problem on circle graphs. We first present an O(n2) time 8-approximation algorithm for this problem and then extend it to an time (2 + )-approximation scheme for this problem. Here n and m are the number of vertices and the number of edges of the circle graph. We then present simple modifications to this algorithm that yield (3 + )-approximation schemes for the minimum connected and the minimum total dominating set problems on circle graphs. Keil (1993, Discrete Appl. Math.42, 51–63) shows that these problems are NP-complete for circle graphs and leaves open the problem of devising approximation algorithms for them. These are the first O(1)-approximation algorithms for domination problems on circle graphs.  相似文献   

15.
模糊集的表现定理是模糊数学的最基本理论.在表现定理的基础上,对各种模糊量:包括凸模糊量、正规模糊量、正规凸模糊量、凸有界模糊量、模糊数、有限模糊数、对称模糊数的表现定理进行了深入的研究,从而建立了不同类型模糊量与普通集合之间的联系.  相似文献   

16.
Three formulations for the Minimum 2-Connected Dominating Set Problem, valid inequalities, a primal heuristic and Branch-and-Cut algorithms are introduced in this paper. As shown here, the preliminary computational results so far obtained indicate that these algorithms are quite promising.  相似文献   

17.
In this paper we consider the problem of finding large collections of vertices and edges satisfying particular separation properties in random regular graphs of degree r, for each fixed r ≥ 3. We prove both constructive lower bounds and combinatorial upper bounds on the maximal sizes of these sets. The lower bounds are proved by analyzing a class of algorithms that return feasible solutions for the given problems. The analysis uses the differential equation method proposed by Wormald [Lectures on Approximation and Randomized Algorithms, PWN, Wassaw, 1999, pp. 239–298]. The upper bounds are proved by direct combinatorial means. © 2007 Wiley Periodicals, Inc. Random Struct. Alg., 2008  相似文献   

18.
This is a summary of the main results presented in the author’s PhD thesis. This thesis, written in English, was supervised by Frits Spieksma and defended on September 23, 2005 at the Katholieke Universiteit Leuven. A copy of the thesis is available from the authors website (http://www.econ. kuleuven.be/linda.moonen/public/). The thesis can be roughly split into two parts. The first part is dedicated to the problem of partitioning partially ordered sets into chains of limited size. The second part deals with the structure and the connectivity of the Internet.  相似文献   

19.
Text editing directs students’ attention to the problem structure as they classify whether the texts of word problems contain sufficient, missing or irrelevant information for working out a solution. Equation worked examples emphasize the formation of a coherent problem structure to generate a solution. Its focus is on the construction of three equation steps each of which comprises essential units of relevant information. In an experiment, students were randomly assigned to either text editing or equation worked examples condition in a regular classroom setting to learn how to solve algebra word problems in a chemistry context. The equation worked examples group outperformed the text editing group for molarity problems, which were more difficult than dilution problems. Empirical evidence supports the theoretical rationale in using equation worked examples to facilitate students’ construction of a coherent problem structure so as to develop problem skills and expertise to solve molarity problems.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号