首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We consider the distribution of the number of successes in success runs of length at least k in a binary sequence. One important application of this statistic is in the detection of tandem repeats among DNA sequence segments. In the literature, its distribution has been computed for independent sequences and Markovian sequences of order one. We extend these results to Markovian sequences of a general order. We also show that the statistic can be represented as a function of the number of overlapping success runs of lengths k and k + 1 in the sequence, and give immediate consequences of this representation. AMS 2000 Subject Classification 60E05, 60J05  相似文献   

2.
Abstract

A new family of plug-in classification techniques has recently been developed in the statistics and machine learning literature. A plug-in classification technique (PICT) is a method that takes a standard classifier (such as LDA or TREES) and plugs it into an algorithm to produce a new classifier. The standard classifier is known as the base classifier. These methods often produce large improvements over using a single classifier. In this article we investigate one of these methods and give some motivation for its success.  相似文献   

3.
Rough set feature selection (RSFS) can be used to improve classifier performance. RSFS removes redundant attributes whilst retaining important ones that preserve the classification power of the original dataset. Reducts are feature subsets selected by RSFS. Core is the intersection of all the reducts of a dataset. RSFS can only handle discrete attributes, hence, continuous attributes need to be discretized before being input to RSFS. Discretization determines the core size of a discrete dataset. However, current discretization methods do not consider the core size during discretization. Earlier work has proposed core-generating approximate minimum entropy discretization (C-GAME) algorithm which selects the maximum number of minimum entropy cuts capable of generating a non-empty core within a discrete dataset. The contributions of this paper are as follows: (1) the C-GAME algorithm is improved by adding a new type of constraint to eliminate the possibility that only a single reduct is present in a C-GAME-discrete dataset; (2) performance evaluation of C-GAME in comparison to C4.5, multi-layer perceptrons, RBF networks and k-nearest neighbours classifiers on ten datasets chosen from the UCI Machine Learning Repository; (3) performance evaluation of C-GAME in comparison to Recursive Minimum Entropy Partition (RMEP), Chimerge, Boolean Reasoning and Equal Frequency discretization algorithms on the ten datasets; (4) evaluation of the effects of C-GAME and the other four discretization methods on the sizes of reducts; (5) an upper bound is defined on the total number of reducts within a dataset; (6) the effects of different discretization algorithms on the total number of reducts are analysed; (7) performance analysis of two RSFS algorithms (a genetic algorithm and Johnson’s algorithm).  相似文献   

4.
Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two strings or occurring twice in the same string. In this paper, we present an algorithm called Nimbus for filtering strings prior to finding repetitions occurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with Nimbus a data set where one wants to find functional elements using a multiple local alignment tool such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes.  相似文献   

5.
Identification and recognition of specific functionally-important DNA sequence fragments such as regulatory sequences are considered the most important problems in bioinformatics. One type of such fragments are promoters, i.e., short regulatory DNA sequences located upstream of a gene. Detection of regulatory DNA sequences is important for successful gene prediction and gene expression studies. In this paper, Support Vector Machine (SVM) is used for classification of DNA sequences and recognition of the regulatory sequences. For optimal classification, various SVM learning and kernel parameters (hyperparameters) and their optimization methods are analyzed. In a case study, optimization of the SVM hyperparameters for linear, polynomial and power series kernels is performed using a modification of the Nelder–Mead (downhill simplex) algorithm. The method allows for improving the precision of identification of the regulatory DNA sequences. The results of promoter recognition for the drosophila sequence datasets are presented.  相似文献   

6.
For several decades, much attention has been paid to the two-sample Behrens-Fisher (BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures. Little work, however, has been done for the k-sample BF problem for high dimensional data which tests the equality of the mean vectors of several high-dimensional normal populations with unequal covariance structures. In this paper we study this challenging problem via extending the famous Scheffe’s transformation method, which reduces the k-sample BF problem to a one-sample problem. The induced one-sample problem can be easily tested by the classical Hotelling’s T 2 test when the size of the resulting sample is very large relative to its dimensionality. For high dimensional data, however, the dimensionality of the resulting sample is often very large, and even much larger than its sample size, which makes the classical Hotelling’s T 2 test not powerful or not even well defined. To overcome this difficulty, we propose and study an L 2-norm based test. The asymptotic powers of the proposed L 2-norm based test and Hotelling’s T 2 test are derived and theoretically compared. Methods for implementing the L 2-norm based test are described. Simulation studies are conducted to compare the L 2-norm based test and Hotelling’s T 2 test when the latter can be well defined, and to compare the proposed implementation methods for the L 2-norm based test otherwise. The methodologies are motivated and illustrated by a real data example. The work was supported by the National University of Singapore Academic Research Grant (Grant No. R-155-000-085-112)  相似文献   

7.
基于ARIMA和LSSVM的非线性集成预测模型   总被引:1,自引:0,他引:1  
针对复杂时间序列预测困难的问题,在综合考虑线性与非线性复合特征的基础上,提出一种基于ARIMA和最小二乘支持向量机(LSSVM)的非线性集成预测方法.首先采用ARIMA模型进行时间序列线性趋势建模,并为LSSVM建模确定输入阶数;接着根据确定的输入阶数进行时间序列样本重构,采用LSSVM模型进行时间序列非线性特征建模;最后采用基于LSSVM的非线性集成技术形成一个综合的预测结果.将该方法用于中国GDP预测取得的结果,与单独预测方法及流行的其他集成预测方法相比,预测精度有了较大的提高,从而验证了方法的有效性和可行性.  相似文献   

8.
In many classification applications and face recognition tasks, there exist unlabelled data available for training along with labelled samples. The use of unlabelled data can improve the performance of a classifier. In this paper, a semi-supervised growing neural gas is proposed for learning with such partly labelled datasets in face recognition applications. The classifier is first trained on the labelled data and then gradually unlabelled data is classified and added to the training data. The classifier is retrained; and so on. The proposed iterative algorithm conforms to the EM framework and is demonstrated, on both artificial and real datasets, to significantly boost the classification rate with the use of unlabelled data. The improvement is particularly great when the labelled dataset is small. Comparison with support vector machine classifiers is also given. The algorithm is computationally efficient and easy to implement.  相似文献   

9.
自从Suykens提出新型统计理论学习方法-最小二乘支持向量机(LSSVM)以来,这种方法引起了广泛的关注,它在预测方面的良好性能得到了广泛应用.应用自组织数据挖掘(GMDH)方法改进LSSVM,提升了预测精度.首先利用GMDH方法选择有效的输入变量,再将这些变量作为LSSVM模型的输入,进行时间序列的预测,从而建立LSSVM和GMDH组合的混合模型GLSSVM.并通过汇率时间序列对本文模型进行了实证.结果表明,混合模型预测精度得到了明显的提高.  相似文献   

10.
Multiple sequence alignment is a task at the heart of much of current computational biology[4]. Several different objective functions have been proposed to formalize the task of multiple sequence alignment, but efficient algorithms are lacking in each case. Thus multiple sequence alignment is one of the most critical, essentially unsolved problems in computational biology. In this paper we consider one of the more compelling objective functions for multiple sequence alignment, formalized as thetree alignment problem. Previously in[13], a ratio-two approximation method was developed for tree alignment, which ran incubictime (as a function of the number of fixed length strings to be aligned), along with a polynomial time approximation scheme (PTAS) for the problem. However, the PTAS in[13]had a running time which made it impractical to reduce the performance ratio much below two for small size biological sequences (100 characters long). In this paper we first develop a ratio-two approximation algorithm which runs inquadratictime, and then use it to develop a PTAS which has a better performance ratio and a vastly improved worst case running time compared to the scheme in[13]for the case where the given tree is a regular deg-ary tree. With the new approximation scheme, it is now practical to guarantee a ratio of 1.583 for strings of lengths 200 characters or less.  相似文献   

11.
On the estimation of entropy   总被引:1,自引:0,他引:1  
Motivated by recent work of Joe (1989,Ann. Inst. Statist. Math.,41, 683–697), we introduce estimators of entropy and describe their properties. We study the effects of tail behaviour, distribution smoothness and dimensionality on convergence properties. In particular, we argue that root-n consistency of entropy estimation requires appropriate assumptions about each of these three features. Our estimators are different from Joe's, and may be computed without numerical integration, but it can be shown that the same interaction of tail behaviour, smoothness and dimensionality also determines the convergence rate of Joe's estimator. We study both histogram and kernel estimators of entropy, and in each case suggest empirical methods for choosing the smoothing parameter.  相似文献   

12.
In the case of the quantum generalization of stable Lévy processes, expressions for the Hermitian operator of momentum and its eigenfunctions are proposed. The normalization constant has been determined and its relation to the translation operator is shown. The interrelation between the momentum and the wave number has been generalized for the processes with a non-integer dimensionality α. The simplest nonlocal superalgebra is introduced.   相似文献   

13.
We propose a new method to find modes based on active information. We develop an algorithm called active information mode hunting (AIMH) that, when applied to the whole space, will say whether there are any modes present and where they are. We show AIMH is consistent and, given that information increases where probability decreases, it helps to overcome issues with the curse of dimensionality. The AIMH also reduces the dimensionality with no resource to principal components. We illustrate the method in three ways: with a theoretical example (showing how it performs better than other mode hunting strategies), a real dataset business application, and a simulation.  相似文献   

14.
15.
A k-means-type algorithm is proposed for efficiently clustering data constrained to lie on the surface of a p-dimensional unit sphere, or data that are mean-zero-unit-variance standardized observations such as those that occur when using Euclidean distance to cluster time series gene expression data using a correlation metric. We also provide methodology to initialize the algorithm and to estimate the number of clusters in the dataset. Results from a detailed series of experiments show excellent performance, even with very large datasets. The methodology is applied to the analysis of the mitotic cell division cycle of budding yeast dataset of Cho et al. [Molecular Cell (1998), 2, 65–73]. The entire dataset has not been analyzed previously, so our analysis provides an understanding for the complete set of genes acting in concert and differentially. We also use our methodology on the submitted abstracts of oral presentations made at the 2008 Joint Statistical Meetings (JSM) to identify similar topics. Our identified groups are both interpretable and distinct and the methodology provides a possible automated tool for efficient parallel scheduling of presentations at professional meetings.

The supplemental materials described in the article are available in the online supplements.  相似文献   

16.
This paper concerns the longest common subsequence (LCS) shared by two sequences (or strings) of length N, whose elements are chosen at random from a finite alphabet. The exact distribution and the expected value of the length of the LCS, k say, between two random sequences is still an open problem in applied probability. While the expected value E(N) of the length of the LCS of two random strings is known to lie within certain limits, the exact value of E(N) and the exact distribution are unknown. In this paper, we calculate the length of the LCS for all possible pairs of binary sequences from N=1 to 14. The length of the LCS and the Hamming distance are represented in color on two all-against-all arrays. An iterative approach is then introduced in which we determine the pairs of sequences whose LCS lengths increased by one upon the addition of one letter to each sequence. The pairs whose score did increase are shown in black and white on an array, which has an interesting fractal-like structure. As the sequence length increases, R(N) (the proportion of sequences whose score increased) approaches the Chvátal–Sankoff constant a c (the proportionality constant for the linear growth of the expected length of the LCS with sequence length). We show that R(N) is converging more rapidly to a c than E(N)/N.  相似文献   

17.
In A.S. Buch and W. Fulton [Invent. Math. 135 (1999), 665–687] a formula for the cohomology class of a quiver variety is proved. This formula writes the cohomology class of a quiver variety as a linear combination of products of Schur polynomials. In the same paper it is conjectured that all of the coefficients in this linear combination are non-negative, and given by a generalized Littlewood-Richardson rule, which states that the coefficients count certain sequences of tableaux called factor sequences. In this paper I prove some special cases of this conjecture. I also prove that the general conjecture follows from a stronger but simpler statement, for which substantial computer evidence has been obtained. Finally I will prove a useful criterion for recognizing factor sequences.  相似文献   

18.
It has been recognized through theory and practice that uniformly distributed deterministic sequences provide more accurate results than purely random sequences. A quasi Monte Carlo (QMC) variant of a multi level single linkage (MLSL) algorithm for global optimization is compared with an original stochastic MLSL algorithm for a number of test problems of various complexities. An emphasis is made on high dimensional problems. Two different low-discrepancy sequences (LDS) are used and their efficiency is analysed. It is shown that application of LDS can significantly increase the efficiency of MLSL. The dependence of the sample size required for locating global minima on the number of variables is examined. It is found that higher confidence in the obtained solution and possibly a reduction in the computational time can be achieved by the increase of the total sample size N. N should also be increased as the dimensionality of problems grows. For high dimensional problems clustering methods become inefficient. For such problems a multistart method can be more computationally expedient.  相似文献   

19.
Many applications call for exhaustive lists of strings subject to various constraints, such as inequivalence under group actions. A k-ary necklace is an equivalence class of k-ary strings under rotation (the cyclic group). A k-ary unlabeled necklace is an equivalence class of k-ary strings under rotation and permutation of alphabet symbols. We present new, fast, simple, recursive algorithms for generating (i.e., listing) all necklaces and binary unlabeled necklaces. These algorithms have optimal running times in the sense that their running times are proportional to the number of necklaces produced. The algorithm for generating necklaces can be used as the basis for efficiently generating many other equivalence classes of strings under rotation and has been applied to generating bracelets, fixed density necklaces, and chord diagrams. As another application, we describe the implementation of a fast algorithm for listing all degree n irreducible and primitive polynomials over GF(2).  相似文献   

20.
Since its introduction in the early 1990s, the idea of using importance sampling (IS) with Markov chain Monte Carlo (MCMC) has found many applications. This article examines problems associated with its application to repeated evaluation of related posterior distributions with a particular focus on Bayesian model validation. We demonstrate that, in certain applications, the curse of dimensionality can be reduced by a simple modification of IS. In addition to providing new theoretical insight into the behavior of the IS approximation in a wide class of models, our result facilitates the implementation of computationally intensive Bayesian model checks. We illustrate the simplicity, computational savings, and potential inferential advantages of the proposed approach through two substantive case studies, notably computation of Bayesian p-values for linear regression models and simulation-based model checking. Supplementary materials including the Appendix and the R code for Section are available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号