首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 453 毫秒
1.
提出一种基于基因表达谱数据筛选差异表达基因的新方法;介绍了筛选差异表达基因常见方法-错误发现率方法(False Discovery Rate,FDR,),分析了多重假设检验p值性质,并根据p值性质提出了一种筛选差异表达基因新方法-单位γ度量法(Unit Measure-γ,UM-γ),建立了计算机模拟基因表达谱数据模型,制定了假阴性率、假阳性率、灵敏度、特异度以及总体错误率等作为考核指标,并使用基因表达谱模拟数据进行计算、比较;单位γ度量法估计非差异表达基因个数具有较高的稳定性和准确性;单位γ度量法既能够同时控制假阳性、假阴性以及总体错事件的发生,又能在一定程度上提高筛选结果的灵敏度和变异度;新提出的方法能有效、准确且稳定的对模拟数据差异表达基因进行筛选.  相似文献   

2.
针对基因表达谱信息基因提取的问题,使用Wilcoxon秩和检验方法进行"无关基因"的剔除,基于高低水平基因表达的特点,建立了关于高/低表达水平的双线性回归模型,基于残差分析提取了19个特征基因.使用启发式宽度优先搜索算法搜索最优基因子集,确定结肠癌的基因"标签",运用支持向量机对分类效果进行检验,分类效果良好.  相似文献   

3.
基因表达数据蕴含着大量的生物信息,在生物基因信息研究中,筛选表达水平发生显著变化的差异基因是认识疾病形成机理和辅助靶点药物研究的关键问题.根据急性髓细胞白血病(AML)的基因表达数据,构造基因均值差序列,建立贝叶斯分层混合模型,并为模型的参数赋予具有基因生物特征的先验信息.采用马尔可夫链蒙特卡洛(MCMC)算法对模型参数进行估计,并筛选出急性髓细胞白血病差异表达基因.在实际数据分析中,从美国生物信息中心(NCBI)的高通量基因表达数据库中获取急性髓细胞白血病基因数据集,从经过非特异滤波预处理的14688个急性髓细胞白血病基因中筛选出711个差异表达基因,差异表达基因数仅占急性髓细胞白血病基因总数的4.84%,这一结果与基因差异表达的生物学原理相吻合.  相似文献   

4.
基于支持向量机的拟南芥基因表达数据分析   总被引:2,自引:0,他引:2  
针对拟南芥根部基因表达数据分析的问题,本文提出了一种新的基于距离度量学习的支持向机多分类算法.鉴于此问题的特殊性,本文通过最小化4分类机的LOO 误差来求得一个恰当的距离度量.并在此度量下找到若干个属于第5类(其它类)的训练点,从而构造出一个5分类机用来对所有基因分类.实验验证了此算法的可行性,并且比基因表达分析中传统使用的聚类方法更有效.  相似文献   

5.
基于贝叶斯统计方法的两总体基因表达数据分类   总被引:1,自引:0,他引:1  
在疾病的诊断过程中,对疾病的精确分类是提高诊断准确率和疾病治愈率至 关重要的一个环节,DNA芯片技术的出现使得我们从微观的层次获得与疾病分类及诊断 密切相关的基因功能信息.但是DNA芯片技术得到的基因的表达模式数据具有多变量小 样本特点,使得分类过程极不稳定,因此我们首先筛选出表达模式发生显著性变化的基因 作为特征基因集合以减少变量个数,然后再根据此特征基因集合建立分类器对样本进行分 类.本文运用似然比检验筛选出特征基因,然后基于贝叶斯方法建立了统计分类模型,并 应用马尔科夫链蒙特卡罗(MCMC)抽样方法计算样本归类后验概率.最后我们将此模型 应用到两组真实的DNA芯片数据上,并将样本成功分类.  相似文献   

6.
运用多重检验方法对高维数据进行推断统计分析.首先将最小一乘估计算法应用在多重检验分析中,构造出新的估计真实零假设个数的方法.其次对最小一乘与最小二乘方法估计真实零假设个数的准确性进行模拟比较分析,模拟结果表明前者较后者估算结果更准确.最后,将上述估计方法应用于乳腺癌微阵列数据的分析中寻找有表达差异的基因.检验结果共找到118个差异基因,其中85个基因在生物学上是有效基因,实证表明该方法具有一定的实用性.  相似文献   

7.
采用统计检验的方法对基因表达数据的特征选取和冗余去除展开研究,为此提出了相应模型及算法,与已有文献中的模型与算法相比较,该模型所提方法思路直观,易于理解,算法构造简单,且运行效率高.数值实验选取3个两分类基因表达数据集,实验结果表明该方法对特征选取和冗余去除均有较好的效果.在此基础上,采用类中心距离法对选取的特征基因进行了分类实验,结果进一步表明,本文提出的方法对两分类基因表达数据具有较高的分类精确度.  相似文献   

8.
急性白血病可分为急性淋巴细胞白血病(ALL)和急性髓系白血病(AML)两大亚型,准确诊断是治疗急性白血病的前提和关键。本文基于急性白血病的基因芯片数据,结合两样本T检验、Wilconxon秩和检验、系统聚类法以及变量选择方法监督式分组套索法(supervised group lasso,SGLasso)筛选出对急性白血病分型(AML、ALL)有显著意义的基因,根据训练组数据建立关于急性白血病分型的逻辑回归模型,并对训练组和检验组中患者的病型作拟合和预测,验证该模型的预测精度。  相似文献   

9.
对乳腺癌基因芯片试验结果进行数据分析,寻找在正常组织与癌组织中呈现差异表达的基因.运用微阵列芯片显著性分析(SAM)方法进行差异表达基因的筛选,并使用permutation算法计算错误发现率(FDR).一些呈现差异表达的基因被筛选出来,其中一部分基因已被数篇文献报道过,认为它与乳腺癌发病相关.SAM方法比较适用于对基因芯片实验的结果进行相关基因的初步筛选,筛选出的基因可用于为进一步的研究提供候选基因.  相似文献   

10.
对候选基因的关联检验,多标记单倍型方法往往要比单标记方法表达出更多的信息,但是单倍型的数量往往会随着所标记的SNP的数目增多而急剧的增加,这又会大大增加检验统计量的自由度,通过使用统计学中的主成分分析法来降低单倍型空间的维数来检验一个数量性状与多个单倍型的关联情况,并与传统的方法做对比,模拟结果显示,此检验方法有较好的第一类错误率及功效.  相似文献   

11.
Because of the high costs of microarray experiments and the availability of only limited biological materials, microarray experiments are often performed with a small number of replicates. Investigators, therefore, often have to perform their experiments with low replication or without replication. However, the heterogeneous error variability observed in microarray experiments increases the difficulty in analyzing microarray data without replication. No current analysis techniques are practically applicable to such microarray data analysis. We here introduce a statistical method, the so-called unreplicated heterogeneous error model (UHEM) for the microarray data analysis without replication. This method is possible by utilizing many adjacent-intensity genes for estimating local error variance after nonparametric elimination of differentially expressed genes between different biological conditions. We compared the performance of UHEM with three empirical Bayes prior specification methods: between-condition local pooled error, pseudo standard error, or adaptive standard error-based HEM. We found that our unreplicated HEM method is effective for the microarray data analysis when replication of an array experiment is impractical or prohibited.  相似文献   

12.
Cluster analysis has been widely used to explore thousands of gene expressions from microarray analysis and identify a small number of similar genes (objects) for further detailed biological investigation. However, most clustering algorithms tend to identify loose clusters with too many genes. In this paper, we propose a Bayesian tight clustering method for time course gene expression data, which selects a small number of closely-related genes and constructs tight clusters only with these closely-related genes.  相似文献   

13.
基于人类、鼠类及其他哺乳动物的基因数据,共计1264个外显子、1553个内含子进行基因预测,对DNA序列信噪比阈值进行判断.提出基于Mann-Whitney检验、符号检验、Wilcoxon符号秩和检验的非参数置信区间法计算信噪比阈值,得到人类的阈值为1.108,鼠类的为1.0971,其他哺乳动物的为1.1754,与均值平均法、带标准差加权平均法、定义阈值为2的方法比较判断的正确率,结果表明利用非参数置信区间法计算的阈值在基因预测中具有最高的正确率.  相似文献   

14.
We consider the problem of assessing the number of clusters in a limited number of tissue samples containing gene expressions for possibly several thousands of genes. It is proposed to use a normal mixture model-based approach to the clustering of the tissue samples. One advantage of this approach is that the question on the number of clusters in the data can be formulated in terms of a test on the smallest number of components in the mixture model compatible with the data. This test can be carried out on the basis of the likelihood ratio test statistic, using resampling to assess its null distribution. The effectiveness of this approach is demonstrated on simulated data and on some microarray datasets, as considered previously in the bioinformatics literature.  相似文献   

15.
The need to estimate a positive definite solution to an overdetermined linear system of equations with multiple right hand side vectors arises in several process control contexts. The coefficient and the right hand side matrices are respectively named data and target matrices. A number of optimization methods were proposed for solving such problems, in which the data matrix is unrealistically assumed to be error free. Here, considering error in measured data and target matrices, we present an approach to solve a positive definite constrained linear system of equations based on the use of a newly defined error function. To minimize the defined error function, we derive necessary and sufficient optimality conditions and outline a direct algorithm to compute the solution. We provide a comparison of our proposed approach and two existing methods, the interior point method and a method based on quadratic programming. Two important characteristics of our proposed method as compared to the existing methods are computing the solution directly and considering error both in data and target matrices. Moreover, numerical test results show that the new approach leads to smaller standard deviations of error entries and smaller effective rank as desired by control problems. Furthermore, in a comparative study, using the Dolan-Moré performance profiles, we show the approach to be more efficient.  相似文献   

16.
Hierarchical and empirical Bayes approaches to inference are attractive for data arising from microarray gene expression studies because of their ability to borrow strength across genes in making inferences. Here we focus on the simplest case where we have data from replicated two colour arrays which compare two samples and where we wish to decide which genes are differentially expressed and obtain estimates of operating characteristics such as false discovery rates. The purpose of this paper is to examine the frequentist performance of Bayesian variable selection approaches to this problem for different prior specifications and to examine the effect on inference of commonly used empirical Bayes approximations to hierarchical Bayes procedures. The paper makes three main contributions. First, we describe how the log odds of differential expression can usually be computed analytically in the case where a double tailed exponential prior is used for gene effects rather than a normal prior, which gives an alternative to the commonly used B-statistic for ranking genes in simple comparative experiments. The second contribution of the paper is to compare empirical Bayes procedures for detecting differential expression with hierarchical Bayes methods which account for uncertainty in prior hyperparameters to examine how much is lost in using the commonly employed empirical Bayes approximations. Third, we describe an efficient MCMC scheme for carrying out the computations required for the hierarchical Bayes procedures. Comparisons are made via simulation studies where the simulated data are obtained by fitting models to some real microarray data sets. The results have implications for analysis of microarray data using parametric hierarchical and empirical Bayes methods for more complex experimental designs: generally we find that the empirical Bayes methods work well, which supports their use in the analysis of more complex experiments when a full hierarchical Bayes analysis would impose heavy computational demands.  相似文献   

17.
Abstract

We study the asymptotic behavior of the reduced rank estimator of the cointegrating space and adjustment space for vector error correction time series models with nonindependent innovations. It is shown that the distribution of the adjustment space can be quite different for models with iid innovations and models with nonindependent innovations. It is also shown that the likelihood ratio test remains valid when the assumption of iid Gaussian errors is relaxed. Monte Carlo experiments illustrate the finite sample performance of the likelihood ratio test using various kinds of weak error processes.  相似文献   

18.
Signature file is a well-studied method in information retrieval for indexing large text databases. Because of the small index size in this method, it is a good candidate for environments where memory is scarce. This small index size, however, comes at the cost of high false positive error rate. In this paper we address the problem of high false positive error rate of signature files by introducing COCA filters, a new variation of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique in real document collections we can reduce the false positive error by up to 21 times, for the same index size. It is also shown that in some extreme cases this technique is even able to completely eliminate the false positive error. COCA filters can be considered as a good replacement for Bloom filters wherever the co-occurrence of any two members of the universe is identifiable.  相似文献   

19.
Multiple hypotheses testing is concerned with appropriately controlling the rate of false positives, false negatives or both when testing several hypotheses simultaneously. Nowadays, the common approach to testing multiple hypotheses calls for controlling the expected proportion of falsely rejected null hypotheses referred to as the false discovery rate (FDR) or suitable measures based on the positive false discovery rate (pFDR). In this paper, we consider the problem of determining levels that both false positives and false negatives can be controlled simultaneously. As our risk function, we use the expected value of the maximum between the proportions of false positives and false negatives, with the expectation being taken conditional on the event that at least one hypothesis is rejected and one is accepted, referred to as hybrid error rate (HER). We then develop, based on HER, an analog of p-value termed as h-value to test the individual hypotheses. The use of the new procedure is illustrated using the well-known public data set by Golub et al. [Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 386 (1999) 531-537] with Affymetrix arrays of patients with acute lymphoic leukemia and acute myeloid leukemia.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号