首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 637 毫秒
1.
Lists of differentially expressed genes (DEGs) detected often show low reproducibility even in technique replicate experiments. The reproducibility is even lower for those real cancer data with large biological variations and limited number of samples. Since existing methods for identifying differentially expressed genes treat each gene separately, they cannot circumvent the problem of low reproducibility. Considering correlation structures of genes may help to mitigate the effect of errors on individual gene estimates and thus get more reliable lists of DEGs. We borrowed information from large amount of existing microarray data to define the expression dependencies amongst genes. We use this prior knowledge of dependencies amongst genes to adjust the significance rank of DEGs. We applied our method and four popular ranking algorithms including mean fold change (FC), SAM, t‐statistic and Wilcoxon rank sum‐test on two cancer microarray datasets. Our method achieved higher reproducibility than other methods across a range of sample sizes. Furthermore, our method obtained higher accuracy than other methods, especially when the sample size is small. The results demonstrate that considering the dependencies amongst genes helps to adjust the significance rank of genes and find those truly differentially expressed genes.  相似文献   

2.
Microarrays have been widely used to identify differentially expressed genes. One related problem is to estimate the proportion of differentially expressed genes. For some complex diseases, the amount of differentially expressed genes may be relatively small and these genes may only have subtly differential expressions. For these microarray data, it is generally difficult to efficiently estimate the proportion of differentially expressed genes. In this study, I propose a likelihood-based method coupled with an expectation-maximization (E-M) algorithm for estimating the proportion of differentially expressed genes. The proposed method has favorable performances if either (i) the P values of differentially expressed genes are homogeneously distributed or (ii) the proportion of differentially expressed genes is relatively small. In both of these situations, I showed through simulations that the proposed method gave satisfactory performances when it was compared to other existing methods. As applications, these methods were applied to two microarray gene expression data sets generated from different platforms.  相似文献   

3.
Real-time RT-PCR has been frequently used in quantitative research in molecular biology and bioinformatics. It provides remarkably useful technology to assess expression of genes. Although mathematical models for gene amplification process have been studied, statistical models and methods for data analysis in real-time RT-PCR have received little attention. In this paper, we briefly introduce current mathematical models, and study statistical models for real-time RT-PCR data. We propose a generalized estimation equations (GEE) model that properly reflects the structure of repeated data in RT-PCR experiments for both cross-sectional and longitudinal data. The GEE model takes the correlation between observations within the same subjects into consideration, and prevents from producing false positives or false negatives. We further demonstrate with a set of actual real-time RT-PCR data that different statistical models yield different estimations of fold change and confidence interval. The SAS program for data analysis using the GEE model is provided to facilitate easy computation for non-statistical professionals.  相似文献   

4.
With the rapid development of DNA microarray technology and next-generation technology, a large number of genomic data were generated. So how to extract more differentially expressed genes from genomic data has become a matter of urgency. Because Low-Rank Representation (LRR) has the high performance in studying low-dimensional subspace structures, it has attracted a chunk of attention in recent years. However, it does not take into consideration the intrinsic geometric structures in data.In this paper, a new method named Laplacian regularized Low-Rank Representation (LLRR) has been proposed and applied on genomic data, which introduces graph regularization into LRR. By taking full advantages of the graph regularization, LLRR method can capture the intrinsic non-linear geometric information among the data. The LLRR method can decomposes the observation matrix of genomic data into a low rank matrix and a sparse matrix through solving an optimization problem. Because the significant genes can be considered as sparse signals, the differentially expressed genes are viewed as the sparse perturbation signals. Therefore, the differentially expressed genes can be selected according to the sparse matrix. Finally, we use the GO tool to analyze the selected genes and compare the P-values with other methods.The results on the simulation data and two real genomic data illustrate that this method outperforms some other methods: in differentially expressed gene selection.  相似文献   

5.
Our ability to detect differentially expressed genes in a microarray experiment can be hampered when the number of biological samples of interest is limited. In this situation, we propose the use of information from self-self hybridizations to acuminate our inference of differential expression. A unified modelling strategy is developed to allow better estimation of the error variance. This principle is similar to the use of a pooled variance estimate in the two-sample t-test. The results from real dataset examples suggest that we can detect more genes that are differentially expressed in the combined models. Our simulation study provides evidence that this method increases sensitivity compared to using the information from comparative hybridizations alone, given the same control for false discovery rate. The largest increase in sensitivity occurs when the amount of information in the comparative hybridization is limited.  相似文献   

6.
With the proliferation of related microarray studies by independent groups, a natural approach to analysis would be to combine the results across studies. In this article, we address a meta-analysis of the gene expression data on imatinib resistance in chronic myelogenous leukemia. First, an analysis of the overlapping among 6 published studies revealed that only 3 genes were coincident between 2 studies. A later reprocessing using different methods on 4 publicly available datasets revealed that 2 extra genes were overlapped between two sets. Both poor overlappings may be due to large differences in the sample source, the microarray platforms used, and a small difference in gene expression between the imatinib non-responder and responder patients. A search of common genes inside 4 public datasets afforded 404 well defined genes. Nevertheless, this necessary condition for meta-analysis caused the loss of many genes of possible interest. The expression signals of the common genes in the four datasets were reanalyzed using three summary statistical methods for combining quantitative information: Fisher, Stouffer and effect-size. Taking the three methods together and using an FDR < 0.10 threshold, a gene-list with 33 differentially expressed genes was found. Considering all the reanalysis approaches used in this work, a final gene-list with 38 differentially expressed genes is reported. Despite the important limitations to this microarray meta-analysis, the presented procedures and integrated gene-list may have some potential value as regards imatinib resistance in CML patients since it is the first attempt to integrate evidence about gene-lists in this area.  相似文献   

7.
Cell-based biosensors utilize functional changes in cellular response to identify the biological threats in a physiological relevant manner. Cell-based sensors have been used for a wide array of applications including toxicological assessment and drug-screening. In this paper, we utilize DNA arrays to identify differential gene expression events induced by toxin exposure for the purpose of developing a reporter gene assay system compatible with insertion into a cell-based sensor platform. HT29, an intestine epithelial cell line, was used as a cell model to study the cholera toxin (CT)-induced host cell modulation using DNA array analysis. A false positive model was generated from analysis of housekeeping genes in untreated control experiments to characterize our system and to minimize the number of false positives in the data. Threshold probability scores (−3.72), which gives <0.02% false positives for up/down regulation from the false positive model, were used to identify 73 and 25 known genes/expression tag sequences (ESTs) that were up- and down-regulated, respectively, in cells exposed 23 nM of CT. Using quantitative multiplex PCR assay, the gene expression levels for several genes shown to be modulated according to the microarray experiments, such as apolipoprotein D (Apol D), E-cadherin, and cyclin A2, were confirmed. The differential expression of genes encoding cytochrome P450, glutathione transferase (GST), and MGAT2 were noteworthy and consistent with previous studies. Our study provides an approach to analyze cDNA microarray data with defined false positive rates. The utility of cDNA microarray information for the design of cell-based sensor using a reporter gene approach is discussed.  相似文献   

8.
Du W  Gu T  Tang LJ  Jiang JH  Wu HL  Shen GL  Yu RQ 《Talanta》2011,85(3):1689-1694
As a greedy search algorithm, classification and regression tree (CART) is easily relapsing into overfitting while modeling microarray gene expression data. A straightforward solution is to filter irrelevant genes via identifying significant ones. Considering some significant genes with multi-modal expression patterns exhibiting systematic difference in within-class samples are difficult to be identified by existing methods, a strategy that unimodal transform of variables selected by interval segmentation purity (UTISP) for CART modeling is proposed. First, significant genes exhibiting varied expression patterns can be properly identified by a variable selection method based on interval segmentation purity. Then, unimodal transform is implemented to offer unimodal featured variables for CART modeling via feature extraction. Because significant genes with complex expression patterns can be properly identified and unimodal feature extracted in advance, this developed strategy potentially improves the performance of CART in combating overfitting or underfitting while modeling microarray data. The developed strategy is demonstrated using two microarray data sets. The results reveal that UTISP-based CART provides superior performance to k-nearest neighbors or CARTs coupled with other gene identifying strategies, indicating UTISP-based CART holds great promise for microarray data analysis.  相似文献   

9.
双龙方组分诱导大鼠BMSCs分化的差异基因筛选及聚类分析   总被引:2,自引:1,他引:1  
利用基因芯片筛选双龙方有效组分(总人参皂苷及总丹酚酸)诱导大鼠骨髓间充质干细胞(BMSCs)类心肌细胞分化过程中的差异表达基因, 并对其进行聚类分析, 在基因水平研究了双龙方组分对大鼠BMSCs分化的影响. 对大鼠BMSCs进行分组培养, 分别收集10, 20, 30及40 d的细胞样本, 提取tRNA, 经基因芯片检测, 筛选出BMSCs变化过程中的差异表达基因并进行生物信息学分析, 同时通过差异表达基因对样本进行Hierarchical聚类分析. 在BMSCs的分化过程中, 筛选出179条差异表达基因, 经分析发现它们与能量代谢和信号传导等多类基因密切相关. 对样本进行聚类分析发现其聚为两大类: 10和20 d的样本聚为一类, 30和40 d的样本聚为一类. 说明BMSCs在20~30 d之间可能发生了显著的改变.  相似文献   

10.
建立了一种基于不相交主成分分析(Disjoint PCA)和遗传算法(GA)的特征变量选择方法, 并用于从基因表达谱(Gene expression profiles)数据中识别差异表达的基因. 在该方法中, 用不相交主成分分析评估基因组在区分两类不同样品时的区分能力; 用GA寻找区分能力最强的基因组; 所识别基因的偶然相关性用统计方法评估. 由于该方法考虑了基因间的协同作用更接近于基因的生物过程, 从而使所识别的基因具有更好的差异表达能力. 将该方法应用于肝细胞癌(HCC)样品的基因芯片数据分析, 结果表明, 所识别的基因具有较强的区分能力, 优于常用的基因芯片显著性分析(Significance analysis of microarrays, SAM)方法.  相似文献   

11.
High-throughput screening (HTS) plays a pivotal role in lead discovery for the pharmaceutical industry. In tandem, cheminformatics approaches are employed to increase the probability of the identification of novel biologically active compounds by mining the HTS data. HTS data is notoriously noisy, and therefore, the selection of the optimal data mining method is important for the success of such an analysis. Here, we describe a retrospective analysis of four HTS data sets using three mining approaches: Laplacian-modified naive Bayes, recursive partitioning, and support vector machine (SVM) classifiers with increasing stochastic noise in the form of false positives and false negatives. All three of the data mining methods at hand tolerated increasing levels of false positives even when the ratio of misclassified compounds to true active compounds was 5:1 in the training set. False negatives in the ratio of 1:1 were tolerated as well. SVM outperformed the other two methods in capturing active compounds and scaffolds in the top 1%. A Murcko scaffold analysis could explain the differences in enrichments among the four data sets. This study demonstrates that data mining methods can add a true value to the screen even when the data is contaminated with a high level of stochastic noise.  相似文献   

12.
The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k‐means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a “gray area” and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k‐means nearest neighbor and the best approximation of positioning real zeros.  相似文献   

13.
The aim of this study was to identify molecular markers associated with oncogenic differentiation in hepatocellular carcinoma (HCC). Using an unsupervised clustering method with a cDNA microarray, HCC (T) gene expression profiles and corresponding non-tumor tissues (NT) from 40 patients were analyzed. Of total 217 genes, 72 were expressed preferentially in HCC tissues. Among 186 differentially regulated genes, there were molecular chaperone and tumor suppressor gene clusters in the Edmondson grades I and II (GI/II) subclass compared with the liver cirrhosis (LC) subclass. The Edmondson grades III and IV (GIII/IV) subclass with a poor survival (P=0.0133) contained 122 differentially regulated genes with a cluster containing various metastasis- and invasion-related genes compared with the GI/II subclass. Immunohistochemical analysis revealed that ANXA2, one of the 72 genes preferentially expressed in HCC, was over-expressed in the sinusoidal endothelium and in malignant hepatocytes in HCC. The genes identified in the HCC subclasses will be useful molecular markers for the genesis and progression of HCC. In addition, ANXA2 might be a novel marker for tumor angiogenesis in HCC.  相似文献   

14.
It has recently been shown that cancer genes (oncogenes) tend to have heterogeneous expressions across disease samples. So it is reasonable to assume that in a microarray data only a subset of disease samples will be activated (often referred to as outliers), which presents some new challenges for statistical analysis. In this paper, we study the multi-class cancer outlier differential gene expression detection. Statistical methods will be proposed to take into account the expression heterogeneity. Through simulation studies and application to public microarray data, we will show that the proposed methods could provide more comprehensive analysis results and improve upon the traditional differential gene expression detection methods, which often ignore the expression heterogeneity and may loss power. Supplementary information can be found at http://www.biostat.umn.edu/~baolin/research/orf.html.  相似文献   

15.
High throughput screening (HTS) data is often noisy, containing both false positives and negatives. Thus, careful triaging and prioritization of the primary hit list can save time and money by identifying potential false positives before incurring the expense of followup. Of particular concern are cell-based reporter gene assays (RGAs) where the number of hits may be prohibitively high to be scrutinized manually for weeding out erroneous data. Based on statistical models built from chemical structures of 650 000 compounds tested in RGAs, we created "frequent hitter" models that make it possible to prioritize potential false positives. Furthermore, we followed up the frequent hitter evaluation with chemical structure based in silico target predictions to hypothesize a mechanism for the observed "off target" response. It was observed that the predicted cellular targets for the frequent hitters were known to be associated with undesirable effects such as cytotoxicity. More specifically, the most frequently predicted targets relate to apoptosis and cell differentiation, including kinases, topoisomerases, and protein phosphatases. The mechanism-based frequent hitter hypothesis was tested using 160 additional druglike compounds predicted by the model to be nonspecific actives in RGAs. This validation was successful (showing a 50% hit rate compared to a normal hit rate as low as 2%), and it demonstrates the power of computational models toward understanding complex relations between chemical structure and biological function.  相似文献   

16.
Clustering analysis of data from DNA microarray hybridization studies is an essential task for identifying biologically relevant groups of genes. Attribute cluster algorithm (ACA) has provided an attractive way to group and select meaningful genes. However, ACA needs much prior knowledge about the genes to set the number of clusters. In practical applications, if the number of clusters is misspecified, the performance of the ACA will deteriorate rapidly. We propose the Cooperative Competition Cluster Algorithm (CCCA) in this paper. In the algorithm, we assume that both cooperation and competition exist simultaneously between clusters in the process of clustering. By using this principle of Cooperative Competition, the number of clusters can be found in the process of clustering. Experimental results on a synthetic and gene expression data are demonstrated. The results show that CCCA can choose the number of clusters automatically and get excellent performance with respect to other competing methods.  相似文献   

17.
Radiotherapy (RT) is a common cancer treatment approach that accounts for nearly 50% of patient treatment; however, tumor relapse after radiotherapy is still a major issue. To study the crucial role of tumor-associated macrophages (TAMs) in the regulation of tumor progression post-RT, microarray experiments comparing TAM gene expression profiles between unirradiated and irradiated tumors were conducted to discover possible roles of TAMs in initiation or contribution to tumor recurrence following RT, taking into account the relationships among gene expression, tumor microenvironment, and immunology. A single dose of 25 Gy was given to TRAMP C-1 prostate tumors established in C57/B6 mice. CD11b-positive macrophages were extracted from the tumors at one, two and three weeks post-RT. Gene ontology (GO) term analysis using the DAVID database revealed that genes that were differentially expressed at one and two weeks after irradiation were associated with biological processes such as morphogenesis of a branching structure, tube development, and cell proliferation. Analysis using Short Time-Series Expression Miner (STEM) revealed the temporal gene expression profiles and identified 13 significant patterns in four main groups of profiles. The genes in the upregulated temporal profile have diverse functions involved in the intracellular signaling cascade, cell proliferation, and cytokine-mediated signaling pathway. We show that tumor irradiation with a single 25-Gy dose can initiate a time-series of differentially expressed genes in TAMs, which are associated with the immune response, DNA repair, cell cycle arrest, and apoptosis. Our study helps to improve our understanding of the function of the group of genes whose expression changes temporally in an irradiated tumor microenvironment.  相似文献   

18.
Single nucleotide polymorphism (SNP) arrays were used to detect chromosomal regions with DNA copy number alterations. Current statistical methods for microarray-based comparative genomic hybridization (array-CGH) analysis generally assume certain relationships among adjacent markers on the same chromosome, and these assumptions may be questionable. For an SNP-array-based CGH study, multiple normal reference SNP arrays were collected. In order to utilize these normal reference SNP arrays, we derived an empirical distribution of signal ratios for each SNP marker. With an assumed threshold value for the overall error rate control and the defined signal ratio ranges for chromosomal amplification and deletion, we proposed a procedure to identify chromosomal alteration regions based on several bootstrapped one-sample t-tests and the false discovery rate control. When we have multiple arrays for different individuals with the same disease, our method can also be used to detect SNP markers for chromosomal alteration regions that are common among these individuals. We applied our method to a published SNP array data set for breast carcinoma cell lines. For an individual with breast cancer, numerous chromosomal alteration regions were identified. Compared to results of previous studies, our method identified more chromosomal alteration regions, with some being implicated in the literature to harbor genes associated with breast cancer. For multiple cancer arrays, our results suggested the existence of common chromosomal alteration regions. However, a high proportion of false positives also indicated that genetic variations among different individuals with breast cancer can be present.  相似文献   

19.
Mantle cell lymphoma (MCL) cell lines have been difficult to generate, since only few have been described so far and even fewer have been thoroughly characterized. Among them, there is only one cell line, called GRANTA-519, which is well established and universally adopted for most lymphoma studies. We succeeded in establishing a new MCL cell line, called MAVER-1, from a leukemic MCL, and performed a thorough phenotypical, cytogenetical and molecular characterization of the cell line. In the present report, the phenotypic expression of GRANTA-519 and MAVER-1 cell lines has been compared and evaluated by a proteomic approach, exploiting 2-D map analysis. By univariate statistical analysis (Student's t-test, as commonly used in most commercial software packages), most of the protein spots were found to be identical between the two cell lines. Thirty spots were found to be unique for the GRANTA-519, whereas another 11 polypeptides appeared to be expressed only by the MAVER-1 cell line. A number of these spots could be identified by MS. These data were confirmed and expanded by multivariate statistical tools (principal component analysis and soft-independent model of class analogy) that allowed identification of a larger number of differently expressed spots. Multivariate statistical tools have the advantage of reducing the risk of false positives and of identifying spots that are significantly altered in terms of correlated expression rather than absolute expression values. It is thus suggested that, in future work in differential proteomic profiling, both univariate and multivariate statistical tools should be adopted.  相似文献   

20.
We present a novel method of statistical analysis for the comparison of electrophoretic data. The method is based on the squared Euclidian distance of normalized signal data vectors of electrophoretic lanes. The differences in the electrophoretic patterns are evaluated by a statistical test based on Hubert's statistics which measures the significance of the signal grouping. We demonstrate the validity and applicability of the method in a large data set derived from automated fluorescent mRNA differential display analysis of the expression of acute-phase proteins during experimental Escherichia coli infection in mice. The current testing method is capable of finding theoretically similar natural groupings to be similar in a statistically significant way whereas theoretically dissimilar or random groupings can be recognized to be artifactual. We also show how the calculated pairwise signal distances can be utilized in methodological problem solving. These analytical methods can be applied to the study of other related problems of similarity analysis of electrophoretic patterns, and also provide useful tools for the development of automated recognition of differentially expressed mRNAs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号