首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Creating hierarchical models of protein families based on Expressed Sequence Tags: the "Sprockets" analysis pipeline
Authors:Gordon Paul M K  Weinel Christian  Jacobi Carsten  Kämpf Udo  Kriventseva Evgenia  Sensen Christoph W
Institution:a University of Calgary, Faculty of Medicine, Sun Center of Excellence for Visual Genomics, 3330 Hospital Drive NW, Calgary, AB, Canada T2N 4N1
b Computational Biology and Chemistry GVC/C, BASF Aktiengesellschaft, 67056 Ludwigshafen, Germany
c BASF Plant Science GmbH, Agricultural Center, 67117 Limburgerhof, Germany
Abstract:We have created an analysis pipeline called Sprockets, which can be used to classify proteins into various hierarchical “families”, and build searchable models of these families. The construction of these families is based on data from Expressed Sequence Tags (ESTs) and Coding DNA Sequences (CDSs), making Sprockets clusters especially suitable for studying gene families in organisms for which the completely sequenced genome does not (yet) exist. The pipeline consists of two main parts: pair-wise analysis and grouping of sequences with Z-score statistics, followed by hierarchical splitting of clusters into alignable protein families. Various computational and statistical techniques applied in Sprockets allow it to act like a massive and selective multiple sequence alignment engine for combining individual sequence collections and related public sequences. The end result is a database of gene Hidden Markov Models, each related to the other by three levels of similarity: secondary structure, function and evolutionary origin. For a sample 20,000 EST set from Lactuca spp., Sprockets provided a 9% improvement in mapping of function to unknown sequences over traditional pair-wise search methods and InterPro mapping.
Keywords:EST assembly  Protein families  Hidden Markov Models  Sequence clustering  Multiple sequence alignments  Sprockets
本文献已被 ScienceDirect PubMed 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号