首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
Protein structures are evolutionarily more conserved than sequences, and sequences with very low sequence identity frequently share the same fold. This leads to the concept of protein designability. Some folds are more designable and lots of sequences can assume that fold. Elucidating the relationship between protein sequence and the three-dimensional (3D) structure that the sequence folds into is an important problem in computational structural biology. Lattice models have been utilized in numerous studies to model protein folds and predict the designability of certain folds. In this study, all possible compact conformations within a set of two-dimensional and 3D lattice spaces are explored. Complementary interaction graphs are then generated for each conformation and are described using a set of graph features. The full HP sequence space for each lattice model is generated and contact energies are calculated by threading each sequence onto all the possible conformations. Unique conformation giving minimum energy is identified for each sequence and the number of sequences folding to each conformation (designability) is obtained. Machine learning algorithms are used to predict the designability of each conformation. We find that the highly designable structures can be distinguished from other non-designable conformations based on certain graphical geometric features of the interactions. This finding confirms the fact that the topology of a conformation is an important determinant of the extent of its designability and suggests that the interactions themselves are important for determining the designability.  相似文献   

2.
基因组计划在实施产生了大量的DNA序列信息,如何有效地利用这些信息来研究基因的产物-蛋白质的结构与功能成为引入注目的研究领域,同源蛋白质结构预测及蛋白质折工识别是在基因组水平上进行蛋白质结构预测的有效方法,酵母基因组中约有50%的基因可以通过这类方法来确定其表面产物蛋白质的结构[1],但是目前所采用的方法在低同源性蛋白质的结构预测方面尚存在较大困难。  相似文献   

3.
4.
As several structural proteomic projects are producing an increasing number of protein structures with unknown function, methods that can reliably predict protein functions from protein structures are in urgent need. In this paper, we present a method to explore the clustering patterns of amino acids on the 3-dimensional space for protein function prediction. First, amino acid residues on a protein structure are clustered into spatial groups using hierarchical agglomerative clustering, based on the distance between them. Second, the protein structure is represented using a graph, where each node denotes a cluster of amino acids. The nodes are labeled with an evolutionary profile derived from the multiple alignment of homologous sequences. Then, a shortest-path graph kernel is used to calculate similarities between the graphs. Finally, a support vector machine using this graph kernel is used to train classifiers for protein function prediction. We applied the proposed method to two separate problems, namely, prediction of enzymes and prediction of DNA-binding proteins. In both cases, the results showed that the proposed method outperformed other state-of-the-art methods.  相似文献   

5.
Assignment of function to protein sequence is a task of growing importance in the life sciences, as new high-throughput sequencing DNA technologies generate ever increasing quantities of genomic and meta-genomic data. Patterns within the sequence space, caused by the evolutionary conservation and assembly of protein domains, make possible the inference of function from sequence similarity. Clustering similar sequences is a useful technique for finding conserved sequences; the CluSTr database is a publicly-available database arranging proteins in a hierarchy structured by similarity. The protein classification tool InterProScan builds on this approach by applying a range of methods to detect proteins that contain signatures indicative of the presence of particular conserved domains. The use of ontologies to describe protein function provides a flexible and abstract language to classify proteins. Together, these techniques can provide an understanding of the shape of the protein space, and can be used to explore the unchartered waters of the emerging metagenomic world.  相似文献   

6.
Selecting folded proteins from a library of secondary structural elements   总被引:1,自引:0,他引:1  
A protein evolution strategy is described by which double-stranded DNA fragments encoding defined Escherichia coli protein secondary structural elements (alpha-helices, beta-strands, and loops) are assembled semirandomly into sequences comprised of as many as 800 amino acid residues. A library of novel polypeptides generated from this system was inserted into an enhanced green fluorescent protein (EGFP) fusion vector. Library members were screened by fluorescence activated cell sorting (FACS) to identify those polypeptides that fold into soluble, stable structures in vivo that comprised a subset of shorter sequences ( approximately 60 to 100 residues) from the semirandom sequence library. Approximately 108 clones were screened by FACS, a set of 1149 high fluorescence colonies were characterized by dPCR, and four soluble clones with varying amounts of secondary structure were identified. One of these is highly homologous to a domain of aspartate racemase from a marine bacterium (Polaromonas sp.) but is not homologous to any E. coli protein sequence. Several other selected polypeptides have no global sequence homology to any known protein but show significant alpha-helical content, limited dispersion in 1D nuclear magnetic resonance spectra, pH sensitive ANS binding and reversible folding into soluble structures. These results demonstrate that this strategy can generate novel polypeptide sequences containing secondary structure.  相似文献   

7.
蛋白质折叠类型的分类建模与识别   总被引:2,自引:0,他引:2  
刘岳  李晓琴  徐海松  乔辉 《物理化学学报》2009,25(12):2558-2564
蛋白质的氨基酸序列如何决定空间结构是当今生命科学研究中的核心问题之一. 折叠类型反映了蛋白质核心结构的拓扑模式, 折叠识别是蛋白质序列-结构研究的重要内容. 我们以占Astral 1.65序列数据库中α, β和α/β三类蛋白质总量41.8%的36个无法独立建模的折叠类型为研究对象, 选取其中序列一致性小于25%的样本作为训练集, 以均方根偏差(RMSD)为指标分别进行系统聚类, 生成若干折叠子类, 并对各子类建立基于多结构比对算法(MUSTANG)结构比对的概形隐马尔科夫模型(profile-HMM). 将Astral 1.65中序列一致性小于95%的9505个样本作为检验集, 36个折叠类型的平均识别敏感性为90%, 特异性为99%, 马修斯相关系数(MCC)为0.95. 结果表明: 对于成员较多, 无法建立统一模型的折叠类型, 基于RMSD的系统分类建模均可实现较高准确率的识别, 为蛋白质折叠识别拓展了新的方法和思路, 为进一步研究奠定了基础.  相似文献   

8.
We present different means of classifying protein structure. One is made rigorous by mathematical knot invariants that coincide reasonably well with ordinary graphical fold classification and another classification is by packing analysis. Furthermore when constructing our mathematical fold classifications, we utilize standard neural network methods for predicting protein fold classes from amino acid sequences. We also make an analysis of the redundancy of the structural classifications in relation to function and ligand binding. Finally we advocate the use of combining the measurement of the VA, VCD, Raman, ROA, EA and ECD spectra with the primary sequence as a way to improve both the accuracy and reliability of fold class prediction schemes.  相似文献   

9.
Protein function is related to its chemical reaction to the surrounding environment including other proteins. On the other hand, this depends on the spatial shape and tertiary structure of protein and folding of its constituent components in space. The correct identification of protein domain fold solely using extracted information from protein sequence is a complicated and controversial task in the current computational biology. In this article a combined classifier based on the information content of extracted features from the primary structure of protein has been introduced to face this challenging problem. In the first stage of our proposed two-tier architecture, there are several classifiers each of which is trained with a different sequence based feature vector. Apart from the application of the predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, and different dimensions of pseudo-amino acid composition vectors in similar studies, the position specific scoring matrix (PSSM) has also been used to improve the correct classification rate (CCR) in this study. Using K-fold cross validation on training dataset related to 27 famous folds of SCOP, the 28 dimensional probability output vector from each evidence theoretic K-NN classifier is used to determine the information content or expertness of corresponding feature for discrimination in each fold class. In the second stage, the outputs of classifiers for test dataset are fused using Sugeno fuzzy integral operator to make better decision for target fold class. The expertness factor of each classifier in each fold class has been used to calculate the fuzzy integral operator weights. Results make it possible to provide deeper interpretation about the effectiveness of each feature for discrimination in target classes for query proteins.  相似文献   

10.
对泉生热袍菌进行了结构基因组的选靶研究,从泉生热袍菌的蛋白组中挑选了20个蛋白质作为第一批进行结构测定的目标,以发现新的蛋白质折叠模式. 选靶研究主要使用了BLAST搜索, PSI-BLAST搜索和ProtoNet数据库搜索等方法. 另外,还用PredictProtein程序对选中的蛋白质进行了二级结构和外形预测. 选中的20个蛋白质中有8个被克隆、表达和纯化,其中2个得到了单晶并收集了X衍射数据. 实验结果和最近一些文献报道的结果表明,挑选的一些蛋白质具有新的折叠模式,表明了这种选靶策略的有效性.  相似文献   

11.
12.
The positions of a given fold always occupied by strong hydrophobic amino acids (V, I, L, F, M, Y, W), which we call “topohydrophobic positions”, were detected and their properties demonstrated within 153 non-redundant families of homologous domains, through 3D structural alignments. Sets of divergent sequences possessing at least four to five members appear to be as informative as larger sets, provided that their mean pairwise sequence identity is low. Amino acids in topohydrophobic positions exhibit several interesting features: they are much more buried than their equivalents in non-topohydrophobic positions, their side chains are far less dispersed; and they often constitute a lattice of close contacts in the inner core of globular domains. In most cases, each regular secondary structure possesses one to three topohydrophobic positions, which cluster in the domain core. Moreover, using sensitive alignment processes such as hydrophobic cluster analysis (HCA), it is possible to identify topohydrophobic positions from only a small set of divergent sequences. Amino acids in topohydrophobic positions, which can be identified directly from sequences, constitute key markers of protein folds, define long-range structural constraints, which, together with secondary structure predictions, limit the number of possible conformations for a given fold. Received: 24 April 1998 / Accepted: 4 August 1998 / Published online: 16 November 1998  相似文献   

13.
A significant challenge in homology detection is to identify sequences that share a common evolutionary ancestor, despite significant primary sequence divergence. Remote homologs will often have less than 30% sequence identity, yet still retain common structural and functional properties. We demonstrate a novel method for identifying remote homologs using a support vector machine (SVM) classifier trained by fusing sequence similarity scores and subcellular location prediction. SVMs have been shown to perform well in a variety of applications where binary classification of data is the goal. At the same time, data fusion methods have been shown to be highly effective in enhancing discriminative power of data. Combining these two approaches in the application SVM-SimLoc resulted in identification of significantly more remote homologs (p-value<0.006) than using either sequence similarity or subcellular location independently.  相似文献   

14.
The functions of many proteins are mediated by specific conformational changes, and therefore the ability to design primary sequences capable of secondary and tertiary changes is an important step toward the creation of novel functional proteins. To this end, we have developed an algorithm that can optimize a single amino acid sequence for multiple target structures. The algorithm consists of an outer loop, in which sequence space is sampled by a Monte Carlo search with simulated annealing, and an inner loop, in which the effect of a given mutation is evaluated on the various target structures by using the rotamer packing routine and composite energy function of the protein design software, RosettaDesign. We have experimentally tested the method by designing a peptide, Sw2, which can be switched from a 2Cys-2His zinc finger-like fold to a trimeric coiled-coil fold, depending upon the pH or the presence of transition metals. Physical characterization of Sw2 confirms that it is able to reversibly adopt each intended target fold.  相似文献   

15.
Conclusions As each of the monoclonal antibodies binds to deglycosylated CEA, the antigenic determinants reside on the protein moiety of the molecule. For this reason, homologous amino acid sequences should be responsible for the presence of repetitive epitopes on CEA. Most probably, the crossreactive determinants on NCA are also protein in nature. Assuming single epitopes on NCA and homologous repetitive epitopes on CEA residing on the protein moieties of both molecules, an ancestral NCA-related gene may have undergone a recent duplication to form a single gene with subsequent divergence of the CEA specific sequences. The extensive homology of the aminoterminal sequence of NCA with that of CEA [6] is in accordance with this conclusion.
Monoklonale Antikörper binden an repetitive kreuzreagierende und singuläre spezifische Epitope auf dem Protein-Anteil des Carcinoembryonalen Antiges
  相似文献   

16.
Evolutionarily related proteins have similar sequences. Such similarity is called homology and can be described using substitution matrices such as Blosum 60. Naturally occurring homologous proteins usually have similar stable tertiary structures and this fact is used in so-called homology modeling. In contrast, the artificial protein designed by the Regan group has 50% identical sequence to the B1 domain of Streptococcal IgG-binding protein and a structure similar to the protein Rop. In this study, we asked the question whether artificial similar protein sequences (pseudohomologs) tend to encode similar protein structures, such as proteins existing in nature. To answer this question, we designed sets of protein sequences (pseudohomologs) homologous to sequences having known three-dimensional structures (template structures), same number of identities, same composition and equal level of homology, according to Blosum 60 substitution matrix as the known natural homolog. We compared the structural features of homologs and pseudohomologs by fitting them to the template structure. The quality of such structures was evaluated by threading potentials. The packing quality was measured using three-dimensional homology models. The packing quality of the models was worse for the “pseudohomologs” than for real homologs. The native homologs have better threading potentials (indicating better sequence-structure fit) in the native structure than the designed sequences. Therefore, we have shown that threading potentials and proper packing are evolutionarily more strongly conserved than sequence homology measured using the Blosum 60 matrix. Our results indicate that three-dimensional protein structure is evolutionarily more conserved than expected due to sequence conservation.  相似文献   

17.
Recent studies suggest that protein folding should be revisited as the emergent property of a complex system and that the nature allows only a very limited number of folds that seem to be strongly influenced by geometrical properties. In this work we explore the principles underlying this new view and show how helical protein conformations can be obtained starting from simple geometric considerations. We generated a large data set of C-alpha traces made of 65 points, by computationally solving a backbone model that takes into account only topological features of the all-alpha proteins; then, we built corresponding tertiary structures, by using the sequences associated to the crystallographic structures of four small globular all-alpha proteins from PDB, and analysed them in terms of structural and energetic properties. In this way we obtained four poorly populated sets of structures that are reasonably similar to the conformational states typical of the experimental PDB structures. These results show that our computational approach can capture the native topology of all-alpha proteins; furthermore, it generates backbone folds without the influence of the side chains and uses the protein sequence to select a specific fold among the generated folds. This agrees with the recent view that the backbone plays an important role in the protein folding process and that the amino acid sequence chooses its own fold within a limited total number of folds.  相似文献   

18.
蛋白质是一切生命体的物质基础,是生命活动的主要承担者,参与各种生理功能的调节.设计具有特定功能的蛋白质在蛋白质工程、生物医药、材料科学等领域具有重要意义.蛋白质序列设计的目标是设计能够折叠成期望结构并具有相应功能的氨基酸序列,是所有理性蛋白质工程的核心问题,具有极其重要的研究和应用潜力.随着蛋白质序列数据的指数型增长和...  相似文献   

19.
We explore automation of protein structural classification using supervised machine learning methods on a set of 11,360 pairs of protein domains (up to 35% sequence identity) consisting of three secondary structure elements. Fifteen algorithms from five categories of supervised algorithms are evaluated for their ability to learn for a pair of protein domains, the deepest common structural level within the SCOP hierarchy, given a one-dimensional representation of the domain structures. This representation encapsulates evolutionary information in terms of sequence identity and structural information characterising the secondary structure elements and lengths of the respective domains. The evaluation is performed in two steps, first selecting the best performing base learners and subsequently evaluating boosted and bagged meta learners. The boosted random forest, a collection of decision trees, is found to be the most accurate, with a cross-validated accuracy of 97.0% and F-measures of 0.97, 0.85, 0.93 and 0.98 for classification of proteins to the Class, Fold, Super-Family and Family levels in the SCOP hierarchy. The meta learning regime, especially boosting, improved performance by more accurately classifying the instances from less populated classes.  相似文献   

20.
We introduce a method for ungapped local multiple alignment (ULMA) in a given set of amino acid or nucleotide sequences. This method explores two search spaces using a linked optimization strategy. The first search space M consists of all possible words of a given length W, defined on the residue alphabet. An evolutionary algorithm searches this space globally. The second search space P consists of all possible ULMAs in the sequence set, each ULMA being represented by a position vector defining exactly one subsequence of length W per sequence. This search space is sampled with hill-climbing processes. The search of both spaces are coupled by projecting high scoring results from the global evolutionary search of M onto P. The hill-climbing processes then refine the optimization by local search, using the relative entropy between the ULMA and background residue frequencies as an objective function. We demonstrate some advantages of our strategy by analyzing difficult natural amino acid sequences and artificial datasets. A web interface is available at  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号