首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
All currently leading protein secondary structure prediction methods use a multiple protein sequence alignment to predict the secondary structure of the top sequence. In most of these methods, prior to prediction, alignment positions showing a gap in the top sequence are deleted, consequently leading to shrinking of the alignment and loss of position-specific information. In this paper we investigate the effect of this removal of information on secondary structure prediction accuracy. To this end, we have designed SymSSP, an algorithm that post-processes the predicted secondary structure of all sequences in a multiple sequence alignment by (i) making use of the alignment's evolutionary information and (ii) re-introducing most of the information that would otherwise be lost. The post-processed information is then given to a new dynamic programming routine that produces an optimally segmented consensus secondary structure for each of the multiple alignment sequences. We have tested our method on the state-of-the-art secondary structure prediction methods PHD, PROFsec, SSPro2 and JNET using the HOMSTRAD database of reference alignments. Our consensus-deriving dynamic programming strategy is consistently better at improving the segmentation quality of the predictions compared to the commonly used majority voting technique. In addition, we have applied several weighting schemes from the literature to our novel consensus-deriving dynamic programming routine. Finally, we have investigated the level of noise introduced by prediction errors into the consensus and show that predictions of edges of helices and strands are half the time wrong for all the four tested prediction methods.  相似文献   

2.
Identification and prediction of RNA-binding residues (RBRs) provides valuable insights into the mechanisms of protein-RNA interactions. We analyzed the contributions of a wide range of factors including amino acid sequence, evolutionary conservation, secondary structure and solvent accessibility, to the prediction/characterization of RBRs. Five feature sets were designed and feature selection was performed to find and investigate relevant features. We demonstrate that (1) interactions with positively charged amino acids Arg and Lys are preferred by the egatively charged nucleotides; (2) Gly provides flexibility for the RNA binding sites; (3) Glu with negatively charged side chain and several hydrophobic residues such as Leu, Val, Ala and Phe are disfavored in the RNA-binding sites; (4) coil residues, especially in long segments, are more flexible (than other secondary structures) and more likely to interact with RNA; (5) helical residues are more rigid and consequently they are less likely to bind RNA; and (6) residues partially exposed to the solvent are more likely to form RNA-binding sites. We introduce a novel sequence-based predictor of RBRs, RBRpred, which utilizes the selected features. RBRpred is comprehensively tested on three datasets with varied atom distance cutoffs by performing both five-fold cross validation and jackknife tests and achieves Matthew's correlation coefficient (MCC) of 0.51, 0.48 and 0.42, respectively. The quality is comparable to or better than that for state-of-the-art predictors that apply the distancebased cutoff definition. We show that the most important factor for RBRs prediction is evolutionary conservation, followed by the amino acid sequence, predicted secondary structure and predicted solvent accessibility. We also investigate the impact of using native vs. predicted secondary structure and solvent accessibility. The predictions are sufficient for the RBR prediction and the knowledge of the actual solvent accessibility helps in predictions for lower distance cutoffs.  相似文献   

3.
As several structural proteomic projects are producing an increasing number of protein structures with unknown function, methods that can reliably predict protein functions from protein structures are in urgent need. In this paper, we present a method to explore the clustering patterns of amino acids on the 3-dimensional space for protein function prediction. First, amino acid residues on a protein structure are clustered into spatial groups using hierarchical agglomerative clustering, based on the distance between them. Second, the protein structure is represented using a graph, where each node denotes a cluster of amino acids. The nodes are labeled with an evolutionary profile derived from the multiple alignment of homologous sequences. Then, a shortest-path graph kernel is used to calculate similarities between the graphs. Finally, a support vector machine using this graph kernel is used to train classifiers for protein function prediction. We applied the proposed method to two separate problems, namely, prediction of enzymes and prediction of DNA-binding proteins. In both cases, the results showed that the proposed method outperformed other state-of-the-art methods.  相似文献   

4.
Modern protein secondary structure prediction methods are based on exploiting evolutionary information contained in multiple sequence alignments. Critical steps in the secondary structure prediction process are (i) the selection of a set of sequences that are homologous to a given query sequence, (ii) the choice of the multiple sequence alignment method, and (iii) the choice of the secondary structure prediction method. Because of the close relationship between these three steps and their critical influence on the prediction results, secondary structure prediction has received increased attention from the bioinformatics community over the last few years. In this treatise, we discuss recent developments in computational methods for protein secondary structure prediction and multiple sequence alignment, focus on the integration of these methods, and provide some recommendations for state-of-the-art secondary structure prediction in practice.  相似文献   

5.
RNA function annotation is often based on alignment to a previously studied template. In contrast to the study of proteins, there are not many alignment-free methods to predict RNA functions if alignment fails. The use of topological indices (TIs) of RNA complex networks (CNs) to find quantitative structure-activity relationships (QSAR) may be an alternative to incorporate secondary structure or sequence-to-sequence similarity. Here, we introduce new QSAR-like techniques using RNA macromolecular CNs (mmCNs), where nodes are nucleotides, or RNA supramolecular CNs (smCNs), where nodes are RNA sequences. We studied a data set of 198 sequences including 18S-rRNAs (important phylogenetic molecular biomarkers). We constructed three types of RNA mmCNs: sequence-linear (SL), Cartesian-lattice (CL), and sequence-folding CNs (SF-CNs) and two smCNs: sequence-sequence disagreement CN (SSD) and sequence-sequence similarity (SSS-smCN). We reported the first comparative QSAR study with all these CIs and CNs, which includes: (i) spectral moments ( ( i )micro d ( w)) of SL-mmCNs (accuracy = 75.3%), (ii) electrostatic CIs (xi d ) of CL-mmCNs (>90%), (iii) thermodynamic parameters (Delta G, Delta H, Delta S, and T m) of SF-mmCNs (64.7%), (iv) disagreement-distribution moments ( M k ) of the SSD-smCN (79.3%), and (v) node centralities of the SSD-smCN (78.0%). Furthermore, we reported the experimental isolation of a new RNA sequence from Psidum guajava leaf tissue and its QSAR and BLAST prediction to illustrate the practical use of these methods. We also investigated the use of these CNs to explore rRNA diversity on bacteria, plants, and parasites from the Dactylogyrus genus. The HPL-mmCNs model was the best of all found. All the CNs and TIs, except SF-mmCNs, were introduced here by the first time for the QSAR study of RNA, which allowed a comparative study for RNA classification.  相似文献   

6.
Knowledge of structural classes is useful in understanding of folding patterns in proteins. Although existing structural class prediction methods applied virtually all state-of-the-art classifiers, many of them use a relatively simple protein sequence representation that often includes amino acid (AA) composition. To this end, we propose a novel sequence representation that incorporates evolutionary information encoded using PSI-BLAST profile-based collocation of AA pairs. We used six benchmark datasets and five representative classifiers to quantify and compare the quality of the structural class prediction with the proposed representation. The best, classifier support vector machine achieved 61-96% accuracy on the six datasets. These predictions were comprehensively compared with a wide range of recently proposed methods for prediction of structural classes. Our comprehensive comparison shows superiority of the proposed representation, which results in error rate reductions that range between 14% and 26% when compared with predictions of the best-performing, previously published classifiers on the considered datasets. The study also shows that, for the benchmark dataset that includes sequences characterized by low identity (i.e., 25%, 30%, and 40%), the prediction accuracies are 20-35% lower than for the other three datasets that include sequences with a higher degree of similarity. In conclusion, the proposed representation is shown to substantially improve the accuracy of the structural class prediction. A web server that implements the presented prediction method is freely available at http://biomine.ece.ualberta.ca/Structural_Class/SCEC.html.  相似文献   

7.
The ability to predict protein folding rates constitutes an important step in understanding the overall folding mechanisms. Although many of the prediction methods are structure based, successful predictions can also be obtained from the sequence. We developed a novel method called prediction of protein folding rates (PPFR), for the prediction of protein folding rates from protein sequences. PPFR implements a linear regression model for each of the mainstream folding dynamics including two-, multi-, and mixed-state proteins. The proposed method provides predictions characterized by strong correlations with the experimental folding rates, which equal 0.87 for the two- and multistate proteins and 0.82 for the mixed-state proteins, when evaluated with out-of-sample jackknife test. Based on in-sample and out-of-sample tests, the PPFR's predictions are shown to be better than most of other sequence only and structure-based predictors and complementary to the predictions of the most recent sequence-based QRSM method. We show that simultaneous incorporation of several characteristics, including the sequence, physiochemical properties of residues, and predicted secondary structure provides improved quality. This hybridized prediction model was analyzed to reveal the complementary factors that can be used in tandem to predict folding rates. We show that bigger proteins require more time for folding, higher helical and coil content and the presence of Phe, Asn, and Gln may accelerate the folding process, the inclusion of Ile, Val, Thr, and Ser may slow down the folding process, and for the two-state proteins increased beta-strand content may decelerate the folding process. Finally, PPFR provides strong correlation when predicting sequences with low similarity.  相似文献   

8.
The evolutionary relationships of organisms are traditionally delineated by the alignment‐based methods using some DNA or protein sequences. In the post‐genome era, the phylogenetics of life could be inferred from many sources such as genomic features, not just from comparison of one or several genes. To investigate the possibility that the physicochemical properties of protein sequences might reflect the phylogenetic ones, an alignment‐free method using a support vector machine (SVM) classifier is implemented to establish the phylogenetic relationships between some protein sequences. There are two types of datasets, namely, the “Enzymatic” (assigned by an EC accession) and “Proteins” used to train the SVM classifiers. By computing the F‐score for feature selection, we find that the classification accuracies of trained SVM classifiers could be significantly enhanced to 84% and 80%, respectively, for the enzymatic and “proteins” datasets classified if the protein sequences are represented with some top 255 features selected. These show that some physicochemical features of amino acid sequences selected are sufficient for inferring the phylogenetic properties of the protein sequences. Moreover, we find that the selected physicochemical features appear to correlate with the physiological characteristic of the taxonomic classes classified. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

9.
Multiple sequence alignment (MSA) is one of the fundamental research topics in computational biology. The alignments help us to find functional assignment, evolutionary history and conserved region. Previous methods use a substitution matrix and do not incorporate knowledge of the sequences being aligned. Therefore, they do not assure the alignment of similar structures and common patterns in the sequences. We have been investigating into the solution to the problem in multiple and making use of knowledge of the sequences being aligned, including patterns in the Prosite databank, Blocks+, eBlocks databases, as well as motif and structural information. A pattern-constrained algorithm has been developed. Experiments with protein sequences have shown more accurate alignments with incorporation of the domain knowledge available in the sequences.  相似文献   

10.
11.
12.
In NMR spectroscopy, residual dipolar couplings (RDCs) have emerged as one of the most exquisite probes of biological structure and dynamics. The measurement of RDCs relies on the partial alignment of the molecule of interest, for example by using a liquid crystal as a solvent. Here, we establish bacterial type 1 pili as an alternative liquid-crystalline alignment medium for the measurement of RDCs. To achieve alignment at pilus concentrations that allow for efficient NMR sample preparation, we elongated wild-type pili by recombinant overproduction of the main structural pilus subunit. Building on the extraordinary stability of type 1 pili against spontaneous dissociation and unfolding, we show that the medium is compatible with challenging experimental conditions such as high temperature, the presence of detergents, organic solvents or very acidic pH, setting it apart from most established alignment media. Using human ubiquitin, HIV-1 TAR RNA and camphor as spectroscopic probes, we demonstrate the applicability of the medium for the determination of RDCs of proteins, nucleic acids and small molecules. Our results show that type 1 pili represent a very useful alternative to existing alignment media and may readily assist the characterization of molecular structure and dynamics by NMR.  相似文献   

13.
Solvent accessibility prediction from amino acid sequences has been pursued by several researchers. Such a prediction typically starts by transforming the amino acid category (or type) information into numerical representations. All twenty amino acids can be completely and uniquely represented by 20-dimensional vectors. Here, we investigate if the amino acid space defined in this way really requires twenty dimensions. We tried to develop corresponding representations in fewer dimensions. A method for searching optimal codification schema in an arbitrary space using neural networks was developed. The method is used to obtain optimal encoding of amino acids at various levels of dimensionality, and applied to optimize the amino acid codifications for the prediction of the solvent accessibility values of the proteins using feed-forward neural networks. The traditional 20-dimensional codification seems to be redundant in solving the solvent accessibility prediction problem, since a 1-dimensional codification is able to achieve almost the same degree of accuracy as the 20-dimensional codification. Optimal coding in much fewer dimensions could be used to make the predictions of accessible surface area with almost the same degree of accuracy as that obtained by a fully unique 20-dimensional coding. The 1-dimensional amino acid codification for solvent accessibility prediction obtained by a purely mathematical way based on neural networks is highly correlated with a physical property of the amino acids, namely their average solvent accessibility. The method developed to find the optimal codification is general, although the codification thus produced is dependent on the type of estimated property.  相似文献   

14.
The presented program ALIGN_MTX makes alignment of two textual sequences with an opportunity to use any several characters for the designation of sequence elements and arbitrary user substitution matrices. It can be used not only for the alignment of amino acid and nucleotide sequences but also for sequence-structure alignment used in threading, amino acid sequence alignment, using preliminary known PSSM matrix, and in other cases when alignment of biological or non-biological textual sequences is required. This distinguishes it from the majority of similar alignment programs that make, as a rule, alignment only of amino acid or nucleotide sequences represented as a sequence of single alphabetic characters. ALIGN_MTX is presented as downloadable zip archive at http://www.imbbp.org/software/ALIGN_MTX/ and available for free use.As application of using the program, the results of comparison of different types of substitution matrix for alignment quality in distantly related protein pair sets were presented. Threading matrix SORDIS, based on side-chain orientation in relation to hydrophobic core centers with evolutionary change-based substitution matrix BLOSUM and using multiple sequence alignment information position-specific score matrices (PSSM) were taken for test alignment accuracy. The best performance shows PSSM matrix, but in the reduced set with lower sequence similarity threading matrix SORDIS shows the same performance and it was shown that combined potential with SORDIS and PSSM can improve alignment quality in evolutionary distantly related protein pairs.  相似文献   

15.
RNA secondary structure prediction is a key technology in RNA bioinformatics. Most algorithms for RNA secondary structure prediction use probabilistic models, in which the model parameters are trained with reliable RNA secondary structures. Because of the difficulty of determining RNA secondary structures by experimental procedures, such as NMR or X-ray crystal structural analyses, there are still many RNA sequences that could be useful for training whose secondary structures have not been experimentally determined. In this paper, we introduce a novel semi-supervised learning approach for training parameters in a probabilistic model of RNA secondary structures in which we employ not only RNA sequences with annotated secondary structures but also ones with unknown secondary structures. Our model is based on a hybrid of generative (stochastic context-free grammars) and discriminative models (conditional random fields) that has been successfully applied to natural language processing. Computational experiments indicate that the accuracy of secondary structure prediction is improved by incorporating RNA sequences with unknown secondary structures into training. To our knowledge, this is the first study of a semi-supervised learning approach for RNA secondary structure prediction. This technique will be useful when the number of reliable structures is limited.  相似文献   

16.
RNA structure is hierarchical. Secondary structure contacts, i.e. the canonical base pair contacts, are generally stronger and form faster than the tertiary structure. Therefore, RNA secondary structures can be predicted independently of tertiary structure prediction. Furthermore, the stability of a given RNA secondary structure can be quantified using nearest neighbor free energy parameters. These parameters are the basis of a number of free energy minimization algorithms that predict RNA secondary structure for either a single sequence or multiple sequences. This article reviews the progress of RNA secondary structure prediction by free energy minimization and describes many of the algorithms that have been developed.  相似文献   

17.
We present the results of the first quantum chemical investigations of 1H NMR hyperfine shifts in the blue copper proteins (BCPs): amicyanin, azurin, pseudoazurin, plastocyanin, stellacyanin, and rusticyanin. We find that very large structural models that incorporate extensive hydrogen bond networks, as well as geometry optimization, are required to reproduce the experimental NMR hyperfine shift results, the best theory vs experiment predictions having R2 = 0.94, a slope = 1.01, and a SD = 40.5 ppm (or approximately 4.7% of the overall approximately 860 ppm shift range). We also find interesting correlations between the hyperfine shifts and the bond and ring critical point properties computed using atoms-in-molecules theory, in addition to finding that hyperfine shifts can be well-predicted by using an empirical model, based on the geometry-optimized structures, which in the future should be of use in structure refinement.  相似文献   

18.
Genome sequencing projects resulted in the identification of a large number of new sequence homologs of archaeal rhodopsins in marine bacteria, fungi, and unicellular algae. It is an important task to unambiguously predict the functions of these new rhodopsins, as it is difficult to perform individual experiments on every newly discovered sequence. The transmembrane segments of rhodopsins have similar three-dimensional structures where the seven transmembrane helices form a tightly packed scaffold to accommodate a covalently bound retinal. We use geometric computations to accurately define the retinal-binding pockets in high-resolution structures of rhodopsins and to extract residues forming the wall of the retinal-binding pocket. We then obtain a tree defining the functional relationship of rhodopsins based on the short sequences of residues forming the wall of the retinal-binding pocket concatenated from the primary sequence, and show that these sequence fragments are often sufficient to discriminate phototactic vs transporting function of the bacterial and unicellular algal rhodopsins. We further study the evolutionary history of retinal-binding pockets by estimating the pocket residue substitution rates using a Bayesian Monte Carlo method. Our findings indicate that every functional class of rhodopsins has a specific allowed set of fast-rate amino acid substitutions in the retinal-binding pocket that may contribute to spectral tuning or photocycle modulation. The substitution rates of the amino acid residues in a putative retinal-binding pocket of marine proteorhodopsins together with the clustering of pocket sequences indicate that green-absorbing and blue-absorbing proteorhodopsins have similar function. Our results demonstrate that the evolutionary patterns of the retinal-binding pockets reflect the functional specificity of the rhodopsins. The approach we describe in this paper may be useful for large-scale functional prediction of rhodopsins.  相似文献   

19.
Efforts to use computers in predicting the secondary structure of proteins based only on primary structure information started over a quarter century ago [1-3]. Although the results were encouraging initially, the accuracy of the pioneering methods generally did not attain the level required for using predictions of secondary structures reliably in modelling the three-dimensional topology of proteins. During the last decade, however, the introduction of new computational techniques as well as the use of multiple sequence information has lead to a dramatic increase in the success rate of prediction methods, such that successful 3D modelling based on predicted secondary structure has become feasible [e.g., Ref 4]. This review is aimed at presenting an overview of the scale of the secondary structure prediction problem and associated pitfalls, as well as the history of the development of computational prediction methods. As recent successful strategies for secondary structure prediction all rely on multiple sequence information, some methods for accurate protein multiple sequence alignments will also be described. While the main focus is on prediction methods for globular proteins, also the prediction of trans-membrane segments within membrane proteins will be briefly summarised. Finally, an integrated iterative approach tying secondary structure prediction and multiple alignment will be introduced [5].  相似文献   

20.
Binding affinity prediction is frequently addressed using computational models constructed solely with molecular structure and activity data. We present a hybrid structure-guided strategy that combines molecular similarity, docking, and multiple-instance learning such that information from protein structures can be used to inform models of structure–activity relationships. The Surflex-QMOD approach has been shown to produce accurate predictions of binding affinity by constructing an interpretable physical model of a binding site with no experimental binding site structural information. We introduce a method to integrate protein structure information into the model induction process in order to construct more robust physical models. The structure-guided models accurately predict binding affinities over a broad range of compounds while producing more accurate representations of the protein pockets and ligand binding modes. Structure-guidance for the QMOD method yielded significant performance improvements, both for affinity and pose prediction, especially in cases where predictions were made on ligands very different from those used for model induction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号