首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Evolutionarily related proteins have similar sequences. Such similarity is called homology and can be described using substitution matrices such as Blosum 60. Naturally occurring homologous proteins usually have similar stable tertiary structures and this fact is used in so-called homology modeling. In contrast, the artificial protein designed by the Regan group has 50% identical sequence to the B1 domain of Streptococcal IgG-binding protein and a structure similar to the protein Rop. In this study, we asked the question whether artificial similar protein sequences (pseudohomologs) tend to encode similar protein structures, such as proteins existing in nature. To answer this question, we designed sets of protein sequences (pseudohomologs) homologous to sequences having known three-dimensional structures (template structures), same number of identities, same composition and equal level of homology, according to Blosum 60 substitution matrix as the known natural homolog. We compared the structural features of homologs and pseudohomologs by fitting them to the template structure. The quality of such structures was evaluated by threading potentials. The packing quality was measured using three-dimensional homology models. The packing quality of the models was worse for the “pseudohomologs” than for real homologs. The native homologs have better threading potentials (indicating better sequence-structure fit) in the native structure than the designed sequences. Therefore, we have shown that threading potentials and proper packing are evolutionarily more strongly conserved than sequence homology measured using the Blosum 60 matrix. Our results indicate that three-dimensional protein structure is evolutionarily more conserved than expected due to sequence conservation.  相似文献   

2.
As more and more protein sequences are available, homolog identification becomes increasingly important for functional, structural, and evolutional studies of proteins. Many homologous proteins were separated a very long time ago in their evolutionary history and thus their sequences share low sequence identity. These remote homologs have become a research focus in bioinformatics over the past decade, and some significant advances have been achieved. In this paper, we provide a comprehensive review on computational techniques used in remote homolog identification based on different methods, including sequence-sequence comparison, and sequence-structure comparison, and structure-structure comparison. Other miscellaneous approaches are also summarized. Pointers to the online resources of these methods and their related databases are provided. Comparisons among different methods in terms of their technical approaches, their strengths, and limitations are followed. Studies on proteins in SARS-CoV are shown as an example for remote homolog identification application.  相似文献   

3.
Due to the exponential growth of sequenced genomes, the need to quickly provide accurate annotation for existing and new sequences is paramount to facilitate biological research. Current sequence comparison approaches fail to detect homologous relationships when sequence similarity is low. Support vector machine (SVM) algorithms approach this problem by transforming all proteins into a feature space of equal dimension based on protein properties, such as sequence similarity scores against a basis set of proteins or motifs. This multivariate representation of the protein space is then used to build a classifier specific to a pre-defined protein family. However, this approach is not well suited to large-scale annotation. We have developed a SVM approach that formulates remote homology as a single classifier that answers the pairwise comparison problem by integrating the two feature vectors for a pair of sequences into a single vector representation that can be used to build a classifier that separates sequence pairs into homologs and non-homologs. This pairwise SVM approach significantly improves the task of remote homology detection on the benchmark dataset, quantified as the area under the receiver operating characteristic curve; 0.97 versus 0.73 and 0.70 for PSI-BLAST and Basic Local Alignment Search Tool (BLAST), respectively.  相似文献   

4.
Protein-Protein Interaction (PPI) prediction is a well known problem in Bioinformatics, for which a large number of techniques have been proposed in the past. However, prediction results have not been sufficiently satisfactory for guiding biologists in web-lab experiments. One reason is that not all useful information, such as pairwise protein interaction information based on sequence alignment, has been integrated together in PPI prediction. Alignment is a basic concept to measure sequence similarity in Proteomics that has been used in a number of applications ranging from protein recognition to protein subcellular localization. In this article, we propose a novel integrated approach to predicting PPI based on sequence alignment by jointly using a k-Nearest Neighbor classifier (SA-kNN) and a Support Vector Machine (SVM). SVM is a machine learning technique used in a wide range of Bioinformatics applications, thanks to the ability to alleviate the overfitting problems. We demonstrate that in our approach the two methods, SA-kNN and SVM, are complementary, which are combined in an ensemble to overcome their respective limitations. While the SVM is trained on Amino Acid (AA) compositions and protein signatures mined from literature, the SA-kNN makes use of the similarity of two protein pairs through alignment. Experimentally, our technique leads to a significant gain in accuracy, precision and sensitivity measures at ~5%, 16% and 10% respectively.  相似文献   

5.
The function of eukaryotic protein is closely correlated with its subcellular location. The number of newly found protein sequences entering into data banks is rapidly increasing with the success of human genome project. It is highly desirable to predict a protein subcellular automatically from its amino acid sequence. In this paper, amino acid hydrophobic patterns and average power-spectral density (APSD) are introduced to define pseudo amino acid composition. The covariant-discriminant predictor is used to predict subcellular location. Immune-genetic algorithm (IGA) is used to find the fittest weight factors which are very important in this method. As such, high success rates are obtained by both self-consistency test (86%) and jackknife test (73%). More than 80% predictive accuracy is achieved in independent dataset test. The results demonstrate that the proposed method is practical. And, the method illuminates that the protein subcellular location can be predicted from its surface physio-chemical characteristic of protein folding.  相似文献   

6.
7.
The subcellular location of a protein is closely correlated with it biological function. In this paper, two new pattern classification methods termed as Nearest Feature Line (NFL) and Tunable Nearest Neighbor (TNN) have been introduced to predict the subcellular location of proteins based on their amino acid composition alone. The simulation experiments were performed with the jackknife test on a previously constructed data set, which consists of 2,427 eukaryotic and 997 prokaryotic proteins. All protein sequences in the data set fall into four eukaryotic subcellular locations and three prokaryotic subcellular locations. The NFL classifier reached the total prediction accuracies of 82.5% for the eukaryotic proteins and 91.0% for the prokaryotic proteins. The TNN classifier reached the total prediction accuracies of 83.6 and 92.2%, respectively. It is clear that high prediction accuracies have been achieved. Compared with Support Vector Machine (SVM) and Nearest Neighbor methods, these two methods display similar or even higher prediction accuracies. Hence, we conclude that NFL and TNN can be used as complementary methods for prediction of protein subcellular locations.  相似文献   

8.
High energy (4 keV) collision-induced dissociation (CID) product ion spectra have been obtained for a series of isomeric sugar molecules of close structural similarity. The reproducibility of the approach has been established and the spectra shown to have significant differences. These differences have been rationalised in terms of conventional mass spectrometric fragmentation rules. The data have also been subjected to analysis using chemometric methods, which require no specialist mass spectrometric input. The resulting classification of the data shows good agreement with the conventional interpretation approach.  相似文献   

9.

Background  

Genetic variants in the FTO (fat mass and obesity associated) gene have been associated with an increased risk of obesity. However, the function of its protein product has not been experimentally studied and previously reported sequence similarity analyses suggested the absence of homologs in existing protein databases. Here, we present the first detailed computational analysis of the sequence and predicted structure of the protein encoded by FTO.  相似文献   

10.
相似系统理论用于中药色谱指纹图谱的相似度评价   总被引:27,自引:0,他引:27  
刘永锁  孟庆华  蒋淑敏  胡育筑 《色谱》2005,23(2):158-163
研究了中药色谱指纹图谱相似度的评价方法,提出了改良的程度相似度的计算方法。以模拟数据和实验数据研究了相关系数、夹角余弦和改良程度相似度的优劣,发现相关系数和夹角余弦对数据的差异不够敏感,经预处理之后仍然不灵敏;采用改良的程度相似度可以反映数据的差异,因此可以将其用于评价中药色谱指纹图谱共有峰的相似度。  相似文献   

11.
Biopolymer sequence comparison to identify evolutionarily related proteins, or homologs, is one of the most common tasks in bioinformatics. Support vector machines (SVMs) represent a new approach to the problem in which statistical learning theory is employed to classify proteins into families, thus identifying homologous relationships. Current SVM approaches have been shown to outperform iterative profile methods, such as PSI-BLAST, for protein homology classification. In this study, we demonstrate that the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation over a benchmark dataset consisting of 54 unique protein families. The SVM-BALSA algorithms returns a higher area under the receiver operating characteristic (ROC) curves for 37 of the 54 families and achieves an improved overall performance curve at a significance level of 0.07.  相似文献   

12.
Predicting the location where a protein resides within a cell is important in cell biology. Computational approaches to this issue have attracted more and more attentions from the community of biomedicine. Among the protein features used to predict the subcellular localization of proteins, the feature derived from Gene Ontology (GO) has been shown to be superior to others. However, most of the sights in this field are set on the presence or absence of some predefined GO terms. We proposed a method to derive information from the intrinsic structure of the GO graph. The feature vector was constructed with each element in it representing the information content of the GO term annotating to a protein investigated, and the support vector machines was used as classifier to test our extracted features. Evaluation experiments were conducted on three protein datasets and the results show that our method can enhance eukaryotic and human subcellular location prediction accuracy by up to 1.1% better than previous studies that also used GO-based features. Especially in the scenario where the cellular component annotation is absent, our method can achieved satisfied results with an overall accuracy of more than 87%.  相似文献   

13.
Sirtuins are a family of proteins that play a key role in regulating a wide range of cellular processes including DNA regulation, metabolism, aging/longevity, cell survival, apoptosis, and stress resistance. Sirtuins are protein deacetylases and include in the class III family of histone deacetylase enzymes (HDACs). The class III HDACs contains seven members of the sirtuin family from SIRT1 to SIRT7. The seven members of the sirtuin family have various substrates and are present in nearly all subcellular localizations including the nucleus, cytoplasm, and mitochondria. In this study, a deep neural network approach using one-dimensional Convolutional Neural Networks (CNN) was proposed to build a prediction model that can accurately identify the outcome of the sirtuin protein by targeting their subcellular localizations. Therefore, the function and localization of sirtuin targets were analyzed and annotated to compartmentalize into distinct subcellular localizations. We further reduced the sequence similarity between protein sequences and three feature extraction methods were applied in datasets. Finally, the proposed method has been tested and compared with various machine-learning algorithms. The proposed method is validated on two independent datasets and showed an average of up to 85.77 % sensitivity, 97.32 % specificity, and 0.82 MCC for seven members of the sirtuin family of proteins.  相似文献   

14.
A new method based on probabilistic suffix trees (PSTs) is defined for pairwise comparison of distantly related protein sequences. The new definition is adopted in a discriminative framework for protein classification using pairwise sequence similarity scores in feature encoding. The framework uses support vector machines (SVMs) to separate structurally similar and dissimilar examples. The new discriminative system, which we call as SVM-PST, has been tested for SCOP family classification task, and compared with existing discriminative methods SVM-BLAST and SVM-Pairwise, which use BLAST similarity scores and dynamic-programming-based alignment scores, respectively. Results have shown that SVM-PST is more accurate than SVM-BLAST and competitive with SVM-Pairwise. In terms of computational efficiency, PST-based comparison is much better than dynamic-programming-based alignment. We also compared our results with the original family-based PST approach from which we were inspired. The present method provides a significantly better solution for protein classification in comparison with the family-based PST model.  相似文献   

15.
In numerous studies charge remote fragmentation (CRF) has been shown to be a powerful technique for determination of primary structure by allowing location of double bonds, various functional groups, and branching in a variety of compound types directly by mass spectrometry. Instrumentation and ionization methods traditionally used for CRF, however, are becoming rare, in large part because ESI and MALDI have to a significant extent replaced them. Here we demonstrate that by selecting a matrix that promotes rather than suppresses ionization of fatty acids (FA) by lithium ion adduction, and using a TOF-TOF mass spectrometer for high-energy collisional activation, CRF ions are produced that allow location of double-bond and branching positions. Further, we show that by using solvent-free MALDI sample preparation methods, thus eliminating the inherent segregation of the hydrophobic fatty acid from the hydrophilic LiCl that can occur during the evaporation of solvent, the desired [FA-H+2Li](+) ions are greatly enhanced. Because FAs can be vaporized using laser desorption, matrix assistance in desorption of the fatty acid may occur, but is not necessary. However, the matrix plays a crucial role in enhancing or suppressing ionization. For example, matrix materials with acid (e.g., 2,5-dihydroxybenzoic acid) or hydroxy groups (e.g., dithranol) compete with the FA for Li(+) and because of the high ratio of matrix to analyte, FA lithium adduction is minimized. However, highly electron-deficient matrix materials (e.g., TCNQ) readily donate Li(+) to FAs because of the instability associated with being positively charged.  相似文献   

16.
Similarity-based methods for virtual screening are widely used. However, conventional searching using 2D chemical fingerprints or 2D graphs may retrieve only compounds which are structurally very similar to the original target molecule. Of particular current interest then is scaffold hopping, that is, the ability to identify molecules that belong to different chemical series but which could form the same interactions with a receptor. Reduced graphs provide summary representations of chemical structures and, therefore, offer the potential to retrieve compounds that are similar in terms of their gross features rather than at the atom-bond level. Using only a fingerprint representation of such graphs, we have previously shown that actives retrieved were more diverse than those found using Daylight fingerprints. Maximum common substructures give an intuitively reasonable view of the similarity between two molecules. However, their calculation using graph-matching techniques is too time-consuming for use in practical similarity searching in larger data sets. In this work, we exploit the low cardinality of the reduced graph in graph-based similarity searching. We reinterpret the reduced graph as a fully connected graph using the bond-distance information of the original graph. We describe searches, using both the maximum common induced subgraph and maximum common edge subgraph formulations, on the fully connected reduced graphs and compare the results with those obtained using both conventional chemical and reduced graph fingerprints. We show that graph matching using fully connected reduced graphs is an effective retrieval method and that the actives retrieved are likely to be topologically different from those retrieved using conventional 2D methods.  相似文献   

17.
Recently a method (RASCAL) for determining graph similarity using a maximum common edge subgraph algorithm has been proposed which has proven to be very efficient when used to calculate the relative similarity of chemical structures represented as graphs. This paper describes heuristics which simplify a RASCAL similarity calculation by taking advantage of certain properties specific to chemical graph representations of molecular structure. These heuristics are shown experimentally to increase the efficiency of the algorithm, especially at more distant values of chemical graph similarity.  相似文献   

18.
Apoptosis is a fundamental process controlling normal tissue homeostasis by regulating a balance between cell proliferation and death. Predicting the subcellular location of apoptosis proteins is very helpful for understanding the mechanism of programmed cell death. Predicting protein subcellular localization with bioinformatic techniques provides quite a few opportunities in related fields. In this work, we propose the use of a hierarchical extreme learning machine (H-ELM) to make a classification of high-dimensional input data without demanding a dimension reduction process, which yields acceptable results. An attempt is made to extract features from different perspectives, and a feature fusion process is accomplished. Regarding the position-specific scoring matrix, the first type depicts the correlation within the sequence with the autocorrelation function for relatively random sections from the sequence; and the second type is the Kullback-Leibler (K-L) divergence of the two distributions formed by the amino acids’ constitutuent proportions. It is illustrated in an experiment with features from different sources mixed by simple concatenation yielding a poor result, but the synthetical feature fused with stochastic nonlinear embedding (t-SNE) greatly improved the classification. Finally, the highest overall accuracy of ZD98 is 87.5% by adjusting the hyper-parameters of H-ELM, and of CL317 is 92.4%.  相似文献   

19.
Non-specific lipid transfer proteins (nsLTPs) are common allergens and they are particularly widespread within the plant kingdom. They have a highly conserved three-dimensional structure that generate a strong cross-reactivity among the members of this family. In the last years several web tools for the prediction of allergenicity of new molecules based on their homology with known allergens have been released, and guidelines to assess potential allergenicity of proteins through bioinformatics have been established. Even if such tools are only partially reliable yet, they can provide important indications when other kinds of molecular characterization are lacking. The potential allergenicity of 28 amino acid sequences of LTPs homologs, either retrieved from the UniProt database or in silico deduced from the corresponding EST coding sequence, was predicted using 7 publicly available web tools. Moreover, their similarity degree to their closest known LTP allergens was calculated, in order to evaluate their potential cross-reactivity. Finally, all sequences were studied for their identity degree with the peach allergen Pru p 3, considering the regions involved in the formation of its known conformational IgE-binding epitope. Most of the analyzed sequences displayed a high probability to be allergenic according to all the software employed. The analyzed LTPs from bell pepper, cassava, mango, mungbean and soybean showed high homology (>70%) with some known allergenic LTPs, suggesting a potential risk of cross-reactivity for sensitized individuals. Other LTPs, like for example those from canola, cassava, mango, mungbean, papaya or persimmon, displayed a high degree of identity with Pru p 3 within the consensus sequence responsible for the formation, at three-dimensional level, of its major conformational epitope. Since recent studies highlighted how in patients mono-sensitized to peach LTP the levels of IgE seem directly proportional to the chance of developing cross-reactivity to LTPs from non-Rosaceae foods, and these chances increase the more similar the protein is to Pru p 3, these proteins should be taken into special account for future studies aimed at evaluating the risk of cross-allergenicity in highly sensitized individuals.  相似文献   

20.
Profile-profile alignment algorithms have proven powerful for recognizing remote homologs and generating alignments by effectively integrating sequence evolutionary information into scoring functions. In comparison to scoring function, the development of gap penalty functions has rarely been addressed in profile-profile alignment algorithms. Although indel frequency profiles have been used to construct profile-based variable gap penalties in some profile-profile alignment algorithms, there is still no fair comparison between variable gap penalties and traditional linear gap penalties to quantify the improvement of alignment accuracy. We compared two linear gap penalty functions, the traditional affine gap penalty (AGP) and the bilinear gap penalty (BGP), with two profile-based variable gap penalty functions, the Profile-based Gap Penalty used in SP(5) (SPGP) and a new Weighted Profile-based Gap Penalty (WPGP) developed by us, on some well-established benchmark datasets. Our results show that profile-based variable gap penalties get limited improvements than linear gap penalties, whether incorporated with secondary structure information or not. Secondary structure information appears less powerful to be incorporated into gap penalties than into scoring functions. Analysis of gap length distributions indicates that gap penalties could stably maintain corresponding distributions of gap lengths in their alignments, but the distribution difference from reference alignments does not reflect the performance of gap penalties. There is useful information in indel frequency profiles, but it is still not good enough for improving alignment accuracy when used in profile-based variable gap penalties. All of the methods tested in this work are freely accessible at http://protein.cau.edu.cn/gppat/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号