首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
A significant challenge in homology detection is to identify sequences that share a common evolutionary ancestor, despite significant primary sequence divergence. Remote homologs will often have less than 30% sequence identity, yet still retain common structural and functional properties. We demonstrate a novel method for identifying remote homologs using a support vector machine (SVM) classifier trained by fusing sequence similarity scores and subcellular location prediction. SVMs have been shown to perform well in a variety of applications where binary classification of data is the goal. At the same time, data fusion methods have been shown to be highly effective in enhancing discriminative power of data. Combining these two approaches in the application SVM-SimLoc resulted in identification of significantly more remote homologs (p-value<0.006) than using either sequence similarity or subcellular location independently.  相似文献   

2.
Protein and peptide sequences contain clues for functional prediction. A challenge is to predict sequences that show low or no homology to proteins or peptides of known function. A machine learning method, support vector machines (SVM), has recently been explored for predicting functional class of proteins and peptides from sequence-derived properties irrespective of sequence similarity, which has shown impressive performance for predicting a wide range of protein and peptide classes including certain low- and non- homologous sequences. This method serves as a new and valuable addition to complement the extensively-used alignment-based, clustering-based, and structure-based functional prediction methods. This article evaluates the strategies, current progresses, reported prediction performances, available software tools, and underlying difficulties in using SVM for predicting the functional class of proteins and peptides.  相似文献   

3.
A new method based on probabilistic suffix trees (PSTs) is defined for pairwise comparison of distantly related protein sequences. The new definition is adopted in a discriminative framework for protein classification using pairwise sequence similarity scores in feature encoding. The framework uses support vector machines (SVMs) to separate structurally similar and dissimilar examples. The new discriminative system, which we call as SVM-PST, has been tested for SCOP family classification task, and compared with existing discriminative methods SVM-BLAST and SVM-Pairwise, which use BLAST similarity scores and dynamic-programming-based alignment scores, respectively. Results have shown that SVM-PST is more accurate than SVM-BLAST and competitive with SVM-Pairwise. In terms of computational efficiency, PST-based comparison is much better than dynamic-programming-based alignment. We also compared our results with the original family-based PST approach from which we were inspired. The present method provides a significantly better solution for protein classification in comparison with the family-based PST model.  相似文献   

4.
Protein structural class prediction for low similarity sequences is a significant challenge and one of the deeply explored subjects. This plays an important role in drug design, folding recognition of protein, functional analysis and several other biology applications. In this paper, we worked with two benchmark databases existing in the literature (1) 25PDB and (2) 1189 to apply our proposed method for predicting protein structural class. Initially, we transformed protein sequences into DNA sequences and then into binary sequences. Furthermore, we applied symmetrical recurrence quantification analysis (the new approach), where we got 8 features from each symmetry plot computation. Moreover, the machine learning algorithms such as Linear Discriminant Analysis (LDA), Random Forest (RF) and Support Vector Machine (SVM) are used. In addition, comparison was made to find the best classifier for protein structural class prediction. Results show that symmetrical recurrence quantification as feature extraction method with RF classifier outperformed existing methods with an overall accuracy of 100% without overfitting.  相似文献   

5.
Protein-Protein Interaction (PPI) prediction is a well known problem in Bioinformatics, for which a large number of techniques have been proposed in the past. However, prediction results have not been sufficiently satisfactory for guiding biologists in web-lab experiments. One reason is that not all useful information, such as pairwise protein interaction information based on sequence alignment, has been integrated together in PPI prediction. Alignment is a basic concept to measure sequence similarity in Proteomics that has been used in a number of applications ranging from protein recognition to protein subcellular localization. In this article, we propose a novel integrated approach to predicting PPI based on sequence alignment by jointly using a k-Nearest Neighbor classifier (SA-kNN) and a Support Vector Machine (SVM). SVM is a machine learning technique used in a wide range of Bioinformatics applications, thanks to the ability to alleviate the overfitting problems. We demonstrate that in our approach the two methods, SA-kNN and SVM, are complementary, which are combined in an ensemble to overcome their respective limitations. While the SVM is trained on Amino Acid (AA) compositions and protein signatures mined from literature, the SA-kNN makes use of the similarity of two protein pairs through alignment. Experimentally, our technique leads to a significant gain in accuracy, precision and sensitivity measures at ~5%, 16% and 10% respectively.  相似文献   

6.
In this paper, we propose a method to create the 60-dimensional feature vector for protein sequences via the general form of pseudo amino acid composition. The construction of the feature vector is based on the contents of amino acids, total distance of each amino acid from the first amino acid in the protein sequence and the distribution of 20 amino acids. The obtained cosine distance metric (also called the similarity matrix) is used to construct the phylogenetic tree by the neighbour joining method. In order to show the applicability of our approach, we tested it on three proteins: 1) ND5 protein sequences from nine species, 2) ND6 protein sequences from eight species, and 3) 50 coronavirus spike proteins. The results are in agreement with known history and the output from the multiple sequence alignment program ClustalW, which is widely used. We have also compared our phylogenetic results with six other recently proposed alignment-free methods. These comparisons show that our proposed method gives a more consistent biological relationship than the others. In addition, the time complexity is linear and space required is less as compared with other alignment-free methods that use graphical representation. It should be noted that the multiple sequence alignment method has exponential time complexity.  相似文献   

7.
Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein–protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al. in 2003 have used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function.  相似文献   

8.
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Nai?ve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.  相似文献   

9.
Tyrosine sulfation is a post‐translational modification of many secreted and membrane‐bound proteins. It governs protein‐protein interactions that are involved in leukocyte adhesion, hemostasis, and chemokine signaling. However, the intrinsic feature of sulfated protein remains elusive and remains to be delineated. This investigation presents SulfoSite, which is a computational method based on a support vector machine (SVM) for predicting protein sulfotyrosine sites. The approach was developed to consider structural information such as concerning the secondary structure and solvent accessibility of amino acids that surround the sulfotyrosine sites. One hundred sixty‐two experimentally verified tyrosine sulfation sites were identified using UniProtKB/SwissProt release 53.0. The results of a five‐fold cross‐validation evaluation suggest that the accessibility of the solvent around the sulfotyrosine sites contributes substantially to predictive accuracy. The SVM classifier can achieve an accuracy of 94.2% in five‐fold cross validation when sequence positional weighted matrix (PWM) is coupled with values of the accessible surface area (ASA). The proposed method significantly outperforms previous methods for accurately predicting the location of tyrosine sulfation sites. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2009  相似文献   

10.
Biopolymer sequence comparison to identify evolutionarily related proteins, or homologs, is one of the most common tasks in bioinformatics. Support vector machines (SVMs) represent a new approach to the problem in which statistical learning theory is employed to classify proteins into families, thus identifying homologous relationships. Current SVM approaches have been shown to outperform iterative profile methods, such as PSI-BLAST, for protein homology classification. In this study, we demonstrate that the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation over a benchmark dataset consisting of 54 unique protein families. The SVM-BALSA algorithms returns a higher area under the receiver operating characteristic (ROC) curves for 37 of the 54 families and achieves an improved overall performance curve at a significance level of 0.07.  相似文献   

11.
Protein methylation is involved in dozens of biological processes and plays an important role in adjusting protein physicochemical properties, conformation and function. However, with the rapid increase of protein sequence entering into databanks, the gap between the number of known sequence and the number of known methylation annotation is widening rapidly. Therefore, it is vitally significant to develop a computational method for quick and accurate identification of methylation sites. In this study, a novel predictor (Methy_SVMIACO) based on support vector machine (SVM) and improved ant colony optimization algorithm (IACO) is developed to identify methylation sites. The IACO is utilized to find the optimal feature subset and parameter of SVM, while SVM is employed to perform the identification of methylation sites. Comparison of the IACO with conventional ACO shows that the IACO converges quickly toward the global optimal solution and it is more useful tool for feature selection and SVM parameter optimization. The performance of Methy_SVMIACO is evaluated with a sensitivity of 85.71%, a specificity of 86.67%, an accuracy of 86.19% and a Matthew's correlation coefficient (MCC) of 0.7238 for lysine as well as a sensitivity of 89.08%, a specificity of 94.07%, an accuracy of 91.56% and a MCC of 0.8323 for arginine in 10-fold cross-validation test. It is shown through the analysis of the optimal feature subset that some upstream and downstream residues play important role in the methylation of arginine and lysine. Compared with other existing methods, the Methy_SVMIACO provides higher Acc, Sen and Spe, indicating that the current method may serve as a powerful complementary tool to other existing approaches in this area. The Methy_SVMIACO can be acquired freely on request from the authors.  相似文献   

12.
The evolutionary relationships of organisms are traditionally delineated by the alignment‐based methods using some DNA or protein sequences. In the post‐genome era, the phylogenetics of life could be inferred from many sources such as genomic features, not just from comparison of one or several genes. To investigate the possibility that the physicochemical properties of protein sequences might reflect the phylogenetic ones, an alignment‐free method using a support vector machine (SVM) classifier is implemented to establish the phylogenetic relationships between some protein sequences. There are two types of datasets, namely, the “Enzymatic” (assigned by an EC accession) and “Proteins” used to train the SVM classifiers. By computing the F‐score for feature selection, we find that the classification accuracies of trained SVM classifiers could be significantly enhanced to 84% and 80%, respectively, for the enzymatic and “proteins” datasets classified if the protein sequences are represented with some top 255 features selected. These show that some physicochemical features of amino acid sequences selected are sufficient for inferring the phylogenetic properties of the protein sequences. Moreover, we find that the selected physicochemical features appear to correlate with the physiological characteristic of the taxonomic classes classified. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

13.
Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets.  相似文献   

14.
Knowledge of structural classes is useful in understanding of folding patterns in proteins. Although existing structural class prediction methods applied virtually all state-of-the-art classifiers, many of them use a relatively simple protein sequence representation that often includes amino acid (AA) composition. To this end, we propose a novel sequence representation that incorporates evolutionary information encoded using PSI-BLAST profile-based collocation of AA pairs. We used six benchmark datasets and five representative classifiers to quantify and compare the quality of the structural class prediction with the proposed representation. The best, classifier support vector machine achieved 61-96% accuracy on the six datasets. These predictions were comprehensively compared with a wide range of recently proposed methods for prediction of structural classes. Our comprehensive comparison shows superiority of the proposed representation, which results in error rate reductions that range between 14% and 26% when compared with predictions of the best-performing, previously published classifiers on the considered datasets. The study also shows that, for the benchmark dataset that includes sequences characterized by low identity (i.e., 25%, 30%, and 40%), the prediction accuracies are 20-35% lower than for the other three datasets that include sequences with a higher degree of similarity. In conclusion, the proposed representation is shown to substantially improve the accuracy of the structural class prediction. A web server that implements the presented prediction method is freely available at http://biomine.ece.ualberta.ca/Structural_Class/SCEC.html.  相似文献   

15.
This works provides an introduction to support vector machines (SVMs) for predictive modeling in heterogeneous catalysis, describing step by step the methodology with a highlighting of the points which make such technique an attractive approach. We first investigate linear SVMs, working in detail through a simple example based on experimental data derived from a study aiming at optimizing olefin epoxidation catalysts applying high-throughput experimentation. This case study has been chosen to underline SVM features in a visual manner because of the few catalytic variables investigated. It is shown how SVMs transform original data into another representation space of higher dimensionality. The concepts of Vapnik-Chervonenkis dimension and structural risk minimization are introduced. The SVM methodology is evaluated with a second catalytic application, that is, light paraffin isomerization. Finally, we discuss why SVMs is a strategic method, as compared to other machine learning techniques, such as neural networks or induction trees, and why emphasis is put on the problem of overfitting.  相似文献   

16.
RNA-binding proteins (RBPs) perform fundamental and diverse functions within the cell. Approximately 15% of proteins sequences are annotated as RNA-binding, but with a significant number of proteins without functional annotation, many RBPs are yet to be identified. A percentage of uncharacterised proteins can be annotated by transferring functional information from proteins sharing significant sequence homology. However, genomes contain a significant number of orphan open reading frames (ORFs) that do not share significant sequence similarity to other ORFs, but correspond to functional proteins. Hence methods for protein function annotation that go beyond sequence homology are essential. One method of annotation is the identification of ligands that bind to proteins, through the characterisation of binding site residues. In the current work RNA-binding residues (RBRs) are characterised in terms of their evolutionary conservation and the patterns they form in sequence space. The potential for such characteristics to be used to identify RBPs from sequence is then evaluated.In the current work the conservation of residues in 261 RBPs is compared for (a) RBRs vs. non-RBRs surface residues, and for (b) specific and non-specific RBRs. The analysis shows that RBRs are more conserved than other surface residues, and RBRs hydrogen-bonded to the RNA backbone are more conserved than those making hydrogen bonds to RNA bases. This observed conservation of RBRs was then used to inform the construction of RBR sequence patterns from known protein–RNA structures. A series of RBR patterns were generated for a case study protein aspartyl-tRNA synthetase bound to tRNA; and used to differentiate between RNA-binding and non-RNA-binding protein sequences. Six sequence patterns performed with high precision values of >80% and recall values 7 times that of an homology search. When the method was expanded to the complete dataset of 261 proteins, many patterns were of poor predictive value, as they had not been manipulated on a family-specific basis. However, two patterns with precision values ≥85% were used to make function predictions for a set of hypothetical proteins. This revealed a number of potential RBPs that require experimental verification.  相似文献   

17.
With the accelerated accumulation of genomic sequence data, there is a pressing need to develop computational methods and advanced bioinformatics infrastructure for reliable and large-scale protein annotation and biological knowledge discovery. The Protein Information Resource (PIR) provides an integrated public resource of protein informatics to support genomic and proteomic research. PIR produces the Protein Sequence Database of functionally annotated protein sequences. The annotation problems are addressed by a classification-driven and rule-based method with evidence attribution, coupled with an integrated knowledge base system being developed. The approach allows sensitive identification, consistent and rich annotation, and systematic detection of annotation errors, as well as distinction of experimentally verified and computationally predicted features. The knowledge base consists of two new databases, sequence analysis tools, and graphical interfaces. PIR-NREF, a non-redundant reference database, provides a timely and comprehensive collection of all protein sequences, totaling more than 1,000,000 entries. iProClass, an integrated database of protein family, function, and structure information, provides extensive value-added features for about 830,000 proteins with rich links to over 50 molecular databases. This paper describes our approach to protein functional annotation with case studies and examines common identification errors. It also illustrates that data integration in PIR supports exploration of protein relationships and may reveal protein functional associations beyond sequence homology.  相似文献   

18.
19.
Advances in protein crystallography and homology modeling techniques are producing vast amounts of high resolution protein structure data at ever increasing rates. As such, the ability to quickly and easily extract structural similarities is a key tool in discovering important functional relationships. We report on an approach for creating and maintaining a database of pairwise structure alignments for a comprehensive database comprising the PDB and homology models for the human and select pathogen genomes. Our approach consists of a novel, multistage method for determining pairwise structural similarity coupled with an efficient clustering protocol that approximates a full NxN assessment in a fraction of the time. Since biologists are commonly interested in recently released structures, and the homology models built from them, an automatically updating database of structural alignments has great value. Our approach yields a querying system that allows scientists to retrieve databank-wide protein structure similarities as easily as retrieving protein sequence similarities via BLAST or PSI-BLAST. Basic, noncommercial access to the database can be requested at https://tip.eidogen-sertanty.com/.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号