首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Protein sumoylation sites prediction based on two-stage feature selection
Authors:Lin Lu  Xiao-He Shi  Su-Jun Li  Zhi-Qun Xie  Yong-Li Feng  Wen-Cong Lu  Yi-Xue Li  Haipeng Li  Yu-Dong Cai
Institution:2. Department of Biomedical Engineering, Shanghai Jiao Tong University, 200240, Shanghai, China
5. Institute of Health Science, Shanghai Institute for Biological Science, Chinese Academy of Science, 225 South ChongQing Road, 200025, Shanghai, China
7. Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 200031, Shanghai, China
3. CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 200031, Shanghai, China
6. Department of Chemistry, College of Sciences, 99 Shang-Da Road, 200444, Shanghai, China
4. Life Science and Technology, School of Shanghai Jiao Tong University, 200240, Shanghai, China
1. Institute of System Biology, Shanghai University, 99 Shang-Da Road, 200244, Shanghai, China
Abstract:Protein sumoylation is one of the most important post-translational modifications. Accurate prediction of sumoylation sites is very useful for the analysis of proteome. Though the putative motif ΨK XE can be used, optimization of prediction models still remains a challenge. In this study, we developed a prediction system based on feature selection strategy. A total of 1,272 peptides with 14 residues from SUMOsp (Xue et al. 8] Nucleic Acids Res 34:W254–W257, 2006) were investigated in this study, including 212 substrates and 1,060 non-substrates. Among the substrates, only 162 substrates comply to the motif ΨK XE. First, 1,272 substrates were divided into training set and test set. All the substrates were encoded into feature vectors by hundreds of amino acid properties collected by Amino Acid Index Database (AAIndex, http://www.genome.jp/aaindex). Then, mRMR (minimum redundancy–maximum relevance) method was applied to extract the most informative features. Finally, Nearest Neighbor Algorithm (NNA) was used to produce the prediction models. Tested by Leave-one-out (LOO) cross-validation, the optimal prediction model reaches the accuracy of 84.4% for the training set and 76.4% for the test set. Especially, 180 substrates were correctly predicted, which was 18 more than using the motif ΨK XE. The final selected features indicate that amino acid residues with two-residue downstream and one-residue upstream of the sumoylation sites play the most important role in determining the occurrence of sumoylation. Based on the feature selection strategy, our prediction system can not only be used for high throughput prediction of sumoylation sites but also as a tool to investigate the mechanism of sumoylation.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号