首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries
Authors:Pankaj Kumar  Xiaohua Ma  Xianghui Liu  Jia Jia  Han Bucong  Ying Xue  Ze Rong Li  Sheng Yong Yang  Yu Quan Wei  Yu Zong Chen
Institution:(1) Bioinformatics and Drug Design Group, Centre for Computational Science and Engineering, Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117546, Singapore;(2) College of Chemistry, Sichuan University, Chengdu, 610064, People’s Republic of China;(3) State Key Laboratory of Biotherapy, Sichuan University, Chengdu, 610064, People’s Republic of China
Abstract:Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT−) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT− compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT− compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT−, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1–51.9% GT+ and 75–93% GT− rates of existing in-silico methods, 58.8% GT+ and 79% GT− rates of Ames method, and the estimated percentages of 23% in vivo and 31–33% in vitro GT+ compounds in the “universe of chemicals”. There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT− MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.
Keywords:
本文献已被 PubMed SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号