首页 | 本学科首页   官方微博 | 高级检索  
     检索      


How to winnow actives from inactives: introducing molecular orthogonal sparse bigrams (MOSBs) and multiclass Winnow
Authors:Nigsch Florian  Mitchell John B O
Institution:Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom.
Abstract:In the present paper we combine the Winnow algorithm and an advanced scheme for feature generation into a tool for multiclass classification. The Winnow algorithm, specifically designed in the late 1980s to work well with high-dimensional data, by design ignores most of the irrelevant features for the scoring of each single training/test case. To augment the pool of available molecular features we use the Winnow algorithm in conjunction with a process that creates additional features from a set of given ones. We adapt a technique formerly employed in text classification termed "orthogonal sparse bigrams" and extend the use of that method to the domain of cheminformatics. Using circular molecular fingerprints as initial features, we create "molecular orthogonal sparse bigrams" (MOSBs) and report their successful application to the task of classification of bioactive molecules. Additionally, we introduce a memory-efficient way of bagging individual classifiers, avoiding the need to hold the complete training data set in memory. To compare the performance of our method with published results, we use the Hert data set of 8293 active molecules in 11 classes. We compare our method to Random Forest and find that our method not only is comparable or better in classification accuracy (up to 50% higher in MCC Matthews correlation coefficient], 98% higher in fraction of correct predictions) but also is quicker to train (by a factor between 2 and 18, depending on the feature generation), more memory efficient, and able to cope more easily with large data sets when we seeded the actives into a pool of 94290 inactive molecules. It is shown that this method can be used with different fingerprints.
Keywords:
本文献已被 PubMed 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号