Classification ensembles for unbalanced class sizes in predictive toxicology |
| |
Authors: | J J Chen C A Tsai J F Young R L Kodell |
| |
Institution: | 1. Division of Biometry and Risk Assessment , National Center for Toxicological Research , Food and Drug Administration , Jefferson, Arkansas 72079, USA jchen@nctr.fda.gov;3. Institute of Statistical Science , Academia Sinica , Taipei, 11529, Taiwan;4. Division of Biometry and Risk Assessment , National Center for Toxicological Research , Food and Drug Administration , Jefferson, Arkansas 72079, USA |
| |
Abstract: | This paper investigates the effects of the ratio of positive-to-negative samples on the sensitivity, specificity, and concordance. When the class sizes in the training samples are not equal, the classification rule derived will favor the majority class and result in a low sensitivity on the minority class prediction. We propose an ensemble classification approach to adjust for differential class sizes in a binary classifier system. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. Two re-sampling methods, augmentation and abatement, are proposed to generate different bootstrap samples of equal class size to build the base classifiers. The augmentation method balances the two class sizes by bootstrapping additional samples from the minority class, whereas the abatement method balances the two class sizes by sampling only a subset of samples from the majority class. The proposed procedure is applied to a data set to predict estrogen receptor binding activity and to a data set to predict animal liver carcinogenicity using SAR (structure-activity relationship) models as base classifiers. The abatement method appears to perform well in balancing sensitivity and specificity. |
| |
Keywords: | Bagging Cross validation Ensemble classification Imbalanced data Sensitivity Specificity |
|
|