Relative density-based classification noise detection |
| |
Authors: | Shu-yin Xia Zhong-yang Xiong Yun He Kuang Li Li-mei Dong Man Zhang |
| |
Affiliation: | 1. College of Computer Science, Chongqing University, Chongqing 400044, China;2. Institute of Electrical Engineering and Information, Sichuan University, Chengdu 400015, China;3. Department of Electronics and Information Engineering, Chongqing Technology and Business Institute, Chongqing 400042, China |
| |
Abstract: | Classification noise is a common byproduct of traditional data mining approaches, and no specialized approach for detecting classification noise is currently available. Methods for outlier detection are well-developed, but outliers and classification noise have characteristics different enough to make outlier detection algorithms unsuitable for classification noise detection. In this paper, a new, specialized approach to detect classification noise is proposed, named relative density based classification noise detection (RDBCND). Computational experiments in artificial data sets described herein show that RDBCND has time complexity of O(n log n), indicating greater efficiency than traditional approaches, which exhibit time complexity of at least O(n2). The use of classification noise detection to improve the generalization ability of common classifier algorithms is also described. In particular, a new unified approach based on RDBCND is compared to a cross validation approach applied to a BP neural network. Trials in both artificial and real-life datasets show that the RDBCND-based approach can greatly accelerate the process of identifying the best decision function. The novel method can also eliminate underfitting, as the algorithm simply searches for the highest training accuracy. The experiments also show that the RDBCND-based method has greater accuracy and lower cpu time in reaching global solutions than the cross-validation method. Since the relative density is a local concept, our new approach can be directly used in nonlinear datasets without data transformation. It is a great advantage compared to some linear classifier algorithms. As in current linear classifiers, the kernel functions or other transformations need to be used to make them suitable for non-linear datasets, and that will increase their complexity. |
| |
Keywords: | Classifcation noise Relative density RDBCND Generalizability Overfitting |
本文献已被 ScienceDirect 等数据库收录! |
|