首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Variable ranking based on the estimated degree of separation for two distributions of data by the length of the receiver operating characteristic curve
Authors:Waleed M Maswadeh  A Peter Snyder
Institution:1. U.S. Army Edgewood Chemical Biological Center (ECBC), RDECOM, ATTN: RDCB-DRD-P, Building E3160, Edgewood Area, Aberdeen Proving Ground, MD 21010-5424, USA;2. Bel Air, MD 21015, USA
Abstract:Variable responses are fundamental for all experiments, and they can consist of information-rich, redundant, and low signal intensities. A dataset can consist of a collection of variable responses over multiple classes or groups. Usually some of the variables are removed in a dataset that contain very little information. Sometimes all the variables are used in the data analysis phase. It is common practice to discriminate between two distributions of data; however, there is no formal algorithm to arrive at a degree of separation (DS) between two distributions of data. The DS is defined herein as the average of the sum of the areas from the probability density functions (PDFs) of A and B that contain a ≥ percentage of A and/or B. Thus, DS90 is the average of the sum of the PDF areas of A and B that contain ≥90% of A and/or B. To arrive at a DS value, two synthesized PDFs or very large experimental datasets are required. Experimentally it is common practice to generate relatively small datasets. Therefore, the challenge was to find a statistical parameter that can be used on small datasets to estimate and highly correlate with the DS90 parameter. Established statistical methods include the overlap area of the two data distribution profiles, Welch’s t-test, Kolmogorov–Smirnov (K–S) test, Mann–Whitney–Wilcoxon test, and the area under the receiver operating characteristics (ROC) curve (AUC). The area between the ROC curve and diagonal (ACD) and the length of the ROC curve (LROC) are introduced. The established, ACD, and LROC methods were correlated to the DS90 when applied on many pairs of synthesized PDFs. The LROC method provided the best linear correlation with, and estimation of, the DS90. The estimated DS90 from the LROC (DS90–LROC) is applied to a database, as an example, of three Italian wines consisting of thirteen variable responses for variable ranking consideration. An important highlight of the DS90–LROC method is utilizing the LROC curve methodology to test all variables one-at-a-time with all pairs of classes in a dataset.
Keywords:Degree of separation  Variable ranking  Probability density function  Length of receiver operating characteristic curve  Area between receiver operating characteristic curve and diagonal
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号