期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix

Kazuyoshi Yata 《Journal of multivariate analysis》2010,101(9):2060-2077

In this paper, we propose a new methodology to deal with PCA in high-dimension, low-sample-size (HDLSS) data situations. We give an idea of estimating eigenvalues via singular values of a cross data matrix. We provide consistency properties of the eigenvalue estimation as well as its limiting distribution when the dimension d and the sample size n both grow to infinity in such a way that n is much lower than d. We apply the new methodology to estimating PC directions and PC scores in HDLSS data situations. We give an application of the findings in this paper to a mixture model to classify a dataset into two clusters. We demonstrate how the new methodology performs by using HDLSS data from a microarray study of prostate cancer. 相似文献

2.

A distance-based,misclassification rate adjusted classifier for multiclass,high-dimensional data

Makoto Aoshima Kazuyoshi Yata 《Annals of the Institute of Statistical Mathematics》2014,66(5):983-1010

In this paper, we consider a scale adjusted-type distance-based classifier for high-dimensional data. We first give such a classifier that can ensure high accuracy in misclassification rates for two-class classification. We show that the classifier is not only consistent but also asymptotically normal for high-dimensional data. We provide sample size determination so that misclassification rates are no more than a prespecified value. We propose a classification procedure called the misclassification rate adjusted classifier. We further develop the classifier to multiclass classification. We show that the classifier can still enjoy asymptotic properties and ensure high accuracy in misclassification rates for multiclass classification. Finally, we demonstrate the proposed classifier in actual data analyses by using a microarray data set. 相似文献

3.

A nonparametric version of Wilks’ lambda—Asymptotic results and small sample approximations

Chunxu LiuArne C. Bathke Solomon W. Harrar 《Statistics & probability letters》2011,81(10):1502-1506

We propose a nonparametric version of Wilks’ lambda (the multivariate likelihood ratio test) and investigate its asymptotic properties under the two different scenarios of either large sample size or large number of samples. For unbalanced samples, a weighted and an unweighted variant are introduced. The unweighted variant of the proposed test appears to be novel also in the normal-theory context.The theoretical results are supplemented by a simulation study with parameter settings that are motivated by clinical and agricultural data, considering in particular the performance for small sample sizes, small number of samples, and varying dimensions. Inference methods based on the asymptotic sampling distribution and a small sample approximation are compared to permutation tests and to other parametric and nonparametric procedures. Application of the proposed method is illustrated by examples. 相似文献

4.

On robust classification using projection depth

Subhajit Dutta Anil K. Ghosh 《Annals of the Institute of Statistical Mathematics》2012,64(3):657-676

This article uses projection depth (PD) for robust classification of multivariate data. Here we consider two types of classifiers, namely, the maximum depth classifier and the modified depth-based classifier. The latter involves kernel density estimation, where one needs to choose the associated scale of smoothing. We consider both the single scale and the multi-scale versions of kernel density estimation, and investigate the large sample properties of the resulting classifiers under appropriate regularity conditions. Some simulated and real data sets are analyzed to evaluate the finite sample performance of these classification tools. 相似文献

5.

Aggregating multiple classification results using fuzzy integration and stochastic feature selection 总被引：1，自引：0，他引：1

Nick J. Pizzi Witold Pedrycz 《International Journal of Approximate Reasoning》2010,51(8):883-894

Classifying magnetic resonance spectra is often difficult due to the curse of dimensionality; scenarios in which a high-dimensional feature space is coupled with a small sample size. We present an aggregation strategy that combines predicted disease states from multiple classifiers using several fuzzy integration variants. Rather than using all input features for each classifier, these multiple classifiers are presented with different, randomly selected, subsets of the spectral features. Results from a set of detailed experiments using this strategy are carefully compared against classification performance benchmarks. We empirically demonstrate that the aggregated predictions are consistently superior to the corresponding prediction from the best individual classifier. 相似文献

6.

The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise

Yingtao Bi 《Journal of multivariate analysis》2010,101(7):1622-1637

In many real world classification problems, class-conditional classification noise (CCC-Noise) frequently deteriorates the performance of a classifier that is naively built by ignoring it. In this paper, we investigate the impact of CCC-Noise on the quality of a popular generative classifier, normal discriminant analysis (NDA), and its corresponding discriminative classifier, logistic regression (LR). We consider the problem of two multivariate normal populations having a common covariance matrix. We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise. We show that when the noise level is low, the asymptotic error rates of both procedures are only slightly affected. We also show that LR is less deteriorated by CCC-Noise compared to NDA. Under CCC-Noise contexts, the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures. In particular, when this distance is small, LR tends to be more tolerable to CCC-Noise compared to NDA. 相似文献

7.

Regularized classification for mixed continuous and categorical variables under across-location heteroscedasticity

Chi-Ying Leung 《Journal of multivariate analysis》2005,93(2):358-374

A regularized classifier is proposed for a two-population classification problem of mixed continuous and categorical variables in a general location model(GLOM). The limiting overall expected error for the classifier is given. It can be used in an optimization search for the regularization parameters. For a heteroscedastic spherical dispersion across all locations, an asymptotic error is available which provides an alternative criterion for the optimization search. In addition, the asymptotic error can serve as a baseline for practical comparisons with other classifiers. Results based on a simulation and two real datasets are presented. 相似文献

8.

Testing procedures for detection of linear dependencies in efficiency models

Antonio Peyrache Tim Coelli 《European Journal of Operational Research》2009

The validity of many efficiency measurement methods rely upon the assumption that variables such as input quantities and output mixes are independent of (or uncorrelated with) technical efficiency, however few studies have attempted to test these assumptions. In a recent paper, Wilson (2003) investigates a number of independence tests and finds that they have poor size properties and low power in moderate sample sizes. In this study we discuss the implications of these assumptions in three situations: (i) bootstrapping non-parametric efficiency models; (ii) estimating stochastic frontier models and (iii) obtaining aggregate measures of industry efficiency. We propose a semi-parametric Hausmann-type asymptotic test for linear independence (uncorrelation), and use a Monte Carlo experiment to show that it has good size and power properties in finite samples. We also describe how the test can be generalized in order to detect higher order dependencies, such as heteroscedasticity, so that the test can be used to test for (full) independence when the efficiency distribution has a finite number of moments. Finally, an empirical illustration is provided using data on US electric power generation. 相似文献

9.

Empirical likelihood test for high dimensional linear models

《Statistics & probability letters》2014

We propose an empirical likelihood method to test whether the coefficients in a possibly high-dimensional linear model are equal to given values. The asymptotic distribution of the test statistic is independent of the number of covariates in the linear model. 相似文献

10.

A formula for multiple classifiers in data mining based on Brandt semigroups

A. V. Kelarev J. L. Yearwood M. A. Mammadov 《Semigroup Forum》2009,78(2):293-309

A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups. 相似文献

11.

Bootstrapping Pseudolikelihood Models for Clustered Binary Data

Marc Aerts Gerda Claeskens 《Annals of the Institute of Statistical Mathematics》1999,51(3):515-530

Asymptotic properties of the parametric bootstrap procedure for maximum pseudolikelihood estimators and hypothesis tests are studied in the general framework of associated populations. The technique is applied to the analysis of toxicological experiments which, based on pseudolikelihood inference for clustered binary data, fits into this framework. It is shown that the bootstrap approximation can be used as an interesting alternative to the classical asymptotic distribution of estimators and test statistics. Finite sample simulations for clustered binary data models confirm the asymptotic theory and indicate some substantial improvements. 相似文献

12.

A Feature Selection Newton Method for Support Vector Machine Classification 总被引：4，自引：1，他引：3

Glenn M. Fung O.L. Mangasarian 《Computational Optimization and Applications》2004,28(2):185-202

A fast Newton method, that suppresses input space features, is proposed for a linear programming formulation of support vector machine classifiers. The proposed stand-alone method can handle classification problems in very high dimensional spaces, such as 28,032 dimensions, and generates a classifier that depends on very few input features, such as 7 out of the original 28,032. The method can also handle problems with a large number of data points and requires no specialized linear programming packages but merely a linear equation solver. For nonlinear kernel classifiers, the method utilizes a minimal number of kernel functions in the classifier that it generates. 相似文献

13.

Tuning membership functions of kernel fuzzy classifiers by maximizing margins

Kazuya Morikawa Seiichi Ozawa Shigeo Abe 《Memetic Computing》2009,1(3):221-228

We propose two methods for tuning membership functions of a kernel fuzzy classifier based on the idea of SVM (support vector machine) training. We assume that in a kernel fuzzy classifier a fuzzy rule is defined for each class in the feature space. In the first method, we tune the slopes of the membership functions at the same time so that the margin between classes is maximized under the constraints that the degree of membership to which a data sample belongs is the maximum among all the classes. This method is similar to a linear all-at-once SVM. We call this AAO tuning. In the second method, we tune the membership function of a class one at a time. Namely, for a class the slope of the associated membership function is tuned so that the margin between the class and the remaining classes is maximized under the constraints that the degrees of membership for the data belonging to the class are large and those for the remaining data are small. This method is similar to a linear one-against-all SVM. This is called OAA tuning. According to the computer experiment for fuzzy classifiers based on kernel discriminant analysis and those with ellipsoidal regions, usually both methods improve classification performance by tuning membership functions and classification performance by AAO tuning is slightly better than that by OAA tuning. 相似文献

14.

A boosting method for maximization of the area under the ROC curve

Osamu Komori 《Annals of the Institute of Statistical Mathematics》2011,63(5):961-979

We discuss receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) for binary classification problems in clinical fields. We propose a statistical method for combining multiple feature variables, based on a boosting algorithm for maximization of the AUC. In this iterative procedure, various simple classifiers that consist of the feature variables are combined flexibly into a single strong classifier. We consider a regularization to prevent overfitting to data in the algorithm using a penalty term for nonsmoothness. This regularization method not only improves the classification performance but also helps us to get a clearer understanding about how each feature variable is related to the binary outcome variable. We demonstrate the usefulness of score plots constructed componentwise by the boosting method. We describe two simulation studies and a real data analysis in order to illustrate the utility of our method. 相似文献

15.

线性回归模型Bootstrap LM-Error检验的水平扭曲

欧变玲龙志和林怡坚《数理统计与管理》2013,32(1):35-41

基于OLS估计残差,本文将Bootstrap方法用于空间误差相关性LM-Error检验,综合考虑Bootstrap模拟抽样次数、空间衔接结构以及样本量,研究并比较空间误差相关Bootstrap LM-Error检验与渐近检验的水平扭曲。大量Monte Carlo实验结果显示,当模型误差不满足独立正态分布的假设条件时,空间误差相关LM-Error渐近检验的水平扭曲较大,采用Bootstrap方法可以较好地降低该水平扭曲;不管模型误差是否满足独立正态分布的假设条件,Bootstrap方法均能够有效地降低LMError渐近检验的水平扭曲。相似文献

16.

Marginal and simultaneous predictive classification using stratified graphical models

Henrik Nyman Jie Xiong Johan Pensar Jukka Corander 《Advances in Data Analysis and Classification》2016,10(3):305-326

An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models. 相似文献

17.

Asymptotic second-order consistency for two-stage estimation methodologies and its applications

Makoto Aoshima Kazuyoshi Yata 《Annals of the Institute of Statistical Mathematics》2010,62(3):571-600

We consider fixed-size estimation for a linear function of means from independent and normally distributed populations having unknown and respective variances. We construct a fixed-width confidence interval with required accuracy about the magnitude of the length and the confidence coefficient. We propose a two-stage estimation methodology having the asymptotic second-order consistency with the required accuracy. The key is the asymptotic second-order analysis about the risk function. We give a variety of asymptotic characteristics about the estimation methodology, such as asymptotic sample size and asymptotic Fisher-information. With the help of the asymptotic second-order analysis, we also explore a number of generalizations and extensions of the two-stage methodology to such as bounded risk point estimation, multiple comparisons among components between the populations, and power analysis in equivalence tests to plan the appropriate sample size for a study. 相似文献

18.

The effect of across-location heteroscedasticity on the classification of mixed categorical and continuous data

Chi-Ying Leung 《Journal of multivariate analysis》2003,84(2):369-386

Classification of mixed categorical and continuous data is often performed using the location linear discriminant function which assumes across-location homoscedasticity. In this paper, we investigate the hazard arising from a routine application of the classifier under across-location heteroscedasticity. A limiting and a first-order asymptotic performance index are proposed and studied in a general setting. The first index studies the limiting behavior. The second index corrects the bias due to the finite sample size. Both indexes are illustrated under the assumption of unequal spherical covariance matrices across all the locations. This is likely to be the case in most classification problems dealing with mixed categorical and continuous data. Results of a numerical study are reported. 相似文献

19.

Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations 总被引：1，自引：0，他引：1

Kazuyoshi YataMakoto Aoshima 《Journal of multivariate analysis》2012,105(1):193-215

In this article, we propose a new estimation methodology to deal with PCA for high-dimension, low-sample-size (HDLSS) data. We first show that HDLSS datasets have different geometric representations depending on whether a ρ-mixing-type dependency appears in variables or not. When the ρ-mixing-type dependency appears in variables, the HDLSS data converge to an n-dimensional surface of unit sphere with increasing dimension. We pay special attention to this phenomenon. We propose a method called the noise-reduction methodology to estimate eigenvalues of a HDLSS dataset. We show that the eigenvalue estimator holds consistency properties along with its limiting distribution in HDLSS context. We consider consistency properties of PC directions. We apply the noise-reduction methodology to estimating PC scores. We also give an application in the discriminant analysis for HDLSS datasets by using the inverse covariance matrix estimator induced by the noise-reduction methodology. 相似文献

20.

基于BalanceCascade-GBDT算法的类别不平衡虚假评论识别方法

陶朝杰杨进《经济数学》2020,37(3):214-220

虚假评论是电商发展过程中一个无法避免的难题. 针对在线评论数据中样本类别不平衡情况,提出基于BalanceCascade-GBDT算法的虚假评论识别方法. BalanceCascade算法通过设置分类器的误报率逐步缩小大类样本空间,然后集成所有基分类器构建最终分类器. GBDT以其高准确性和可解释性被广泛应用于分类问题中,并且作为样本扰动不稳定算法,是十分合适的基分类模型. 模型基于Yelp评论数据集,采用AUC值作为评价指标,并与逻辑回归、随机森林以及神经网络算法进行对比,实验证明了该方法的有效性. 相似文献