首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose a two-step variable selection procedure for censored quantile regression with high dimensional predictors. To account for censoring data in high dimensional case, we employ effective dimension reduction and the ideas of informative subset idea. Under some regularity conditions, we show that our procedure enjoys the model selection consistency. Simulation study and real data analysis are conducted to evaluate the finite sample performance of the proposed approach.  相似文献   

2.
Feature screening plays an important role in ultrahigh dimensional data analysis. This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors (e.g., genetic makers) given a low-dimensional exposure variable (such as clinical variables or environmental variables). To this end, we first propose a new index to measure conditional independence, and further develop a conditional screening procedure based on the newly proposed index. We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions. The newly proposed screening procedure enjoys some appealing properties. (a) It is model-free in that its implementation does not require a specification on the model structure; (b) it is robust to heavy-tailed distributions or outliers in both directions of response and predictors; and (c) it can deal with both feature screening and the conditional screening in a unified way. We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.  相似文献   

3.
In this paper we propose a new test procedure for sphericity of the covariance matrix when the dimensionality, p, exceeds that of the sample size, N=n+1. Under the assumptions that (A) as p for i=1,…,16 and (B) p/nc< known as the concentration, a new statistic is developed utilizing the ratio of the fourth and second arithmetic means of the eigenvalues of the sample covariance matrix. The newly defined test has many desirable general asymptotic properties, such as normality and consistency when (n,p)→. Our simulation results show that the new test is comparable to, and in some cases more powerful than, the tests for sphericity in the current literature.  相似文献   

4.
For high dimensional data sets the sample covariance matrix is usually unbiased but noisy if the sample is not large enough. Shrinking the sample covariance towards a constrained, low dimensional estimator can be used to mitigate the sample variability. By doing so, we introduce bias, but reduce variance. In this paper, we give details on feasible optimal shrinkage allowing for time series dependent observations.  相似文献   

5.
In this article, the problem of classifying a new observation vector into one of the two known groups Πi,i=1,2, distributed as multivariate normal with common covariance matrix is considered. The total number of observation vectors from the two groups is, however, less than the dimension of the observation vectors. A sample-squared distance between the two groups, using Moore-Penrose inverse, is introduced. A classification rule based on the minimum distance is proposed to classify an observation vector into two or several groups. An expression for the error of misclassification when there are only two groups is derived for large p and n=O(pδ),0<δ<1.  相似文献   

6.
This paper deals with the problem of choosing the optimum criterion to select the best of a set of nested binary choice models. Special attention is given to the procedures which are derived in a decision-theoretic framework, called model selection criteria (MSC). We propose a new criterion, which we call C 2, whose theoretical behaviour is compared with that of the AIC and SBIC criteria. The result of the theoretical study shows that the SBIC is the best criterion whatever the situation we consider, while the AIC and C 2 are only adequate in some cases. The Monte Carlo experiment that is carried out corroborates the theoretical results and adds others: finite sample behaviour and robustness to changes in some aspects of the data generating process. The classical hypothesis testing procedures LR and LM are included and compared with the three criteria of the MSC category. The authors wish to thank the financial support provided by the Spanish Department of Education under project BEC 2003-01757.  相似文献   

7.
We propose using graph theoretic results to develop an infrastructure that tracks movement from a display of one set of variables to another. The illustrative example throughout is the real-time morphing of one scatterplot into another. Hurley and Oldford (J Comput Graph Stat 2010) made extensive use of the graph having variables as nodes and edges indicating a paired relationship between them. The present paper introduces several new graphs derivable from this one whose traversals can be described as particular movements through high dimensional spaces. These are connected to known results in graph theory and the graph theoretic results applied to the problem of visualizing high-dimensional data.  相似文献   

8.
Summary A new method for the numerical integration of very high dimensional functions is introduced and implemented based on the Metropolis' Monte Carlo algorithm. The logarithm of the high dimensional integral is reduced to a 1-dimensional integration of a certain statistical function with respect to a scale parameter over the range of the unit interval. The improvement in accuracy is found to be substantial comparing to the conventional crude Monte Carlo integration. Several numerical demonstrations are made, and variability of the estimates are shown.  相似文献   

9.
In high‐dimensional data settings where p  ? n , many penalized regularization approaches were studied for simultaneous variable selection and estimation. However, with the existence of covariates with weak effect, many existing variable selection methods, including Lasso and its generations, cannot distinguish covariates with weak and no contribution. Thus, prediction based on a subset model of selected covariates only can be inefficient. In this paper, we propose a post selection shrinkage estimation strategy to improve the prediction performance of a selected subset model. Such a post selection shrinkage estimator (PSE) is data adaptive and constructed by shrinking a post selection weighted ridge estimator in the direction of a selected candidate subset. Under an asymptotic distributional quadratic risk criterion, its prediction performance is explored analytically. We show that the proposed post selection PSE performs better than the post selection weighted ridge estimator. More importantly, it improves the prediction performance of any candidate subset model selected from most existing Lasso‐type variable selection methods significantly. The relative performance of the post selection PSE is demonstrated by both simulation studies and real‐data analysis. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

10.
Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance.  相似文献   

11.
We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency.  相似文献   

12.
Previous studies on financial distress prediction (FDP) almost construct FDP models based on a balanced data set, or only use traditional classification methods for FDP modelling based on an imbalanced data set, which often results in an overestimation of an FDP model’s recognition ability for distressed companies. Our study focuses on support vector machine (SVM) methods for FDP based on imbalanced data sets. We propose a new imbalance-oriented SVM method that combines the synthetic minority over-sampling technique (SMOTE) with the Bagging ensemble learning algorithm and uses SVM as the base classifier. It is named as SMOTE-Bagging-based SVM-ensemble (SB-SVM-ensemble), which is theoretically more effective for FDP modelling based on imbalanced data sets with limited number of samples. For comparative study, the traditional SVM method as well as three classical imbalance-oriented SVM methods such as cost-sensitive SVM, SMOTE-SVM, and data-set-partition-based SVM-ensemble are also introduced. We collect an imbalanced data set for FDP from the Chinese publicly traded companies, and carry out 100 experiments to empirically test its effectiveness. The experimental results indicate that the new SB-SVM-ensemble method outperforms the traditional methods and is a useful tool for imbalanced FDP modelling.  相似文献   

13.
14.
In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.  相似文献   

15.
16.
We develop walk-on-sphere method for fractional Poisson equations with Dirichilet boundary conditions in high dimensions. The walk-on-sphere method is based on probabilistic representation of the fractional Poisson equation. We propose efficient quadrature rules to evaluate integral representation in the ball and apply rejection sampling method to drawing from the computed probabilities in general domains. Moreover, we provide an estimate of the number of walks in the mean value for the method when the domain is a ball. We show that the number of walks is increasing in the fractional order and the distance of the starting point to the origin. We also give the relationship between the Green function of fractional Laplace equation and that of the classical Laplace equation. Numerical results for problems in 2–10 dimensions verify our theory and the efficiency of the modified walk-on-sphere method.  相似文献   

17.
Extreme learning machine (ELM) not only is an effective classifier in supervised learning, but also can be applied on unsupervised learning and semi-supervised learning. The model structure of unsupervised extreme learning machine (US-ELM) and semi-supervised extreme learning machine (SS-ELM) are same as ELM, the difference between them is the cost function. We introduce kernel function to US-ELM and propose unsupervised extreme learning machine with kernel (US-KELM). And SS-KELM has been proposed. Wavelet analysis has the characteristics of multivariate interpolation and sparse change, and Wavelet kernel functions have been widely used in support vector machine. Therefore, to realize a combination of the wavelet kernel function, US-ELM, and SS-ELM, unsupervised extreme learning machine with wavelet kernel function (US-WKELM) and semi-supervised extreme learning machine with wavelet kernel function (SS-WKELM) are proposed in this paper. The experimental results show the feasibility and validity of US-WKELM and SS-WKELM in clustering and classification.  相似文献   

18.
Classification of samples into two or multi-classes is to interest of scientists in almost every field. Traditional statistical methodology for classification does not work well when there are more variables (p) than there are samples (n) and it is highly sensitive to outlying observations. In this study, a robust partial least squares based classification method is proposed to handle data containing outliers where $n\ll p.$ The proposed method is applied to well-known benchmark datasets and its properties are explored by an extensive simulation study.  相似文献   

19.
This paper investigates the feature subset selection problem for the binary classification problem using logistic regression model. We developed a modified discrete particle swarm optimization (PSO) algorithm for the feature subset selection problem. This approach embodies an adaptive feature selection procedure which dynamically accounts for the relevance and dependence of the features included the feature subset. We compare the proposed methodology with the tabu search and scatter search algorithms using publicly available datasets. The results show that the proposed discrete PSO algorithm is competitive in terms of both classification accuracy and computational performance.  相似文献   

20.
We consider the problem of finding, from the final data u(x,y,T)=g(x,y), the initial data u(x,y,0) of the temperature function u(x,y,t),(x,y)I=(0,π)×(0,π),t[0,T] satisfying the following system
The problem is severely ill-posed. In this paper a simple and convenient new regularization method for solving this problem is considered. Meanwhile, some quite sharp error estimates between the approximate solution and exact solution are provided. A numerical example also shows that the method works effectively.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号