首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The multinomial logit model is the most widely used model for nominal multi-category responses. One problem with the model is that many parameters are involved, and another that interpretation of parameters is much harder than for linear models because the model is nonlinear. Both problems can profit from graphical representations. We propose to visualize the effect strengths by star plots, where one star collects all the parameters connected to one term in the linear predictor. In simple models, one star refers to one explanatory variable. In contrast to conventional star plots, which are used to represent data, the plots represent parameters and are considered as parameter glyphs. The set of stars for a fitted model makes the main features of the effects of explanatory variables on the response variable easily accessible. The method is extended to ordinal models and illustrated by several datasets. Supplementary materials are available online.  相似文献   

2.
This article presents a likelihood-based boosting approach for fitting binary and ordinal mixed models. In contrast to common procedures, this approach can be used in high-dimensional settings where a large number of potentially influential explanatory variables are available. Constructed as a componentwise boosting method, it is able to perform variable selection with the complexity of the resulting estimator being determined by information criteria. The method is investigated in simulation studies both for cumulative and sequential models and is illustrated by using real datasets. The supplementary materials for the article are available online.  相似文献   

3.
In developing travel demand models it is generally assumed that the base-year data used in developing the parameters, as well as the forecasted data to be used as independent variables for the design year, are of acceptable quality. The purpose of this paper is to present the application of error propagation theory in assesing the predictive quality of one type of travel demand forecasting model (multinomial logit models) and to demonstrate how error considerations can be used as a tool for identifying the optimal model. The general conclusions of this study are that: (1) it is indeed possible to quantify errors in dependent variables in logit models as a consequence of errors in independent variables; and (2) error consideration can be used as a tool for identifying the optimal model from a set of candidate models. Further research is recommended to develop better insights into the phenomenon of error propagation so that the consideration of errors can be a factor in decisions on model selection.  相似文献   

4.
In biomedical research, boosting-based regression approaches have gained much attention in the last decade. Their intrinsic variable selection procedure and ability to shrink the estimates of the regression coefficients toward 0 make these techniques appropriate to fit prediction models in the case of high-dimensional data, e.g. gene expressions. Their prediction performance, however, highly depends on specific tuning parameters, in particular on the number of boosting iterations to perform. This crucial parameter is usually selected via cross-validation. The cross-validation procedure may highly depend on a completely random component, namely the considered fold partition. We empirically study how much this randomness affects the results of the boosting techniques, in terms of selected predictors and prediction ability of the related models. We use four publicly available data sets related to four different diseases. In these studies, the goal is to predict survival end-points when a large number of continuous candidate predictors are available. We focus on two well known boosting approaches implemented in the R-packages CoxBoost and mboost, assuming the validity of the proportional hazards assumption and the linearity of the effects of the predictors. We show that the variability in selected predictors and prediction ability of the model is reduced by averaging over several repetitions of cross-validation in the selection of the tuning parameters.  相似文献   

5.
Semiparametric models with diverging number of predictors arise in many contemporary scientific areas.Variable selection for these models consists of two components:model selection for non-parametric components and selection of significant variables for the parametric portion.In this paper,we consider a variable selection procedure by combining basis function approximation with SCAD penalty.The proposed procedure simultaneously selects significant variables in the parametric components and the nonparametric components.With appropriate selection of tuning parameters,we establish the consistency and sparseness of this procedure.  相似文献   

6.
The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. We study the problem of variable selection and estimation in this model in the sparse, high-dimensional case. We develop a concave group selection approach for this problem using basis function expansion and study its theoretical and empirical properties. We also apply the group Lasso for variable selection and estimation in this model and study its properties. Under appropriate conditions, we show that the group least absolute shrinkage and selection operator (Lasso) selects a model whose dimension is comparable to the underlying model, regardless of the large number of unimportant variables. In order to improve the selection results, we show that the group minimax concave penalty (MCP) has the oracle selection property in the sense that it correctly selects important variables with probability converging to one under suitable conditions. By comparison, the group Lasso does not have the oracle selection property. In the simulation parts, we apply the group Lasso and the group MCP. At the same time, the two approaches are evaluated using simulation and demonstrated on a data example.  相似文献   

7.
Multiple linear regression with special properties of its coefficients parameterized by exponent, logit, and multinomial functions is considered. To obtain always positive coefficients the exponential parameterization is applied. To get coefficients in an assigned range, the logistic parameterization is used. Such coefficients permit us to evaluate the impact of individual predictors in the model. The coefficients obtained by the multinomial–logit parameterization equal the shares of the predictors, which is useful for interpretation of their influence. The considered regression models are constructed by nonlinear optimization techniques, have stable solutions and good quality of fit, have simple structure of the linear aggregates, demonstrate high predictive ability, and suggest a convenient way to identify the main predictors.  相似文献   

8.
A general single-index model with high-dimensional predictors is considered.Additive structure of the unknown link function and the error is not assumed in this model.The consistency of predictor selection and estimation is investigated in this model.The index is formulated in the sufficient dimension reduction framework.A distribution-based LASSO estimation is then suggested.When the dimension of predictors can diverge at a polynomial rate of the sample size,the consistency holds under an irrepresentable condition and mild conditions on the predictors.The new method has no requirement,other than independence from the predictors,for the distribution of the error.This property results in robustness of the new method against outliers in the response variable.The conventional consistency of index estimation is provided after the dimension is brought down to a value smaller than the sample size.The importance of the irrepresentable condition for the consistency,and the robustness are examined by a simulation study and two real-data examples.  相似文献   

9.
Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.  相似文献   

10.
We consider the problem of variable selection for single-index varying-coefficient model, and present a regularized variable selection procedure by combining basis function approximations with SCAD penalty. The proposed procedure simultaneously selects significant covariates with functional coefficients and local significant variables with parametric coefficients. With appropriate selection of the tuning parameters, the consistency of the variable selection procedure and the oracle property of the estimators are established. The proposed method can naturally be applied to deal with pure single-index model and varying-coefficient model. Finite sample performances of the proposed method are illustrated by a simulation study and the real data analysis.  相似文献   

11.
Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.  相似文献   

12.
We propose a new binary classification and variable selection technique especially designed for high-dimensional predictors. Among many predictors, typically, only a small fraction of them have significant impact on prediction. In such a situation, more interpretable models with better prediction accuracy can be obtained by variable selection along with classification. By adding an ?1-type penalty to the loss function, common classification methods such as logistic regression or support vector machines (SVM) can perform variable selection. Existing penalized SVM methods all attempt to jointly solve all the parameters involved in the penalization problem altogether. When data dimension is very high, the joint optimization problem is very complex and involves a lot of memory allocation. In this article, we propose a new penalized forward search technique that can reduce high-dimensional optimization problems to one-dimensional optimization by iterating the selection steps. The new algorithm can be regarded as a forward selection version of the penalized SVM and its variants. The advantage of optimizing in one dimension is that the location of the optimum solution can be obtained with intelligent search by exploiting convexity and a piecewise linear or quadratic structure of the criterion function. In each step, the predictor that is most able to predict the outcome is chosen in the model. The search is then repeatedly used in an iterative fashion until convergence occurs. Comparison of our new classification rule with ?1-SVM and other common methods show very promising performance, in that the proposed method leads to much leaner models without compromising misclassification rates, particularly for high-dimensional predictors.  相似文献   

13.
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data.Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.  相似文献   

14.
在响应变量带有单调缺失的情形下考虑高维纵向线性回归模型的变量选择.主要基于逆概率加权广义估计方程提出了一种自动的变量选择方法,该方法不使用现有的惩罚函数,不涉及惩罚函数非凸最优化的问题,并且可以自动地剔除零回归系数,同时得到非零回归系数的估计.在一定正则条件下,证明了该变量选择方法具有Oracle性质.最后,通过模拟研究验证了所提出方法的有限样本性质.  相似文献   

15.
In multinomial logit models, the identifiability of parameter estimates is typically obtained by side constraints that specify one of the response categories as reference category. When parameters are penalized, shrinkage of estimates should not depend on the reference category. In this paper we investigate ridge regression for the multinomial logit model with symmetric side constraints, which yields parameter estimates that are independent of the reference category. In simulation studies the results are compared with the usual maximum likelihood estimates and an application to real data is given.  相似文献   

16.
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.  相似文献   

17.
变量选择控制图是高维统计过程监控的重要方法。针对传统变量选择控制图较少考虑高维过程空间相关性而造成监控效率低的问题,提出一种基于Fused-LASSO的高维空间相关过程监控模型。首先,利用Fused LASSO算法对似然比检验进行改进;然后,推导出基于惩罚似然比的监控统计量;最后,通过仿真模拟和真实案例分析所提监控模型的性能。仿真实验和真实案例均表明:在高维空间相关过程中,当相邻监控变量同时发生异常时,利用所提监控方法能够准确识别潜在异常变量,取得较好的监控效果。  相似文献   

18.
A general methodology for selecting predictors for Gaussian generative classification models is presented. The problem is regarded as a model selection problem. Three different roles for each possible predictor are considered: a variable can be a relevant classification predictor or not, and the irrelevant classification variables can be linearly dependent on a part of the relevant predictors or independent variables. This variable selection model was inspired by a previous work on variable selection in model-based clustering. A BIC-like model selection criterion is proposed. It is optimized through two embedded forward stepwise variable selection algorithms for classification and linear regression. The model identifiability and the consistency of the variable selection criterion are proved. Numerical experiments on simulated and real data sets illustrate the interest of this variable selection methodology. In particular, it is shown that this well ground variable selection model can be of great interest to improve the classification performance of the quadratic discriminant analysis in a high dimension context.  相似文献   

19.
We study the properties of the Lasso in the high-dimensional partially linear model where the number of variables in the linear part can be greater than the sample size. We use truncated series expansion based on polynomial splines to approximate the nonparametric component in this model. Under a sparsity assumption on the regression coefficients of the linear component and some regularity conditions, we derive the oracle inequalities for the prediction risk and the estimation error. We also provide sufficient conditions under which the Lasso estimator is selection consistent for the variables in the linear part of the model. In addition, we derive the rate of convergence of the estimator of the nonparametric function. We conduct simulation studies to evaluate the finite sample performance of variable selection and nonparametric function estimation.  相似文献   

20.
Classical robust statistical methods dealing with noisy data are often based on modifications of convex loss functions. In recent years, nonconvex loss-based robust methods have been increasingly popular. A nonconvex loss can provide robust estimation for data contaminated with outliers. The significant challenge is that a nonconvex loss can be numerically difficult to optimize. This article proposes quadratic majorization algorithm for nonconvex (QManc) loss. The QManc can decompose a nonconvex loss into a sequence of simpler optimization problems. Subsequently, the QManc is applied to a powerful machine learning algorithm: quadratic majorization boosting algorithm (QMBA). We develop QMBA for robust classification (binary and multi-category) and regression. In high-dimensional cancer genetics data and simulations, the QMBA is comparable with convex loss-based boosting algorithms for clean data, and outperforms the latter for data contaminated with outliers. The QMBA is also superior to boosting when directly implemented to optimize nonconvex loss functions. Supplementary material for this article is available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号