首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper introduces a model-based approach to the important data mining tool Multivariate adaptive regression splines (MARS), which has originally been organized in a more model-free way. Indeed, MARS denotes a modern methodology from statistical learning which is important in both classification and regression, with an increasing number of applications in many areas of science, economy and technology. It is very useful for high-dimensional problems and shows a great promise for fitting nonlinear multivariate functions. The MARS algorithm for estimating the model function consists of two algorithms, these are the forward and the backward stepwise algorithm. In our paper, we propose not to use the backward stepwise algorithm. Instead, we construct a penalized residual sum of squares for MARS as a Tikhonov regularization problem which is also known as ridge regression. We treat this problem using continuous optimization techniques which we consider to become an important complementary technology and model-based alternative to the concept of the backward stepwise algorithm. In particular, we apply the elegant framework of conic quadratic programming. This is an area of convex optimization which is very well-structured, herewith, resembling linear programming and, hence, permitting the use of powerful interior point methods. Based on these theoretical and algorithmical studies, this paper also contains an application to diabetes data. We evaluate and compare the performance of the established MARS and our new CMARS in classifying diabetic persons, where CMARS turns out to be very competitive and promising.  相似文献   

2.
Multivariate adaptive regression splines (MARS) has become a popular data mining (DM) tool due to its flexible model building strategy for high dimensional data. Compared to well-known others, it performs better in many areas such as finance, informatics, technology and science. Many studies have been conducted on improving its performance. For this purpose, an alternative backward stepwise algorithm is proposed through Conic-MARS (CMARS) method which uses a penalized residual sum of squares for MARS as a Tikhonov regularization problem. Additionally, by modifying the forward step of MARS via mapping approach, a time efficient procedure has been introduced by S-FMARS. Inspiring from the advantages of MARS, CMARS and S-FMARS, two hybrid methods are proposed in this study, aiming to produce time efficient DM tools without degrading their performances especially for large datasets. The resulting methods, called SMARS and SCMARS, are tested in terms of several performance criteria such as accuracy, complexity, stability and robustness via simulated and real life datasets. As a DM application, the hybrid methods are also applied to an important field of finance for predicting interest rates offered by a Turkish bank to its customers. The results show that the proposed hybrid methods, being the most time efficient with competing performances, can be considered as powerful choices particularly for large datasets.  相似文献   

3.
In this article we propose a modification to the output from Metropolis-within-Gibbs samplers that can lead to substantial reductions in the variance over standard estimates. The idea is simple: at each time step of the algorithm, introduce an extra sample into the estimate that is negatively correlated with the current sample, the rationale being that this provides a two-sample numerical approximation to a Rao–Blackwellized estimate. As the conditional sampling distribution at each step has already been constructed, the generation of the antithetic sample often requires negligible computational effort. Our method is implementable whenever one subvector of the state can be sampled from its full conditional and the corresponding distribution function may be inverted, or the full conditional has a symmetric density. We demonstrate our approach in the context of logistic regression and hierarchical Poisson models. The data and computer code used in this article are available online.  相似文献   

4.
In high dimensional data modeling, Multivariate Adaptive Regression Splines (MARS) is a popular nonparametric regression technique used to define the nonlinear relationship between a response variable and the predictors with the help of splines. MARS uses piecewise linear functions for local fit and apply an adaptive procedure to select the number and location of breaking points (called knots). The function estimation is basically generated via a two-stepwise procedure: forward selection and backward elimination. In the first step, a large number of local fits is obtained by selecting large number of knots via a lack-of-fit criteria; and in the latter one, the least contributing local fits or knots are removed. In conventional adaptive spline procedure, knots are selected from a set of all distinct data points that makes the forward selection procedure computationally expensive and leads to high local variance. To avoid this drawback, it is possible to restrict the knot points to a subset of data points. In this context, a new method is proposed for knot selection which bases on a mapping approach like self organizing maps. By this method, less but more representative data points are become eligible to be used as knots for function estimation in forward step of MARS. The proposed method is applied to many simulated and real datasets, and the results show that it proposes a time efficient forward step for the knot selection and model estimation without degrading the model accuracy and prediction performance.  相似文献   

5.
Additive isotonic regression attempts to determine the relationship between a multidimensional observation variable and a response, under the constraint that the estimate is the additive sum of univariate component effects that are monotonically increasing. In this article, we present a new method for such regression called LASSO Isotone (LISO). LISO adapts ideas from sparse linear modeling to additive isotonic regression. Thus, it is viable in many situations with high-dimensional predictor variables, where selection of significant versus insignificant variables is required. We suggest an algorithm involving a modification of the backfitting algorithm CPAV. We give a numerical convergence result, and finally examine some of its properties through simulations. We also suggest some possible extensions that improve performance, and allow calculation to be carried out when the direction of the monotonicity is unknown. Supplemental materials are available online for this article.  相似文献   

6.
Support vector machines (SVMs), which are a kind of statistical learning methods, were applied in this research work to predict occupational accidents with success. In the first place, semi-parametric principal component analysis (SPPCA) was used in order to perform a dimensional reduction, but no satisfactory results were obtained. Next, a dimensional reduction was carried out using an innovative and intelligent computing regression algorithm known as multivariate adaptive regression splines (MARS) model with good results. The variables selected as important by the previous MARS model were taken as input variables for a SVM model. This SVM technique was able to classify, according to their working conditions, those workers that have suffered a work-related accident in the last 12 months and those that have not. SVM technique does not over-fit the experimental data and gives place to a better performance than back-propagation neural network models. Finally, the results and conclusions of this study are presented.  相似文献   

7.
8.
Locally weighted regression is a technique that predicts the response for new data items from their neighbors in the training data set, where closer data items are assigned higher weights in the prediction. However, the original method may suffer from overfitting and fail to select the relevant variables. In this paper we propose combining a regularization approach with locally weighted regression to achieve sparse models. Specifically, the lasso is a shrinkage and selection method for linear regression. We present an algorithm that embeds lasso in an iterative procedure that alternatively computes weights and performs lasso-wise regression. The algorithm is tested on three synthetic scenarios and two real data sets. Results show that the proposed method outperforms linear and local models for several kinds of scenarios.  相似文献   

9.
We propose a penalized likelihood method that simultaneously fits the multinomial logistic regression model and combines subsets of the response categories. The penalty is nondifferentiable when pairs of columns in the optimization variable are equal. This encourages pairwise equality of these columns in the estimator, which corresponds to response category combination. We use an alternating direction method of multipliers algorithm to compute the estimator and we discuss the algorithm’s convergence. Prediction and model selection are also addressed. Supplemental materials for this article are available online.  相似文献   

10.
High-dimensional data with hundreds of thousands of observations are becoming commonplace in many disciplines. The analysis of such data poses many computational challenges, especially when the observations are correlated over time and/or across space. In this article, we propose flexible hierarchical regression models for analyzing such data that accommodate serial and/or spatial correlation. We address the computational challenges involved in fitting these models by adopting an approximate inference framework. We develop an online variational Bayes algorithm that works by incrementally reading the data into memory one portion at a time. The performance of the method is assessed through simulation studies. The methodology is applied to analyze signal intensity in MRI images of subjects with knee osteoarthritis, using data from the Osteoarthritis Initiative. Supplementary materials for this article are available online.  相似文献   

11.
We propose an algorithm, semismooth Newton coordinate descent (SNCD), for the elastic-net penalized Huber loss regression and quantile regression in high dimensional settings. Unlike existing coordinate descent type algorithms, the SNCD updates a regression coefficient and its corresponding subgradient simultaneously in each iteration. It combines the strengths of the coordinate descent and the semismooth Newton algorithm, and effectively solves the computational challenges posed by dimensionality and nonsmoothness. We establish the convergence properties of the algorithm. In addition, we present an adaptive version of the “strong rule” for screening predictors to gain extra efficiency. Through numerical experiments, we demonstrate that the proposed algorithm is very efficient and scalable to ultrahigh dimensions. We illustrate the application via a real data example. Supplementary materials for this article are available online.  相似文献   

12.
Maximum likelihood estimation in finite mixture distributions is typically approached as an incomplete data problem to allow application of the expectation-maximization (EM) algorithm. In its general formulation, the EM algorithm involves the notion of a complete data space, in which the observed measurements and incomplete data are embedded. An advantage is that many difficult estimation problems are facilitated when viewed in this way. One drawback is that the simultaneous update used by standard EM requires overly informative complete data spaces, which leads to slow convergence in some situations. In the incomplete data context, it has been shown that the use of less informative complete data spaces, or equivalently smaller missing data spaces, can lead to faster convergence without sacrifying simplicity. However, in the mixture case, little progress has been made in speeding up EM. In this article we propose a component-wise EM for mixtures. It uses, at each iteration, the smallest admissible missing data space by intrinsically decoupling the parameter updates. Monotonicity is maintained, although the estimated proportions may not sum to one during the course of the iteration. However, we prove that the mixing proportions will satisfy this constraint upon convergence. Our proof of convergence relies on the interpretation of our procedure as a proximal point algorithm. For performance comparison, we consider standard EM as well as two other algorithms based on missing data space reduction, namely the SAGE and AECME algorithms. We provide adaptations of these general procedures to the mixture case. We also consider the ECME algorithm, which is not a data augmentation scheme but still aims at accelerating EM. Our numerical experiments illustrate the advantages of the component-wise EM algorithm relative to these other methods.  相似文献   

13.
We propose Near-optimal Nonlinear Regression Trees with hyperplane splits (NNRTs) that use a polynomial prediction function in the leaf nodes, which we solve by stochastic gradient methods. On synthetic data, we show experimentally that the algorithm converges to the global optimal. We compare NNRTs, ORT-LH, Multivariate Adaptive Regression Splines (MARS), Random Forests (RF) and XGBoost on 40 real-world datasets and show that overall NNRTs have a performance edge over all other methods.  相似文献   

14.
We present a massively parallel algorithm for the fused lasso, powered by a multiple number of graphics processing units (GPUs). Our method is suitable for a class of large-scale sparse regression problems on which a two-dimensional lattice structure among the coefficients is imposed. This structure is important in many statistical applications, including image-based regression in which a set of images are used to locate image regions predictive of a response variable such as human behavior. Such large datasets are increasingly common. In our study, we employ the split Bregman method and the fast Fourier transform, which jointly have a high data-level parallelism that is distinct in a two-dimensional setting. Our multi-GPU parallelization achieves remarkably improved speed. Specifically, we obtained as much as 433 times improved speed over that of the reference CPU implementation. We demonstrate the speed and scalability of the algorithm using several datasets, including 8100 samples of 512 × 512 images. Compared to the single GPU counterpart, our method also showed improved computing speed as well as high scalability. We describe the various elements of our study as well as our experience with the subtleties in selecting an existing algorithm for parallelization. It is critical that memory bandwidth be carefully considered for multi-GPU algorithms. Supplementary material for this article is available online.  相似文献   

15.
Abstract

In this article we give a general definition of residuals for regression models with independent responses. Our definition produces residuals that are exactly normal, apart from sampling variability in the estimated parameters, by inverting the fitted distribution function for each response value and finding the equivalent standard normal quantile. Our definition includes some randomization to achieve continuous residuals when the response variable is discrete. Quantile residuals are easily computed in computer packages such as SAS, S-Plus, GLIM, or LispStat, and allow residual analyses to be carried out in many commonly occurring situations in which the customary definitions of residuals fail. Quantile residuals are applied in this article to three example data sets.  相似文献   

16.
An open challenge in nonparametric regression is finding fast, computationally efficient approaches to estimating local bandwidths for large datasets, in particular in two or more dimensions. In the work presented here, we introduce a novel local bandwidth estimation procedure for local polynomial regression, which combines the greedy search of the regularization of the derivative expectation operator (RODEO) algorithm with linear binning. The result is a fast, computationally efficient algorithm, which we refer to as the fast RODEO. We motivate the development of our algorithm by using a novel scale-space approach to derive the RODEO. We conclude with a toy example and a real-world example using data from the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite validation study, where we show the fast RODEO’s improvement in accuracy and computational speed over two other standard approaches.  相似文献   

17.
For high-dimensional supervised learning problems, often using problem-specific assumptions can lead to greater accuracy. For problems with grouped covariates, which are believed to have sparse effects both on a group and within group level, we introduce a regularized model for linear regression with ?1 and ?2 penalties. We discuss the sparsity and other regularization properties of the optimal fit for this model, and show that it has the desired effect of group-wise and within group sparsity. We propose an algorithm to fit the model via accelerated generalized gradient descent, and extend this model and algorithm to convex loss functions. We also demonstrate the efficacy of our model and the efficiency of our algorithm on simulated data. This article has online supplementary material.  相似文献   

18.
Multivariate adaptive regression spline (MARS) is a statistical modeling method used to represent a complex system. More recently, a version of MARS was modified to be piecewise linear. This paper presents a mixed integer linear program, called MARSOPT, that optimizes a non-convex piecewise linear MARS model subject to constraints that include both linear regression models and piecewise linear MARS models. MARSOPT is customized for an automotive crash safety system design problem for a major US automaker and solved using branch and bound. The solutions from MARSOPT are compared with those from customized genetic algorithms.  相似文献   

19.
Domain experts can often quite reliably specify the sign of influences between variables in a Bayesian network. If we exploit this prior knowledge in estimating the probabilities of the network, it is more likely to be accepted by its users and may in fact be better calibrated with reality. We present two algorithms that exploit prior knowledge of qualitative influences in learning the parameters of a Bayesian network from incomplete data. The isotonic regression EM, or irEM, algorithm adds an isotonic regression step to standard EM in each iteration, to obtain parameter estimates that satisfy the given qualitative influences. In an attempt to reduce the computational burden involved, we further define the qirEM algorithm that enforces the constraints imposed by the qualitative influences only once, after convergence of standard EM. We evaluate the performance of both algorithms through experiments. Our results demonstrate that exploitation of the qualitative influences improves the parameter estimates over standard EM, and more so if the proportion of missing data is relatively large. The results also show that the qirEM algorithm performs just as well as its computationally more expensive counterpart irEM.  相似文献   

20.
Estimation of Taylor’s power law for species abundance data may be performed by linear regression of the log empirical variances on the log means, but this method suffers from a problem of bias for sparse data. We show that the bias may be reduced by using a bias-corrected Pearson estimating function. Furthermore, we investigate a more general regression model allowing for site-specific covariates. This method may be efficiently implemented using a Newton scoring algorithm, with standard errors calculated from the inverse Godambe information matrix. The method is applied to a set of biomass data for benthic macrofauna from two Danish estuaries.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号