首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A measure of dissimilarity of real observations is introduced that is used to identify clusters and is based on ``gaps', and also on averages of selected subgroups of the observations. This measure is surprisingly associated with the sample variance in a way that leads to a new identity and interpretation of the notion of variance.

  相似文献   


2.
A general methodology for selecting predictors for Gaussian generative classification models is presented. The problem is regarded as a model selection problem. Three different roles for each possible predictor are considered: a variable can be a relevant classification predictor or not, and the irrelevant classification variables can be linearly dependent on a part of the relevant predictors or independent variables. This variable selection model was inspired by a previous work on variable selection in model-based clustering. A BIC-like model selection criterion is proposed. It is optimized through two embedded forward stepwise variable selection algorithms for classification and linear regression. The model identifiability and the consistency of the variable selection criterion are proved. Numerical experiments on simulated and real data sets illustrate the interest of this variable selection methodology. In particular, it is shown that this well ground variable selection model can be of great interest to improve the classification performance of the quadratic discriminant analysis in a high dimension context.  相似文献   

3.
The aim of this paper is to enlarge the usual domain of cluster analysis. A procedure for clustering time varying data is presented which takes into account the time dimension with its intrinsic properties.

This procedure consists of two steps. In the first step a dissimilarity between variables is defined and the dissimilarity matrix is calculated for each unit separately. In the second step the dissimilarity between units is calculated in terms of the dissimilarity matrices defined in the first step. The dissimilarity matrix obtained is the base for a suitable clustering method.

The procedure is illustrated on an empirical example.  相似文献   

4.
The multinomial logit model is the most widely used model for the unordered multi-category responses. However, applications are typically restricted to the use of few predictors because in the high-dimensional case maximum likelihood estimates frequently do not exist. In this paper we are developing a boosting technique called multinomBoost that performs variable selection and fits the multinomial logit model also when predictors are high-dimensional. Since in multi-category models the effect of one predictor variable is represented by several parameters one has to distinguish between variable selection and parameter selection. A special feature of the approach is that, in contrast to existing approaches, it selects variables not parameters. The method can also distinguish between mandatory predictors and optional predictors. Moreover, it adapts to metric, binary, nominal and ordinal predictors. Regularization within the algorithm allows to include nominal and ordinal variables which have many categories. In the case of ordinal predictors the order information is used. The performance of boosting technique with respect to mean squared error, prediction error and the identification of relevant variables is investigated in a simulation study. The method is applied to the national Indonesia contraceptive prevalence survey and the identification of glass. Results are also compared with the Lasso approach which selects parameters.  相似文献   

5.
With advanced capability in data collection, applications of linear regression analysis now often involve a large number of predictors. Variable selection thus has become an increasingly important issue in building a linear regression model. For a given selection criterion, variable selection is essentially an optimization problem that seeks the optimal solution over 2m possible linear regression models, where m is the total number of candidate predictors. When m is large, exhaustive search becomes practically impossible. Simple suboptimal procedures such as forward addition, backward elimination, and backward-forward stepwise procedure are fast but can easily be trapped in a local solution. In this article we propose a relatively simple algorithm for selecting explanatory variables in a linear regression for a given variable selection criterion. Although the algorithm is still a suboptimal algorithm, it has been shown to perform well in extensive empirical study. The main idea of the procedure is to partition the candidate predictors into a small number of groups. Working with various combinations of the groups and iterating the search through random regrouping, the search space is substantially reduced, hence increasing the probability of finding the global optimum. By identifying and collecting “important” variables throughout the iterations, the algorithm finds increasingly better models until convergence. The proposed algorithm performs well in simulation studies with 60 to 300 predictors. As a by-product of the proposed procedure, we are able to study the behavior of variable selection criteria when the number of predictors is large. Such a study has not been possible with traditional search algorithms.

This article has supplementary material online.  相似文献   

6.
In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.  相似文献   

7.
This article proposes a new quantity for assessing the number of groups or clusters in a dataset. The key idea is to view clustering as a supervised classification problem, in which we must also estimate the “true” class labels. The resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well. In the process, we develop novel notions of bias and variance for unlabeled data. Prediction strength performs well in simulation studies, and we apply it to clusters of breast cancer samples from a DNA microarray study. Finally, some consistency properties of the method are established.  相似文献   

8.
Treed Regression     
Abstract

Given a data set consisting of n observations on p independent variables and a single dependent variable, treed regression creates a binary tree with a simple linear regression function at each of the leaves. Each node of the tree consists of an inequality condition on one of the independent variables. The tree is generated from the training data by a recursive partitioning algorithm. Treed regression models are more parsimonious than CART models because there are fewer splits. Additionally, monotonicity in some or all of the variables can be imposed.  相似文献   

9.
It has been known for many years that an optimal discrete nonlinear filter may be synthesized for systems whose plant dynamics, sensor characteristics and signal statistics are known by applying Bayes' Rule to sequentially update the conditional probability density function from the latest data. However, it was not until 1969 that a digital computer algorithm implementing the theory for a one-state variable one-step predictor appeared in the literature. This delay and the continuing scarcity of multidimensional nonlinear filters result from the overwhelming computational task which leads to unrealistic data processing times. For many nonlinear filtering problems analog and digital computers (a hybrid computation) combine to yield a higher data rate than can be obtained by con¬ventional digital methods. This paper describes an implementation of the theory by means of a hybrid computer algorithm for the optimal nonlinear one-step predictor.

The hybrid computer algorithm presented reduces the overall solution time per prediction because:

1) Many large computations of identical form are executed on the analog computer in parallel.

2) The discrete running variable in the digital algorithm may be replaced with a continuous analog computer variable in one or more dimensions leading to increased computational speed and finer resolution of the exponential transformation.

3) The modern analog computer is well suited to generate functions such as the expo¬nential at high speed with modest equipment.

4) The arithmetic, storage, and control functions performed rapidly by the digital computer are utilized without introducing extensive auxiliary calculations.

To illustrate pertinent aspects of the algorithm developed, the scalar cubed sensor problem previously described by Bucy is treated extensively. The hybrid algorithm is described. Problems associated with partitioning of equations between analog and digital computers, machine representations of variables, setting of initial conditions and floating of grid base are discussed. The effects of analog component bandwidths, digital-to-analog and analog-to-digital conversion times, analog computer mode switching times and digital computer I/O data rates on overall processing time are examined. The effect of limited analog computer dynamic range on accuracy is discussed. Results from a simulation of this optimal predictor using MOBSSL, a continuous system simulation language, are given. Timing estimates are presented and compared against similar estimates for the all digital algorithm.

For example, given a four-state variable optimal 1-step predictor utilizing 7 discrete points in each dimension, the hybrid algorithm can be used to generate predictions accurate to 2 decimal places once every 10 seconds. An analog computer complement of 250 integra¬tors and multipliers and a high-speed 3rd generation digital computer such as the CDC 6600 or IBM 360/85 are required. This compares with a lower bound of about 3 seconds per all digital prediction which would require 49 CDC 6600's operating in parallel. Analytical and simulation work quantifying errors in one state variable filters is presented. Finally, the use of an interactive graphic system for real time display and for filter evaluation is described.  相似文献   

10.
The elastic net (supervised enet henceforth) is a popular and computationally efficient approach for performing the simultaneous tasks of selecting variables, decorrelation, and shrinking the coefficient vector in the linear regression setting. Semisupervised regression, currently unrelated to the supervised enet, uses data with missing response values (unlabeled) along with labeled data to train the estimator. In this article, we propose the joint trained elastic net (jt-enet), which elegantly incorporates the benefits of semisupervised regression with the supervised enet. The supervised enet and other approaches like it rely on shrinking the linear estimator in a way that simultaneously performs variable selection and decorrelates the data. Both the variable selection and decorrelation components of the supervised enet inherently rely on the pairwise correlation structure in the feature data. In circumstances in which the number of variables is high, the feature data are relatively easy to obtain, and the response is expensive to generate, it seems reasonable that one would want to be able to use any existing unlabeled observations to more accurately define these correlations. However, the supervised enet is not able to incorporate this information and focuses only on the information within the labeled data. In this article, we propose the jt-enet, which allows the unlabeled data to influence the variable selection, decorrelation, and shrinkage capabilities of the linear estimator. In addition, we investigate the impact of unlabeled data on the risk and bias of the proposed estimator. The jt-enet is demonstrated on two applications with encouraging results. Online supplementary material is available for this article.  相似文献   

11.

We demonstrate that, in a regression setting with a Hilbertian predictor, a response variable is more likely to be more highly correlated with the leading principal components of the predictor than with trailing ones. This is despite the extraction procedure being unsupervised. Our results are established under the conditional independence model, which includes linear regression and single-index models as special cases, with some assumptions on the regression vector. These results are a generalisation of earlier work which showed that this phenomenon holds for predictors which are real random vectors. A simulation study is used to quantify the phenomenon.

  相似文献   

12.
Sequential clustering aims at determining homogeneous and/or well-separated clusters within a given set of entities, one at a time, until no more such clusters can be found. We consider a bi-criterion sequential clustering problem in which the radius of a cluster (or maximum dissimilarity between an entity chosen as center and any other entity of the cluster) is chosen as a homogeneity criterion and the split of a cluster (or minimum dissimilarity between an entity in the cluster and one outside of it) is chosen as a separation criterion. An O(N 3) algorithm is proposed for determining radii and splits of all efficient clusters, which leads to an O(N 4) algorithm for bi-criterion sequential clustering with radius and split as criteria. This algorithm is illustrated on the well known Ruspini data set.  相似文献   

13.
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

Supplemental materials with additional experiments for this article are available online.  相似文献   

14.
《Fuzzy Sets and Systems》2004,141(2):301-317
This paper presents fuzzy clustering algorithms for mixed features of symbolic and fuzzy data. El-Sonbaty and Ismail proposed fuzzy c-means (FCM) clustering for symbolic data and Hathaway et al. proposed FCM for fuzzy data. In this paper we give a modified dissimilarity measure for symbolic and fuzzy data and then give FCM clustering algorithms for these mixed data types. Numerical examples and comparisons are also given. Numerical examples illustrate that the modified dissimilarity gives better results. Finally, the proposed clustering algorithm is applied to real data with mixed feature variables of symbolic and fuzzy data.  相似文献   

15.
Random forest (RF) methodology is a nonparametric methodology for prediction problems. A standard way to use RFs includes generating a global RF to predict all test cases of interest. In this article, we propose growing different RFs specific to different test cases, namely case-specific random forests (CSRFs). In contrast to the bagging procedure in the building of standard RFs, the CSRF algorithm takes weighted bootstrap resamples to create individual trees, where we assign large weights to the training cases in close proximity to the test case of interest a priori. Tuning methods are discussed to avoid overfitting issues. Both simulation and real data examples show that the weighted bootstrap resampling used in CSRF construction can improve predictions for specific cases. We also propose a new case-specific variable importance (CSVI) measure as a way to compare the relative predictor variable importance for predicting a particular case. It is possible that the idea of building a predictor case-specifically can be generalized in other areas.  相似文献   

16.
Abstract

A new algorithm—backward elimination via repeated data splitting (BERDS)—is proposed for variable selection in regression. Initially, the data are partitioned into two sets {E, V}, and an exhaustive backward elimination (BE) is performed in E. For each p value cutoff α used in BE, the corresponding fitted model from E is validated in V by computing the sum of squared deviations of observed from predicted values. This is repeated m times, and the α minimizing the sum of the m sums of squares is used as the cutoff in a final BE on the entire data set. BERDS is a modification of the algorithm BECV proposed by Thall, Simon, and Grier (1992). An extensive simulation study shows that, compared to BECV, BERDS has a smaller model error and higher probabilities of excluding noise variables, of selecting each of several uncorrelated true predictors, and of selecting exactly one of two or three highly correlated true predictors. BERDS is also superior to standard BE with cutoffs .05 or .10, and this superiority increases with the number of noise variables in the data and the degree of correlation among true predictors. An application is provided for illustration.  相似文献   

17.
Clustering multimodal datasets can be problematic when a conventional algorithm such as k-means is applied due to its implicit assumption of Gaussian distribution of the dataset. This paper proposes a tandem clustering process for multimodal data sets. The proposed method first divides the multimodal dataset into many small pre-clusters by applying k-means or fuzzy k-means algorithm. These pre-clusters are then clustered again by agglomerative hierarchical clustering method using Kullback–Leibler divergence as an initial measure of dissimilarity. Benchmark results show that the proposed approach is not only effective at extracting the multimodal clusters but also efficient in computational time and relatively robust at the presence of outliers.  相似文献   

18.

It is well known that variable selection in multiple regression can be unstable and that the model uncertainty can be considerable. The model uncertainty can be quantified and explored by bootstrap resampling, see Sauerbrei et al. (Biom J 57:531–555, 2015). Here approaches are introduced that use the results of bootstrap replications of the variable selection process to obtain more detailed information about the data. Analyses will be based on dissimilarities between the results of the analyses of different bootstrap samples. Dissimilarities are computed between the vector of predictions, and between the sets of selected variables. The dissimilarities are used to map the models by multidimensional scaling, to cluster them, and to construct heatplots. Clusters can point to different interpretations of the data that could arise from different selections of variables supported by different bootstrap samples. A new measure of variable selection instability is also defined. The methodology can be applied to various regression models, estimators, and variable selection methods. It will be illustrated by three real data examples, using linear regression and a Cox proportional hazards model, and model selection by AIC and BIC.

  相似文献   

19.
Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing.  相似文献   

20.
A fundamental problem in data analysis is that of fitting a given model to observed data. It is commonly assumed that only the dependent variable values are in error, and the least squares criterion is often used to fit the model. When significant errors occur in all the variables, then an alternative approach which is frequently suggested for this errors in variables problem is to minimize the sum of squared orthogonal distances between each data point and the curve described by the model equation. It has long been recognized that the use of least squares is not always satisfactory, and thel 1 criterion is often superior when estimating the true form of data which contain some very inaccurate observations. In this paper the measure of goodness of fit is taken to be thel 1 norm of the errors. A Levenberg-Marquardt method is proposed, and the main objective is to take full advantage of the structure of the subproblems so that they can be solved efficiently.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号