首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
Estimating the number of clusters is one of the most difficult problems in cluster analysis. Most previous approaches require knowing the data matrix and may not work when only a Euclidean distance matrix is available. Other approaches also suffer from the curse of dimensionality and work poorly in high dimension. In this article, we develop a new statistic, called the GUD statistic, based on the idea of the Gap method, but use the determinant of the pooled within-group scatter matrix instead of the within-cluster sum of squared distances. Some theory is developed to show this statistic can work well when only the Euclidean distance matrix is known. More generally, this statistic can even work for any dissimilarity matrix that satisfies some properties. We also propose a modification for high-dimensional datasets, called the R-GUD statistic, which can give a robust estimation in high-dimensional settings. The simulation shows our method needs less information but is generally found to be more accurate and robust than other methods considered in the study, especially in many difficult settings.  相似文献   

2.
High-dimensional multivariate time series are challenging due to the dependent and high-dimensional nature of the data, but in many applications there is additional structure that can be exploited to reduce computing time along with statistical error. We consider high-dimensional vector autoregressive processes with spatial structure, a simple and common form of additional structure. We propose novel high-dimensional methods that take advantage of such structure without making model assumptions about how distance affects dependence. We provide nonasymptotic bounds on the statistical error of parameter estimators in high-dimensional settings and show that the proposed approach reduces the statistical error. An application to air pollution in the USA demonstrates that the estimation approach reduces both computing time and prediction error and gives rise to results that are meaningful from a scientific point of view, in contrast to high-dimensional methods that ignore spatial structure. In practice, these high-dimensional methods can be used to decompose high-dimensional multivariate time series into lower-dimensional multivariate time series that can be studied by other methods in more depth. Supplementary materials for this article are available online.  相似文献   

3.
We present an ensemble tree-based algorithm for variable selection in high-dimensional datasets, in settings where a time-to-event outcome is observed with error. This work is motivated by self-reported outcomes collected in large-scale epidemiologic studies, such as the Women’s Health Initiative. The proposed methods equally apply to imperfect outcomes that arise in other settings such as data extracted from electronic medical records. To evaluate the performance of our proposed algorithm, we present results from simulation studies, considering both continuous and categorical covariates. We illustrate this approach to discover single nucleotide polymorphisms that are associated with incident Type 2 diabetes in the Women’s Health Initiative. A freely available R package icRSF has been developed to implement the proposed methods. Supplementary material for this article is available online.  相似文献   

4.
We develop the technique of reduced word manipulation to give a range of results concerning reduced words and permutations more generally. We prove a broad connection between pattern containment and reduced words, which specializes to our previous work for vexillary permutations. We also analyze general tilings of Elnitsky’s polygon and demonstrate that these are closely related to the patterns in a permutation. Building on previous work for commutation classes, we show that reduced word enumeration is monotonically increasing with respect to pattern containment. Finally, we give several applications of this work. We show that a permutation and a pattern have equally many reduced words if and only if they have the same length (equivalently, the same number of 21-patterns) and that they have equally many commutation classes if and only if they have the same number of 321-patterns. We also apply our techniques to enumeration problems of pattern avoidance and give a bijection between 132-avoiding permutations of a given length and partitions of that same size, as well as refinements of these data and a connection to the Catalan numbers.  相似文献   

5.
In some approximation problems, sampling from the target function can be both expensive and time-consuming. It would be convenient to have a method for indicating where approximation quality is poor, so that generation of new data provides the user with greater accuracy where needed. In this paper, we propose a new adaptive algorithm for radial basis function (RBF) interpolation which aims to assess the local approximation quality, and add or remove points as required to improve the error in the specified region. For Gaussian and multiquadric approximation, we have the flexibility of a shape parameter which we can use to keep the condition number of interpolation matrix at a moderate size. Numerical results for test functions which appear in the literature are given for dimensions 1 and 2, to show that our method performs well. We also give a three-dimensional example from the finance world, since we would like to advertise RBF techniques as useful tools for approximation in the high-dimensional settings one often meets in finance.  相似文献   

6.
In this paper, we consider a scale adjusted-type distance-based classifier for high-dimensional data. We first give such a classifier that can ensure high accuracy in misclassification rates for two-class classification. We show that the classifier is not only consistent but also asymptotically normal for high-dimensional data. We provide sample size determination so that misclassification rates are no more than a prespecified value. We propose a classification procedure called the misclassification rate adjusted classifier. We further develop the classifier to multiclass classification. We show that the classifier can still enjoy asymptotic properties and ensure high accuracy in misclassification rates for multiclass classification. Finally, we demonstrate the proposed classifier in actual data analyses by using a microarray data set.  相似文献   

7.
Abstract

We propose a rudimentary taxonomy of interactive data visualization based on a triad of data analytic tasks: finding Gestalt, posing queries, and making comparisons. These tasks are supported by three classes of interactive view manipulations: focusing, linking, and arranging views. This discussion extends earlier work on the principles of focusing and linking and sets them on a firmer base. Next, we give a high-level introduction to a particular system for multivariate data visualization—XGobi. This introduction is not comprehensive but emphasizes XGobi tools that are examples of focusing, linking, and arranging views; namely, high-dimensional projections, linked scatterplot brushing, and matrices of conditional plots. Finally, in a series of case studies in data visualization, we show the powers and limitations of particular focusing, linking, and arranging tools. The discussion is dominated by high-dimensional projections that form an extremely well-developed part of XGobi. Of particular interest are the illustration of asymptotic normality of high-dimensional projections (a theorem of Diaconis and Freedman), the use of high-dimensional cubes for visualizing factorial experiments, and a method for interactively generating matrices of conditional plots with high-dimensional projections. Although there is a unifying theme to this article, each section—in particular the case studies—can be read separately.  相似文献   

8.
We propose a new binary classification and variable selection technique especially designed for high-dimensional predictors. Among many predictors, typically, only a small fraction of them have significant impact on prediction. In such a situation, more interpretable models with better prediction accuracy can be obtained by variable selection along with classification. By adding an ?1-type penalty to the loss function, common classification methods such as logistic regression or support vector machines (SVM) can perform variable selection. Existing penalized SVM methods all attempt to jointly solve all the parameters involved in the penalization problem altogether. When data dimension is very high, the joint optimization problem is very complex and involves a lot of memory allocation. In this article, we propose a new penalized forward search technique that can reduce high-dimensional optimization problems to one-dimensional optimization by iterating the selection steps. The new algorithm can be regarded as a forward selection version of the penalized SVM and its variants. The advantage of optimizing in one dimension is that the location of the optimum solution can be obtained with intelligent search by exploiting convexity and a piecewise linear or quadratic structure of the criterion function. In each step, the predictor that is most able to predict the outcome is chosen in the model. The search is then repeatedly used in an iterative fashion until convergence occurs. Comparison of our new classification rule with ?1-SVM and other common methods show very promising performance, in that the proposed method leads to much leaner models without compromising misclassification rates, particularly for high-dimensional predictors.  相似文献   

9.
Inference for spatial generalized linear mixed models (SGLMMs) for high-dimensional non-Gaussian spatial data is computationally intensive. The computational challenge is due to the high-dimensional random effects and because Markov chain Monte Carlo (MCMC) algorithms for these models tend to be slow mixing. Moreover, spatial confounding inflates the variance of fixed effect (regression coefficient) estimates. Our approach addresses both the computational and confounding issues by replacing the high-dimensional spatial random effects with a reduced-dimensional representation based on random projections. Standard MCMC algorithms mix well and the reduced-dimensional setting speeds up computations per iteration. We show, via simulated examples, that Bayesian inference for this reduced-dimensional approach works well both in terms of inference as well as prediction; our methods also compare favorably to existing “reduced-rank” approaches. We also apply our methods to two real world data examples, one on bird count data and the other classifying rock types. Supplementary material for this article is available online.  相似文献   

10.
Cross-validation has long been used for choosing tuning parameters and other model selection tasks. It generally performs well provided the data are independent, or nearly so. Improvements have been suggested which address ordinary cross-validation’s (OCV) shortcomings in correlated data. Whereas these techniques have merit, they can still lead to poor model selection in correlated data or are not readily generalizable to high-dimensional data.

The proposed solution, far casting cross-validation (FCCV), addresses these problems. FCCV withholds correlated neighbors in every aspect of the cross-validation procedure. The result is a technique that stresses a fitted model’s ability to extrapolate rather than interpolate. This generally leads to better model selection in correlated datasets.

Whereas FCCV is less than optimal in the independence case, our improvement of OCV applies more generally to higher dimensional error processes and to both parametric and nonparametric model selection problems. To facilitate introduction, we consider only one application, namely estimating global bandwidths for curve estimation with local linear regression. We provide theoretical motivation and report some comparative results from a simulation experiment and on a time series of annual global temperature deviations. For such data, FCCV generally has lower average squared error when disturbances are correlated.

Supplementary materials are available online.  相似文献   

11.
Variable and model selection are of major concern in many statistical applications, especially in high-dimensional regression models. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection. We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure. We show that variable selection may be biased if the covariates are of different nature. Important examples are models combining continuous and categorical covariates, especially if the number of categories is large. In this case, least squares base-learners offer increased flexibility for the categorical covariate and lead to a preference even if the categorical covariate is noninformative. Similar difficulties arise when comparing linear and nonlinear base-learners for a continuous covariate. The additional flexibility in the nonlinear base-learner again yields a preference of the more complex modeling alternative. We investigate these problems from a theoretical perspective and suggest a framework for bias correction based on a general class of penalized least squares base-learners. Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed in naive boosting specifications. The importance of unbiased model selection is demonstrated in simulations. Supplemental materials including an application to forest health models, additional simulation results, additional theorems, and proofs for the theorems are available online.  相似文献   

12.
In combinatorial commutative algebra and algebraic statistics many toric ideals are constructed from graphs. Keeping the categorical structure of graphs in mind we give previous results a more functorial context and generalize them by introducing the ideals of graph homomorphisms. For this new class of ideals we investigate how the topology of the graphs influences the algebraic properties. We describe explicit Gröbner bases for several classes, generalizing results by Hibi, Sturmfels, and Sullivant. One of our main tools is the toric fiber product, and we employ results by Engström, Kahle, and Sullivant. The lattice polytopes defined by our ideals include important classes in optimization theory, as the stable set polytopes.  相似文献   

13.
In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.  相似文献   

14.
We focus on inference about high-dimensional mean vectors when the sample size is much fewer than the dimension. Such data situation occurs in many areas of modern science such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. First, we give a given-radius confidence region for mean vectors. This inference can be utilized as a variable selection of high-dimensional data. Next, we give a given-width confidence interval for squared norm of mean vectors. This inference can be utilized in a classification procedure of high-dimensional data. In order to assure a prespecified coverage probability, we propose a two-stage estimation methodology and determine the required sample size for each inference. Finally, we demonstrate how the new methodologies perform by using a microarray data set.  相似文献   

15.
This work proposes a modified forward-backward splitting algorithm combining an inertial technique for solving the monotone variational inclusion problem. The weak convergence theorem is established under some suitable conditions in Hilbert space, and a new step size is presented for our algorithm to speed up the convergence. We give an example and numerical results for supporting our main theorem in infinite dimensional spaces. We also provide an application to predict breast cancer by using our proposed algorithm for updating the optimal weight in machine learning. Moreover, we use the Wisconsin original breast cancer data set as a training set to show efficiency comparing with the other three algorithms in terms of three key parameters, namely, accuracy, recall, and precision.  相似文献   

16.
In this paper, we give a classification of (finite or countable) ?0‐categorical coloured linear orders, generalizing Rosenstein's characterization of ?0‐categorical linear orderings. We show that they can all be built from coloured singletons by concatenation and ?n‐combinations (for n ≥ 1). We give a method using coding trees to describe all structures in our list (© 2010 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

17.
散乱数据的数值微分及其误差估计   总被引:7,自引:1,他引:6  
1 背景及问题的提出 导数是数学分析中的一个基本的慨念。对于数学工作者来讲,计算导数不是一项特别困难的工作。但是,对于研究实际问题的科学工作者来讲,这项工作就不是一件简单的工作了,首先,求导数的问题是一个典型的Hadamard意义下的不适定问题([5],[12]等)  相似文献   

18.
Forecasting mortality rates is a problem which involves the analysis of high-dimensional time series. Most of usual mortality models propose to decompose the mortality rates into several latent factors to reduce this complexity. These approaches, in particular those using cohort factors, have a good fit, but they are less reliable for forecasting purposes. One of the major challenges is to determine the spatial–temporal dependence structure between mortality rates given a relatively moderate sample size. This paper proposes a large vector autoregressive (VAR) model fitted on the differences in the log-mortality rates, ensuring the existence of long-run relationships between mortality rate improvements. Our contribution is threefold. First, sparsity, when fitting the model, is ensured by using high-dimensional variable selection techniques without imposing arbitrary constraints on the dependence structure. The main interest is that the structure of the model is directly driven by the data, in contrast to the main factor-based mortality forecasting models. Hence, this approach is more versatile and would provide good forecasting performance for any considered population. Additionally, our estimation allows a one-step procedure, as we do not need to estimate hyper-parameters. The variance–covariance matrix of residuals is then estimated through a parametric form. Secondly, our approach can be used to detect nonintuitive age dependence in the data, beyond the cohort and the period effects which are implicitly captured by our model. Third, our approach can be extended to model the several populations in long run perspectives, without raising issue in the estimation process. Finally, in an out-of-sample forecasting study for mortality rates, we obtain rather good performances and more relevant forecasts compared to classical mortality models using the French, US and UK data. We also show that our results enlighten the so-called cohort and period effects for these populations.  相似文献   

19.
In this paper, a retarded impulsive n-species Lotka–Volterra competition system with feedback controls is studied. Some sufficient conditions are obtained to guarantee the global exponential stability and global asymptotic stability of a unique equilibrium for such a high-dimensional biological system. The problem considered in this paper is in many aspects more general and incorporates as special cases various problems which have been extensively studied in the literature. Moreover, applying the obtained results to some special cases, I derive some new criteria which generalize and greatly improve some well known results. A method is proposed to investigate biological systems subjected to the effect of both impulses and delays. The method is based on Banach fixed point theory and matrix’s spectral theory as well as Lyapunov function. Moreover, some novel analytic techniques are employed to study GAS and GES. It is believed that the method can be extended to other high-dimensional biological systems and complex neural networks. Finally, two examples show the feasibility of the results.  相似文献   

20.
We present a very fast algorithm for general matrix factorization of a data matrix for use in the statistical analysis of high-dimensional data via latent factors. Such data are prevalent across many application areas and generate an ever-increasing demand for methods of dimension reduction in order to undertake the statistical analysis of interest. Our algorithm uses a gradient-based approach which can be used with an arbitrary loss function provided the latter is differentiable. The speed and effectiveness of our algorithm for dimension reduction is demonstrated in the context of supervised classification of some real high-dimensional data sets from the bioinformatics literature.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号