首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Fitting semiparametric clustering models to dissimilarity data   总被引:1,自引:0,他引:1  
The cluster analysis problem of partitioning a set of objects from dissimilarity data is here handled with the statistical model-based approach of fitting the “closest” classification matrix to the observed dissimilarities. A classification matrix represents a clustering structure expressed in terms of dissimilarities. In cluster analysis there is a lack of methodologies widely used to directly partition a set of objects from dissimilarity data. In real applications, a hierarchical clustering algorithm is applied on dissimilarities and subsequently a partition is chosen by visual inspection of the dendrogram. Alternatively, a “tandem analysis” is used by first applying a Multidimensional Scaling (MDS) algorithm and then by using a partitioning algorithm such as k-means applied on the dimensions specified by the MDS. However, neither the hierarchical clustering algorithms nor the tandem analysis is specifically defined to solve the statistical problem of fitting the closest partition to the observed dissimilarities. This lack of appropriate methodologies motivates this paper, in particular, the introduction and the study of three new object partitioning models for dissimilarity data, their estimation via least-squares and the introduction of three new fast algorithms.  相似文献   

2.
Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice.  相似文献   

3.
This short paper takes up the problem of El-Shishiny and Ghabbour1 in order to highlight some of the problems in data analysis oriented at recognition of basic data structures, of hierarchy or partition type, with vagueness in initial data and/or the procedure applied. A response to the considered ‘soil fauna taxa versus Sudan sites’ problem is also suggested.  相似文献   

4.
《Fuzzy Sets and Systems》1987,24(3):363-375
Since fuzzy data can be regarded as distribution of possibility, fuzzy data analysis by possibilistic linear models is proposed in this paper. Possibilistic linear systems are defined by the extension principle. Fuzzy parameter estimations are discussed in possibilistic linear systems and possibilistic linear models are employed for fuzzy data analysis with non-fuzzy inputs and fuzzy outputs defined by fuzzy numbers. The estimated possibilistic linear system can be obtained by solving a linear programming problem. This approach can be regarded as fuzzy interval analysis.  相似文献   

5.
In this paper, a novel memetic algorithm (MA) named GS-MPSO is proposed by combining a particle swarm optimization (PSO) with a Gaussian mutation operator and a Simulated Annealing (SA)-based local search operator. In GS-MPSO, the particles are organized as a ring lattice. The Gaussian mutation operator is applied to the stagnant particles to prevent GS-MPSO trapping into local optima. The SA-based local search strategy is developed to combine with the cognition-only PSO model and perform a fine-grained local search around the promising regions. The experimental results show that GS-MPSO is superior to some other variants of PSO with better performance on optimizing the benchmark functions when the computing resource is limited. Data clustering is studied as a real case study to further demonstrate its optimization ability and usability, too.  相似文献   

6.
7.
The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic alterations seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes and associate them with known biological markers.To utilize the spatial coherence between nearby clones, we use an unsupervised hidden Markov models approach. The clones are partitioned into the states which represent the underlying copy number of the group of clones. The method is demonstrated on the two cell line datasets, one with known copy number alterations. The biological conclusions drawn from the analyses are discussed.  相似文献   

8.
Estimating the probability of extreme temperature events is difficult because of limited records across time and the need to extrapolate the distributions of these events, as opposed to just the mean, to locations where observations are not available. Another related issue is the need to characterize the uncertainty in the estimated probability of extreme events at different locations. Although the tools for statistical modeling of univariate extremes are well-developed, extending these tools to model spatial extreme data is an active area of research. In this paper, in order to make inference about spatial extreme events, we introduce a new nonparametric model for extremes. We present a Dirichlet-based copula model that is a flexible alternative to parametric copula models such as the normal and t-copula. The proposed modelling approach is fitted using a Bayesian framework that allow us to take into account different sources of uncertainty in the data and models. We apply our methods to annual maximum temperature values in the east-south-central United States.  相似文献   

9.
Advances in Data Analysis and Classification - Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for...  相似文献   

10.
11.
This paper addresses the non‐parametric estimation of the stochastic process related to the classification problem that arises in robot programming by demonstration of compliant motion tasks. Robot programming by demonstration is a robot programming paradigm in which a human operator demonstrates the task to be performed by the robot. In such demonstration, several observable variables, such as velocities and forces can be modeled, non‐parametrically, in order to classify the current state of a contact between an object manipulated by the robot and the environment in which it operates. Essential actions in compliant motion tasks are the contacts that take place, and therefore, it is important to understand the sequence of contact states made during a demonstration, called contact classification. We propose a contact classification algorithm based on the random forest algorithm. The main advantage of this approach is that it does not depend on the geometric model of the objects involved in the demonstration. Moreover, it does not rely on the kinestatic model of the contact interactions. The comparison with state‐of‐the‐art contact classifiers shows that random forest classifier is more accurate. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

12.

A new class of survival frailty models based on the generalized inverse-Gaussian (GIG) distributions is proposed. We show that the GIG frailty models are flexible and mathematically convenient like the popular gamma frailty model. A piecewise-exponential baseline hazard function is employed, yielding flexibility for the proposed class. Although a closed-form observed log-likelihood function is available, simulation studies show that employing an EM-algorithm is advantageous concerning the direct maximization of this function. Further simulated results address the comparison of different methods for obtaining standard errors of the estimates and confidence intervals for the parameters. Additionally, the finite-sample behavior of the EM-estimators is investigated and the performance of the GIG models under misspecification assessed. We apply our methodology to a TARGET (Therapeutically Applicable Research to Generate Effective Treatments) data about the survival time of patients with neuroblastoma cancer and show some advantages of the GIG frailties over existing models in the literature.

  相似文献   

13.
In this paper we study the asymptotic behavior of Bayes estimators for hidden Markov models as the number of observations goes to infinity. The theorem that we prove is similar to the Bernstein—von Mises theorem on the asymptotic behavior of the posterior distribution for the case of independent observations. We show that our theorem is applicable to a wide class of hidden Markov models. We also discuss the implication of the theorem’s assumptions for several models that are used in practical applications such as ion channel kinetics.   相似文献   

14.
In this paper, we study the structured nonnegative matrix factorization problem: given a square, nonnegative matrix P, decompose it as P=VAV? with V and A nonnegative matrices and with the dimension of A as small as possible. We propose an iterative approach that minimizes the Kullback-Leibler divergence between P and VAV? subject to the nonnegativity constraints on A and V with the dimension of A given. The approximate structured decomposition P?VAV? is closely related to the approximate symmetric decomposition P?VV?. It is shown that the approach for finding an approximate structured decomposition can be adapted to solve the symmetric decomposition problem approximately. Finally, we apply the nonnegative decomposition VAV? to the hidden Markov realization problem and to the clustering of data vectors based on their distance matrix.  相似文献   

15.
In this paper we present a robust conjugate duality theory for convex programming problems in the face of data uncertainty within the framework of robust optimization, extending the powerful conjugate duality technique. We first establish robust strong duality between an uncertain primal parameterized convex programming model problem and its uncertain conjugate dual by proving strong duality between the deterministic robust counterpart of the primal model and the optimistic counterpart of its dual problem under a regularity condition. This regularity condition is not only sufficient for robust duality but also necessary for it whenever robust duality holds for every linear perturbation of the objective function of the primal model problem. More importantly, we show that robust strong duality always holds for partially finite convex programming problems under scenario data uncertainty and that the optimistic counterpart of the dual is a tractable finite dimensional problem. As an application, we also derive a robust conjugate duality theorem for support vector machines which are a class of important convex optimization models for classifying two labelled data sets. The support vector machine has emerged as a powerful modelling tool for machine learning problems of data classification that arise in many areas of application in information and computer sciences.  相似文献   

16.
This paper provides an introduction to and an overview of Bayesian estimation, based on Markov chain Monte Carlo techniques. Autoregressive time series models are considered in some detail. Finally, an example involving the modelling of a metal pollutant concentration in sludge is presented.  相似文献   

17.
Summary Local mean-field Markov processes are constructed from local mean-field dynamical semigroups of Markov transition operators. This provides a general scheme for the convergence of empirical measure processes for tagged particles in the thermodynamic limit of classical interacting particle systems. As an application the Poissonian approximation for message-switching queueing networks is investigated.  相似文献   

18.
There are many well-known document classification/clustering algorithms. In this paper, compression-based distances between documents are focused on, in particular, the normalized compression distance (NCD). The NCD is a popular and powerful metric between strings. A new distance $D_\alpha $ with one parameter $\alpha $ between strings is designed on the basis of the NCD, and several properties of $D_\alpha $ are studied. It is also proved that every pair of strings $(x,y)$ can be plotted on the contour graphs of NCD and $D_\alpha $ (and some other compression-based distances) in a 2-dimensional plane. The distance $D_\alpha (x,y)$ is defined to take a relatively small value if a string $x$ is a portion of a string $y.$ Literary works $x$ and $y$ are usually assumed to be written by the same author(s) if $x$ is a portion of $y.$ Therefore, it may be appropriate to consider the performance of $D_\alpha $ for literary work classification based on authorship, as a benchmark. An algorithm to determine an appropriate value of $\alpha $ is presented using the contour graphs, and this algorithm does not require the knowledge of the names of the authors of each work. According to experimental results of the area under receiver operating characteristics curves and clustering, $D_\alpha $ with such an appropriate value of $\alpha $ performs somewhat better in literary work classification based on authorship.  相似文献   

19.
This paper presents a new approach for consumer credit scoring, by tailoring a profit-based classification performance measure to credit risk modeling. This performance measure takes into account the expected profits and losses of credit granting and thereby better aligns the model developers’ objectives with those of the lending company. It is based on the Expected Maximum Profit (EMP) measure and is used to find a trade-off between the expected losses – driven by the exposure of the loan and the loss given default – and the operational income given by the loan. Additionally, one of the major advantages of using the proposed measure is that it permits to calculate the optimal cutoff value, which is necessary for model implementation. To test the proposed approach, we use a dataset of loans granted by a government institution, and benchmarked the accuracy and monetary gain of using EMP, accuracy, and the area under the ROC curve as measures for selecting model parameters, and for determining the respective cutoff values. The results show that our proposed profit-based classification measure outperforms the alternative approaches in terms of both accuracy and monetary value in the test set, and that it facilitates model deployment.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号