首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Clustering and classification are important tasks for the analysis of microarray gene expression data. Classification of tissue samples can be a valuable diagnostic tool for diseases such as cancer. Clustering samples or experiments may lead to the discovery of subclasses of diseases. Clustering genes can help identify groups of genes that respond similarly to a set of experimental conditions. We also need validation tools for clustering and classification. Here, we focus on the identification of outliers—units that may have been misallocated, or mislabeled, or are not representative of the classes or clusters.We present two new methods: DDclust and DDclass, for clustering and classification. These non-parametric methods are based on the intuitively simple concept of data depth. We apply the methods to several gene expression and simulated data sets. We also discuss a convenient visualization and validation tool—the relative data depth plot.  相似文献   

2.
In this work, we assess the suitability of cluster analysis for the gene grouping problem confronted with microarray data. Gene clustering is the exercise of grouping genes based on attributes, which are generally the expression levels over a number of conditions or subpopulations. The hope is that similarity with respect to expression is often indicative of similarity with respect to much more fundamental and elusive qualities, such as function. By formally defining the true gene-specific attributes as parameters, such as expected expression across the conditions, we obtain a well-defined gene clustering parameter of interest, which greatly facilitates the statistical treatment of gene clustering. We point out that genome-wide collections of expression trajectories often lack natural clustering structure, prior to ad hoc gene filtering. The gene filters in common use induce a certain circularity to most gene cluster analyses: genes are points in the attribute space, a filter is applied to depopulate certain areas of the space, and then clusters are sought (and often found!) in the “cleaned” attribute space. As a result, statistical investigations of cluster number and clustering strength are just as much a study of the stringency and nature of the filter as they are of any biological gene clusters. In the absence of natural clusters, gene clustering may still be a worthwhile exercise in data segmentation. In this context, partitions can be fruitfully encoded in adjacency matrices and the sampling distribution of such matrices can be studied with a variety of bootstrapping techniques.  相似文献   

3.
In this article, the problem of classifying a new observation vector into one of the two known groups Πi,i=1,2, distributed as multivariate normal with common covariance matrix is considered. The total number of observation vectors from the two groups is, however, less than the dimension of the observation vectors. A sample-squared distance between the two groups, using Moore-Penrose inverse, is introduced. A classification rule based on the minimum distance is proposed to classify an observation vector into two or several groups. An expression for the error of misclassification when there are only two groups is derived for large p and n=O(pδ),0<δ<1.  相似文献   

4.
This paper centres on clustering approaches that deal with multiple DNA microarray datasets. Four clustering algorithms for deriving a clustering solution from multiple gene expression matrices studying the same biological phenomenon are considered: two unsupervised cluster techniques based on information integration, a hybrid consensus clustering method combining Particle Swarm Optimization and k-means that can be referred to supervised clustering, and a supervised consensus clustering algorithm enhanced by Formal Concept Analysis (FCA), which initially produces a list of different clustering solutions, one per each experiment and then these solutions are transformed by portioning the cluster centres into a single overlapping partition, which is further analyzed by employing FCA. The four algorithms are evaluated on gene expression time series obtained from a study examining the global cell-cycle control of gene expression in fission yeast Schizosaccharomyces pombe.  相似文献   

5.
Ogawa (1951) considered the efficiency of estimation of the population mean from suitably chosen order statistics in large samples. Cox (1957) has considered the relative amount of information retained by grouping the normal curve. Cochran and Hopkins (1961) determined the discriminating power retained after partitioning normally distributed variates into qualitative ones in multivariate classification problems. And Connor (1972) discussed the asymptotic efficiencies of the test for the trend using m groups formed from a continuous variable. The same expression appears in all these investigations. This note throws some more light on the occurrence of the same expression in these seemingly unrelated problems.  相似文献   

6.
Multiple hypotheses testing is concerned with appropriately controlling the rate of false positives, false negatives or both when testing several hypotheses simultaneously. Nowadays, the common approach to testing multiple hypotheses calls for controlling the expected proportion of falsely rejected null hypotheses referred to as the false discovery rate (FDR) or suitable measures based on the positive false discovery rate (pFDR). In this paper, we consider the problem of determining levels that both false positives and false negatives can be controlled simultaneously. As our risk function, we use the expected value of the maximum between the proportions of false positives and false negatives, with the expectation being taken conditional on the event that at least one hypothesis is rejected and one is accepted, referred to as hybrid error rate (HER). We then develop, based on HER, an analog of p-value termed as h-value to test the individual hypotheses. The use of the new procedure is illustrated using the well-known public data set by Golub et al. [Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 386 (1999) 531-537] with Affymetrix arrays of patients with acute lymphoic leukemia and acute myeloid leukemia.  相似文献   

7.
Hierarchical and empirical Bayes approaches to inference are attractive for data arising from microarray gene expression studies because of their ability to borrow strength across genes in making inferences. Here we focus on the simplest case where we have data from replicated two colour arrays which compare two samples and where we wish to decide which genes are differentially expressed and obtain estimates of operating characteristics such as false discovery rates. The purpose of this paper is to examine the frequentist performance of Bayesian variable selection approaches to this problem for different prior specifications and to examine the effect on inference of commonly used empirical Bayes approximations to hierarchical Bayes procedures. The paper makes three main contributions. First, we describe how the log odds of differential expression can usually be computed analytically in the case where a double tailed exponential prior is used for gene effects rather than a normal prior, which gives an alternative to the commonly used B-statistic for ranking genes in simple comparative experiments. The second contribution of the paper is to compare empirical Bayes procedures for detecting differential expression with hierarchical Bayes methods which account for uncertainty in prior hyperparameters to examine how much is lost in using the commonly employed empirical Bayes approximations. Third, we describe an efficient MCMC scheme for carrying out the computations required for the hierarchical Bayes procedures. Comparisons are made via simulation studies where the simulated data are obtained by fitting models to some real microarray data sets. The results have implications for analysis of microarray data using parametric hierarchical and empirical Bayes methods for more complex experimental designs: generally we find that the empirical Bayes methods work well, which supports their use in the analysis of more complex experiments when a full hierarchical Bayes analysis would impose heavy computational demands.  相似文献   

8.
Euclidean distance-based classification rules are derived within a certain nonclassical linear model approach and applied to elliptically contoured samples having a density generating function g. Then a geometric measure theoretical method to evaluate exact probabilities of correct classification for multivariate uncorrelated feature vectors is developed. When doing this one has to measure suitably defined sets with certain standardized measures. The geometric key point is that the intersection percentage functions of the areas under investigation coincide with those of certain parabolic cylinder type sets. The intersection percentage functions of the latter sets can be described as threefold integrals. It turns out that these intersection percentage functions yield simultaneously geometric representation formulae for the doubly noncentral g-generalized F-distributions. Hence, we get beyond new formulae for evaluating probabilities of correct classification new geometric representation formulae for the doubly noncentral g-generalized F-distributions. A numerical study concerning several aspects of evaluating both probabilities of correct classification and values of the doubly noncentral g-generalized F-distributions demonstrates the advantageous computational properties of the present new approach. This impression will be supported by comparison with the literature.It is shown that probabilities of correct classification depend on the parameters of the underlying sample distribution through a certain well-defined set of secondary parameters. If the underlying parameters are unknown, we propose to estimate probabilities of correct classification.  相似文献   

9.
Feature selection consists of choosing a subset of available features that capture the relevant properties of the data. In supervised pattern classification, a good choice of features is fundamental for building compact and accurate classifiers. In this paper, we develop an efficient feature selection method using the zero-norm l 0 in the context of support vector machines (SVMs). Discontinuity at the origin for l 0 makes the solution of the corresponding optimization problem difficult to solve. To overcome this drawback, we use a robust DC (difference of convex functions) programming approach which is a general framework for non-convex continuous optimisation. We consider an appropriate continuous approximation to l 0 such that the resulting problem can be formulated as a DC program. Our DC algorithm (DCA) has a finite convergence and requires solving one linear program at each iteration. Computational experiments on standard datasets including challenging feature-selection problems of the NIPS 2003 feature selection challenge and gene selection for cancer classification show that the proposed method is promising: while it suppresses up to more than 99% of the features, it can provide a good classification. Moreover, the comparative results illustrate the superiority of the proposed approach over standard methods such as classical SVMs and feature selection concave.  相似文献   

10.
In this paper we consider categorical data that are distributed according to a multinomial, product-multinomial or Poisson distribution whose expected values follow a log-linear model and we study the inference problem of hypothesis testing in a log-linear model setting. The family of test statistics considered is based on the family of ?-divergence measures. The unknown parameters in the log-linear model under consideration are also estimated using ?-divergence measures: Minimum ?-divergence estimators. A simulation study is included to find test statistics that offer an attractive alternative to the Pearson chi-square and likelihood-ratio test statistics.  相似文献   

11.
An exhaustive search as required for traditional variable selection methods is impractical in high dimensional statistical modeling. Thus, to conduct variable selection, various forms of penalized estimators with good statistical and computational properties, have been proposed during the past two decades. The attractive properties of these shrinkage and selection estimators, however, depend critically on the size of regularization which controls model complexity. In this paper, we consider the problem of consistent tuning parameter selection in high dimensional sparse linear regression where the dimension of the predictor vector is larger than the size of the sample. First, we propose a family of high dimensional Bayesian Information Criteria (HBIC), and then investigate the selection consistency, extending the results of the extended Bayesian Information Criterion (EBIC), in Chen and Chen (2008) to ultra-high dimensional situations. Second, we develop a two-step procedure, the SIS+AENET, to conduct variable selection in p>n situations. The consistency of tuning parameter selection is established under fairly mild technical conditions. Simulation studies are presented to confirm theoretical findings, and an empirical example is given to illustrate the use in the internet advertising data.  相似文献   

12.
We focus on a well-known classification task with expert systems based on Bayesian networks: predicting the state of a target variable given an incomplete observation of the other variables in the network, i.e., an observation of a subset of all the possible variables. To provide conclusions robust to near-ignorance about the process that prevents some of the variables from being observed, it has recently been derived a new rule, called conservative updating. With this paper we address the problem to efficiently compute the conservative updating rule for robust classification with Bayesian networks. We show first that the general problem is NP-hard, thus establishing a fundamental limit to the possibility to do robust classification efficiently. Then we define a wide subclass of Bayesian networks that does admit efficient computation. We show this by developing a new classification algorithm for such a class, which extends substantially the limits of efficient computation with respect to the previously existing algorithm. The algorithm is formulated as a variable elimination procedure, whose computation time is linear in the input size.  相似文献   

13.
The ratio of the largest eigenvalue divided by the trace of a p×p random Wishart matrix with n degrees of freedom and an identity covariance matrix plays an important role in various hypothesis testing problems, both in statistics and in signal processing. In this paper we derive an approximate explicit expression for the distribution of this ratio, by considering the joint limit as both p,n with p/nc. Our analysis reveals that even though asymptotically in this limit the ratio follows a Tracy-Widom (TW) distribution, one of the leading error terms depends on the second derivative of the TW distribution, and is non-negligible for practical values of p, in particular for determining tail probabilities. We thus propose to explicitly include this term in the approximate distribution for the ratio. We illustrate empirically using simulations that adding this term to the TW distribution yields a quite accurate expression to the empirical distribution of the ratio, even for small values of p,n.  相似文献   

14.
This paper deals with the problem of quantization of a random variable X taking values in a separable and reflexive Banach space, and with the related question of clustering independent random observations distributed as X. To this end, we use a quantization scheme with a class of distortion measures called Bregman divergences, and provide conditions ensuring the existence of an optimal quantizer and an empirically optimal quantizer. Rates of convergence are also discussed.  相似文献   

15.
A general depth measure, based on the use of one-dimensional linear continuous projections, is proposed. The applicability of this idea in different statistical setups (including inference in functional data analysis, image analysis and classification) is discussed. A special emphasis is made on the possible usefulness of this method in some statistical problems where the data are elements of a Banach space.The asymptotic properties of the empirical approximation of the proposed depth measure are investigated. In particular, its asymptotic distribution is obtained through U-statistics techniques. The practical aspects of these ideas are discussed through a small simulation study and a real-data example.  相似文献   

16.
In this article, we propose a new estimation methodology to deal with PCA for high-dimension, low-sample-size (HDLSS) data. We first show that HDLSS datasets have different geometric representations depending on whether a ρ-mixing-type dependency appears in variables or not. When the ρ-mixing-type dependency appears in variables, the HDLSS data converge to an n-dimensional surface of unit sphere with increasing dimension. We pay special attention to this phenomenon. We propose a method called the noise-reduction methodology to estimate eigenvalues of a HDLSS dataset. We show that the eigenvalue estimator holds consistency properties along with its limiting distribution in HDLSS context. We consider consistency properties of PC directions. We apply the noise-reduction methodology to estimating PC scores. We also give an application in the discriminant analysis for HDLSS datasets by using the inverse covariance matrix estimator induced by the noise-reduction methodology.  相似文献   

17.
Homogeneity tests based on several progressively Type-II censored samples   总被引:2,自引:0,他引:2  
In this paper, we discuss the problem of testing the homogeneity of several populations when the available data are progressively Type-II censored. Defining for each sample a univariate counting process, we can modify all the methods that were developed during the last two decades (see e.g. [P.K. Andersen, Ø. Borgan, R. Gill, N. Keiding, Statistical Models Based on Counting Processes, Springer, New York, 1993]) for use to this problem. An important aspect of these tests is that they are based on either linear or non-linear functionals of a discrepancy process (DP) based on the comparison of the cumulative hazard rate (chr) estimated from each sample with the chr estimated from the whole sample (viz., the aggregation of all the samples), leading to either linear tests or non-linear tests. Both these kinds of tests suffer from some serious drawbacks. For example, it is difficult to extend non-linear tests to the K-sample situation when K?3. For this reason, we propose here a new class of non-linear tests, based on a chi-square type functional of the DP, that can be applied to the K-sample problem for any K?2.  相似文献   

18.
In this paper, we are interested in the calculation of the Haezendonck-Goovaerts risk measure, which is defined via a convex Young function and a parameter q∈(0,1) representing the confidence level. We mainly focus on the case in which the risk variable follows a distribution function from a max-domain of attraction. For this case, we restrict the Young function to be a power function and we derive exact asymptotics for the Haezendonck-Goovaerts risk measure as q1. As a subsidiary, we also consider the case with an exponentially distributed risk variable and a general Young function, and we obtain an analytical expression for the Haezendonck-Goovaerts risk measure.  相似文献   

19.
Given a random sample from a continuous variable, it is observed that the copula linking any pair of order statistics is independent of the parent distribution. To compare the degree of association between two such pairs of ordered random variables, a notion of relative monotone regression dependence (or stochastic increasingness) is considered. Using this concept, it is proved that for i<j, the dependence of the jth order statistic on the ith order statistic decreases as i and j draw apart. This extends earlier results of Tukey (Ann. Math. Statist. 29 (1958) 588) and Kim and David (J. Statist. Plann. Inference 24 (1990) 363). The effect of the sample size on this type of dependence is also investigated, and an explicit expression is given for the population value of Kendall's coefficient of concordance between two arbitrary order statistics of a random sample.  相似文献   

20.
This paper proposes a comparative appraisal of the fuzzy classification methods which are Fuzzy C-Means, K Nearest Neighbours, method based on Fuzzy Rules and Fuzzy Pattern Matching method. It presents the results we obtained in applying those methods on three types of data that we present in the second part of this article. The classification rate and the computing times are compared from a method to another. This paper describes the advantages of the fuzzy classifiers for an application to a diagnosis problem. To finish it proposes a synthesis of our study which can constitute a base to choose an algorithm in order to apply it to a process diagnosis in real time. It shows how we can associate unsupervised and supervised methods in a diagnosis algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号