首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Statistical methods of discrimination and classification are used for the prediction of protein structure from amino acid sequence data. This provides information for the establishment of new paradigms of carcinogenesis modeling on the basis of gene expression. Feed forward neural networks and standard statistical classification procedures are used to classify proteins into fold classes. Logistic regression, additive models, and projection pursuit regression from the family of methods based on a posterior probabilities; linear, quadratic, and a flexible discriminant analysis from the class of methods based on class conditional probabilities, and the nearest-neighbors classification rule are applied to a data set of 268 sequences. From analyzing the prediction error obtained with a test sample (n = 125) and with a cross validation procedure, we conclude that the standard linear discriminant analysis and nearest-neighbor methods are at the same time statistically feasible and potent competitors to the more flexible tools of feed forward neural networks. Further research is needed to explore the gain obtainable from statistical methods by the application to larger sets of protein sequence data and to compare the results with those from biophysical approaches.  相似文献   

2.
This article introduces a classification tree algorithm that can simultaneously reduce tree size, improve class prediction, and enhance data visualization. We accomplish this by fitting a bivariate linear discriminant model to the data in each node. Standard algorithms can produce fairly large tree structures because they employ a very simple node model, wherein the entire partition associated with a node is assigned to one class. We reduce the size of our trees by letting the discriminant models share part of the data complexity. Being themselves classifiers, the discriminant models can also help to improve prediction accuracy. Finally, because the discriminant models use only two predictor variables at a time, their effects are easily visualized by means of two-dimensional plots. Our algorithm does not simply fit discriminant models to the terminal nodes of a pruned tree, as this does not reduce the size of the tree. Instead, discriminant modeling is carried out in all phases of tree growth and the misclassification costs of the node models are explicitly used to prune the tree. Our algorithm is also distinct from the “linear combination split” algorithms that partition the data space with arbitrarily oriented hyperplanes. We use axis-orthogonal splits to preserve the interpretability of the tree structures. An extensive empirical study with real datasets shows that, in general, our algorithm has better prediction power than many other tree or nontree algorithms.  相似文献   

3.
The quadratic discriminant function is often used to separate two classes of points in a multidimensional space. When the two classes are normally distributed, this results in the optimum separation. In some cases however, the assumption of normality is a poor one and the classification error is increased. The current paper derives an upper bound for the classification error due to a quadratic decision surface. The bound is strict when the class means and covariances and the quadratic discriminant surface satisfy certain specified symmetry conditions.  相似文献   

4.
When clustering multivariate observations adhering the mixture model of Gaussian distributions, rather frequently projections of the observations onto a linear subspace of less dimensionality, called discriminant space (DS), contain all statistical information about the cluster structure of the model. In this case, the actual reduction of data dimensionality substantially facilitates a solution of various classification problems. In the paper, attention is devoted to statistical testing of hypotheses about DS and its dimension. The characterization of DS and methods of its identification are also briefly discussed.  相似文献   

5.
In this paper it is studied how observations in the training sample affect the misclassification probability of a quadratic discriminant rule. An approach based on partial influence functions is followed. It allows to quantify the effect of observations in the training sample on the performance of the associated classification rule. Focus is on the effect of outliers on the misclassification rate, merely than on the estimates of the parameters of the quadratic discriminant rule. The expression for the partial influence function is then used to construct a diagnostic tool for detecting influential observations. Applications on real data sets are provided.  相似文献   

6.
The problem of classification of a multivariate observation X drawn from a mixture of Gaussian distributions is considered. A linear subspace of the least dimension containing all information about the cluster structure of X is called a discriminant space (DS). Estimation of DS is based on characterizations of DS via projection pursuit with an appropriate projection index. An estimator of DS is obtained merely by applying the projection pursuit with the projection index replaced by its nonparametric estimator. We discuss the asymptotic behavior of the estimator obtained in this way.  相似文献   

7.
This paper presents an application of knowledge discovery via rough sets to a real life case study of global investing risk in 52 countries using 27 indicator variables. The aim is explanation of the classification of the countries according to financial risks assessed by Wall Street Journal international experts and knowledge discovery from data via decision rule mining, rather than prediction; i.e. to capture the explicit or implicit knowledge or policy of international financial experts, rather than to predict the actual classifications. Suggestions are made about the most significant attributes for each risk class and country, as well as the minimal set of decision rules needed. Our results compared favorably with those from discriminant analysis and several variations of preference disaggregation MCDA procedures. The same approach could be adapted to other problems with missing data in data mining, knowledge extraction, and different multi-criteria decision problems, like sorting, choice and ranking.  相似文献   

8.
In developing a classification model for assigning observations of unknown class to one of a number of specified classes using the values of a set of features associated with each observation, it is often desirable to base the classifier on a limited number of features. Mathematical programming discriminant analysis methods for developing classification models can be extended for feature selection. Classification accuracy can be used as the feature selection criterion by using a mixed integer programming (MIP) model in which a binary variable is associated with each training sample observation, but the binary variable requirements limit the size of problems to which this approach can be applied. Heuristic feature selection methods for problems with large numbers of observations are developed in this paper. These heuristic procedures, which are based on the MIP model for maximizing classification accuracy, are then applied to three credit scoring data sets.  相似文献   

9.
The relationship between canonical correlation and classification accuracy in linear discriminant analysis is explored mathematically. The discriminant score is assumed to conform to a uniform distribution on the interval (0, 1]. This distribution is used as a reference distribution to extract a minimum correlation for certain classification accuracy. Four different cases are analyzed. First, a case for equal group size is considered for an overall accuracy of 100%. Second, the results are generalized for unequal group size. Third, existence of discordant observations is allowed. Fourth, the effect of concentration is analyzed for the first case. The results are demonstrated by numerical examples. In addition, a sample of 2092 default and 63,072 non-default Finnish firms are used to empirically illustrate the results in the context of failure prediction. The results show that group size of default firms, number of discordant observations, and bipolar concentration of observations strongly affect both canonical correlation and classification accuracy.  相似文献   

10.
The current study provides a simple algorithm for finding the optimal ROC curve for a linear discriminant between two point distributions, given only information about the classes' means and covariances. The method makes no assumptions concerning the exact type of distribution and is shown to provide the best possible discrimination for any physically reasonable measure of the classification error. This very general solution is shown to specialise to results obtained in other papers which assumed multi-dimensional Gaussian distributed classes, or minimised the maximum classification error. Some numerical examples are provided which show the improvement in classification of this method over previously used methods.  相似文献   

11.
The concept of quadratic subspace is introduced as a helpful tool for dimension reduction in quadratic discriminant analysis (QDA). It is argued that an adequate representation of the quadratic subspace may lead to better methods for both data representation and classification. Several theoretical results describe the structure of the quadratic subspace, that is shown to contain some of the subspaces previously proposed in the literature for finding differences between the class means and covariances. A suitable assumption of orthogonality between location and dispersion subspaces allows us to derive a convenient reduced version of the full QDA rule. The behavior of these ideas in practice is illustrated with three real data examples.  相似文献   

12.
The use of definitive screening designs (DSDs) has been increasing since their introduction in 2011. These designs are used to screen factors and to make predictions. We assert that the choice of analysis method for these designs depends on the goal of the experiment, screening, or prediction. In this work, we present simulation results to address the explanatory (screening) use and the predictive use of DSDs. To address the predictive ability of DSDs, we use two 5‐factor DSDs and simultaneously run central composite designs case studies on which we will compare several common analysis methods. Overall, we find that for screening purposes, the Dantzig selector using the Bayesian Information Criterion statistic is a good analysis choice; however, when the goal of analysis is prediction forward selection using the Bayesian Information Criterion statistic produces models with a lower mean squared prediction error.  相似文献   

13.
Bayes判别分析在医疗数据处理中的应用   总被引:1,自引:0,他引:1  
本文利用判别分析的基本原理和方法,针对肝硬化医疗数据建立数学模型,然后利用SPSS16.0作为工具求解模型,得到了三个有意义的能判别归类的函数判别式。  相似文献   

14.
We consider the problem of estimating the discriminant coefficients, η=∑1-(1)(2)) based on two independent normal samples fromN p (1),∑) andN p (2),∑). We are concerned with the estimation of η as the gradient of log-odds between two extreme situations. A decision theoretic approach is taken with the quadratic loss function. We derive the unbiased estimator of the essential part of the risk which is applicable for general estimators. We propose two types of new estimators and prove their dominance over the traditional estimator using this unbiased estimator.  相似文献   

15.
In this article, the Stein-Haff identity is established for a singular Wishart distribution with a positive definite mean matrix but with the dimension larger than the degrees of freedom. This identity is then used to obtain estimators of the precision matrix improving on the estimator based on the Moore-Penrose inverse of the Wishart matrix under the Efron-Morris loss function and its variants. Ridge-type empirical Bayes estimators of the precision matrix are also given and their dominance properties over the usual one are shown using this identity. Finally, these precision estimators are used in a quadratic discriminant rule, and it is shown through simulation that discriminant methods based on the ridge-type empirical Bayes estimators provide higher correct classification rates.  相似文献   

16.
有序判别分析新算法及其应用   总被引:1,自引:1,他引:0  
判别分析是用已知分类数据建模对未知分类数据进行判别的方法,所用数据和分类不分顺序。要对有序又有周期数据进行判别分析,就要探索有序判别的新方法。这种方法的分类应当是有序的,并且能够排除事物发展周期性的干扰。本文介绍多元数据有序判别分析新方法的原理、建模流程、应用流程和应用实例。这种判别分析将分类建模与判别归类分开。新方法对多元数据建模时在多类模型中建立滑移的多套子模型,应用时根据应用领域的知识对样本归属作初步预估,然后程序选择相关的子模型进行判别归类。这种方法解决了由于时间序列多元数据周期性造成的样本分类颠倒问题,为时间序列数据的分类和预测开辟了新途径,在实际应用中取得了良好的效果,解决了重大难题。  相似文献   

17.
Research on mathematical programming approaches to the classification problem has focused almost exclusively on linear discriminant functions with only first-order terms. While many of these first-order models have displayed excellent classificatory performance when compared to Fisher's linear discriminant method, they cannot compete with Smith's quadratic discriminant method on certain data sets. In this paper, we investigate the appropriateness of including second-order terms in mathematical programming models. Various issues are addressed, such as performance of models with small to moderate sample size, need for crossproduct terms, and loss of power by the mathematical programming models under conditions ideal for the parametric procedures. A simulation study is conducted to assess the relative performance of first-order and second-order mathematical programming models to the parametric procedures. The simulation study indicates that mathematical programming models using polynomial functions may be prone to overfitting on the training samples which in turn may cause rather poor fits on the validation samples. The simulation study also indicates that inclusion of cross-product terms may hurt a polynomial model's accuracy on the validation samples, although omission of them means that the model is not invariant to nonsingular transformations of the data.  相似文献   

18.
Mathematical programming (MP) discriminant analysis models are widely used to generate linear discriminant functions that can be adopted as classification models. Nonlinear classification models may have better classification performance than linear classifiers, but although MP methods can be used to generate nonlinear discriminant functions, functions of specified form must be evaluated separately. Piecewise-linear functions can approximate nonlinear functions, and two new MP methods for generating piecewise-linear discriminant functions are developed in this paper. The first method uses maximization of classification accuracy (MCA) as the objective, while the second uses an approach based on minimization of the sum of deviations (MSD). The use of these new MP models is illustrated in an application to a test problem and the results are compared with those from standard MCA and MSD models.  相似文献   

19.
The credit scoring is a risk evaluation task considered as a critical decision for financial institutions in order to avoid wrong decision that may result in huge amount of losses. Classification models are one of the most widely used groups of data mining approaches that greatly help decision makers and managers to reduce their credit risk of granting credits to customers instead of intuitive experience or portfolio management. Accuracy is one of the most important criteria in order to choose a credit‐scoring model; and hence, the researches directed at improving upon the effectiveness of credit scoring models have never been stopped. In this article, a hybrid binary classification model, namely FMLP, is proposed for credit scoring, based on the basic concepts of fuzzy logic and artificial neural networks (ANNs). In the proposed model, instead of crisp weights and biases, used in traditional multilayer perceptrons (MLPs), fuzzy numbers are used in order to better model of the uncertainties and complexities in financial data sets. Empirical results of three well‐known benchmark credit data sets indicate that hybrid proposed model outperforms its component and also other those classification models such as support vector machines (SVMs), K‐nearest neighbor (KNN), quadratic discriminant analysis (QDA), and linear discriminant analysis (LDA). Therefore, it can be concluded that the proposed model can be an appropriate alternative tool for financial binary classification problems, especially in high uncertainty conditions. © 2013 Wiley Periodicals, Inc. Complexity 18: 46–57, 2013  相似文献   

20.
Mathematical programming (MP) discriminant analysis models can be used to develop classification models for assigning observations of unknown class membership to one of a number of specified classes using values of a set of features associated with each observation. Since most MP discriminant analysis models generate linear discriminant functions, these MP models are generally used to develop linear classification models. Nonlinear classifiers may, however, have better classification performance than linear classifiers. In this paper, a mixed integer programming model is developed to generate nonlinear discriminant functions composed of monotone piecewise-linear marginal utility functions for each feature and the cut-off value for class membership. It is also shown that this model can be extended for feature selection. The performance of this new MP model for two-group discriminant analysis is compared with statistical discriminant analysis and other MP discriminant analysis models using a real problem and a number of simulated problem sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号