首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Unsupervised classification is a highly important task of machine learning methods. Although achieving great success in supervised classification, support vector machine (SVM) is much less utilized to classify unlabeled data points, which also induces many drawbacks including sensitive to nonlinear kernels and random initializations, high computational cost, unsuitable for imbalanced datasets. In this paper, to utilize the advantages of SVM and overcome the drawbacks of SVM-based clustering methods, we propose a completely new two-stage unsupervised classification method with no initialization: a new unsupervised kernel-free quadratic surface SVM (QSSVM) model is proposed to avoid selecting kernels and related kernel parameters, then a golden-section algorithm is designed to generate the appropriate classifier for balanced and imbalanced data. By studying certain properties of proposed model, a convergent decomposition algorithm is developed to implement this non-covex QSSVM model effectively and efficiently (in terms of computational cost). Numerical tests on artificial and public benchmark data indicate that the proposed unsupervised QSSVM method outperforms well-known clustering methods (including SVM-based and other state-of-the-art methods), particularly in terms of classification accuracy. Moreover, we extend and apply the proposed method to credit risk assessment by incorporating the T-test based feature weights. The promising numerical results on benchmark personal credit data and real-world corporate credit data strongly demonstrate the effectiveness, efficiency and interpretability of proposed method, as well as indicate its significant potential in certain real-world applications.  相似文献   

2.
During the last years, kernel based methods proved to be very successful for many real-world learning problems. One of the main reasons for this success is the efficiency on large data sets which is a result of the fact that kernel methods like support vector machines (SVM) are based on a convex optimization problem. Solving a new learning problem can now often be reduced to the choice of an appropriate kernel function and kernel parameters. However, it can be shown that even the most powerful kernel methods can still fail on quite simple data sets in cases where the inherent feature space induced by the used kernel function is not sufficient. In these cases, an explicit feature space transformation or detection of latent variables proved to be more successful. Since such an explicit feature construction is often not feasible for large data sets, the ultimate goal for efficient kernel learning would be the adaptive creation of new and appropriate kernel functions. It can, however, not be guaranteed that such a kernel function still leads to a convex optimization problem for Support Vector Machines. Therefore, we have to enhance the optimization core of the learning method itself before we can use it with arbitrary, i.e., non-positive semidefinite, kernel functions. This article motivates the usage of appropriate feature spaces and discusses the possible consequences leading to non-convex optimization problems. We will show that these new non-convex optimization SVM are at least as accurate as their quadratic programming counterparts on eight real-world benchmark data sets in terms of the generalization performance. They always outperform traditional approaches in terms of the original optimization problem. Additionally, the proposed algorithm is more generic than existing traditional solutions since it will also work for non-positive semidefinite or indefinite kernel functions.  相似文献   

3.
解决不平衡数据分类问题,在现实中有着深远的意义。马田系统利用单一的正常类别构建基准空间和测量基准尺度,并由此建立数据分类模型,十分适合不平衡数据分类问题的处理。本文以传统马田系统方法为基础,结合信噪比及F-value、G-mean等分类精度,建立了基于遗传算法的基准空间优化模型,同时运用Bagging集成化算法,构造了改进马田系统模型算法GBMTS。通过对不同分类方法及相关数据集的实验分析,表明:GBMTS算法较其他分类算法,更能够有效的处理不平衡数据的分类问题。  相似文献   

4.
Credit scoring is a method of modelling potential risk of credit applications. Traditionally, logistic regression and discriminant analysis are the most widely used approaches to create scoring models in the industry. However, these methods are associated with quite a few limitations, such as being instable with high-dimensional data and small sample size, intensive variable selection effort and incapability of efficiently handling non-linear features. Most importantly, based on these algorithms, it is difficult to automate the modelling process and when population changes occur, the static models usually fail to adapt and may need to be rebuilt from scratch. In the last few years, the kernel learning approach has been investigated to solve these problems. However, the existing applications of this type of methods (in particular the SVM) in credit scoring have all focused on the batch model and did not address the important problem of how to update the scoring model on-line. This paper presents a novel and practical adaptive scoring system based on an incremental kernel method. With this approach, the scoring model is adjusted according to an on-line update procedure that can always converge to the optimal solution without information loss or running into numerical difficulties. Non-linear features in the data are automatically included in the model through a kernel transformation. This approach does not require any variable reduction effort and is also robust for scoring data with a large number of attributes and highly unbalanced class distributions. Moreover, a new potential kernel function is introduced to further improve the predictive performance of the scoring model and a kernel attribute ranking technique is used that adds transparency in the final model. Experimental studies using real world data sets have demonstrated the effectiveness of the proposed method.  相似文献   

5.
The classification system is very important for making decision and it has been attracted much attention of many researchers. Usually, the traditional classifiers are either domain specific or produce unsatisfactory results over classification problems with larger size and imbalanced data. Hence, genetic algorithms (GA) are recently being combined with traditional classifiers to find useful knowledge for making decision. Although, the main concerns of such GA-based system are the coverage of less search space and increase of computational cost with the growth of population. In this paper, a rule-based knowledge discovery model, combining C4.5 (a Decision Tree based rule inductive algorithm) and a new parallel genetic algorithm based on the idea of massive parallelism, is introduced. The prime goal of the model is to produce a compact set of informative rules from any kind of classification problem. More specifically, the proposed model receives a base method C4.5 to generate rules which are then refined by our proposed parallel GA. The strength of the developed system has been compared with pure C4.5 as well as the hybrid system (C4.5 + sequential genetic algorithm) on six real world benchmark data sets collected from UCI (University of California at Irvine) machine learning repository. Experiments on data sets validate the effectiveness of the new model. The presented results especially indicate that the model is powerful for volumetric data set.  相似文献   

6.
Classification of imbalanced data sets in which negative instances outnumber the positive instances is a significant challenge. These data sets are commonly encountered in real-life problems. However, performance of well-known classifiers is limited in such cases. Various solution approaches have been proposed for the class imbalance problem using either data-level or algorithm-level modifications. Support Vector Machines (SVMs) that have a solid theoretical background also encounter a dramatic decrease in performance when the data distribution is imbalanced. In this study, we propose an L 1-norm SVM approach that is based on a three objective optimization problem so as to incorporate into the formulation the error sums for the two classes independently. Motivated by the inherent multi objective nature of the SVMs, the solution approach utilizes a reduction into two criteria formulations and investigates the efficient frontier systematically. The results indicate that a comprehensive treatment of distinct positive and negative error levels may lead to performance improvements that have varying degrees of increased computational effort.  相似文献   

7.
In many medical applications, longitudinal data sets are available. Longitudinal data, as well as observations from paired organs, show a dependency structure which should be respected in the evaluation. Adler et al. (Comput Stat Data Anal 53(3):718–729, 2009) proposed various bootstrapping strategies for ensemble methods based on classification trees for two measurements of paired organs. These strategies have shown to improve the classification performance compared to the traditional approach, where only one observation per subject is used. We extend the methodology to the situation, where an arbitrary number of observations per individual are available and investigate the performance of the proposed methods with bagged classification trees (bagging) and random forests in the situation of longitudinal data. Moreover, we adapt the estimation of classification performance criteria to repeated measurements data. The clinical data set consists of morphological examinations of both eyes of glaucoma patients and healthy controls over a time period of up to 7 years. The performance of our modified classifiers is evaluated by a subject-based leave-one-out bootstrap ROC analysis. Simulation results and results for the glaucoma data set demonstrate that our proposal is an improvement of adhoc strategies and of the use all measurements of each subject or block strategy.  相似文献   

8.
Multivariate data modelling problems consist of a number of nodes with associated function (class) values. The main purpose of these problems is to construct an analytical model to represent the characteristics of the problem under consideration. Because the devices, tools, and/or algorithms used to collect the data may have incapabilities or limited capabilities, the data set is likely to contain unavoidable errors. That is, each component of data is reliable only within an interval which contains the data value. To this end, when an analytical structure is needed for the given data, a band structure should be determined instead of a unique structure. As the multivariance of the given data set increases, divide–and–conquer methods become important in multivariate modelling problems. HDMR based methods allow us to partition the given multivariate data into less variate data sets to reduce the complexity of the given problem. This paper focuses on Interval Factorized HDMR method developed to determine an approximate band structure for a given multivariate data modelling problem having uncertainties on its nodes and function values.  相似文献   

9.
We develop new, higher-order numerical one-step methods and apply them to several examples to investigate approximate discrete solutions of nonlinear differential equations. These new algorithms are derived from the Adomian decomposition method (ADM) and the Rach-Adomian-Meyers modified decomposition method (MDM) to present an alternative to such classic schemes as the explicit Runge-Kutta methods for engineering models, which require high accuracy with low computational costs for repetitive simulations in contrast to a one-size-fits-all philosophy. This new approach incorporates the notion of analytic continuation, which extends the region of convergence without resort to other techniques that are also used to accelerate the rate of convergence such as the diagonal Padé approximants or the iterated Shanks transforms. Hence global solutions instead of only local solutions are directly realized albeit in a discretized representation. We observe that one of the difficulties in applying explicit Runge-Kutta one-step methods is that there is no general procedure to generate higher-order numeric methods. It becomes a time-consuming, tedious endeavor to generate higher-order explicit Runge-Kutta formulas, because it is constrained by the traditional Picard formalism as used to represent nonlinear differential equations. The ADM and the MDM rely instead upon Adomian’s representation and the Adomian polynomials to permit a straightforward universal procedure to generate higher-order numeric methods at will such as a 12th-order or 24th-order one-step method, if need be. Another key advantage is that we can easily estimate the maximum step-size prior to computing data sets representing the discretized solution, because we can approximate the radius of convergence from the solution approximants unlike the Runge-Kutta approach with its intrinsic linearization between computed data points. We propose new variable step-size, variable order and variable step-size, variable order algorithms for automatic step-size control to increase the computational efficiency and reduce the computational costs even further for critical engineering models.  相似文献   

10.
The existing support vector machines (SVMs) are all assumed that all the features of training samples have equal contributions to construct the optimal separating hyperplane. However, for a certain real-world data set, some features of it may possess more relevances to the classification information, while others may have less relevances. In this paper, the linear feature-weighted support vector machine (LFWSVM) is proposed to deal with the problem. Two phases are employed to construct the proposed model. First, the mutual information (MI) based approach is used to assign appropriate weights for each feature of the whole given data set. Second, the proposed model is trained by the samples with their features weighted by the obtained feature weight vector. Meanwhile, the feature weights are embedded in the quadratic programming through detailed theoretical deduction to obtain the dual solution to the original optimization problem. Although the calculation of feature weights may add an extra computational cost, the proposed model generally exhibits better generalization performance over the traditional support vector machine (SVM) with linear kernel function. Experimental results upon one synthetic data set and several benchmark data sets confirm the benefits in using the proposed method. Moreover, it is also shown in experiments that the proposed MI based approach to determining feature weights is superior to the other two mostly used methods.  相似文献   

11.
Cluster analysis, the determination of natural subgroups in a data set, is an important statistical methodology that is used in many contexts. A major problem with hierarchical clustering methods used today is the tendency for classification errors to occur when the empirical data departs from the ideal conditions of compact isolated clusters. Many empirical data sets have structural imperfections that confound the identification of clusters. We use a Self Organizing Map (SOM) neural network clustering methodology and demonstrate that it is superior to the hierarchical clustering methods. The performance of the neural network and seven hierarchical clustering methods is tested on 252 data sets with various levels of imperfections that include data dispersion, outliers, irrelevant variables, and nonuniform cluster densities. The superior accuracy and robustness of the neural network can improve the effectiveness of decisions and research based on clustering messy empirical data.  相似文献   

12.
In this paper we investigate to what extent random search methods, equipped with an archive of bounded size to store a limited amount of solutions and other data, are able to obtain good Pareto front approximations. We propose and analyze two archiving schemes that allow for maintaining a sequence of solution sets of given cardinality that converge with probability one to an ?-Pareto set of a certain quality, under very mild assumptions on the process used to sample new solutions. The first algorithm uses a hierarchical grid to define a family of approximate dominance relations to compare solutions and solution sets. Acceptance of a new solution is based on a potential function that counts the number of occupied boxes (on various levels) and thus maintains a strictly monotonous progress to a limit set that covers the Pareto front with non-overlapping boxes at finest resolution possible. The second algorithm uses an adaptation scheme to modify the current value of ? based on the information gathered during the run. This way it will be possible to achieve convergence to the best (smallest) ? value, and to a corresponding solution set of k solutions that ?-dominate all other solutions, which is probably the best possible result regarding the limit behavior of random search methods or metaheuristics for obtaining Pareto front approximations.  相似文献   

13.
利用传统支持向量机(SVM)对不平衡数据进行分类时,由于真实的少数类支持向量样本过少且难以被识别,造成了分类时效果不是很理想.针对这一问题,提出了一种基于支持向量机混合采样的不平衡数据分类方法(BSMS).该方法首先对经过支持向量机分类的原始不平衡数据按照所处位置的不同划分为支持向量区(SV),多数类非支持向量区(MN...  相似文献   

14.
《Optimization》2012,61(7):1099-1116
In this article we study support vector machine (SVM) classifiers in the face of uncertain knowledge sets and show how data uncertainty in knowledge sets can be treated in SVM classification by employing robust optimization. We present knowledge-based SVM classifiers with uncertain knowledge sets using convex quadratic optimization duality. We show that the knowledge-based SVM, where prior knowledge is in the form of uncertain linear constraints, results in an uncertain convex optimization problem with a set containment constraint. Using a new extension of Farkas' lemma, we reformulate the robust counterpart of the uncertain convex optimization problem in the case of interval uncertainty as a convex quadratic optimization problem. We then reformulate the resulting convex optimization problems as a simple quadratic optimization problem with non-negativity constraints using the Lagrange duality. We obtain the solution of the converted problem by a fixed point iterative algorithm and establish the convergence of the algorithm. We finally present some preliminary results of our computational experiments of the method.  相似文献   

15.
The performance of kernel-based method, such as support vector machine (SVM), is greatly affected by the choice of kernel function. Multiple kernel learning (MKL) is a promising family of machine learning algorithms and has attracted many attentions in recent years. MKL combines multiple sub-kernels to seek better results compared to single kernel learning. In order to improve the efficiency of SVM and MKL, in this paper, the Kullback–Leibler kernel function is derived to develop SVM. The proposed method employs an improved ensemble learning framework, named KLMKB, which applies Adaboost to learning multiple kernel-based classifier. In the experiment for hyperspectral remote sensing image classification, we employ feature selected through Optional Index Factor (OIF) to classify the satellite image. We extensively examine the performance of our approach in comparison to some relevant and state-of-the-art algorithms on a number of benchmark classification data sets and hyperspectral remote sensing image data set. Experimental results show that our method has a stable behavior and a noticeable accuracy for different data set.  相似文献   

16.
Shipping companies are forced by the current EU regulation to set up a system for monitoring, reporting, and verification of harmful emissions from their fleet. In this regulatory background, data collected from onboard sensors can be utilized to assess the ship's operating conditions and quantify its CO2 emission levels. The standard approach for analyzing such data sets is based on summarizing the measurements obtained during a given voyage by the average value. However, this compression step may lead to significant information loss since most variables present a dynamic profile that is not well approximated by the average value only. Therefore, in this work, we test two feature‐oriented methods that are able to extract additional features, namely, profile‐driven features (PdF) and statistical pattern analysis (SPA). A real data set from a Ro‐Pax ship is then considered to test the selected methods. The data set is segregated according to the voyage distance into short, medium, and long routes. Both PdF and SPA are compared with the standard approach, and the results demonstrate the benefits of employing more systematic and informative feature‐oriented methods. For the short route, no method is able to predict CO2 emissions in a satisfactory way, whereas for the medium and long routes, regression models built using features obtained from both PdF and SPA improve their prediction performance. In particular, for the long route, the standard approach failed to provide reasonably good predictions.  相似文献   

17.
Professionals in neuropsychology usually perform diagnoses of patients’ behaviour in a verbal rather than in a numerical form. This fact generates interest in decision support systems that process verbal data. It also motivates us to develop methods for the classification of such data. In this paper, we describe ways of aiding classification of a discrete set of objects, evaluated on set of criteria that may have verbal estimations, into ordered decision classes. In some situations, there is no explicit additional information available, while in others it is possible to order the criteria lexicographically. We consider both of these cases. The proposed Dichotomic Classification (DC) method is based on the principles of Verbal Decision Analysis (VDA). Verbal Decision Analysis methods are especially helpful when verbal data, in criteria values, are to be handled. When compared to the previously developed Verbal Decision Analysis classification methods, Dichotomic Classification method performs better on the same data sets and is able to cope with larger sizes of the object sets to be classified. We present an interactive classification procedure, estimate the effectiveness and computational complexity of the new method and compare it to one of the previously developed Verbal Decision Analysis methods. The developed and studied methods are implemented in the framework of a decision support system, and the results of testing on artificial sets of data are reported.  相似文献   

18.
We address randomized methods for control and optimization based on generating points uniformly distributed in a set. For control systems this sets are either stability domain in the space of feedback controllers, or quadratic stability domain, or robust stability domain, or level set for a performance specification. By generating random points in the prescribed set one can optimize some additional performance index. To implement such approach we exploit two modern Monte Carlo schemes for generating points which are approximately uniformly distributed in a given convex set. Both methods use boundary oracle to find an intersection of a ray and the set. The first method is Hit-and-Run, the second is sometimes called Shake-and-Bake. We estimate the rate of convergence for such methods and demonstrate the link with the center of gravity method. Numerical simulation results look very promising.  相似文献   

19.
《Applied Mathematical Modelling》2014,38(11-12):2800-2818
Electrical discharge machining (EDM) is inherently a stochastic process. Predicting the output of such a process with reasonable accuracy is rather difficult. Modern learning based methodologies, being capable of reading the underlying unseen effect of control factors on responses, appear to be effective in this regard. In the present work, support vector machine (SVM), one of the supervised learning methods, is applied for developing the model of EDM process. Gaussian radial basis function and ε-insensitive loss function are used as kernel function and loss function respectively. Separate models of material removal rate (MRR) and average surface roughness parameter (Ra) are developed by minimizing the mean absolute percentage error (MAPE) of training data obtained for different set of SVM parameter combinations. Particle swarm optimization (PSO) is employed for the purpose of optimizing SVM parameter combinations. Models thus developed are then tested with disjoint testing data sets. Optimum parameter settings for maximum MRR and minimum Ra are further investigated applying PSO on the developed models.  相似文献   

20.
Interior-point methods are among the most efficient approaches for solving large-scale nonlinear programming problems. At the core of these methods, highly ill-conditioned symmetric saddle-point problems have to be solved. We present combinatorial methods to preprocess these matrices in order to establish more favorable numerical properties for the subsequent factorization. Our approach is based on symmetric weighted matchings and is used in a sparse direct LDL T factorization method where the pivoting is restricted to static supernode data structures. In addition, we will dynamically expand the supernode data structure in cases where additional fill-in helps to select better numerical pivot elements. This technique can be seen as an alternative to the more traditional threshold pivoting techniques. We demonstrate the competitiveness of this approach within an interior-point method on a large set of test problems from the CUTE and COPS sets, as well as large optimal control problems based on partial differential equations. The largest nonlinear optimization problem solved has more than 12 million variables and 6 million constraints.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号