首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency.  相似文献   

2.
Advanced Genetic Programming Based Machine Learning   总被引:1,自引:0,他引:1  
A Genetic Programming based approach for solving classification problems is presented in this paper. Classification is understood as the act of placing an object into a set of categories, based on the object’s properties; classification algorithms are designed to learn a function which maps a vector of object features into one of several classes. This is done by analyzing a set of input-output examples (“training samples”) of the function. Here we present a method based on the theory of Genetic Algorithms and Genetic Programming that interprets classification problems as optimization problems: Each presented instance of the classification problem is interpreted as an instance of an optimization problem, and a solution is found by a heuristic optimization algorithm. The major new aspects presented in this paper are advanced algorithmic concepts as well as suitable genetic operators for this problem class (mainly the creation of new hypotheses by merging already existing ones and their detailed evaluation). The experimental part of the paper documents the results produced using new hybrid variants of Genetic Algorithms as well as investigated parameter settings. Graphical analysis is done using a novel multiclass classifier analysis concept based on the theory of Receiver Operating Characteristic curves. The work described in this paper was done within the Translational Research Project L282 “GP-Based Techniques for the Design of Virtual Sensors” sponsored by the Austrian Science Fund (FWF).  相似文献   

3.
This study shows how data envelopment analysis (DEA) can be used to reduce vertical dimensionality of certain data mining databases. The study illustrates basic concepts using a real-world graduate admissions decision task. It is well known that cost sensitive mixed integer programming (MIP) problems are NP-complete. This study shows that heuristic solutions for cost sensitive classification problems can be obtained by solving a simple goal programming problem by that reduces the vertical dimension of the original learning dataset. Using simulated datasets and a misclassification cost performance metric, the performance of proposed goal programming heuristic is compared with the extended DEA-discriminant analysis MIP approach. The holdout sample results of our experiments shows that the proposed heuristic approach outperforms the extended DEA-discriminant analysis MIP approach.  相似文献   

4.
In today’s competitive electronic marketplace, companies try to create long-lasting relations with their online customers. Log files and registration forms generate millions of online transactions. Companies use new techniques to “mine” these data and establish optimal online storefronts to maximize their web presence. Several criteria, such as minimization of download time, maximization of web-site visualization and product association level, can be used for the optimization of virtual storefronts. This paper introduces a genetic algorithm, to be used in a model-driven decision-support system for web-site optimizations. The algorithm ensures multiple criteria web-site optimizations, and the genetic search provides dynamic and timely solutions independent of the number of objects to be arranged.  相似文献   

5.
The paper presents a simulation–optimization modeling framework for the evacuation of large-scale pedestrian facilities with multiple exit gates. The framework integrates a genetic algorithm (GA) and a microscopic pedestrian simulation–assignment model. The GA searches for the optimal evacuation plan, while the simulation model guides the search through evaluating the quality of the generated evacuation plans. Evacuees are assumed to receive evacuation instructions in terms of the optimal exit gates and evacuation start times. The framework is applied to develop an optimal evacuation plan for a hypothetical crowded exhibition hall. The obtained results show that the model converges to a superior optimal evacuation plan within an acceptable number of iterations. In addition, the obtained evacuation plan outperforms conventional plans that implement nearest-gate immediate evacuation strategies.  相似文献   

6.
Data mining aims to find patterns in organizational databases. However, most techniques in mining do not consider knowledge of the quality of the database. In this work, we show how to incorporate into classification mining recent advances in the data quality field that view a database as the product of an imprecise manufacturing process where the flaws/defects are captured in quality matrices. We develop a general purpose method of incorporating data quality matrices into the data mining classification task. Our work differs from existing data preparation techniques since while other approaches detect and fix errors to ensure consistency with the entire data set our work makes use of the apriori knowledge of how the data is produced/manufactured.  相似文献   

7.
Geometric coordinates are an integral part of many data streams. Examples include sensor locations in environmental monitoring, vehicle locations in traffic monitoring or battlefield simulations, scientific measurements of earth or atmospheric phenomena, etc. This paper focuses on the problem of summarizing such geometric data streams using limited storage so that many natural geometric queries can be answered faithfully. Some examples of such queries are: report the smallest convex region in which a chemical leak has been sensed, or track the diameter of the dataset, or track the extent of the dataset in any given direction. One can also pose queries over multiple streams: for instance, track the minimum distance between the convex hulls of two data streams, report when datasets A and B are no longer linearly separable, or report when points of data stream A become completely surrounded by points of data stream B, etc. These queries are easily extended to more than two streams.

In this paper, we propose an adaptive sampling scheme that gives provably optimal error bounds for extremal problems of this nature. All our results follow from a single technique for computing the approximate convex hull of a point stream in a single pass. Our main result is this: given a stream of two-dimensional points and an integer r, we can maintain an adaptive sample of at most 2r+1 points such that the distance between the true convex hull and the convex hull of the sample points is O(D/r2), where D is the diameter of the sample set. The amortized time for processing each point in the stream is O(logr). Using the sample convex hull, all the queries mentioned above can be answered approximately in either O(logr) or O(r) time.  相似文献   


8.
With the broad development of the World Wide Web, various kinds of heterogeneous data (including multimedia data) are now available to decision support tasks. A data warehousing approach is often adopted to prepare data for relevant analysis. Data integration and dimensional modeling indeed allow the creation of appropriate analysis contexts. However, the existing data warehousing tools are well-suited to classical, numerical data. They cannot handle complex data. In our approach, we adapt the three main phases of the data warehousing process to complex data. In this paper, we particularly focus on two main steps in complex data warehousing. The first step is data integration. We define a generic UML model that helps representing a wide range of complex data, including their possible semantic properties. Complex data are then stored in XML documents generated by a piece of software we designed. The second important phase we address is the preparation of data for dimensional modeling. We propose an approach that exploits data mining techniques to assist users in building relevant dimensional models.  相似文献   

9.
Dimensionality reduction is used to preserve significant properties of data in a low-dimensional space. In particular, data representation in a lower dimension is needed in applications, where information comes from multiple high dimensional sources. Data integration, however, is a challenge in itself.In this contribution, we consider a general framework to perform dimensionality reduction taking into account that data are heterogeneous. We propose a novel approach, called Deep Kernel Dimensionality Reduction which is designed for learning layers of new compact data representations simultaneously. The method can be also used to learn shared representations between modalities. We show by experiments on standard and on real large-scale biomedical data sets that the proposed method embeds data in a new compact meaningful representation, and leads to a lower classification error compared to the state-of-the-art methods.  相似文献   

10.
11.
Recently developed SAGE technology enables us to simultaneously quantify the expression levels of thousands of genes in a population of cells. SAGE data is helpful in classification of different types of cancers. However, one main challenge in this task is the availability of a smaller number of samples compared to huge number of genes, many of which are irrelevant for classification. Another main challenge is that there is a lack of appropriate statistical methods that consider the specific properties of SAGE data. We propose an efficient solution by selecting relevant genes by information gain and building a multinomial event model for SAGE data. Promising results, in terms of accuracy, were obtained for the model proposed.   相似文献   

12.
This paper describes a method and the corresponding algorithms for simplification of large-scale linear programming models. It consists of the elimination of the balance constraints (i.e. constraints with zero RHS term). The idea is to apply some linear transformations to the original problem in order to nullify the balance constraints. These transformations are able to simultaneously eliminate more balance rows. The core of this contribution is the introduction of the reduction matrix and the associated theorems on the equivalent linear programs (original and reduced). The numerical experiments with this method of simplification proved this approach to be beneficial for a large class of LP problems.This research work was done while the first author was at Duisburg University, Mess-, Steuer und Regelungstechnik, Germany, under the greatly appreciated financial assistance given by the Alexander-von-Humboldt Foundation.  相似文献   

13.
In this paper, we present a new method of model reduction for large-scale dynamical systems, which belongs to the SVD-Krylov based method category. It is a two-sided projection where one side reflects the Krylov part and the other side reflects the SVD (observability gramian) part. The reduced model matches the first r+i Markov parameters of the full order model, and the remaining ones approximate in a least squares sense without being explicitly computed, where r is the order of the reduced system, and i is a nonnegative integer such that 1≤i<r. The reduced system minimizes a weighted ?2 error. By the definition of a shift operator, the proposed approximation is also obtained by solving an equality constrained least squares problem. Moreover, the method is generalized for moment matching at arbitrary interpolation points. Several numerical examples verify the effectiveness of the approach.  相似文献   

14.
A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups.  相似文献   

15.
Rough set theory provides a powerful tool for dealing with uncertainty in data. Application of variety of rough set models to mining data stored in a single table has been widely studied. However, analysis of data stored in a relational structure using rough sets is still an extensive research area. This paper proposes compound approximation spaces and their constrained versions that are intended for handling uncertainty in relational data. The proposed spaces are expansions of tolerance approximation ones to a relational case. Compared with compound approximation spaces, the constrained version enables to derive new knowledge from relational data. The proposed approach can improve mining relational data that is uncertain, incomplete, or inconsistent.  相似文献   

16.
The facial reduction algorithm reduces the size of the positive semidefinite cone in SDP. The elimination method for a sparse SOS polynomial [M. Kojima, S. Kim, H. Waki, Sparsity in sums of squares of polynomials, Math. Program. 103 (2005) 45-62] removes monomials which do not appear in any SOS representations. In this paper, we establish a relationship between a facial reduction algorithm and the elimination method for a sparse SOS polynomial.  相似文献   

17.
There are several classes of interior point algorithms that solve linear programming problems in O(√nL) iterations. Among them, several potential reduction algorithms combine both theoretical (O(√nL) iterations) and practical efficiency as they allow the flexibility of line searches in the potential function, and thus can lead to practical implementations. It is a significant open question whether interior point algorithms can lead to better complexity bounds. In the present paper we give some negative answers to this question for the class of potential reduction algorithms. We show that, even if we allow line searches in the potential function, and even for problems that have network structure, the bound O(√nL) is tight for several potential reduction algorithms, i.e., there is a class of examples with network structure, in which the algorithms need at least Θ(√nL) iterations to find an optimal solution. The research of the author was partially supported by a Presidential Young Investigator Award DDM-9158118 with matching funds from Draper Laboratory.  相似文献   

18.
Data mining is generally defined as the science of nontrivial extraction of implicit, previously unknown, and potentially useful information from datasets. There are many websites on the Internet that provide extensive information about products and allow users post comments on various products and rate the product on a scale of 1 to 5. During the past decade, the need for intelligent algorithms for calculating and organizing extremely large sets of data has grown exponentially. In this article we investigate the extent to which a product’s average user rating can be predicted, using a manageable subset of a data set. For this we use a linearization-algorithm based prediction model and sketch how an inverse problem can be formulated to yield a smooth local volatility function of user ratings. The MAPLE programs that implement the proposed algorithm show that the method is reasonably accurate for the reconstruction of volatility of user ratings, which is useful both in accurate user predictions as well as computing sensitivity.  相似文献   

19.
We propose and discuss a new Lepp-surface method able to produce a small triangular approximation of huge sets of terrain grid data by using a two-goal strategy that assures both small approximation error and well-shaped 3D triangles. This is a refinement method which starts with a coarse initial triangulation of the input data, and incrementally selects and adds data points into the mesh as follows: for the edge e having the highest error in the mesh, one or two points close to (one or two) terminal edges associated with e are inserted in the mesh. The edge error is computed by adding the triangle approximation errors of the two triangles that share e, while each L2-norm triangle error is computed by using a curvature tensor (a good approximation of the surface) at a representative point associated with both triangles. The method produces triangular approximations that capture well the relevant features of the terrain surface by naturally producing well-shaped triangles. We compare our method with a pure L2-norm optimization method.  相似文献   

20.
The traditional development of conjugate gradient (CG) methods emphasizes notions of conjugacy and the minimization of quadratic functions. The associated theory of conjugate direction methods, strictly a branch of numerical linear algebra, is both elegant and useful for obtaining insight into algorithms for nonlinear minimization. Nevertheless, it is preferable that favorable behavior on a quadratic be a consquence of a more general approach, one which fits in more naturally with Newton and variable metric methods. We give new CG algorithms along these lines and discuss some of their properties, along with some numerical supporting evidence.I acknowledge partial support by ONR Contract #N00014-76-C-0013 at the Center for Pure & Applied Mathematics, University of California, Berkeley in the dissemination of this article.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号