共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency. 相似文献
4.
The paper presents a review of the basic concepts of the Logical Analysis of Data (LAD), along with a series of discrete optimization
models associated to the implementation of various components of its general methodology, as well as an outline of applications
of LAD to medical problems. The combinatorial optimization models described in the paper represent variations on the general
theme of set covering, including some with nonlinear objective functions. The medical applications described include the development
of diagnostic and prognostic systems in cancer research and pulmonology, risk assessment among cardiac patients, and the design
of biomaterials. 相似文献
5.
6.
This paper presents a dual-objective evolutionary algorithm (DOEA) for extracting multiple decision rule lists in data mining,
which aims at satisfying the classification criteria of high accuracy and ease of user comprehension. Unlike existing approaches,
the algorithm incorporates the concept of Pareto dominance to evolve a set of non-dominated decision rule lists each having
different classification accuracy and number of rules over a specified range. The classification results of DOEA are analyzed
and compared with existing rule-based and non-rule based classifiers based upon 8 test problems obtained from UCI Machine
Learning Repository. It is shown that the DOEA produces comprehensible rules with competitive classification accuracy as compared
to many methods in literature. Results obtained from box plots and t-tests further examine its invariance to random partition of datasets.
An erratum to this article is available at . 相似文献
7.
8.
Data mining aims to find patterns in organizational databases. However, most techniques in mining do not consider knowledge of the quality of the database. In this work, we show how to incorporate into classification mining recent advances in the data quality field that view a database as the product of an imprecise manufacturing process where the flaws/defects are captured in quality matrices. We develop a general purpose method of incorporating data quality matrices into the data mining classification task. Our work differs from existing data preparation techniques since while other approaches detect and fix errors to ensure consistency with the entire data set our work makes use of the apriori knowledge of how the data is produced/manufactured. 相似文献
9.
10.
Disaggregation methods have become popular in multicriteria decision aiding (MCDA) for eliciting preferential information and constructing decision models from decision examples. From a statistical point of view, data mining and machine learning are also involved with similar problems, mainly with regard to identifying patterns and extracting knowledge from data. Recent research has also focused on the introduction of specific domain knowledge in machine learning algorithms. Thus, the connections between disaggregation methods in MCDA and traditional machine learning tools are becoming stronger. In this paper the relationships between the two fields are explored. The differences and similarities between the two approaches are identified, and a review is given regarding the integration of the two fields. 相似文献
11.
信息披露制度是上市公司为保障投资者利益、接受社会公众的监督而依照法律规定必须将其自身的财务变化、经营状况等信息向社会及监管部门公开或公告,以便投资者充分了解情况的制度.XBRL作为一种基于XML的可扩展性商业报告语言,目前已广泛应用于财务信息披露制度中,并逐渐成为了信息披露制度的标准数据格式.对XBRL的规范、分类、实例文档进行研究,基于MapReduce和HDFS提出可用于海量XBRL数据的频繁模式并行挖掘方法,基于我国上市公司的XBRL实例数据进行了实验,取得了良好的效果. 相似文献
12.
《International Journal of Approximate Reasoning》2014,55(7):1519-1534
Methods for analyzing or learning from “fuzzy data” have attracted increasing attention in recent years. In many cases, however, existing methods (for precise, non-fuzzy data) are extended to the fuzzy case in an ad-hoc manner, and without carefully considering the interpretation of a fuzzy set when being used for modeling data. Distinguishing between an ontic and an epistemic interpretation of fuzzy set-valued data, and focusing on the latter, we argue that a “fuzzification” of learning algorithms based on an application of the generic extension principle is not appropriate. In fact, the extension principle fails to properly exploit the inductive bias underlying statistical and machine learning methods, although this bias, at least in principle, offers a means for “disambiguating” the fuzzy data. Alternatively, we therefore propose a method which is based on the generalization of loss functions in empirical risk minimization, and which performs model identification and data disambiguation simultaneously. Elaborating on the fuzzification of specific types of losses, we establish connections to well-known loss functions in regression and classification. We compare our approach with related methods and illustrate its use in logistic regression for binary classification. 相似文献
13.
The S-Net System for Internet Packet Streams: Strategies for Stream Analysis and System Architecture
《Journal of computational and graphical statistics》2013,22(4):865-892
The traffic on an Internet link is a packet stream: packets of varying sizes arriving for transmission on the link. Each packet has an arrival time, and contained within the packet are headers that carry many critical variables. Packet traces, which consist of captured headers and measurements of the arrival times, convey substantial information about the Internet—security, usage, network performance, and the performance of engineering protocols. This article discusses strategies for the analysis of very large databases of packet traces, and the architecture of a software system that facilitates the use of these strategies. The system has a pipeline: (1) raw packet traces; (2) a database with objects tailored to ensuing analyses; and (3) an environment with tools for data analysis: statistical methods, model fitting, and visualization. The pipeline addresses the full set of tasks in the study of packet streams, from the initial processing of raw packet traces to the final output, often a visual display. S-Net—an extensible, open-source software implementation of this architecture—is based on the R implementation of the S language for graphics and data analysis, and has been developed on Linux. 相似文献
14.
This paper presents a framework for finding optimal modules in a delayed product differentiation scenario. Historical product sales data is utilized to estimate demand probability and customer preferences. Then this information is used by a multiple-objective optimization model to form modules. An evolutionary computation approach is applied to solve the optimization model and find the Pareto-optimal solutions. An industrial case study illustrates the ideas presented in the paper. The mean number of assembly operations and expected pre-assembly costs are the two competing objectives that are optimized in the case study. The mean number of assembly operations can be significantly reduced while incurring relatively small increases in the expected pre-assembly cost. 相似文献
15.
Xin Wang Xiaodong Liu Witold PedryczXiaolei Zhu Guangfei Hu 《European Journal of Operational Research》2012,218(1):202-210
In this paper, we propose a novel method to mine association rules for classification problems namely AFSRC (AFS association rules for classification) realized in the framework of the axiomatic fuzzy set (AFS) theory. This model provides a simple and efficient rule generation mechanism. It can also retain meaningful rules for imbalanced classes by fuzzifying the concept of the class support of a rule. In addition, AFSRC can handle different data types occurring simultaneously. Furthermore, the new model can produce membership functions automatically by processing available data. An extensive suite of experiments are reported which offer a comprehensive comparison of the performance of the method with the performance of some other methods available in the literature. The experimental result shows that AFSRC outperforms most of other methods when being quantified in terms of accuracy and interpretability. AFSRC forms a classifier with high accuracy and more interpretable rule base of smaller size while retaining a sound balance between these two characteristics. 相似文献
16.
本文提出了数据挖掘中求解聚类中心问题的一种新方法.这类问题属于非凸非光滑全局最优化问题.我们首先利用光滑化方法将非光滑聚类函数用光滑函数逼近,然后对光滑化问题利用填充函数搜索其全局最优点.对不同数据库的数值试验表明,本文提出的算法是可行和有效的. 相似文献
17.
A Graph b-coloring Framework for Data Clustering 总被引:1,自引:0,他引:1
Haytham Elghazel Hamamache Kheddouci Véronique Deslandres Alain Dussauchoy 《Journal of Mathematical Modelling and Algorithms》2008,7(4):389-423
The graph b-coloring is an interesting technique that can be applied to various domains. The proper b-coloring problem is the assignment of colors (classes) to the vertices of one graph so that no two adjacent vertices have the same
color, and for each color class there exists at least one dominating vertex which is adjacent (dissimilar) to all other color classes. This paper presents a new graph b-coloring framework for clustering heterogeneous objects into
groups. A number of cluster validity indices are also reviewed. Such indices can be used for automatically determining the
optimal partition. The proposed approach has interesting properties and gives good results on benchmark data set as well as
on real medical data set. 相似文献
18.
"数据挖掘"是数据处理的一个新领域.支持向量机是数据挖掘的一种新方法,该技术在很多领域得到了成功的应用.但是,支持向量机目前还存在许多局限,当支持向量机的训练集中含有模糊信息时,支持向量机将无能为力.为解决一般情况下支持向量机中含有模糊信息(模糊参数)问题,研究了模糊机会约束规划、模糊分类中的模糊特征及其表示方法,建立了模糊支持向量分类机理论,给出了模糊线性可分的模糊支持向量分类机算法. 相似文献
19.
“数据挖掘”是数据处理的一个新领域.支持向量机是数据挖掘的一种新方法,该技术在很多领域得到了成功的应用.但是,支持向量机目前还存在许多局限,当支持向量机的训练集中含有模糊信息时,支持向量机将无能为力.为解决一般情况下支持向量机中含有模糊信息(模糊参数)问题,研究了模糊机会约束规划、模糊分类中的模糊特征及其表示方法,建立了模糊支持向量分类机理论,给出了模糊线性可分的模糊支持向量分类机算法. 相似文献
20.
AndreasChristmann 《应用数学学报(英文版)》2005,21(2):193-208
The goals of this paper are twofold: we describe common features in data sets from motor vehicle insurance companies and we investigate a general strategy which exploits the knowledge of such features. The results of the strategy are a basis to develop insurance tariffs. We use a nonparametric approach based on a combination of kernel logistic regression and ε-support vector regression which both have good robustness properties. The strategy is applied to a data set from motor vehicle insurance companies. 相似文献