首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Data mining aims to find patterns in organizational databases. However, most techniques in mining do not consider knowledge of the quality of the database. In this work, we show how to incorporate into classification mining recent advances in the data quality field that view a database as the product of an imprecise manufacturing process where the flaws/defects are captured in quality matrices. We develop a general purpose method of incorporating data quality matrices into the data mining classification task. Our work differs from existing data preparation techniques since while other approaches detect and fix errors to ensure consistency with the entire data set our work makes use of the apriori knowledge of how the data is produced/manufactured.  相似文献   

2.
Data mining is generally defined as the science of nontrivial extraction of implicit, previously unknown, and potentially useful information from datasets. There are many websites on the Internet that provide extensive information about products and allow users post comments on various products and rate the product on a scale of 1 to 5. During the past decade, the need for intelligent algorithms for calculating and organizing extremely large sets of data has grown exponentially. In this article we investigate the extent to which a product’s average user rating can be predicted, using a manageable subset of a data set. For this we use a linearization-algorithm based prediction model and sketch how an inverse problem can be formulated to yield a smooth local volatility function of user ratings. The MAPLE programs that implement the proposed algorithm show that the method is reasonably accurate for the reconstruction of volatility of user ratings, which is useful both in accurate user predictions as well as computing sensitivity.  相似文献   

3.
We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency.  相似文献   

4.
A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups.  相似文献   

5.
The statistical theories are not expected to generate significant conclusions, when applied to very small data sets. Knowledge derived from limited data gathered in the early stages is considered too fragile for long term production decisions. Unfortunately, this work is necessary in the competitive industry and business environments. Our previous researches have been aimed at learning from small data sets for scheduling flexible manufacturing systems, and this article will focus development of an incremental learning procedure for small sequential data sets. The main consideration concentrates on two properties of data: that the data size is very small and the data are time-dependent. For this reason, we propose an extended algorithm named the Generalized-Trend-Diffusion (GTD) method, based on fuzzy theories, developing a unique backward tracking process for exploring predictive information through the strategy of shadow data generation. The extra information extracted from the shadow data has proven useful in accelerating the learning task and dynamically correcting the derived knowledge in a concurrent fashion.  相似文献   

6.
Supervised classification is an important part of corporate data mining to support decision making in customer-centric planning tasks. The paper proposes a hierarchical reference model for support vector machine based classification within this discipline. The approach balances the conflicting goals of transparent yet accurate models and compares favourably to alternative classifiers in a large-scale empirical evaluation in real-world customer relationship management applications. Recent advances in support vector machine oriented research are incorporated to approach feature, instance and model selection in a unified framework.  相似文献   

7.
Sets of “positive” and “negative” points (observations) in n-dimensional discrete space given along with their non-negative integer multiplicities are analyzed from the perspective of the Logical Analysis of Data (LAD). A set of observations satisfying upper and/or lower bounds imposed on certain components is called a positive pattern if it contains some positive observations and no negative one. The number of variables on which such restrictions are imposed is called the degree of the pattern. A total polynomial algorithm is proposed for the enumeration of all patterns of limited degree, and special efficient variants of it for the enumeration of all patterns with certain “sign” and “coverage” requirements are presented and evaluated on a publicly available collection of benchmark datasets.  相似文献   

8.
Discretization techniques can be used to reduce the number of values for a given continuous attribute, and a concept hierarchy can be used to define a discretization of a given continuous attribute. Traditional methods of building a concept hierarchy from a continuous attribute are usually based on the level-wise approach. Unfortunately, this approach suffers from three weaknesses: (1) it only seeks a local optimal solution instead of a global optimal, (2) it is usually subject to the constraint that each interval can only be partitioned into a fixed number of subintervals, and (3) the constructed tree may be unbalanced. In view of these weaknesses, this paper develops a new algorithm based on dynamic-programming strategy for constructing concept hierarchies from continuous attributes. The constructed trees have three merits: (1) they are global optimal trees, (2) each interval is partitioned into the most appropriate number of subintervals, and (3) the trees are balanced. Finally, we carry out an experimental study using real data to show its efficiency and effectiveness.  相似文献   

9.
上市公司财务危机预警分析——基于数据挖掘的研究   总被引:3,自引:0,他引:3  
刘旻  罗慧 《数理统计与管理》2004,23(3):51-56,68
本文以我国上市公司为研究对象,选取了1999-2001年被ST的公司和正常公司各73家作为训练样本,2002年被ST的公司和正常公司各43家作为检验样本,分析了财务危机出现前2年内各年两类公司15个财务指标。在进行数据挖掘中,我们运用了三种独立的方法,分别为判别分析、Logistic回归和神经网络,结果发现神经网络预测的效果要优于其它两种方法。最后,结合了这些方法的优点,建立了一种混合模型,研究表明预测的正确性要高于每种单独方法,从而提高了模型的预警效果。  相似文献   

10.
Recently developed SAGE technology enables us to simultaneously quantify the expression levels of thousands of genes in a population of cells. SAGE data is helpful in classification of different types of cancers. However, one main challenge in this task is the availability of a smaller number of samples compared to huge number of genes, many of which are irrelevant for classification. Another main challenge is that there is a lack of appropriate statistical methods that consider the specific properties of SAGE data. We propose an efficient solution by selecting relevant genes by information gain and building a multinomial event model for SAGE data. Promising results, in terms of accuracy, were obtained for the model proposed.   相似文献   

11.
This paper presents a novel four-stage algorithm for the measurement of the rank correlation coefficients between pairwise financial time series. In first stage returns of financial time series are fitted as skewed-t distributions by the generalized autoregressive conditional heteroscedasticity model. In the second stage, the joint probability density function (PDF) of the fitted skewed-t distributions is computed using the symmetrized Joe–Clayton copula. The joint PDF is then utilized as the scoring scheme for pairwise sequence alignment in the third stage. After solving the optimal sequence alignment problem using the dynamic programming method, we obtain the aligned pairs of the series. Finally, we compute the rank correlation coefficients of the aligned pairs in the fourth stage. To the best of our knowledge, the proposed algorithm is the first to use a sequence alignment technique to pair numerical financial time series directly, without initially transforming numerical values into symbols. Using practical financial data, the experiments illustrate the method and demonstrate the advantages of the proposed algorithm.  相似文献   

12.
With the broad development of the World Wide Web, various kinds of heterogeneous data (including multimedia data) are now available to decision support tasks. A data warehousing approach is often adopted to prepare data for relevant analysis. Data integration and dimensional modeling indeed allow the creation of appropriate analysis contexts. However, the existing data warehousing tools are well-suited to classical, numerical data. They cannot handle complex data. In our approach, we adapt the three main phases of the data warehousing process to complex data. In this paper, we particularly focus on two main steps in complex data warehousing. The first step is data integration. We define a generic UML model that helps representing a wide range of complex data, including their possible semantic properties. Complex data are then stored in XML documents generated by a piece of software we designed. The second important phase we address is the preparation of data for dimensional modeling. We propose an approach that exploits data mining techniques to assist users in building relevant dimensional models.  相似文献   

13.
We propose a new algorithm for the total variation based on image denoising problem. The split Bregman method is used to convert an unconstrained minimization denoising problem to a linear system in the outer iteration. An algebraic multi-grid method is applied to solve the linear system in the inner iteration. Furthermore, Krylov subspace acceleration is adopted to improve convergence in the outer iteration. Numerical experiments demonstrate that this algorithm is efficient even for images with large signal-to-noise ratio.  相似文献   

14.
In this paper, we present a new algorithm to estimate a regression function in a fixed design regression model, by piecewise (standard and trigonometric) polynomials computed with an automatic choice of the knots of the subdivision and of the degrees of the polynomials on each sub-interval. First we give the theoretical background underlying the method: the theoretical performances of our penalized least-squares estimator are based on non-asymptotic evaluations of a mean-square type risk. Then we explain how the algorithm is built and possibly accelerated (to face the case when the number of observations is great), how the penalty term is chosen and why it contains some constants requiring an empirical calibration. Lastly, a comparison with some well-known or recent wavelet methods is made: this brings out that our algorithm behaves in a very competitive way in term of denoising and of compression.  相似文献   

15.
In this paper, a new conceptual algorithm for the conceptual analysis of mixed incomplete data sets is introduced. This is a logical combinatorial pattern recognition (LCPR) based tool for the conceptual structuralization of spaces. Starting from the limitations of the elaborated conceptual algorithms, our laboratories are working in the application of the methods, the techniques, and in general, the philosophy of the logical combinatorial pattern recognition with the task to improve those limitations. An extension of Michalski's concept of l-complex for any similarity measure, a generalization operator for symbolic variables, and an extension of Michalski's refunion operator are introduced. Finally, the performance of the RGC algorithm is analyzed. A comparison with several known conceptual algorithms is presented.  相似文献   

16.
Pathology ordering by general practitioners (GPs) is a significant contributor to rising health care costs both in Australia and worldwide. A thorough understanding of the nature and patterns of pathology utilization is an essential requirement for effective decision support for pathology ordering. In this paper a novel methodology for integrating data mining and case-based reasoning for decision support for pathology ordering is proposed. It is demonstrated how this methodology can facilitate intelligent decision support that is both patient-oriented and deeply rooted in practical peer-group evidence. Comprehensive data collected by professional pathology companies provide a system-wide profile of patient-specific pathology requests by various GPs as opposed to that limited to an individual GP practice. Using the real data provided by XYZ Pathology Company in Australia that contain more than 1.5 million records of pathology requests by general practitioners (GPs), we illustrate how knowledge extracted from these data through data mining with Kohonen’s self-organizing maps constitutes the base that, with further assistance of modern data visualization tools and on-line processing interfaces, can provide “peer-group consensus” evidence support for solving new cases of pathology test ordering problem. The conclusion is that the formal methodology that integrates case-based reasoning principles which are inherently close to GPs’ daily practice, and data-driven computationally intensive knowledge discovery mechanisms which can be applied to massive amounts of the pathology requests data routinely available at professional pathology companies, can facilitate more informed evidential decision making by doctors in the area of pathology ordering.  相似文献   

17.
Customer churn prediction models aim to indicate the customers with the highest propensity to attrite, allowing to improve the efficiency of customer retention campaigns and to reduce the costs associated with churn. Although cost reduction is their prime objective, churn prediction models are typically evaluated using statistically based performance measures, resulting in suboptimal model selection. Therefore, in the first part of this paper, a novel, profit centric performance measure is developed, by calculating the maximum profit that can be generated by including the optimal fraction of customers with the highest predicted probabilities to attrite in a retention campaign. The novel measure selects the optimal model and fraction of customers to include, yielding a significant increase in profits compared to statistical measures.In the second part an extensive benchmarking experiment is conducted, evaluating various classification techniques applied on eleven real-life data sets from telecom operators worldwide by using both the profit centric and statistically based performance measures. The experimental results show that a small number of variables suffices to predict churn with high accuracy, and that oversampling generally does not improve the performance significantly. Finally, a large group of classifiers is found to yield comparable performance.  相似文献   

18.
Revenue management (RM) enhances the revenues of a company by means of demand-management decisions. An RM system must take into account the possibility that a booking may be canceled, or that a booked customer may fail to show up at the time of service (no-show). We review the Passenger Name Record data mining based cancellation rate forecasting models proposed in the literature, which mainly address the no-show case. Using a real-world dataset, we illustrate how the set of relevant variables to describe cancellation behavior is very different in different stages of the booking horizon, which not only confirms the dynamic aspect of this problem, but will also help revenue managers better understand the drivers of cancellation. Finally, we examine the performance of the state-of-the-art data mining methods when applied to Passenger Name Record based cancellation rate forecasting.  相似文献   

19.
Data envelopment analysis (DEA) is a method to estimate the relative efficiency of decision-making units (DMUs) performing similar tasks in a production system that consumes multiple inputs to produce multiple outputs. So far, a number of DEA models with interval data have been developed. The CCR model with interval data, the BCC model with interval data and the FDH model with interval data are well known as basic DEA models with interval data. In this study, we suggest a model with interval data called interval generalized DEA (IGDEA) model, which can treat the stated basic DEA models with interval data in a unified way. In addition, by establishing the theoretical properties of the relationships among the IGDEA model and those DEA models with interval data, we prove that the IGDEA model makes it possible to calculate the efficiency of DMUs incorporating various preference structures of decision makers.  相似文献   

20.
Support Vector Machines (SVMs) is known to be a powerful nonparametric classification technique even for high-dimensional data. Although predictive ability is important, obtaining an easy-to-interpret classifier is also crucial in many applications. Linear SVM provides a classifier based on a linear score. In the case of functional data, the coefficient function that defines such linear score usually has many irregular oscillations, making it difficult to interpret.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号