首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Recently developed SAGE technology enables us to simultaneously quantify the expression levels of thousands of genes in a population of cells. SAGE data is helpful in classification of different types of cancers. However, one main challenge in this task is the availability of a smaller number of samples compared to huge number of genes, many of which are irrelevant for classification. Another main challenge is that there is a lack of appropriate statistical methods that consider the specific properties of SAGE data. We propose an efficient solution by selecting relevant genes by information gain and building a multinomial event model for SAGE data. Promising results, in terms of accuracy, were obtained for the model proposed.   相似文献   

2.
With the broad development of the World Wide Web, various kinds of heterogeneous data (including multimedia data) are now available to decision support tasks. A data warehousing approach is often adopted to prepare data for relevant analysis. Data integration and dimensional modeling indeed allow the creation of appropriate analysis contexts. However, the existing data warehousing tools are well-suited to classical, numerical data. They cannot handle complex data. In our approach, we adapt the three main phases of the data warehousing process to complex data. In this paper, we particularly focus on two main steps in complex data warehousing. The first step is data integration. We define a generic UML model that helps representing a wide range of complex data, including their possible semantic properties. Complex data are then stored in XML documents generated by a piece of software we designed. The second important phase we address is the preparation of data for dimensional modeling. We propose an approach that exploits data mining techniques to assist users in building relevant dimensional models.  相似文献   

3.
Denoising analysis imposes new challenge for mining high-frequency financial data due to its irregularities and roughness. Inefficient decomposition of the systematic pattern (the trend) and noises of high-frequency data will lead to erroneous conclusion as the irregularities and roughness of the data make the application of traditional methods difficult. In this paper, we propose the local linear scaling approximation (in short, LLSA) algorithm, a new nonlinear filtering algorithm based on the linear maximal overlap discrete wavelet transform (MODWT) to decompose the systematic pattern and noises. We show several unique properties of this brand-new algorithm, that are, the local linearity, computational complexity, and consistency. We conduct a simulation study to confirm these properties we have analytically shown and compare the performance of LLSA with MODWT. We then apply our new algorithm with the real high-frequency data from German equity market to investigate its implementation in forecasting. We show the superior performance of LLSA and conclude that it can be applied with flexible settings and suitable for high-frequency data mining.  相似文献   

4.
Classification and rule induction are two important tasks to extract knowledge from data. In rule induction, the representation of knowledge is defined as IF-THEN rules which are easily understandable and applicable by problem-domain experts. In this paper, a new chromosome representation and solution technique based on Multi-Expression Programming (MEP) which is named as MEPAR-miner (Multi-Expression Programming for Association Rule Mining) for rule induction is proposed. Multi-Expression Programming (MEP) is a relatively new technique in evolutionary programming that is first introduced in 2002 by Oltean and Dumitrescu. MEP uses linear chromosome structure. In MEP, multiple logical expressions which have different sizes are used to represent different logical rules. MEP expressions can be encoded and implemented in a flexible and efficient manner. MEP is generally applied to prediction problems; in this paper a new algorithm is presented which enables MEP to discover classification rules. The performance of the developed algorithm is tested on nine publicly available binary and n-ary classification data sets. Extensive experiments are performed to demonstrate that MEPAR-miner can discover effective classification rules that are as good as (or better than) the ones obtained by the traditional rule induction methods. It is also shown that effective gene encoding structure directly improves the predictive accuracy of logical IF-THEN rules.  相似文献   

5.
As a consequence of the heightened competition on the education market, the management of educational institutions often attempts to collect information on what drives student satisfaction by e.g. organizing large scale surveys amongst the student population. Until now, this source of potentially very valuable information remains largely untapped. In this study, we address this issue by investigating the applicability of different data mining techniques to identify the main drivers of student satisfaction in two business education institutions. In the end, the resulting models are to be used by the management to support the strategic decision making process. Hence, the aspect of model comprehensibility is considered to be at least equally important as model performance. It is found that data mining techniques are able to select a surprisingly small number of constructs that require attention in order to manage student satisfaction.  相似文献   

6.
Data mining is generally defined as the science of nontrivial extraction of implicit, previously unknown, and potentially useful information from datasets. There are many websites on the Internet that provide extensive information about products and allow users post comments on various products and rate the product on a scale of 1 to 5. During the past decade, the need for intelligent algorithms for calculating and organizing extremely large sets of data has grown exponentially. In this article we investigate the extent to which a product’s average user rating can be predicted, using a manageable subset of a data set. For this we use a linearization-algorithm based prediction model and sketch how an inverse problem can be formulated to yield a smooth local volatility function of user ratings. The MAPLE programs that implement the proposed algorithm show that the method is reasonably accurate for the reconstruction of volatility of user ratings, which is useful both in accurate user predictions as well as computing sensitivity.  相似文献   

7.
A general approach to designing multiple classifiers represents them as a combination of several binary classifiers in order to enable correction of classification errors and increase reliability. This method is explained, for example, in Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques, 2005, Sect. 7.5). The aim of this paper is to investigate representations of this sort based on Brandt semigroups. We give a formula for the maximum number of errors of binary classifiers, which can be corrected by a multiple classifier of this type. Examples show that our formula does not carry over to larger classes of semigroups.  相似文献   

8.
Revenue management (RM) enhances the revenues of a company by means of demand-management decisions. An RM system must take into account the possibility that a booking may be canceled, or that a booked customer may fail to show up at the time of service (no-show). We review the Passenger Name Record data mining based cancellation rate forecasting models proposed in the literature, which mainly address the no-show case. Using a real-world dataset, we illustrate how the set of relevant variables to describe cancellation behavior is very different in different stages of the booking horizon, which not only confirms the dynamic aspect of this problem, but will also help revenue managers better understand the drivers of cancellation. Finally, we examine the performance of the state-of-the-art data mining methods when applied to Passenger Name Record based cancellation rate forecasting.  相似文献   

9.
The identification of different dynamics in sequential data has become an every day need in scientific fields such as marketing, bioinformatics, finance, or social sciences. Contrary to cross-sectional or static data, this type of observations (also known as stream data, temporal data, longitudinal data or repeated measures) are more challenging as one has to incorporate data dependency in the clustering process. In this research we focus on clustering categorical sequences. The method proposed here combines model-based and heuristic clustering. In the first step, the categorical sequences are transformed by an extension of the hidden Markov model into a probabilistic space, where a symmetric Kullback–Leibler distance can operate. Then, in the second step, using hierarchical clustering on the matrix of distances, the sequences can be clustered. This paper illustrates the enormous potential of this type of hybrid approach using a synthetic data set as well as the well-known Microsoft dataset with website users search patterns and a survey on job career dynamics.  相似文献   

10.
With the rapid growth of databases in many modern enterprises data mining has become an increasingly important approach for data analysis. The operations research community has contributed significantly to this field, especially through the formulation and solution of numerous data mining problems as optimization problems, and several operations research applications can also be addressed using data mining methods. This paper provides a survey of the intersection of operations research and data mining. The primary goals of the paper are to illustrate the range of interactions between the two fields, present some detailed examples of important research work, and provide comprehensive references to other important work in the area. The paper thus looks at both the different optimization methods that can be used for data mining, as well as the data mining process itself and how operations research methods can be used in almost every step of this process. Promising directions for future research are also identified throughout the paper. Finally, the paper looks at some applications related to the area of management of electronic services, namely customer relationship management and personalization.  相似文献   

11.
In many industrial processes hundreds of noisy and correlated process variables are collected for monitoring and control purposes. The goal is often to correctly classify production batches into classes, such as good or failed, based on the process variables. We propose a method for selecting the best process variables for classification of process batches using multiple criteria including classification performance measures (i.e., sensitivity and specificity) and the measurement cost. The method applies Partial Least Squares (PLS) regression on the training set to derive an importance index for each variable. Then an iterative classification/elimination procedure using k-Nearest Neighbor is carried out. Finally, Pareto analysis is used to select the best set of variables and avoid excessive retention of variables. The method proposed here consistently selects process variables important for classification, regardless of the batches included in the training data. Further, we demonstrate the advantages of the proposed method using six industrial datasets.  相似文献   

12.
Supervised classification is an important part of corporate data mining to support decision making in customer-centric planning tasks. The paper proposes a hierarchical reference model for support vector machine based classification within this discipline. The approach balances the conflicting goals of transparent yet accurate models and compares favourably to alternative classifiers in a large-scale empirical evaluation in real-world customer relationship management applications. Recent advances in support vector machine oriented research are incorporated to approach feature, instance and model selection in a unified framework.  相似文献   

13.
数据挖掘中统计方法的作用和问题点   总被引:4,自引:0,他引:4  
本文讨论了数据挖掘与统计学间的关系,介绍了在数据挖掘中常用的统计方法和存在的问题。提出了统计怎样适应于数据挖掘的课题。  相似文献   

14.
We present the design of more effective and efficient genetic algorithm based data mining techniques that use the concepts of feature selection. Explicit feature selection is traditionally done as a wrapper approach where every candidate feature subset is evaluated by executing the data mining algorithm on that subset. In this article we present a GA for doing both the tasks of mining and feature selection simultaneously by evolving a binary code along side the chromosome structure used for evolving the rules. We then present a wrapper approach to feature selection based on Hausdorff distance measure. Results from applying the above techniques to a real world data mining problem show that combining both the feature selection methods provides the best performance in terms of prediction accuracy and computational efficiency.  相似文献   

15.
The progress in bioinformatics and biotechnology area has generated a huge amount of sequences that requires a detailed analysis. There are several data mining techniques that can be used to discovery patterns in large databases. This paper describes the development of a tool/methodology to extract hydrophobicity patterns/profiles that archives a specific secondary structure in proteins. The results indicate that association rules can be efficient method to investigate this kind of problem. This work contributes for two areas: prediction of protein structure and protein folding.  相似文献   

16.
Sets of “positive” and “negative” points (observations) in n-dimensional discrete space given along with their non-negative integer multiplicities are analyzed from the perspective of the Logical Analysis of Data (LAD). A set of observations satisfying upper and/or lower bounds imposed on certain components is called a positive pattern if it contains some positive observations and no negative one. The number of variables on which such restrictions are imposed is called the degree of the pattern. A total polynomial algorithm is proposed for the enumeration of all patterns of limited degree, and special efficient variants of it for the enumeration of all patterns with certain “sign” and “coverage” requirements are presented and evaluated on a publicly available collection of benchmark datasets.  相似文献   

17.
Data reduction is an important issue in the field of data mining. The goal of data reduction techniques is to extract a subset of data from a massive dataset while maintaining the properties and characteristics of the original data in the reduced set. This allows an otherwise difficult or impossible data mining task to be carried out efficiently and effectively. This paper describes a new method for selecting a subset of data that closely represents the original data in terms of its joint and univariate distributions. A pair of distance criteria, motivated by the χ2-statistic, are used for measuring the goodness-of-fit between the distributions of the reduced and full datasets. Under these criteria, the data reduction problem can be formulated as a bi-objective quadratic program. A genetic algorithm technique is used in the search/optimization process. Experiments conducted on several real-world data sets demonstrate the effectiveness of the proposed method.  相似文献   

18.
Pathology ordering by general practitioners (GPs) is a significant contributor to rising health care costs both in Australia and worldwide. A thorough understanding of the nature and patterns of pathology utilization is an essential requirement for effective decision support for pathology ordering. In this paper a novel methodology for integrating data mining and case-based reasoning for decision support for pathology ordering is proposed. It is demonstrated how this methodology can facilitate intelligent decision support that is both patient-oriented and deeply rooted in practical peer-group evidence. Comprehensive data collected by professional pathology companies provide a system-wide profile of patient-specific pathology requests by various GPs as opposed to that limited to an individual GP practice. Using the real data provided by XYZ Pathology Company in Australia that contain more than 1.5 million records of pathology requests by general practitioners (GPs), we illustrate how knowledge extracted from these data through data mining with Kohonen’s self-organizing maps constitutes the base that, with further assistance of modern data visualization tools and on-line processing interfaces, can provide “peer-group consensus” evidence support for solving new cases of pathology test ordering problem. The conclusion is that the formal methodology that integrates case-based reasoning principles which are inherently close to GPs’ daily practice, and data-driven computationally intensive knowledge discovery mechanisms which can be applied to massive amounts of the pathology requests data routinely available at professional pathology companies, can facilitate more informed evidential decision making by doctors in the area of pathology ordering.  相似文献   

19.
The paper is concerned with the problem of binary classification of data records, given an already classified training set of records. Among the various approaches to the problem, the methodology of the logical analysis of data (LAD) is considered. Such approach is based on discrete mathematics, with special emphasis on Boolean functions. With respect to the standard LAD procedure, enhancements based on probability considerations are presented. In particular, the problem of the selection of the optimal support set is formulated as a weighted set covering problem. Testable statistical hypothesis are used. Accuracy of the modified LAD procedure is compared to that of the standard LAD procedure on datasets of the UCI repository. Encouraging results are obtained and discussed.  相似文献   

20.
Support Vector Machines (SVMs) is known to be a powerful nonparametric classification technique even for high-dimensional data. Although predictive ability is important, obtaining an easy-to-interpret classifier is also crucial in many applications. Linear SVM provides a classifier based on a linear score. In the case of functional data, the coefficient function that defines such linear score usually has many irregular oscillations, making it difficult to interpret.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号