首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 318 毫秒
1.
Summary  Increasing amounts of large climate data require new analysis techniques. The area of data mining investigates new paradigms and methods including factors like scalability, flexibility and problem abstraction for large data sets. The field of visual data mining in particular offers valuable methods for analyzing large amounts of data intuitively. In this paper we describe our approach of integrating cluster analysis and visualization methods for the exploration of climate data. We integrated cluster algorithms, appropriate visualization techniques and sophisticated interaction paradigms into a general framework.  相似文献   

2.
Multidimensional multivariate data have been studied in different areas for quite some time. Commonly, the analysis goal is not to look into individual records but to understand the distribution of the records at large and to find clusters of records that exhibit correlations between dimensions or variables. We propose a visualization method that operates on density rather than individual records. To not restrict our search for clusters, we compute density in the given multidimensional space. Clusters are formed by areas of high density. We present an approach that automatically computes a hierarchical tree of high density clusters. For visualization purposes, we propose a method to project the multidimensional clusters to a 2D or 3D layout. The projection method uses an optimized star coordinates layout. The optimization procedure minimizes the overlap of projected clusters and maximally maintains the cluster shapes, compactness, and distribution. The star coordinate visualization allows for an interactive analysis of the distribution of clusters and comprehension of the relations between clusters and the original dimensions. Clusters are being visualized using nested sequences of density level sets leading to a quantitative understanding of information content, patterns, and relationships.  相似文献   

3.
Development of methods for visualisation of high-dimensional data where the number of observations, n, is small compared to the number of variables, p, is of increasing importance. One major application is the burgeoning field of microarray (gene expression) experiments. Because of their high cost, the number of chips (n) is O(10 − 102) while the number (p) of genes (including expressed sequence tags) on each chip is O(103 − 104). Based on synthetic data simulated in accord with current biological interpretation of microarray data, we have adapted the biplot that simultaneously plots the genes and the chips to display relevant experimental information. Other ordination techniques are also useful for visually exploring microarray data. The biological information that can be revealed by applying these exploratory, visual techniques is illustrated using data from gene expression experiments. When ordination methods, or dimension reduction methods such as PCA and its many variants, are used, in association with gene selection methods, it is well known that “selection bias” can result. We show an application of bootstrap methodology to ordination methods that can be used to account for this bias. Such methods are invaluable when visualization methods are used for pattern recognition, such as when identifying previously unknown sub-classes of tumours in molecular classification. A colour version of the paper is available at: DOI:. The sample numbers shown on the plots can also be used for identifying the different classes if a colour version is not available. The sample numbers for the ALL B-cells are 1, 4, 5, 7, 8, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, and 27 respectively. Those for the ALL T-Cells are 2, 3, 6, 9, 10, 11, 14 and 23, and for the AML the samples are 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38.  相似文献   

4.
In this paper, we investigate the problem of determining the number of clusters in the k-modes based categorical data clustering process. We propose a new categorical data clustering algorithm with automatic selection of k. The new algorithm extends the k-modes clustering algorithm by introducing a penalty term to the objective function to make more clusters compete for objects. In the new objective function, we employ a regularization parameter to control the number of clusters in a clustering process. Instead of finding k directly, we choose a suitable value of regularization parameter such that the corresponding clustering result is the most stable one among all the generated clustering results. Experimental results on synthetic data sets and the real data sets are used to demonstrate the effectiveness of the proposed algorithm.  相似文献   

5.
Summary  The paper introduces the idea of generalising a cumulative frequency curve to show arbitrary cumulative counts. For example, in demographic studies generalised cumulative curves can represent the distribution of population or area. Generalised cumulative curves can be a valuable instrument for exploratory data analysis. The use of cumulative curves in an investigation of population statistics in Northwest England allowed us to discover interesting facts about relationships between the distribution of national minorities and the degree of deprivation. We detected that, while high concentration of national minorities occurs, in general, in underprivileged districts, there are some differences related to the origin of the minorities. The paper sets the applicability conditions for generalised cumulative curves and compares them with other graphical tools for exploratory data analysis.  相似文献   

6.
New challenges in knowledge extraction include interpreting and classifying data sets while simultaneously considering related information to confirm results or identify false positives. We discuss a data fusion algorithmic framework targeted at this problem. It includes separate base classifiers for each data type and a fusion method for combining the individual classifiers. The fusion method is an extension of current ensemble classification techniques and has the advantage of allowing data to remain in heterogeneous databases. In this paper, we focus on the applicability of such a framework to the protein phosphorylation prediction problem.  相似文献   

7.
Summary  A statistical analysis using the forward search produces many graphs. For multivariate data an appreciable proportion of these are a variety of plots of the Mahalanobis distances of the individual observations during the search. Each unit, originally a point inv-dimensional space, is then represented by a curve in two dimensions connecting the almostn values of the distance for each unit calculated during the search. Our task is now to recognise and classify these curves: we may find several clusters of data, or outliers or some unexpected, non-normal, structure. We look at the plots from five data sets. Statistical techniques in clude cluster analysis and transformations to multivariate normality.  相似文献   

8.
We propose a modified adaptive multiresolution scheme for solving dd-dimensional hyperbolic conservation laws which is based on cell-average discretization in dyadic grids. Adaptivity is obtained by interrupting the refinement at the locations where appropriate scale (wavelet) coefficients are sufficiently small. One important aspect of such a multiresolution representation is that we can use the same binary tree data structure for domains of any dimension. The tree structure allows us to succinctly represent the data and efficiently navigate through it. Dyadic grids also provide a more gradual refinement as compared with the traditional quad-trees (2D) or oct-trees (3D) that are commonly used for multiresolution analysis. We show some examples of adaptive binary tree representations, with significant savings in data storage when compared to quad-tree based schemes. As a test problem, we also consider this modified adaptive multiresolution method, using a dynamic binary tree data structure, applied to a transport equation in 2D domain, based on a second-order finite volume discretization.  相似文献   

9.
Summary  The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets, a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as a useful technique for multidimensional outlier detection.  相似文献   

10.
This article presents a method for visualization of multivariate functions. The method is based on a tree structure—called the level set tree—built from separated parts of level sets of a function. The method is applied for visualization of estimates of multivarate density functions. With different graphical representations of level set trees we may visualize the number and location of modes, excess masses associated with the modes, and certain shape characteristics of the estimate. Simulation examples are presented where projecting data to two dimensions does not help to reveal the modes of the density, but with the help of level set trees one may detect the modes. I argue that level set trees provide a useful method for exploratory data analysis.  相似文献   

11.
The records of a data base can be accessed from other records or from a set of data items (inverted access, primary and secondary index of IMS, search keys of CODASYL etc.) which we call selectors. The implementation of this selectors can use different techniques as hash coding, inverted lists or hierarchical index (indexed sequential, B-trees etc…) We consider here the last one and we search for a given set of selectors an optimal index structure. We show how this problem can be put as the search of an optimal rooted tree among the partial subgraphs of a given graph G (this problem is known in graph theory as Steiner problem) and we give several properties which allow the graph G to be notabily reduced. Then we present a branch and bound algorithm to solve this problem.  相似文献   

12.
In the case of large-scale surveys, such as a Census, data may contain errors or missing values. An automatic error correction procedure is therefore needed. We focus on the problem of restoring the consistency of agricultural data concerning cultivation areas and number of livestock, and we propose here an approach to this balancing problem based on optimization. Possible alternative models, either linear, quadratic or mixed integer, are presented. The mixed integer linear one has been preferred and used for the treatment of possibly unbalanced data records. Results on real-world Agricultural Census data show the effectiveness of the proposed approach.  相似文献   

13.
Missing data recurrently affect datasets in almost every field of quantitative research. The subject is vast and complex and has originated a literature rich in very different approaches to the problem. Within an exploratory framework, distance-based methods such as nearest-neighbour imputation (NNI), or procedures involving multivariate data analysis (MVDA) techniques seem to treat the problem properly. In NNI, the metric and the number of donors can be chosen at will. MVDA-based procedures expressly account for variable associations. The new approach proposed here, called Forward Imputation, ideally meets these features. It is designed as a sequential procedure that imputes missing data in a step-by-step process involving subsets of units according to their “completeness rate”. Two methods within this context are developed for the imputation of quantitative data. One applies NNI with the Mahalanobis distance, the other combines NNI and principal component analysis. Statistical properties of the two methods are discussed, and their performance is assessed, also in comparison with alternative imputation methods. To this purpose, a simulation study in the presence of different data patterns along with an application to real data are carried out, and practical hints for users are also provided.  相似文献   

14.
A new approach to assess product lifetime performance for small data sets   总被引:2,自引:0,他引:2  
Because of cost and time limit factors, the number of samples is usually small in the early stages of manufacturing systems, and the scarcity of actual data will cause problems in decision-making. In order to solve this problem, this paper constructs a counter-intuitive hypothesis testing method by choosing the maximal p-value based on a two-parameter Weibull distribution to enhance the estimate of a nonlinear and asymmetrical shape of product lifetime distribution. Further, we systematically generate virtual data to extend the small data set to improve learning robustness of product lifetime performance. This study provides simulated data sets and two practical examples to demonstrate that the proposed method is a more appropriate technique to increase estimation accuracy of product lifetime for normal or non-normal data with small sample sizes.  相似文献   

15.
Compositional data are considered as data where relative contributions of parts on a whole, conveyed by (log-)ratios between them, are essential for the analysis. In Symbolic Data Analysis (SDA), we are in the framework of interval data when elements are characterized by variables whose values are intervals on \(\mathbb {R}\) representing inherent variability. In this paper, we address the special problem of the analysis of interval compositions, i.e., when the interval data are obtained by the aggregation of compositions. It is assumed that the interval information is represented by the respective midpoints and ranges, and both sources of information are considered as compositions. In this context, we introduce the representation of interval data as three-way data. In the framework of the log-ratio approach from compositional data analysis, it is outlined how interval compositions can be treated in an exploratory context. The goal of the analysis is to represent the compositions by coordinates which are interpretable in terms of the original compositional parts. This is achieved by summarizing all relative information (logratios) about each part into one coordinate from the coordinate system. Based on an example from the European Union Statistics on Income and Living Conditions (EU-SILC), several possibilities for an exploratory data analysis approach for interval compositions are outlined and investigated.  相似文献   

16.
Summary  In the last decade, factorial and clustering techniques have been developed to analyze multidimensional interval data (MIDs). In classic data analysis, PCA and clustering of the most significant components are usually performed to extract cluster structure from data. The clustering of the projected data is then performed, once the noise is filtered out, in a subspace generated by few orthogonal variables. In the framework of interval data analysis, we propose the same strategy. Several computational questions arise from this generalization. First of all, the representation of data onto a factorial subspace: in classic data analysis projected points remain points, but projected MIDs do not remains MIDs. Further, the choice of a distance between the represented data: many distances between points can be computed, few distances between convex sets of points are defined. We here propose optimized techniques for representing data by convex shapes, for computing the Hausdorff distance between convex shapes, based on an L 2 norm, and for performing a hierarchical clustering of projected data.  相似文献   

17.
Pathology ordering by general practitioners (GPs) is a significant contributor to rising health care costs both in Australia and worldwide. A thorough understanding of the nature and patterns of pathology utilization is an essential requirement for effective decision support for pathology ordering. In this paper a novel methodology for integrating data mining and case-based reasoning for decision support for pathology ordering is proposed. It is demonstrated how this methodology can facilitate intelligent decision support that is both patient-oriented and deeply rooted in practical peer-group evidence. Comprehensive data collected by professional pathology companies provide a system-wide profile of patient-specific pathology requests by various GPs as opposed to that limited to an individual GP practice. Using the real data provided by XYZ Pathology Company in Australia that contain more than 1.5 million records of pathology requests by general practitioners (GPs), we illustrate how knowledge extracted from these data through data mining with Kohonen’s self-organizing maps constitutes the base that, with further assistance of modern data visualization tools and on-line processing interfaces, can provide “peer-group consensus” evidence support for solving new cases of pathology test ordering problem. The conclusion is that the formal methodology that integrates case-based reasoning principles which are inherently close to GPs’ daily practice, and data-driven computationally intensive knowledge discovery mechanisms which can be applied to massive amounts of the pathology requests data routinely available at professional pathology companies, can facilitate more informed evidential decision making by doctors in the area of pathology ordering.  相似文献   

18.
Data reduction is an important issue in the field of data mining. The goal of data reduction techniques is to extract a subset of data from a massive dataset while maintaining the properties and characteristics of the original data in the reduced set. This allows an otherwise difficult or impossible data mining task to be carried out efficiently and effectively. This paper describes a new method for selecting a subset of data that closely represents the original data in terms of its joint and univariate distributions. A pair of distance criteria, motivated by the χ2-statistic, are used for measuring the goodness-of-fit between the distributions of the reduced and full datasets. Under these criteria, the data reduction problem can be formulated as a bi-objective quadratic program. A genetic algorithm technique is used in the search/optimization process. Experiments conducted on several real-world data sets demonstrate the effectiveness of the proposed method.  相似文献   

19.
Abstract

An important goal of visualization technology is to support the exploration and analysis of very large amounts of data. This article describes a set of pixel-oriented visualization techniques that use each pixel of the display to visualize one data value and therefore allow the visualization of the largest amount of data possible. Most of the techniques have been specifically designed for visualizing and querying large data bases. The techniques may be divided into query-independent techniques that directly visualize the data (or a certain portion of it) and query-dependent techniques that visualize the data in the context of a specific query. Examples for the class of query-independent techniques are the screen-filling curve and recursive pattern techniques. The screen-filling curve techniques are based on the well-known Morton and Peano–Hilbert curve algorithms, and the recursive pattern technique is based on a generic recursive scheme, which generalizes a wide range of pixel-oriented arrangements for visualizing large data sets. Examples for the class of query-dependent techniques are the snake-spiral and snake-axes techniques, which visualize the distances with respect to a data base query and arrange the most relevant data items in the center of the display. In addition to describing the basic ideas of our techniques, we provide example visualizations generated by the various techniques, which demonstrate the usefulness of our techniques and show some of their advantages and disadvantages.  相似文献   

20.
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

Supplemental materials with additional experiments for this article are available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号