首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Summary  An increasingly important problem in exploratory data analysis and visualization is that of scale; more and more data sets are much too large to analyze using traditional techniques, either in terms of the number of variables or the number of records. One approach to addressing this problem is the development and use of multiresolution strategies, where we represent the data at different levels of abstraction or detail through aggregation and summarization. In this paper we present an overview of our recent and current activities in the development of a multiresolution exploratory visualization environment for large-scale multivariate data. We have developed visualization, interaction, and data management techniques for effectively dealing with data sets that contain millions of records and/or hundreds of dimensions, and propose methods for applying similar approaches to extend the system to handle nominal as well as ordinal data.  相似文献   

2.
Abstract

An important goal of visualization technology is to support the exploration and analysis of very large amounts of data. This article describes a set of pixel-oriented visualization techniques that use each pixel of the display to visualize one data value and therefore allow the visualization of the largest amount of data possible. Most of the techniques have been specifically designed for visualizing and querying large data bases. The techniques may be divided into query-independent techniques that directly visualize the data (or a certain portion of it) and query-dependent techniques that visualize the data in the context of a specific query. Examples for the class of query-independent techniques are the screen-filling curve and recursive pattern techniques. The screen-filling curve techniques are based on the well-known Morton and Peano–Hilbert curve algorithms, and the recursive pattern technique is based on a generic recursive scheme, which generalizes a wide range of pixel-oriented arrangements for visualizing large data sets. Examples for the class of query-dependent techniques are the snake-spiral and snake-axes techniques, which visualize the distances with respect to a data base query and arrange the most relevant data items in the center of the display. In addition to describing the basic ideas of our techniques, we provide example visualizations generated by the various techniques, which demonstrate the usefulness of our techniques and show some of their advantages and disadvantages.  相似文献   

3.
In this paper we compare and contrast the new data mining activity of pattern search with more traditional cluster analysis methods of data mining, in the context of credit data. In particular, we examine a set of behavioural data from a large UK bank relating to the status of current accounts over a twelve month period. We show how conventional clustering approaches can be used, for example to define broad categories of behaviour, whereas pattern search can be used to find small groups of accounts that exhibit distinctive behaviour.  相似文献   

4.
Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.  相似文献   

5.
Mathematical theory of optimization has found many applications in the area of medicine over the last few decades. Several data analysis and decision making problems in medicine can be formulated using optimization and data mining techniques. The significance of the mathematical models is greatly realized in the recent years owing to the growing technological capabilities and the large amounts of data available. In this paper, we attempt to give a brief overview of some of the most interesting applications of mathematical programming and data mining in medicine. In the overview, we include applications like radiation therapy treatment, microarray data analysis, and computational neuroscience.  相似文献   

6.
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

Supplemental materials with additional experiments for this article are available online.  相似文献   

7.
The degree of correlation between variables is used in many data analysis applications as a key measure of interdependence. The most common techniques for exploratory analysis of pairwise correlation in multivariate datasets, like scatterplot matrices and clustered heatmaps, however, do not scale well to large datasets, either computationally or visually. We present a new visualization that is capable of encoding pairwise correlation between hundreds of thousands variables, called the s-CorrPlot. The s-CorrPlot encodes correlation spatially between variables as points on scatterplot using the geometric structure underlying Pearson’s correlation. Furthermore, we extend the s-CorrPlot with interactive techniques that enable animation of the scatterplot to new projections of the correlation space, as illustrated in the companion video in supplementary materials. We provide the s-CorrPlot as an open-source R package and validate its effectiveness through a variety of methods including a case study with a biology collaborator. Supplementary materials for this article are available online.  相似文献   

8.
Visual data mining is an efficient way to involve human in search for a optimal decision. This paper focuses on the optimization of the visual presentation of multidimensional data.A variety of methods for projection of multidimensional data on the plane have been developed. At present, a tendency of their joint use is observed. In this paper, two consequent combinations of the self-organizing map (SOM) with two other well-known nonlinear projection methods are examined theoretically and experimentally. These two methods are: Sammon’s mapping and multidimensional scaling (MDS). The investigations showed that the combinations (SOM_Sammon and SOM_MDS) have a similar efficiency. This grounds the possibility of application of the MDS with the SOM, because up to now in most researches SOM is applied together with Sammon’s mapping. The problems on the quality and accuracy of such combined visualization are discussed. Three criteria of different nature are selected for evaluation the efficiency of the combined mapping. The joint use of these criteria allows us to choose the best visualization result from some possible ones.Several different initialization ways for nonlinear mapping are examined, and a new one is suggested. A new approach to the SOM visualization is suggested.The obtained results allow us to make better decisions in optimizing the data visualization.  相似文献   

9.
The use of Virtual Reality (VR) techniques for the investigation of complex flow phenomena offers distinct advantages in comparison to conventional visualization techniques. Especially for unsteady flows, VR methodology provides an intuitive approach for the exploration of simulated fluid flows. However, the visualization of Computational Fluid Dynamics (CFD) data is often too time-consuming to be carried out in real-time, and thus violates essential constraints concerning real-time interaction and visualization. To overcome this obstacle, we make use of the fact that typically a multi-block approach is employed for domain decomposition, and we use the corresponding data structures for the computation of path lines and for parallelization. In this paper, we present the synthesis of fragmented multi-block data sets and our implementation of an accurate path line integration scheme in order to speed up path line computations. We report on the results of our efforts and describe a combination of this algorithm with a highly efficient visualization approach of large amounts of particle traces, thus considerably improving interactivity when exploring large scale CFD data sets.Mathematics Subject Classifications (2000) 76Mxx, 76M27, 76M28, 65M55, 65L05, 65L06, 65D05, 65Y05, 68U05.  相似文献   

10.
The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics, they remain a severely underutilized visualization tool in modern data analysis. This article introduces superheat, a new R package that provides an extremely flexible and customizable platform for visualizing complex datasets. Superheat produces attractive and extendable heatmaps to which the user can add a response variable as a scatterplot, model results as boxplots, correlation information as barplots, and more. The goal of this article is two-fold: (1) to demonstrate the potential of the heatmap as a core visualization method for a range of data types, and (2) to highlight the customizability and ease of implementation of the superheat R package for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three reproducible case studies, each based on publicly available data sources.  相似文献   

11.

Asymmetric pairwise relationships are frequently observed in experimental and non-experimental studies. They can be analysed with different aims and approaches. A brief review of models and methods of multidimensional scaling and cluster analysis able to deal with asymmetric proximities is provided taking a ‘data-analytic’ approach and emphasizing data visualization.

  相似文献   

12.
Discussion     
This article proposes a new hybrid visualization technique that integrates a frequency-based model and a generalized parallel coordinate plot (GPCP), thus mitigating the visual cluttering of GPCP. In the new technique, a GPCP’s profile lines (or curves) with similar frequencies are detected and saturated with appropriate color intensity corresponding to the frequencies. The technique may be employed to enhance a family of visualization tools—the Andrews plot and scatterplot matrix, for example. In addition to the new technique’s efficiency in reducing visual clutter in the multivariate data visualization techniques, it is computationally feasible, easy to implement, and has important mathematical and statistical properties. The reliability and accuracy of the technique are demonstrated through extensive experiments on challenging datasets, both simulated and real. These datasets are high in dimensions and large so that they cannot be explored with GPCP or frequency-based techniques alone.

The datasets for pollen, OUT5D, and California housing are available in the online supplements.  相似文献   

13.
Summary  Jasp is an experimental general purpose Java-based statistical system which adopts several new computing technologies. It has a function-based and object-oriented language, an advanced user interface, flexible extensibility and a server/client architecture with distributed computing abilities. DAVIS is, on the other hand, a stand-alone Java-based system, and is designed for providing advanced data visualization functions with easy operations by a GUI. In this paper, it is made possible to use tools of DAVIS from within Jasp, in order that the new integrated system can handle not only data filtering and statistical analysis but also data visualization. We develop a mechanism for extending the server/client system of Jasp to realize an efficient collaboration with DAVIS in the client-side. It is shown that the mechanism is straightforward and simple.  相似文献   

14.
Advanced statistical techniques and data mining methods have been recognized as a powerful support for mass spectrometry (MS) data analysis. Particularly, due to its unsupervised learning nature, data clustering methods have attracted increasing interest for exploring, identifying, and discriminating pathological cases from MS clinical samples. Supporting biomarker discovery in protein profiles has drawn special attention from biologists and clinicians. However, the huge amount of information contained in a single sample, that is, the high-dimensionality of MS data makes the effective identification of biomarkers a challenging problem.In this paper, we present a data mining approach for the analysis of MS data, in which the mining phase is developed as a task of clustering of MS data. Under the natural assumption of modeling MS data as time series, we propose a new representation model of MS data which allows for significantly reducing the high-dimensionality of such data, while preserving the relevant features. Besides the reduction of high-dimensionality (which typically affects effectiveness and efficiency of computational methods), the proposed representation model of MS data also alleviates the critical task of preprocessing the raw spectra in the whole process of MS data analysis. We evaluated our MS data clustering approach to publicly available proteomic datasets, and experimental results have shown the effectiveness of the proposed approach that can be used to aid clinicians in studying and formulating diagnosis of pathological states.  相似文献   

15.
Nowadays, with the volume of data growing at an unprecedented rate, large-scale data mining and knowledge discovery have become a new challenge. Rough set theory for knowledge acquisition has been successfully applied in data mining. The recently introduced MapReduce technique has received much attention from both scientific community and industry for its applicability in big data analysis. To mine knowledge from big data, we present parallel large-scale rough set based methods for knowledge acquisition using MapReduce in this paper. We implemented them on several representative MapReduce runtime systems: Hadoop, Phoenix and Twister. Performance comparisons on these runtime systems are reported in this paper. The experimental results show that (1) The computational time is mostly minimum on Twister while employing the same cores; (2) Hadoop has the best speedup for larger data sets; (3) Phoenix has the best speedup for smaller data sets. The excellent speedups also demonstrate that the proposed parallel methods can effectively process very large data on different runtime systems. Pitfalls and advantages of these runtime systems are also illustrated through our experiments, which are helpful for users to decide which runtime system should be used in their applications.  相似文献   

16.
Visual analytics is the science of analytical reasoning facilitated by interactive visual interfaces. In this paper, we present an adaptation of the visual analytics framework to the context of software understanding for maintenance. We discuss the similarities and differences of the general visual analytics context with the software maintenance context, and present in detail an instance of a visual software analytics application for the build optimization of large-scale code bases. Our application combines and adapts several data mining and information visualization techniques in answering several questions that help developers in assessing and reducing the build cost of such code bases by means of user-driven, interactive analysis techniques.  相似文献   

17.
Fast detection of string differences is a prerequisite for string clustering problems. An example of such a problem is the identification of duplicate information in the data cleansing stage of the data mining process. The relevant algorithms allow the application of large-scale clustering techniques in order to create clusters of similar strings. The vast majority of comparisons, in such cases, is between very dissimilar strings, therefore methods that perform better at detecting large differences are preferable. This paper presents approaches which comply with this requirement, based on reformulation of the underlying shortest path problem. It is believed that such methods can lead to a family of new algorithms. An upper bound algorithm is presented, as an example, which produces promising results.  相似文献   

18.
Abstract

The community of researchers studying global climate change is preparing to launch the first Earth observing system (EOS) satellite, EOS Terra. The satellite will generate huge amounts of data, filling gaps in the information available to address critical questions about Earth's climate. But many data handling and data analysis problems must be solved if we are to make best use of the new measurements. In key areas, the experience and expertise of the statistics community could be of great help.  相似文献   

19.
Pathology ordering by general practitioners (GPs) is a significant contributor to rising health care costs both in Australia and worldwide. A thorough understanding of the nature and patterns of pathology utilization is an essential requirement for effective decision support for pathology ordering. In this paper a novel methodology for integrating data mining and case-based reasoning for decision support for pathology ordering is proposed. It is demonstrated how this methodology can facilitate intelligent decision support that is both patient-oriented and deeply rooted in practical peer-group evidence. Comprehensive data collected by professional pathology companies provide a system-wide profile of patient-specific pathology requests by various GPs as opposed to that limited to an individual GP practice. Using the real data provided by XYZ Pathology Company in Australia that contain more than 1.5 million records of pathology requests by general practitioners (GPs), we illustrate how knowledge extracted from these data through data mining with Kohonen’s self-organizing maps constitutes the base that, with further assistance of modern data visualization tools and on-line processing interfaces, can provide “peer-group consensus” evidence support for solving new cases of pathology test ordering problem. The conclusion is that the formal methodology that integrates case-based reasoning principles which are inherently close to GPs’ daily practice, and data-driven computationally intensive knowledge discovery mechanisms which can be applied to massive amounts of the pathology requests data routinely available at professional pathology companies, can facilitate more informed evidential decision making by doctors in the area of pathology ordering.  相似文献   

20.
We introduce methods for visualization of data structured along trees, especially hierarchically structured collections of time series. To this end, we identify questions that often emerge when working with hierarchical data and provide an R package to simplify their investigation. Our key contribution is the adaptation of the visualization principles of focus-plus-context and linking to the study of tree-structured data. Our motivating application is to the analysis of bacterial time series, where an evolutionary tree relating bacteria is available a priori. However, we have identified common problem types where, if a tree is not directly available, it can be constructed from data and then studied using our techniques. We perform detailed case studies to describe the alternative use cases, interpretations, and utility of the proposed visualization methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号