首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 62 毫秒
1.
Summary  An increasingly important problem in exploratory data analysis and visualization is that of scale; more and more data sets are much too large to analyze using traditional techniques, either in terms of the number of variables or the number of records. One approach to addressing this problem is the development and use of multiresolution strategies, where we represent the data at different levels of abstraction or detail through aggregation and summarization. In this paper we present an overview of our recent and current activities in the development of a multiresolution exploratory visualization environment for large-scale multivariate data. We have developed visualization, interaction, and data management techniques for effectively dealing with data sets that contain millions of records and/or hundreds of dimensions, and propose methods for applying similar approaches to extend the system to handle nominal as well as ordinal data.  相似文献   

2.
Data reduction is an important issue in the field of data mining. The goal of data reduction techniques is to extract a subset of data from a massive dataset while maintaining the properties and characteristics of the original data in the reduced set. This allows an otherwise difficult or impossible data mining task to be carried out efficiently and effectively. This paper describes a new method for selecting a subset of data that closely represents the original data in terms of its joint and univariate distributions. A pair of distance criteria, motivated by the χ2-statistic, are used for measuring the goodness-of-fit between the distributions of the reduced and full datasets. Under these criteria, the data reduction problem can be formulated as a bi-objective quadratic program. A genetic algorithm technique is used in the search/optimization process. Experiments conducted on several real-world data sets demonstrate the effectiveness of the proposed method.  相似文献   

3.
Compositional data are considered as data where relative contributions of parts on a whole, conveyed by (log-)ratios between them, are essential for the analysis. In Symbolic Data Analysis (SDA), we are in the framework of interval data when elements are characterized by variables whose values are intervals on \(\mathbb {R}\) representing inherent variability. In this paper, we address the special problem of the analysis of interval compositions, i.e., when the interval data are obtained by the aggregation of compositions. It is assumed that the interval information is represented by the respective midpoints and ranges, and both sources of information are considered as compositions. In this context, we introduce the representation of interval data as three-way data. In the framework of the log-ratio approach from compositional data analysis, it is outlined how interval compositions can be treated in an exploratory context. The goal of the analysis is to represent the compositions by coordinates which are interpretable in terms of the original compositional parts. This is achieved by summarizing all relative information (logratios) about each part into one coordinate from the coordinate system. Based on an example from the European Union Statistics on Income and Living Conditions (EU-SILC), several possibilities for an exploratory data analysis approach for interval compositions are outlined and investigated.  相似文献   

4.
Data envelopment analysis (DEA) is a method to estimate the relative efficiency of decision-making units (DMUs) performing similar tasks in a production system that consumes multiple inputs to produce multiple outputs. So far, a number of DEA models with interval data have been developed. The CCR model with interval data, the BCC model with interval data and the FDH model with interval data are well known as basic DEA models with interval data. In this study, we suggest a model with interval data called interval generalized DEA (IGDEA) model, which can treat the stated basic DEA models with interval data in a unified way. In addition, by establishing the theoretical properties of the relationships among the IGDEA model and those DEA models with interval data, we prove that the IGDEA model makes it possible to calculate the efficiency of DMUs incorporating various preference structures of decision makers.  相似文献   

5.
This paper extends the classical cost efficiency (CE) models to include data uncertainty. We believe that many research situations are best described by the intermediate case, where some uncertain input and output data are available. In such cases, the classical cost efficiency models cannot be used, because input and output data appear in the form of ranges. When the data are imprecise in the form of ranges, the cost efficiency measure calculated from the data should be uncertain as well. So, in the current paper, we develop a method for the estimation of upper and lower bounds for the cost efficiency measure in situations of uncertain input and output data. Also, we develop the theory of efficiency measurement so as to accommodate incomplete price information by deriving upper and lower bounds for the cost efficiency measure. The practical application of these bounds is illustrated by a numerical example.  相似文献   

6.
Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.  相似文献   

7.
Suppose the parametric form of a curve is not known, but only a set of observations. Quadrature formulae can be used to integrate a function only known from a set of data points. However, the results will be unreliable if the data contains measurement errors (noise). The method presented here fits an even degree piecewise polynomial to the data where all the data points are being used as knot points and the smoothing parameter is optimal for the indefinite integral of the curve which happens to be a smoothing spline. After the smoothing parameter has been chosen, this approach is less computationally expensive than fitting a smoothing spline and integrating.  相似文献   

8.
In this paper, we investigate DEA with interval input-output data. First we show various extensions of efficiency and that 25 of them are essential. Second we formulate the efficiency test problems as mixed integer programming problems. We prove that 14 among 25 problems can be reduced to linear programming problems and that the other 11 efficiencies can be tested by solving a finite sequence of linear programming problems. Third, in order to obtain efficiency scores, we extend SBM model to interval input-output data. Fourth, to moderate a possible positive overassessment by DEA, we introduce the inverted DEA model with interval input-output data. Using efficiency and inefficiency scores, we propose a classification of DMUs. Finally, we apply the proposed approach to Japanese Bank Data and demonstrate its advantages.  相似文献   

9.
In the statistical analysis of environmental data, space and time are often disregarded by the use of classical methods such as hydrological analysis of frequencies or factor analysis. But these methods, based on the assumptions of independent identically distributed observations, cannot be efficient. This article discusses more appropriate approaches regarding the space and time influences, and surveys some important proposals of modeling environmental data. Three examples show the workability of the presented theory. Within the first example, a system to detect abnormal occurrences in water quality as early as possible depending in quasi-continuous data is developed. A second example decomposes a water quality time series into three unobservable components. Finally, it is shown how the factor model can be extended to time series data.  相似文献   

10.
This research attempts to solve the problem of dealing with missing data via the interface of Data Envelopment Analysis (DEA) and human behavior. Missing data is under continuing discussion in various research fields, especially those highly dependent on data. In practice and research, some necessary data may not be obtained in many cases, for example, procedural factors, lack of needed responses, etc. Thus the question of how to deal with missing data is raised. In this paper, modified DEA models are developed to estimate the appropriate value of missing data in its interval, based on DEA and Inter-dimensional Similarity Halo Effect. The estimated value of missing data is determined by the General Impression of original DEA efficiency. To evaluate the effectiveness of this method, the impact factor is proposed. In addition, the advantages of the proposed approach are illustrated in comparison with previous methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号