期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Hierarchical visual data mining for large-scale data

Matthew Ward Wei Peng Xiaoning Wang 《Computational Statistics》2004,19(1):147-158

Summary An increasingly important problem in exploratory data analysis and visualization is that of scale; more and more data sets are much too large to analyze using traditional techniques, either in terms of the number of variables or the number of records. One approach to addressing this problem is the development and use of multiresolution strategies, where we represent the data at different levels of abstraction or detail through aggregation and summarization. In this paper we present an overview of our recent and current activities in the development of a multiresolution exploratory visualization environment for large-scale multivariate data. We have developed visualization, interaction, and data management techniques for effectively dealing with data sets that contain millions of records and/or hundreds of dimensions, and propose methods for applying similar approaches to extend the system to handle nominal as well as ordinal data. 相似文献

2.

Adaptive data reduction for large-scale transaction data

Xiao-Bai Li Varghese S. Jacob 《European Journal of Operational Research》2008

Data reduction is an important issue in the field of data mining. The goal of data reduction techniques is to extract a subset of data from a massive dataset while maintaining the properties and characteristics of the original data in the reduced set. This allows an otherwise difficult or impossible data mining task to be carried out efficiently and effectively. This paper describes a new method for selecting a subset of data that closely represents the original data in terms of its joint and univariate distributions. A pair of distance criteria, motivated by the χ²-statistic, are used for measuring the goodness-of-fit between the distributions of the reduced and full datasets. Under these criteria, the data reduction problem can be formulated as a bi-objective quadratic program. A genetic algorithm technique is used in the search/optimization process. Experiments conducted on several real-world data sets demonstrate the effectiveness of the proposed method. 相似文献

3.

Exploratory data analysis for interval compositional data

Karel Hron Paula Brito Peter Filzmoser 《Advances in Data Analysis and Classification》2017,11(2):223-241

Compositional data are considered as data where relative contributions of parts on a whole, conveyed by (log-)ratios between them, are essential for the analysis. In Symbolic Data Analysis (SDA), we are in the framework of interval data when elements are characterized by variables whose values are intervals on \(\mathbb {R}\) representing inherent variability. In this paper, we address the special problem of the analysis of interval compositions, i.e., when the interval data are obtained by the aggregation of compositions. It is assumed that the interval information is represented by the respective midpoints and ranges, and both sources of information are considered as compositions. In this context, we introduce the representation of interval data as three-way data. In the framework of the log-ratio approach from compositional data analysis, it is outlined how interval compositions can be treated in an exploratory context. The goal of the analysis is to represent the compositions by coordinates which are interpretable in terms of the original compositional parts. This is achieved by summarizing all relative information (logratios) about each part into one coordinate from the coordinate system. Based on an example from the European Union Statistics on Income and Living Conditions (EU-SILC), several possibilities for an exploratory data analysis approach for interval compositions are outlined and investigated. 相似文献

4.

A generalized model for data envelopment analysis with interval data

G.R. Jahanshahloo F. Hosseinzadeh LotfiM. Rostamy Malkhalifeh M. Ahadzadeh Namin 《Applied Mathematical Modelling》2009

Data envelopment analysis (DEA) is a method to estimate the relative efficiency of decision-making units (DMUs) performing similar tasks in a production system that consumes multiple inputs to produce multiple outputs. So far, a number of DEA models with interval data have been developed. The CCR model with interval data, the BCC model with interval data and the FDH model with interval data are well known as basic DEA models with interval data. In this study, we suggest a model with interval data called interval generalized DEA (IGDEA) model, which can treat the stated basic DEA models with interval data in a unified way. In addition, by establishing the theoretical properties of the relationships among the IGDEA model and those DEA models with interval data, we prove that the IGDEA model makes it possible to calculate the efficiency of DMUs incorporating various preference structures of decision makers. 相似文献

5.

Cost efficiency measures in data envelopment analysis with data uncertainty

A. Mostafaee F.H. Saljooghi 《European Journal of Operational Research》2010

This paper extends the classical cost efficiency (CE) models to include data uncertainty. We believe that many research situations are best described by the intermediate case, where some uncertain input and output data are available. In such cases, the classical cost efficiency models cannot be used, because input and output data appear in the form of ranges. When the data are imprecise in the form of ranges, the cost efficiency measure calculated from the data should be uncertain as well. So, in the current paper, we develop a method for the estimation of upper and lower bounds for the cost efficiency measure in situations of uncertain input and output data. Also, we develop the theory of efficiency measurement so as to accommodate incomplete price information by deriving upper and lower bounds for the cost efficiency measure. The practical application of these bounds is illustrated by a numerical example. 相似文献

6.

A framework of irregularity enlightenment for data pre-processing in data mining

Siu-Tong Au Rong Duan Siamak G. Hesar Wei Jiang 《Annals of Operations Research》2010,174(1):47-66

Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis. 相似文献

7.

Integrating noisy data

T. Prvan 《Applied Mathematics Letters》1995,8(6):83-87

Suppose the parametric form of a curve is not known, but only a set of observations. Quadrature formulae can be used to integrate a function only known from a set of data points. However, the results will be unreliable if the data contains measurement errors (noise). The method presented here fits an even degree piecewise polynomial to the data where all the data points are being used as knot points and the smoothing parameter is optimal for the indefinite integral of the curve which happens to be a smoothing spline. After the smoothing parameter has been chosen, this approach is less computationally expensive than fitting a smoothing spline and integrating. 相似文献

8.

Qualitative and quantitative data envelopment analysis with interval data

Masahiro Inuiguchi Fumiki Mizoshita 《Annals of Operations Research》2012,195(1):189-220

In this paper, we investigate DEA with interval input-output data. First we show various extensions of efficiency and that 25 of them are essential. Second we formulate the efficiency test problems as mixed integer programming problems. We prove that 14 among 25 problems can be reduced to linear programming problems and that the other 11 efficiencies can be tested by solving a finite sequence of linear programming problems. Third, in order to obtain efficiency scores, we extend SBM model to interval input-output data. Fourth, to moderate a possible positive overassessment by DEA, we introduce the inverted DEA model with interval input-output data. Using efficiency and inefficiency scores, we propose a classification of DMUs. Finally, we apply the proposed approach to Japanese Bank Data and demonstrate its advantages. 相似文献

9.

Environmental data modeling

Hartmut Hebbel 《Annals of Operations Research》1994,54(1):263-278

In the statistical analysis of environmental data, space and time are often disregarded by the use of classical methods such as hydrological analysis of frequencies or factor analysis. But these methods, based on the assumptions of independent identically distributed observations, cannot be efficient. This article discusses more appropriate approaches regarding the space and time influences, and surveys some important proposals of modeling environmental data. Three examples show the workability of the presented theory. Within the first example, a system to detect abnormal occurrences in water quality as early as possible depending in quasi-continuous data is developed. A second example decomposes a water quality time series into three unobservable components. Finally, it is shown how the factor model can be extended to time series data. 相似文献

10.

Dealing with missing data based on data envelopment analysis and halo effect

Yong Zha Ali Song Chuanyong Xu Honglin Yang 《Applied Mathematical Modelling》2013

This research attempts to solve the problem of dealing with missing data via the interface of Data Envelopment Analysis (DEA) and human behavior. Missing data is under continuing discussion in various research fields, especially those highly dependent on data. In practice and research, some necessary data may not be obtained in many cases, for example, procedural factors, lack of needed responses, etc. Thus the question of how to deal with missing data is raised. In this paper, modified DEA models are developed to estimate the appropriate value of missing data in its interval, based on DEA and Inter-dimensional Similarity Halo Effect. The estimated value of missing data is determined by the General Impression of original DEA efficiency. To evaluate the effectiveness of this method, the impact factor is proposed. In addition, the advantages of the proposed approach are illustrated in comparison with previous methods. 相似文献