首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In the statistical analysis of environmental data, space and time are often disregarded by the use of classical methods such as hydrological analysis of frequencies or factor analysis. But these methods, based on the assumptions of independent identically distributed observations, cannot be efficient. This article discusses more appropriate approaches regarding the space and time influences, and surveys some important proposals of modeling environmental data. Three examples show the workability of the presented theory. Within the first example, a system to detect abnormal occurrences in water quality as early as possible depending in quasi-continuous data is developed. A second example decomposes a water quality time series into three unobservable components. Finally, it is shown how the factor model can be extended to time series data.  相似文献   

2.
Denoising analysis imposes new challenge for mining high-frequency financial data due to its irregularities and roughness. Inefficient decomposition of the systematic pattern (the trend) and noises of high-frequency data will lead to erroneous conclusion as the irregularities and roughness of the data make the application of traditional methods difficult. In this paper, we propose the local linear scaling approximation (in short, LLSA) algorithm, a new nonlinear filtering algorithm based on the linear maximal overlap discrete wavelet transform (MODWT) to decompose the systematic pattern and noises. We show several unique properties of this brand-new algorithm, that are, the local linearity, computational complexity, and consistency. We conduct a simulation study to confirm these properties we have analytically shown and compare the performance of LLSA with MODWT. We then apply our new algorithm with the real high-frequency data from German equity market to investigate its implementation in forecasting. We show the superior performance of LLSA and conclude that it can be applied with flexible settings and suitable for high-frequency data mining.  相似文献   

3.
With the broad development of the World Wide Web, various kinds of heterogeneous data (including multimedia data) are now available to decision support tasks. A data warehousing approach is often adopted to prepare data for relevant analysis. Data integration and dimensional modeling indeed allow the creation of appropriate analysis contexts. However, the existing data warehousing tools are well-suited to classical, numerical data. They cannot handle complex data. In our approach, we adapt the three main phases of the data warehousing process to complex data. In this paper, we particularly focus on two main steps in complex data warehousing. The first step is data integration. We define a generic UML model that helps representing a wide range of complex data, including their possible semantic properties. Complex data are then stored in XML documents generated by a piece of software we designed. The second important phase we address is the preparation of data for dimensional modeling. We propose an approach that exploits data mining techniques to assist users in building relevant dimensional models.  相似文献   

4.
We explore use of data mining for lead time estimation in make-to-order manufacturing. The regression tree approach is chosen as the specific data mining method. Training and test data are generated from variations of a job shop simulation model. Starting with a large set of job and shop attributes, a reasonably small subset is selected based on their contribution to estimation performance. Data mining with the selected attributes is compared with linear regression and three other lead time estimation methods from the literature. Empirical results indicate that our data mining approach coupled with the attribute selection scheme outperforms these methods.  相似文献   

5.
With contemporary data collection capacity, data sets containing large numbers of different multivariate time series relating to a common entity (e.g., fMRI, financial stocks) are becoming more prevalent. One pervasive question is whether or not there are patterns or groups of series within the larger data set (e.g., disease patterns in brain scans, mining stocks may be internally similar but themselves may be distinct from banking stocks). There is a relatively large body of literature centered on clustering methods for univariate and multivariate time series, though most do not utilize the time dependencies inherent to time series. This paper develops an exploratory data methodology which in addition to the time dependencies, utilizes the dependency information between S series themselves as well as the dependency information between p variables within the series simultaneously while still retaining the distinctiveness of the two types of variables. This is achieved by combining the principles of both canonical correlation analysis and principal component analysis for time series to obtain a new type of covariance/correlation matrix for a principal component analysis to produce a so-called “principal component time series”. The results are illustrated on two data sets.  相似文献   

6.
Summary  Increasing amounts of large climate data require new analysis techniques. The area of data mining investigates new paradigms and methods including factors like scalability, flexibility and problem abstraction for large data sets. The field of visual data mining in particular offers valuable methods for analyzing large amounts of data intuitively. In this paper we describe our approach of integrating cluster analysis and visualization methods for the exploration of climate data. We integrated cluster algorithms, appropriate visualization techniques and sophisticated interaction paradigms into a general framework.  相似文献   

7.
Data mining involves extracting interesting patterns from data and can be found at the heart of operational research (OR), as its aim is to create and enhance decision support systems. Even in the early days, some data mining approaches relied on traditional OR methods such as linear programming and forecasting, and modern data mining methods are based on a wide variety of OR methods including linear and quadratic optimization, genetic algorithms and concepts based on artificial ant colonies. The use of data mining has rapidly become widespread, with applications in domains ranging from credit risk, marketing, and fraud detection to counter-terrorism. In all of these, data mining is increasingly playing a key role in decision making. Nonetheless, many challenges still need to be tackled, ranging from data quality issues to the problem of how to include domain experts' knowledge, or how to monitor model performance. In this paper, we outline a series of upcoming trends and challenges for data mining and its role within OR.  相似文献   

8.
In data mining, the unsupervised learning technique of clustering is a useful method for ascertaining trends and patterns in data. Most general clustering techniques do not take into consideration the time-order of data. In this paper, mathematical programming and statistical techniques and methodologies are combined to develop a seasonal clustering technique for determining clusters of time series data. We apply this technique to weather and aviation data to determine probabilistic distributions of arrival capacity scenarios, which can be used for efficient traffic flow management. In general, this technique may be used for seasonal forecasting and planning.  相似文献   

9.
In trying to distinguish data features within time series data for specific time intervals, time series segmentation technology is often required. This research divides time series data into segments of varying lengths. A time series segmentation algorithm based on the Ant Colony Optimization (ACO) algorithm is proposed to exhibit the changeability of the time series data. In order to verify the effect of the proposed algorithm, we experiment with the Bottom-Up method, which has been reported in available literature to give good results for time series segmentation. Simulation data and genuine stock price data are also used in some of our experiments. The research result shows that time series segmentation run by the ACO algorithm not only automatically identifies the number of segments, but its segmentation cost was lower than that of the time series segmentation using the Bottom-Up method. More importantly, during the ACO algorithm process, the degree of data loss is also less compared to that of the Bottom-Up method.  相似文献   

10.
Time series data with periodic trends like daily temperatures or sales of seasonal products can be seen in periods fluctuating between highs and lows throughout the year. Generalized least squares estimators are often computed for such time series data as these estimators have minimum variance among all linear unbiased estimators. However, the generalized least squares solution can require extremely demanding computation when the data is large. This paper studies an efficient algorithm for generalized least squares estimation in periodic trended regression with autoregressive errors. We develop an algorithm that can substantially simplify generalized least squares computation by manipulating large sets of data into smaller sets. This is accomplished by coining a structured matrix for dimension reduction. Simulations show that the new computation methods using our algorithm can drastically reduce computing time. Our algorithm can be easily adapted to big data that show periodic trends often pertinent to economics, environmental studies, and engineering practices.  相似文献   

11.
Methods designed for second-order stationary time series can be misleading when applied to nonstationary series, often resulting in inaccurate models and poor forecasts. Hence, testing time series stationarity is important especially with the advent of the ‘data revolution’ and the recent explosion in the number of nonstationary time series analysis tools. Most existing stationarity tests rely on a single basis. We propose new tests that use nondecimated basis libraries which permit discovery of a wider range of nonstationary behaviours, with greater power whilst preserving acceptable statistical size. Our tests work with a wide range of time series including those whose marginal distributions possess heavy tails. We provide freeware R software that implements our tests and a range of graphical tools to identify the location and duration of nonstationarities. Theoretical and simulated power calculations show the superiority of our wavelet packet approach in a number of important situations and, hence, we suggest that the new tests are useful additions to the analyst's toolbox.  相似文献   

12.
Multi-step prediction is still an open challenge in time series prediction. Moreover, practical observations are often incomplete because of sensor failure or outliers causing missing data. Therefore, it is very important to carry out research on multi-step prediction of time series with random missing data. Based on nonlinear filters and multilayer perceptron artificial neural networks (ANNs), one novel approach for multi-step prediction of time series with random missing data is proposed in the study. With the basis of original nonlinear filters which do not consider the missing data, first we obtain the generalized nonlinear filters by using a sequence of independent Bernoulli random variables to model random interruptions. Then the multi-step prediction model of time series with random missing data, which can be fit for the online training of generalized nonlinear filters, is established by using the ANN’s weights to present the state vector and the ANN’s outputs to present the observation equation. The performance between the original nonlinear filters based ANN model for multi-step prediction of time series with missing data and the generalized nonlinear filters based ANN model for multi-step prediction of time series with missing data is compared. Numerical results have demonstrated that the generalized nonlinear filters based ANN are proportionally superior to the original nonlinear filters based ANN for multi-step prediction of time series with missing data.  相似文献   

13.
许多时间序列 ,例如资本数据等经济类时间序列 ,是由众多错综复杂因素共同作用的结果 ,存在着种种线性和非线性作用机制 ,频谱分析及其种种变形不应该是这些时间序列周期分析的合适工具 ,R/S分析因为不象频谱分析那样有正弦或余弦的假设 ,因而具有明显的优势 .通过对上证指数的 R/S分析 ,发现上证指数具有长程正相关和大约 5个月一个周期的特点 .  相似文献   

14.
Candidate groups search for K-harmonic means data clustering   总被引:2,自引:0,他引:2  
Clustering is a very popular data analysis and data mining technique. K-means is one of the most popular methods for clustering. Although K-mean is easy to implement and works fast in most situations, it suffers from two major drawbacks, sensitivity to initialization and convergence to local optimum. K-harmonic means clustering has been proposed to overcome the first drawback, sensitivity to initialization. In this paper we propose a new algorithm, candidate groups search (CGS), combining with K-harmonic mean to solve clustering problem. Computational results showed CGS does get better performance with less computational time in clustering, especially for large datasets or the number of centers is big.  相似文献   

15.
With new technologies or products invented, customers migrate from a legacy product to a new product from time to time. This paper discusses a time series data mining framework for product and service migration analysis. In order to identify who migrate, how migrations look like, and the relationship between the legacy product and the new product, we first discuss certain characteristics of customer spending data associated with product migration. By exploring interesting patterns and defining a number of features that capture the associations between the spending time series, we develop a co-integration-based classifier to identify customers associated with migration and summarize their time series patterns before, during and after the migration. Customers can then be scored based on the migration index that integrates the statistical significance and business impact of migration customers. We illustrate the research through a case study of internet protocol (IP) migration in telecommunications and compare it with likelihood-ratio-based tests for change point detections.  相似文献   

16.
Advanced statistical techniques and data mining methods have been recognized as a powerful support for mass spectrometry (MS) data analysis. Particularly, due to its unsupervised learning nature, data clustering methods have attracted increasing interest for exploring, identifying, and discriminating pathological cases from MS clinical samples. Supporting biomarker discovery in protein profiles has drawn special attention from biologists and clinicians. However, the huge amount of information contained in a single sample, that is, the high-dimensionality of MS data makes the effective identification of biomarkers a challenging problem.In this paper, we present a data mining approach for the analysis of MS data, in which the mining phase is developed as a task of clustering of MS data. Under the natural assumption of modeling MS data as time series, we propose a new representation model of MS data which allows for significantly reducing the high-dimensionality of such data, while preserving the relevant features. Besides the reduction of high-dimensionality (which typically affects effectiveness and efficiency of computational methods), the proposed representation model of MS data also alleviates the critical task of preprocessing the raw spectra in the whole process of MS data analysis. We evaluated our MS data clustering approach to publicly available proteomic datasets, and experimental results have shown the effectiveness of the proposed approach that can be used to aid clinicians in studying and formulating diagnosis of pathological states.  相似文献   

17.
Data mining aims to find patterns in organizational databases. However, most techniques in mining do not consider knowledge of the quality of the database. In this work, we show how to incorporate into classification mining recent advances in the data quality field that view a database as the product of an imprecise manufacturing process where the flaws/defects are captured in quality matrices. We develop a general purpose method of incorporating data quality matrices into the data mining classification task. Our work differs from existing data preparation techniques since while other approaches detect and fix errors to ensure consistency with the entire data set our work makes use of the apriori knowledge of how the data is produced/manufactured.  相似文献   

18.
Desulfurization systems in coal-fired power stations often suffer the problem of high operating costs caused by a rule-of-thumb control strategy, which implies great potential for optimization of the operation. Due to the complex desulfurization mechanism, frequently fluctuating unit load, and severe disturbance, it is challenging to determine the optimal operating parameters based on the traditional mechanistic models, and the operating parameters are closely related to the operational efficiency of the flue gas desulfurization system. In this paper, an operation strategy optimization method for the desulfurization process is proposed based on a data mining framework, which is able to determine online the optimal operating parameter settings from a large amount of historical data. First, Principal Component Analysis (PCA) is used to reduce data redundancy by mapping the data into a new vector space. Based on the new vector space, an enhanced fuzzy C-means clustering (Enhanced-FCM) is developed to cluster the historical data into groups sharing similar characteristics. Taking sulfur dioxide emission concentration as a constraint condition, the system is optimized with economic benefits and desulfurization efficiency as the objective function. When performing optimization, the group that current operating conditions belong to is determined first, then the operating parameters of the best performance are searched within the group and provided as the optimization results. The method is validated and tested based on the data from a wet flue gas desulfurization (WFGD) system of a 1000 MWe supercritical coal-fired power plant in China. The results indicate that the proposed operation strategy can appropriately obtain operating parameter settings at different conditions, and effectively reduce the desulfurization cost under the constraint of meeting emission requirements.  相似文献   

19.
Mathematical theory of optimization has found many applications in the area of medicine over the last few decades. Several data analysis and decision making problems in medicine can be formulated using optimization and data mining techniques. The significance of the mathematical models is greatly realized in the recent years owing to the growing technological capabilities and the large amounts of data available. In this paper, we attempt to give a brief overview of some of the most interesting applications of mathematical programming and data mining in medicine. In the overview, we include applications like radiation therapy treatment, microarray data analysis, and computational neuroscience.  相似文献   

20.
Data preprocessing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project. In this paper, we present an algorithm DB-HReduction, which discretizes or eliminates numeric attributes and generalizes or eliminates symbolic attributes very efficiently and effectively. This algorithm greatly decreases the number of attributes and tuples of the data set and improves the accuracy and decreases the running time of the data mining algorithms in the later stage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号