首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. A principal curve acts as a nonlinear generalization of PCA, and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation (called slpc, for sequential learning principal curves) that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret computation and performance on synthetic and real-life data.  相似文献   

2.
This paper presents new approaches to fit regression models for symbolic internal-valued variables, which are shown to improve and extend the center method suggested by Billard and Diday and the center and range method proposed by Lima-Neto, E.A.and De Carvalho, F.A.T. Like the previously mentioned methods, the proposed regression models consider the midpoints and half of the length of the intervals as additional variables. We considered various methods to fit the regression models, including tree-based models, K-nearest neighbors, support vector machines, and neural networks. The approaches proposed in this paper were applied to a real dataset and to synthetic datasets generated with linear and nonlinear relations. For an evaluation of the methods, the root-mean-squared error and the correlation coefficient were used. The methods presented herein are available in the the RSDA package written in the R language, which can be installed from CRAN.  相似文献   

3.
This paper investigates the asymptotic properties of estimators obtained from the so called CVA (canonical variate analysis) subspace algorithm proposed by Larimore (1983) in the case when the data is generated using a minimal state space system containing unit roots at the seasonal frequencies such that the yearly difference is a stationary vector autoregressive moving average (VARMA) process. The empirically most important special cases of such data generating processes are the I(1) case as well as the case of seasonally integrated quarterly or monthly data. However, increasingly also datasets with a higher sampling rate such as hourly, daily or weekly observations are available, for example for electricity consumption. In these cases the vector error correction representation (VECM) of the vector autoregressive (VAR) model is not very helpful as it demands the parameterization of one matrix per seasonal unit root. Even for weekly series this amounts to 52 matrices using yearly periodicity, for hourly data this is prohibitive. For such processes estimation using quasi-maximum likelihood maximization is extremely hard since the Gaussian likelihood typically has many local maxima while the parameter space often is high-dimensional. Additionally estimating a large number of models to test hypotheses on the cointegrating rank at the various unit roots becomes practically impossible for weekly data, for example. This paper shows that in this setting CVA provides consistent estimators of the transfer function generating the data, making it a valuable initial estimator for subsequent quasi-likelihood maximization. Furthermore, the paper proposes new tests for the cointegrating rank at the seasonal frequencies, which are easy to compute and numerically robust, making the method suitable for automatic modeling. A simulation study demonstrates by example that for processes of moderate to large dimension the new tests may outperform traditional tests based on long VAR approximations in sample sizes typically found in quarterly macroeconomic data. Further simulations show that the unit root tests are robust with respect to different distributions for the innovations as well as with respect to GARCH-type conditional heteroskedasticity. Moreover, an application to Kaggle data on hourly electricity consumption by different American providers demonstrates the usefulness of the method for applications. Therefore the CVA algorithm provides a very useful initial guess for subsequent quasi maximum likelihood estimation and also delivers relevant information on the cointegrating ranks at the different unit root frequencies. It is thus a useful tool for example in (but not limited to) automatic modeling applications where a large number of time series involving a substantial number of variables need to be modelled in parallel.  相似文献   

4.
Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets.  相似文献   

5.
The strong interest in recent years in analyzing chaotic dynamical systems according to their asymptotic behavior has led to various definitions of fractal dimension and corresponding methods of statistical estimation. In this paper we first provide a rigorous mathematical framework for the study of dimension, focusing on pointwise dimension(x) and the generalized Renyi dimensionsD(q), and give a rigorous proof of inequalities first derived by Grassberger and Procaccia and Hentschel and Procaccia. We then specialize to the problem of statistical estimation of the correlation dimension and information dimension. It has been recognized for some time that the error estimates accompanying the usual procedures (which generally involve least squares methods and nearest neighbor calculations) grossly underestimate the true statistical error involved. In least squares analyses of and we identify sources of error not previously discussed in the literature and address the problem of obtaining accurate error estimates. We then develop an estimation procedure for which corrects for an important bias term (the local measure density) and provides confidence intervals for. The general applicability of this method is illustrated with various numerical examples.  相似文献   

6.
刘辉  严志伟  肖孟  祝世宁 《光学学报》2021,41(1):317-336
光子学中的合成维度是近年来微纳光学和拓扑光子学的研究热点。通常意义上,一个光学系统的物理维度受限于其空间几何维度,因此极大地制约了光学系统所支持研究的物理现象。而研究者通过引入合成维度,可以突破几何维度对物理系统维度的制约,研究高维空间的物理问题。同时,合成维度的高度可控性和选择的丰富多样性,为简化系统设计、观察高维物理现象提供了很大的便利。本文介绍了光子学中合成维度的基本概念,回顾了近年来实现合成维度的各种设计方案,并初步探讨了其在基础物理研究和应用上的未来前景。  相似文献   

7.
Shannon’s entropy measure is a popular means for quantifying ecological diversity. We explore how one can use information-theoretic measures (that are often called indices in ecology) on joint ensembles to study the diversity of species interaction networks. We leverage the little-known balance equation to decompose the network information into three components describing the species abundance, specificity, and redundancy. This balance reveals that there exists a fundamental trade-off between these components. The decomposition can be straightforwardly extended to analyse networks through time as well as space, leading to the corresponding notions for alpha, beta, and gamma diversity. Our work aims to provide an accessible introduction for ecologists. To this end, we illustrate the interpretation of the components on numerous real networks. The corresponding code is made available to the community in the specialised Julia package EcologicalNetworks.jl.  相似文献   

8.
A new software package for the Julia language, CountTimeSeries.jl, is under review, which provides likelihood based methods for integer-valued time series. The package’s functionalities are showcased in a simulation study on finite sample properties of Maximum Likelihood (ML) estimation and three real-life data applications. First, the number of newly infected COVID-19 patients is predicted. Then, previous findings on the need for overdispersion and zero inflation are reviewed in an application on animal submissions in New Zealand. Further, information criteria are used for model selection to investigate patterns in corporate insolvencies in Rhineland-Palatinate. Theoretical background and implementation details are described, and complete code for all applications is provided online. The CountTimeSeries package is available at the general Julia package registry.  相似文献   

9.
Scaling phenomena have been intensively studied during the past decade in the context of complex networks. As part of these works, recently novel methods have appeared to measure the dimension of abstract and spatially embedded networks. In this paper we propose a new dimension measurement method for networks, which does not require global knowledge on the embedding of the nodes, instead it exploits link-wise information (link lengths, link delays or other physical quantities). Our method can be regarded as a generalization of the spectral dimension, that grasps the network’s large-scale structure through local observations made by a random walker while traversing the links. We apply the presented method to synthetic and real-world networks, including road maps, the Internet infrastructure and the Gowalla geosocial network. We analyze the theoretically and empirically designated case when the length distribution of the links has the form P(ρ)∼1/ρP(ρ)1/ρ. We show that while previous dimension concepts are not applicable in this case, the new dimension measure still exhibits scaling with two distinct scaling regimes. Our observations suggest that the link length distribution is not sufficient in itself to entirely control the dimensionality of complex networks, and we show that the proposed measure provides information that complements other known measures.  相似文献   

10.
一种光谱分析中的降维方法   总被引:2,自引:0,他引:2  
在可见/近红外光谱分析中,提取光谱数据中的有用信息是建立稳健准确模型的前提。ISOMAP是一种有效的提取数据本真维的降维方法,但对噪声和邻域参数都比较敏感。提出了一种改进的ISOMAP有监督降维方法,利用光谱数据本身的相关性指导邻域图的构建,降低对噪声和邻域参数的敏感程度,以正确表达数据的邻域结构。采用该方法对两组光谱数据降维并进行PLS建模,结果表明,改进后的算法消弱了邻域大小的影响,提取出的本真维数更小,同时提高了模型精度。  相似文献   

11.
We present example quantum chemistry programs written with JaqalPaq, a python meta-programming language used to code in Jaqal (Just Another Quantum Assembly Language). These JaqalPaq algorithms are intended to be run on the Quantum Scientific Computing Open User Testbed (QSCOUT) platform at Sandia National Laboratories. Our exemplars use the variational quantum eigensolver (VQE) quantum algorithm to compute the ground state energies of the H2, HeH+, and LiH molecules. Since the exemplars focus on how to program in JaqalPaq, the calculations of the second-quantized Hamiltonians are performed with the PySCF python package, and the mappings of the fermions to qubits are obtained from the OpenFermion python package. Using the emulator functionality of JaqalPaq, we emulate how these exemplars would be executed on an error-free QSCOUT platform and compare the emulated computation of the bond-dissociation curves for these molecules with their exact forms within the relevant basis.  相似文献   

12.
Estimating the effective signal dimension of resting-state functional MRI (fMRI) data sets (i.e., selecting an appropriate number of signal components) is essential for data-driven analysis. However, current methods are prone to overestimate the dimensions, especially for concatenated group data sets. This work aims to develop improved dimension estimation methods for group fMRI data generated by data reduction and grouping procedure at multiple levels. We proposed a “noise-blurring” approach to suppress intragroup signal variations and to correct spectral alterations caused by the data reduction, which should be responsible for the group dimension overestimation. This technique was evaluated on both simulated group data sets and in vivo resting-state fMRI data sets acquired from 14 normal human subjects during five different scan sessions. Reduction and grouping procedures were repeated at three levels in either “scan–session–subject” or “scan–subject–session” order. Compared with traditional estimation methods, our approach exhibits a stronger immunity against intragroup signal variation, less sensitivity to group size and a better agreement on the dimensions at the third level between the two grouping orders.  相似文献   

13.
We present a Python package developed for computing optical properties of non-spherical particles. It gives a user friendly flexible framework that takes advantage of programming with the modern language supported by the abundant library of scientific packages. The framework is designed to include the methods and interfaces to third-party codes required to treat scatterers of different shape and structure. We describe the current state of our package called ScattPy, briefly outline its range of applicability and note its outstanding accuracy for inhomogeneous particles with a multilayered structure.We also demonstrate some advantages of the ScattPy in particular when performing large-scale computations. Such languages as Python are known to simplify the data input and allow one to include new classes and objects (e.g. those required to define new scatterer shapes) without recompiling the code. The main benefits come from their ability to organize easily the output data as a database. In the ScattPy we use the SQLite database and illustrate how it is utilized in our investigation of the phase function dependence on the shape, size and structure of spheroids. By comparing the time consumption of the ScattPy to that of an equivalent code written completely in FORTRAN we show that there can be no essential performance losses when using Python.  相似文献   

14.
在空间人造目标光谱分析领域,受到观测距离和观测设备空间分辨率的限制,通常在观测空间人造目标光谱信号时,目标某个瞬时视场中的多种纯物质材料的光谱特征信息组合在一个像元中,形成“混合光谱”。因此,将这些混合光谱分解为每个单一材料的光谱并估计出相应的组成比例是空间人造目标光谱分析研究的重点。大多数现有空间目标光谱分解方法都假设空间人造目标混合光谱中包含的纯物质材料种类个数(即“端元数目”)是先验已知的,这对于未知空间人造目标而言是不现实的。因此,纯物质材料数目正确估计对后续光谱数据分析处理的准确性起着至关重要的作用。目前,现有的端元数目确定方法的设计均在高斯白噪声的假设下进行,而对于噪声信号的分布存在频谱相关性的情况下,会提供较差的结果。采用一种基于数据内在维度和似然最大化理论的方法--鲁棒特征值极大似然方法。由于数据内在维数与信号协方差矩阵和信号相关矩阵特征值差异的统计分布特性高度相关,因此通过分析该特征值差异的统计分布特性,构建一个极大似然函数,可以实现空间人造目标混合光谱端元数目的确定。该方法包含两个步骤:首先,采用基于多元回归和改进最小噪声分离方法对原始光谱数据进行预处理完成噪声特性估计和噪声白化过程,从而有效抑制具有频谱相关性的噪声的干扰;接下来,通过求解一个离散对数联合似然函数的极大值问题来实现空间人造目标混合光谱端元数目的确定,该方法完全不需要输入任何参数,并且运行速度比较快。分别利用实验室实测的五种空间人造目标材料的可见/近红外光谱数据和美国地质勘测局光谱数据构建混合光谱仿真数据进行实验。结果表明,该方法能有效抑制相关噪声和白噪声的干扰,空间人造目标纯物质材料数目确定结果具有很好的准确性和稳定性。  相似文献   

15.
An open‐source framework for conducting a broad range of virtual X‐ray imaging experiments, syris, is presented. The simulated wavefield created by a source propagates through an arbitrary number of objects until it reaches a detector. The objects in the light path and the source are time‐dependent, which enables simulations of dynamic experiments, e.g. four‐dimensional time‐resolved tomography and laminography. The high‐level interface of syris is written in Python and its modularity makes the framework very flexible. The computationally demanding parts behind this interface are implemented in OpenCL, which enables fast calculations on modern graphics processing units. The combination of flexibility and speed opens new possibilities for studying novel imaging methods and systematic search of optimal combinations of measurement conditions and data processing parameters. This can help to increase the success rates and efficiency of valuable synchrotron beam time. To demonstrate the capabilities of the framework, various experiments have been simulated and compared with real data. To show the use case of measurement and data processing parameter optimization based on simulation, a virtual counterpart of a high‐speed radiography experiment was created and the simulated data were used to select a suitable motion estimation algorithm; one of its parameters was optimized in order to achieve the best motion estimation accuracy when applied on the real data. syris was also used to simulate tomographic data sets under various imaging conditions which impact the tomographic reconstruction accuracy, and it is shown how the accuracy may guide the selection of imaging conditions for particular use cases.  相似文献   

16.
17.
In this contribution a new method for improving the accuracy of classification and identification experiments is presented. For this purpose the four most applied dimension reduction methods (principal component analysis, independent component analysis, partial least square dimension reduction and the linear discriminant analysis) are used as starting point for the optimization method. The optimization is done by a specially designed genetic algorithm, which is best suited for this kind of experiments. The presented multi‐level chemometric approach has been tested for a Raman dataset containing over 2200 Raman spectra of eight classes of bacteria species (Bacillus anthracis, Bacillus cereus, Bacillus licheniformis, Bacillus mycoides, Bacillus subtilis, Bacillus thuringiensis, Bacillus weihenstephanensis and Paenibacillus polymyxa). The optimization of the dimension reduction improved the accuracy for classification by 6% compared with the accuracy, if the standard dimension reduction is applied. The identification rate is improved by 14% compared with the dimension reduction. The testing in a classification and identification experiment showed the robustness of the algorithm. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

18.
The existing estimation of the upper critical dimension of the Abelian Sandpile Model is based on a qualitative consideration of avalanches as self-avoiding branching processes. We find an exact representation of an avalanche as a sequence of spanning subtrees of two-component spanning trees. Using equivalence between chemical paths on the spanning tree and loop-erased random walks, we reduce the problem to determination of the fractal dimension of spanning subtrees. Then the upper critical dimension d u=4 follows from Lawler's theorems for intersection probabilities of random walks and loop-erased random walks.  相似文献   

19.
The trust region method which originated from the Levenberg–Marquardt (LM) algorithm for mixed effect model estimation are considered in the context of second level functional magnetic resonance imaging (fMRI) data analysis. We first present the mathematical and optimization details of the method for the mixed effect model analysis, then we compare the proposed methods with the conventional expectation-maximization (EM) algorithm based on a series of datasets (synthetic and real human fMRI datasets). From simulation studies, we found a higher damping factor for the LM algorithm is better than lower damping factor for the fMRI data analysis. More importantly, in most cases, the expectation trust region algorithm is superior to the EM algorithm in terms of accuracy if the random effect variance is large. We also compare these algorithms on real human datasets which comprise repeated measures of fMRI in phased-encoded and random block experiment designs. We observed that the proposed method is faster in computation and robust to Gaussian noise for the fMRI analysis. The advantages and limitations of the suggested methods are discussed.  相似文献   

20.
There is great demand for inferring causal effect heterogeneity and for open-source statistical software, which is readily available for practitioners. The mcf package is an open-source Python package that implements Modified Causal Forest (mcf), a causal machine learner. We replicate three well-known studies in the fields of epidemiology, medicine, and labor economics to demonstrate that our mcf package produces aggregate treatment effects, which align with previous results, and in addition, provides novel insights on causal effect heterogeneity. For all resolutions of treatment effects estimation, which can be identified, the mcf package provides inference. We conclude that the mcf constitutes a practical and extensive tool for a modern causal heterogeneous effects analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号