首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Many interesting datasets available on the Internet are of a medium size—too big to fit into a personal computer’s memory, but not so large that they would not fit comfortably on its hard disk. In the coming years, datasets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality. Supplementary material for this article is available online.  相似文献   

2.
Abstract

Exploratory data analysis (EDA) is as much a matter of strategy as of selecting specific statistical operations. We have developed a knowledge-based planning system, called AIDE, to help users with EDA. AIDE strikes a balance between conventional statistical packages, which need guidance for every step in the exploration, and autonomous systems, which leave the user entirely out of the decision-making process. AIDE's processing is based on artificial intelligence planning techniques, which give us a useful means of representing some types of statistical strategy. In this article we describe the design of AIDE and its behavior in exploring a small, complex data set.  相似文献   

3.
We develop a general ontology of statistical methods and use it to propose a common framework for statistical analysis and software development built on and within the R language, including R's numerous existing packages. This framework offers a simple unified structure and syntax that can encompass a large fraction of existing statistical procedures. We conjecture that it can be used to encompass and present simply a vast majority of existing statistical methods, without requiring changes in existing approaches, and regardless of the theory of inference on which they are based, notation with which they were developed, and programming syntax with which they have been implemented. This development enabled us, and should enable others, to design statistical software with a single, simple, and unified user interface that helps overcome the conflicting notation, syntax, jargon, and statistical methods existing across the methods subfields of numerous academic disciplines. The approach also enables one to build a graphical user interface that automatically includes any method encompassed within the framework. We hope that the result of this line of research will greatly reduce the time from the creation of a new statistical innovation to its widespread use by applied researchers whether or not they use or program in R.  相似文献   

4.
The importance of graphical displays in statistical practice has been recognized sporadically in the statistical literature over the past century, with wider awareness following Tukey's Exploratory Data Analysis and Tufte's books in the succeeding decades. But statistical graphics still occupy an awkward in-between position: within statistics, exploratory and graphical methods represent a minor subfield and are not well integrated with larger themes of modeling and inference. Outside of statistics, infographics (also called information visualization or Infovis) are huge, but their purveyors and enthusiasts appear largely to be uninterested in statistical principles.

We present here a set of goals for graphical displays discussed primarily from the statistical point of view and discuss some inherent contradictions in these goals that may be impeding communication between the fields of statistics and Infovis. One of our constructive suggestions, to Infovis practitioners and statisticians alike, is to try not to cram into a single graph what can be better displayed in two or more. We recognize that we offer only one perspective and intend this article to be a starting point for a wide-ranging discussion among graphic designers, statisticians, and users of statistical methods. The purpose of this article is not to criticize but to explore the different goals that lead researchers in different fields to value different aspects of data visualization.  相似文献   

5.
R语言作为GNU系统的一个自由,免费,源代码开放的软件,是一种适合推广应用于统计计算和统计制图的优秀工具.在地球化学大数据的趋势面分析中借助R语言软件,选择spatial库包来进行kriging分析和点模式分析;spatial库中的surf.gls方法用最小二乘法来拟合趋势面;使用anova方法比较多个嵌套模型的拟合优度,实现了趋势面函数模型的最优拟合,计算与绘图自动完成,增强了分析的可靠性.该文以安徽省全椒县中部地区主要商品粮基地215.86 km~2范围内586个土壤样品测试数据为例,通过对Zn, Cu及Zn/Cu比值的趋势面分析,求得三阶趋势面函数拟合度最优,三阶趋势面图形与地质环境条件基本吻合,证实了R语言的应用优势.  相似文献   

6.
The last few years have seen a significant increase in publicly available software specifically targeted to the analysis of extreme values. This reflects the increase in the use of extreme value methodology by the general statistical community. The software that is available for the analysis of extremes has evolved in essentially independent units, with most forming extensions of larger software environments. An inevitable consequence is that these units are spread about the statistical landscape. Scientists seeking to apply extreme value methods must spend considerable time and effort in determining whether the currently available software can be usefully applied to a given problem. We attempt to simplify this process by reviewing the current state, and suggest future approaches for software development. These suggestions aim to provide a basis for an initiative leading to the successful creation and distribution of a flexible and extensible set of tools for extreme value practitioners and researchers alike. In particular, we propose a collaborative framework for which cooperation between developers is of fundamental importance. AMS 2000 Subject Classification Primary—62P99  相似文献   

7.
Interactive web graphics are great for communication and knowledge sharing, but are difficult to leverage during the exploratory phase of a data science workflow. Even before the web, interactive graphics helped data analysts quickly gather insight from data, discover the unexpected, and develop better model diagnostics. Although web technologies make interactive graphics more accessible, they are not designed to fit inside an exploratory data analysis (EDA) workflow where rapid iteration between data manipulation, modeling, and visualization must occur. To better facilitate exploratory web graphics that are easily distributed, we need better interfaces between statistical computing environments (e.g., the R language) and client-side web technologies. We propose the R package animint for rapid creation of linked and animated web graphics through a simple extension of ggplot2’s implementation of the Grammar of Graphics. The extension allows one to write ggplot2 code and produce a standalone web page with multiple linked views. Supplementary material for this article is available online.  相似文献   

8.
This article proposes a goodness-of-fit test for the null hypothesis of a functional linear model with scalar response. The test is based on a generalization to the functional framework of a previous one, designed for the goodness-of-fit of regression models with multivariate covariates using random projections. The test statistic is easy to compute using geometrical and matrix arguments, and simple to calibrate in its distribution by a wild bootstrap on the residuals. The finite sample properties of the test are illustrated by a simulation study for several types of basis and under different alternatives. Finally, the test is applied to two datasets for checking the assumption of the functional linear model and a graphical tool is introduced. Supplementary materials are available online.  相似文献   

9.
非平稳性度量是衡量时间序列平稳程度的方法.利用非平稳度量,给出了C检验,并结合非平稳性度量值,对我国体彩"排列五"、"七星彩"及美国亚利桑那州的博彩"Pick3"的历史数据进行分析,发现博彩各数位上整数"0~9"出现都拥有稳定的概率,但并不是以等概率1/10出现,其分布与i.i.d均匀分布稍有差异,其中"七星彩"均匀性最好,"Pick3"的均匀性次之,"排列五"均匀性稍差.  相似文献   

10.
Hypothesis-error (or “HE”) plots, introduced by Friendly (J Stat Softw 17(6):1–42, 2006a; J Comput Graph Stat 16:421–444, 2006b), permit the visualization of hypothesis tests in multivariate linear models by representing hypothesis and error matrices of sums of squares and cross-products as ellipses. This paper describes the implementation of these methods in the heplots package for R, as well as their extension, for example from two to three dimensions and by scaling hypothesis ellipses and ellipsoids in a natural manner relative to error. This is a paper for the proceedings of the Directions in Statistical Computing conference.  相似文献   

11.
Abstract

The massive flood of numbers in ongoing large-scale periodic economic and social surveys commonly leaves little time for anything but a cursory examination of the quality of the data, and few techniques exist for giving an overview of data activity. At the U.S. Bureau of Labor Statistics, a graphical and query-based solution to these problems has recently been adopted for data review in the Current Employment Statistics survey. Chief among the motivations for creating the new system were: (1) Reduce or eliminate the arduous paper review of thousands of sample reports by review analysts; (2) allow the review analysts a more global view of sample activity and at the same time make outlier detection less of a strain; and (3) present global views of estimates over time and among groups of subestimates. The specific graphics approaches used in the new system were designed to quickly portray both time series and cross-sectional aspects of the data, as these are both critical elements in the review process. The described system allows the data analysts to track down suspicious sample members by first graphically pinpointing questionable estimates, and then pinpointing questionable sample data used to produce those estimates. Query methods are used for cross-checking relationships among different sample data elements. Although designed for outlier detection and estimation, the data-representation methods employed in the system have opened up new possibilities for further statistical and economic uses of the data. The authors were torn between the desire for a completely automatic system of data review and the practical demands of an actual survey operating under imperfect conditions, and thus viewed the new system as an evolutionary advance, not as an ideal final solution. Possibilities opened up by the new system prompted some further thinking on finding an ideal state.  相似文献   

12.
Research is an incremental, iterative process, with new results relying and building upon previous ones. Scientists need to find, retrieve, understand, and verify results to confidently extend them, even when the results are their own. We present the trackr framework for organizing, automatically annotating, discovering, and retrieving results. We identify sources of automatically extractable metadata for computational results, and we define an extensible system for organizing, annotating, and searching for results based on these and other metadata. We present an open-source implementation of these concepts for plots, computational artifacts, and woven dynamic reports generated in the R statistical computing language. Supplementary materials for this article are available online.  相似文献   

13.
In this paper, we consider developmental lines of computer-assisted decision support (with consideration of knowledge-based approaches) for data analysis problems. First, we discuss some situations where it is obviously appropriate to apply computer-assisted decision support in connection with data analysis tasks. Then, a brief historical retrospect is given viewing the development of this area of research and its interfaces to knowledge-based approaches. Against this background we illustrate two prototypes of knowledge-based decision support systems for specific data-analysis problems related to fields of interest of our own. Finally, we indicate possible progress and future activities in this area.  相似文献   

14.
15.
Abstract

Statistical software provides essential support for statisticians and others who are analyzing data or doing research on new statistical techniques. Those supported typically regard themselves as “users” of the software, but as soon as they need to express their own ideas computationally, they in fact become “programmers.” Nothing is more important for the success of statistical software than enabling this transition from user to programmer, and on to gradually more ambitious software design. What does the user need? How can the design of statistical software help? This article presents a number of suggestions based on past experience and current research. The evolution of the S system reflects some of these opinions. Work on the Omegahat software provides a promising direction for future systems that reflect similar motivations.  相似文献   

16.
云计算和大数据已成为IT领域的研究热点,如何将云计算在数据存储和数据处理方面的优势应用于大数据领域具有重要的实际应用价值.开源的云平台OpenStack可方便地从硬件管理方面构建私有云,其存储模块Swift能够支持PB级的大数据存储.开源的云平台Hadoop在数据处理方面具有很强的优势,但在支持超大数据存储方面存在不足.通过对OpenStack中的存储模块Swift和Hadoop中的文件处理模块HDFS的比较分析,提出了将Swift和Hadoop的MapReduce技术结合来构建企业处理大数据的私有云计算系统方案.分析结果显示该方案是可行的,这种异构的私有云系统可以整合不同云计算平台各自的优势进行高效的大数据处理.  相似文献   

17.
The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics, they remain a severely underutilized visualization tool in modern data analysis. This article introduces superheat, a new R package that provides an extremely flexible and customizable platform for visualizing complex datasets. Superheat produces attractive and extendable heatmaps to which the user can add a response variable as a scatterplot, model results as boxplots, correlation information as barplots, and more. The goal of this article is two-fold: (1) to demonstrate the potential of the heatmap as a core visualization method for a range of data types, and (2) to highlight the customizability and ease of implementation of the superheat R package for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three reproducible case studies, each based on publicly available data sources.  相似文献   

18.
本文提出了基于语言分布评估加权平均(DAWA)算子的多属性群决策方法;定义了个体决策者评价结果与决策群体评价结果的次序一致性和数值一致性测度,以此分析决策群体评价结果的可靠性;最后,通过具体实例验证了群决策方法的有效性和实用性,分析了个体决策者评价结果与决策群体评价结果的次序一致性和数值一致性。  相似文献   

19.
We consider alternate formulations of recently proposed hierarchical nearest neighbor Gaussian process (NNGP) models for improved convergence, faster computing time, and more robust and reproducible Bayesian inference. Algorithms are defined that improve CPU memory management and exploit existing high-performance numerical linear algebra libraries. Computational and inferential benefits are assessed for alternate NNGP specifications using simulated datasets and remotely sensed light detection and ranging data collected over the U.S. Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska. The resulting data product is the first statistically robust map of forest canopy for the TIU. Supplemental materials for this article are available online.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号