首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice.

An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl 1 RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables.

We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.  相似文献   

2.
Nowadays, long wait, cancellations and resource overload frequently occur in healthcare, especially in those sectors related to the patients passing through the operating theatre in both United States and the European Union. Since more and more hospitals seek to develop the overall patient pathways instead of the effectiveness of “isolated” departments, the most important work has been defining suitable patient groups for employing process management and simulation tools developed in the recent decades. In this study, we proposed a data mining method, an auto-stopped Bisecting K-Medoids clustering algorithm, to classify patients into groups with homogenous trajectories. This method classifies the patient trajectories with two stages. At the first stage, patients are classified by the complexity of outpatient visits; afterwards, the groups obtained at the first stage are further classified by the original information of the trajectories where all medical appointments including outpatient ones are taken into account. By using a real data set collected from a medium-size Belgian hospital, we demonstrate how the proposed approach works and examine which kinds of trajectories are grouped into the same clusters. According to the experimental results, the proposed method can be used to classify patients into manageable groups with homogenous trajectories, which can be used as a base for the process modelling techniques and simulation tools.  相似文献   

3.
The motivation to this paper stems from signal/image processing where it is desired to measure various attributes or physical quantities such as position, scale, direction and frequency of a signal or an image. These physical quantities are measured via a signal transform, for example, the short time Fourier transform measures the content of a signal at different times and frequencies. There are well known obstructions for completely accurate measurements formulated as “uncertainty principles”. It has been shown recently that “conventional” localization notions, based on variances associated with Lie-group generators and their corresponding uncertainty inequality might be misleading, if they are applied to transformation groups which differ from the Heisenberg group, the latter being prevailing in signal analysis and quantum mechanics. In this paper we describe a generic signal transform as a procedure of measuring the content of a signal at different values of a set of given physical quantities. This viewpoint sheds a light on the relationship between signal transforms and uncertainty principles. In particular we introduce the concepts of “adjoint translations” and “adjoint observables”, respectively. We show that the fundamental issue of interest is the measurement of physical quantities via the appropriate localization operators termed “adjoint observables”. It is shown how one can define, for each localization operator, a family of related “adjoint translation” operators that translate the spectrum of that localization operator. The adjoint translations in the examples of this paper correspond to well-known transformations in signal processing such as the short time Fourier transform (STFT), the continuous wavelet transform (CWT) and the shearlet transform. We show how the means and variances of states transform appropriately under the translation action and compute associated minimizers and equalizers for the uncertainty criterion. Finally, the concept of adjoint observables is used to estimate concentration properties of ambiguity functions, the latter being an alternative localization concept frequently used in signal analysis.  相似文献   

4.
Three-dimensional dynamic scatterplots can reveal certain features of data that cannot be apprehended in marginal two-dimensional displays. Using graduate students as subjects, we sought to establish whether the detection of clusters and nonlinearity in 3-D plots varies by easily characterized properties of the data and the design of the display. We found that the probability of detection of clusters increased smoothly with cluster separation, and that, at a fixed level of separation, “diagonally” displaced clusters were easier to detect than “horizontally” displaced clusters. Cluster detection appeared to be affected to a smaller extent by the design of the display. Three further experiments addressed the detection of nonlinearity in 3-D dynamic scatterplots. Most subjects were able to respond in a reasonable manner to properties of the data, so that the probability of detection of nonlinearity increased with its level, particularly when the signal was strong. As in the experiment on cluster detection, subjects' performance was also affected, though to a lesser extent, by characteristics of the displays; for example, spinning the display horizontally in the regression plane was particularly effective. We discuss the implications of these results for the design of statistical software incorporating dynamic 3-D scatterplots.  相似文献   

5.
We study a variational problem for the perimeter associated with the Grushin plane, called minimal partition problem with trace constraint. This consists in studying how to enclose three prescribed areas in the Grushin plane, using the least amount of perimeter, under an additional “one-dimensional” constraint on the intersections of their boundaries. We prove existence of regular solutions for this problem, and we characterize them in terms of isoperimetric sets, showing differences with the Euclidean case. The problem arises from the study of quantitative isoperimetric inequalities and has connections with the theory of minimal clusters.  相似文献   

6.
More than 50 years ago, John Tukey called for a reformation of academic statistics. In “The Future of Data Analysis,” he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or “data analysis.” Ten to 20 years ago, John Chambers, Jeff Wu, Bill Cleveland, and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland and Wu even suggested the catchy name “data science” for this envisioned field. A recent and growing phenomenon has been the emergence of “data science” programs at major universities, including UC Berkeley, NYU, MIT, and most prominently, the University of Michigan, which in September 2015 announced a $100M “Data Science Initiative” that aims to hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; yet many academic statisticians perceive the new programs as “cultural appropriation.” This article reviews some ingredients of the current “data science moment,” including recent commentary about data science in the popular media, and about how/whether data science is really different from statistics. The now-contemplated field of data science amounts to a superset of the fields of statistics and machine learning, which adds some technology for “scaling up” to “big data.” This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next 50 years. Because all of science itself will soon become data that can be mined, the imminent revolution in data science is not about mere “scaling up,” but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers, and Breiman, I present a vision of data science based on the activities of people who are “learning from data,” and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s data science initiatives, while being able to accommodate the same short-term goals. Based on a presentation at the Tukey Centennial Workshop, Princeton, NJ, September 18, 2015.  相似文献   

7.
At first we model the way an intelligence “I” constructs statements from phrases, and then how “I” interlocks these statements to form a string of statements to attain a concept. These strings of statements are called progressions. That is, starting with an initial stimulating relation between two phrases, we study how “I” forms the first statement of the progression and continues from this first statement to form the remaining statements in these progressions to construct a concept. We assume that “I” retains the progressions that it has constructed. Then we show how these retained progressions provide “I” with a platform to incrementally constructs more and more sophisticated conceptual structures. The reason for the construction of these conceptual structures is to achieve additional concepts. Choice plays a very important role in the progression and concept formation. We show that as “I” forms new concepts, it enriches its conceptual structure and makes further concepts attainable. This incremental attainment of concepts is a way in which we humans learn, and this paper studies the attainability of concepts from previously attained concepts. We also study the ability of “I” to apply its progressions and also the ability of “I” to electively manipulate its conceptual structure to achieve new concepts. Application and elective manipulation requires of “I” ingenuity and insight. We also show that as “I” attains new concepts, the conceptual structures change and circumstances arise where unanticipated conceptual discoveries are attainable. As the conceptual structure of “I” is developed, the logical and structural relationships between concepts embedded in this structure also develop. These relationships help “I” understand concepts in the context of other concepts and help “I1” communicate to another “I2” information and concept structures. The conceptual structures formed by “I” give rise to a directed web of statement paths which is called a convolution web. The convolution web provides “I” with the paths along which it can reason and obtain new concepts and alternative ways to attain a given concept.This paper is an extension of the ideas introduced in [1]. It is written to be self-contained and the required background is supplied as needed.  相似文献   

8.
In this paper, we propose a new kernel-based fuzzy clustering algorithm which tries to find the best clustering results using optimal parameters of each kernel in each cluster. It is known that data with nonlinear relationships can be separated using one of the kernel-based fuzzy clustering methods. Two common fuzzy clustering approaches are: clustering with a single kernel and clustering with multiple kernels. While clustering with a single kernel doesn’t work well with “multiple-density” clusters, multiple kernel-based fuzzy clustering tries to find an optimal linear weighted combination of kernels with initial fixed (not necessarily the best) parameters. Our algorithm is an extension of the single kernel-based fuzzy c-means and the multiple kernel-based fuzzy clustering algorithms. In this algorithm, there is no need to give “good” parameters of each kernel and no need to give an initial “good” number of kernels. Every cluster will be characterized by a Gaussian kernel with optimal parameters. In order to show its effective clustering performance, we have compared it to other similar clustering algorithms using different databases and different clustering validity measures.  相似文献   

9.
跳汰机的性能及统计模型   总被引:1,自引:0,他引:1  
在煤用重选设备评定的国际标准(ISO923)和以之为蓝本的国家标准(GB/T15715)中的第一项评定指标“可能偏差”及和其有关的“不完善度”的确定是由以下步骤决定:1.由仅仅一次浮沉试验取得一组数据(一个观察值)并由此计算两段分离所得产品的分配率;2.用手工凭想象中的“S型曲线”把分配率中的两组6 至8 个点联起来得到两条“分配曲线”;3.由这两条粗糙的曲线“量出”各自的25% 和75% 分位点,以此得到能反映重选设备分离能力好坏的两个指标:“可能偏差”E和“不完善度”I,很难想象,这样由一个观察值得到的指标并通过没有模型的手工绘图而得到的结果会有任何实际意义,本文建议利用尽可能多的观察值来拟合logistic回归模型,并依此得到分配曲线和计算出参数E和I,本文还利用一组实际数据来说明我们方法的合理性  相似文献   

10.
In this paper, we report the results of a series of experiments on a version of the centipede game in which the total payoff to the two players is constant. Standard backward induction arguments lead to a unique Nash equilibrium outcome prediction, which is the same as the prediction made by theories of “fair” or “focal” outcomes. We find that subjects frequently fail to select the unique Nash outcome prediction. While this behavior was also observed in McKelvey and Palfrey (1992) in the “growing pie” version of the game they studied, the Nash outcome was not “fair”, and there was the possibility of Pareto improvement by deviating from Nash play. Their findings could therefore be explained by small amounts of altruistic behavior. There are no Pareto improvements available in the constant-sum games we examine. Hence, explanations based on altruism cannot account for these new data. We examine and compare two classes of models to explain these data. The first class consists of non-equilibrium modifications of the standard “Always Take” model. The other class we investigate, the Quantal Response Equilibrium model, describes an equilibrium in which subjects make mistakes in implementing their best replies and assume other players do so as well. One specification of this model fits the experimental data best, among the models we test, and is able to account for all the main features we observe in the data.  相似文献   

11.
The problem of merging Gaussian mixture components is discussed in situations where a Gaussian mixture is fitted but the mixture components are not separated enough from each other to interpret them as “clusters”. The problem of merging Gaussian mixtures is not statistically identifiable, therefore merging algorithms have to be based on subjective cluster concepts. Cluster concepts based on unimodality and misclassification probabilities (“patterns”) are distinguished. Several different hierarchical merging methods are proposed for different cluster concepts, based on the ridgeline analysis of modality of Gaussian mixtures, the dip test, the Bhattacharyya dissimilarity, a direct estimator of misclassification and the strength of predicting pairwise cluster memberships. The methods are compared by a simulation study and application to two real datasets. A new visualisation method of the separation of Gaussian mixture components, the ordered posterior plot, is also introduced.  相似文献   

12.
This papers analyzes how several geometric theorems, that were considered to be disconnected from each other in the beginning of the nineteenth century, have been progressively recognized as elements of a bigger whole called “the theorems of closure.” In particular, we show that the constitution of this set of theorems was grounded on the use of encompassing words, as well as observations of analogies, and searches for unifying points of view. In the concluding remarks, we discuss the relevancy of the notion of “family resemblance” to describe the categorization process of the theorems of closure during the nineteenth century.  相似文献   

13.
We consider the problem of estimating the optimal steady effort level from a time series of catch and effort data, taking account of errors in the observation of the “effective effort” as well as randomness in the stock-production function. The “total least squares” method ignores the time series nature of the data, while the “approximate likelihood” method takes it into account. We compare estimation schemes based upon these two methods by applying them to artificial data for which the “correct” parameters are known. We use a similar procedure to compare the effectiveness of a “power model” for stock and production with the “Ricker model.” We apply these estimation methods to some sets of real data, and obtain an interval estimate of the optimal effort.  相似文献   

14.
Spatial scan statistics are commonly used for geographic disease cluster detection and evaluation. We propose and implement a modified version of the simulated annealing spatial scan statistic that incorporates the concept of “non-compactness” in order to penalize clusters that are very irregular in shape. We evaluate its power for the simulated annealing scan and compare it with the circular and elliptic spatial scan statistics. We observe that, with the non-compactness penalty, the simulated annealing method is competitive with the circular and elliptic scan statistic, and both have good power performance. The elliptic scan statistic is computationally faster and is well suited for mildly irregular clusters, but the simulated annealing method deals better with highly irregular cluster shapes. The new method is applied to breast cancer mortality data from northeastern United States.  相似文献   

15.
Recently Papadimitriou has proposed a randomized “bit-flipping” method for solving the 2-satisfiability problem, and the author has proposed a randomized recoloring method which, given a 3-colorable graph, finds a 2-coloring of the vertices so that no triangle is monochromatic. Both methods involve finding a “bad” configuration (unsatisfied clause, monochromatic triangle) and randomly changing one of the bits involved. In this paper we see how these problems and methods fit naturally in a more general geometrical context in which we seek a vector which “agrees” with a given collection of vectors; and we propose a simple “bit-flipping” method for the more general problem, which extends the solution methods for the two problems mentioned above. Further, we consider deterministic methods to handle such problems, and in particular we see how to solve the above “triangle problem” for 3-colorable graphs deterministically in polynomial time. © 1996 John Wiley & Sons, Inc.  相似文献   

16.
In the framework of the evolutionary dynamics of the Prisoner’s Dilemma game on complex networks, we investigate the possibility that the average level of cooperation shows hysteresis under quasi-static variations of a model parameter (the “temptation to defect”). Under the “discrete replicator” strategy updating rule, for both Erdös–Rényi and Barabási–Albert graphs we observe cooperation hysteresis cycles provided one reaches tipping point values of the parameter; otherwise, perfect reversibility is obtained. The selective fixation of cooperation at certain nodes and its organization in cooperator clusters, that are surrounded by fluctuating strategists, allows the rationalization of the “lagging behind” behavior observed.  相似文献   

17.
Many economic and financial applications lead (from the mathematical point of view) to deterministic optimization problems depending on a probability measure. These problems can be static (one stage), dynamic with finite (multistage) or infinite horizon, single objective or multiobjective. We focus on one-stage case in multiobjective setting. Evidently, well known results from the deterministic optimization theory can be employed in the case when the “underlying” probability measure is completely known. The assumption of a complete knowledge of the probability measure is fulfilled very seldom. Consequently, we have mostly to analyze the mathematical models on the data base to obtain a stochastic estimate of the corresponding “theoretical” characteristics. However, the investigation of these estimates has been done mostly in one-objective case. In this paper we focus on the investigation of the relationship between “characteristics” obtained on the base of complete knowledge of the probability measure and estimates obtained on the (above mentioned) data base, mostly in the multiobjective case. Consequently we obtain also the relationship between analysis (based on the data) of the economic process characteristics and “real” economic process. To this end the results of the deterministic multiobjective optimization theory and the results obtained for stochastic one objective problems will be employed.  相似文献   

18.
19.
Recently a new statistical methodology, developed over the last three decades, has become available to practitioners. This methodology is called “ranking and selection” theory. In this article we review procedures for completely ranking a set of populations (from “best”, “second best”, etc., down to “worst”); we also give new tables needed to implement these procedures, and we consider several practical examples using real data.  相似文献   

20.
In summer 2006 the University of Education in Weingarten, Germany, and East China Normal University, Shanghai, performed a semi-virtual seminar with mathematics students on “Mathematics and Architecture”. The goal was the joint development of teaching materials for German or Chinese school, based on different buildings such as “Nanpu Bridge”, or the “Eiffel Tower”. The purpose of the seminar was to provide a learning environment for students supported by using information and communication technology (ICT) to understand how the hidden mathematics in buildings should be related to school mathematics; to experience the multicultural potential of the international language “Mathematics”; to develop “media competence” while communicating with others and using technologies in mathematics education; and to recognize the differences in teaching mathematics between the two cultures. In this paper we will present our ideas, experiences and results from the seminar.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号