首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), and strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding, and dependence dilution. Leveraging these connections, we prove a new cardinality bound on the auxiliary variable in IB, making its computation more tractable for discrete random variables. In the second part, we introduce a general family of optimization problems, termed “bottleneck problems”, by replacing mutual information in IB and PF with other notions of mutual information, namely f-information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantees in a variety of inference tasks with statistical constraints on accuracy and privacy. While the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to a binary case, we derive closed form expressions for several bottleneck problems.  相似文献   

2.
We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. CEDA is developed upon all features’ categorical nature via histogram and it is guided by all features’ associative patterns (order-2 dependence) in a mutual conditional entropy matrix. Higher-order structural dependency of k(3) features is exhibited through block patterns within heatmaps that are constructed by permuting contingency-kD-lattices of counts. By growing k, the resultant heatmap series contains global and large scales of structural dependency that constitute the data matrix’s information content. When involving continuous features, the principal component analysis (PCA) extracts fine-scale information content from each block in the final heatmap. Our mimicking protocol coherently simulates this heatmap series by preserving global-to-fine scales structural dependency. Upon every step of mimicking process, each accepted simulated heatmap is subject to constraints with respect to all of the reliable observed categorical patterns. For reliability and robustness in sciences, CEDA with mimicking enhances data visualization by revealing deterministic and stochastic structures within each scale-specific structural dependency. For inferences in Machine Learning (ML) and Statistics, it clarifies, upon which scales, which covariate feature-groups have major-vs.-minor predictive powers on response features. For the social justice of Artificial Intelligence (AI) products, it checks whether a data matrix incompletely prescribes the targeted system.  相似文献   

3.
Users of social networks have a variety of social statuses and roles. For example, the users of Weibo include celebrities, government officials, and social organizations. At the same time, these users may be senior managers, middle managers, or workers in companies. Previous studies on this topic have mainly focused on using the categorical, textual and topological data of a social network to predict users’ social statuses and roles. However, this cannot fully reflect the overall characteristics of users’ social statuses and roles in a social network. In this paper, we consider what social network structures reflect users’ social statuses and roles since social networks are designed to connect people. Taking an Enron email dataset as an example, we analyzed a preprocessing mechanism used for social network datasets that can extract users’ dynamic behavior features. We further designed a novel social network representation learning algorithm in order to infer users’ social statuses and roles in social networks through the use of an attention and gate mechanism on users’ neighbors. The extensive experimental results gained from four publicly available datasets indicate that our solution achieves an average accuracy improvement of 2% compared with GraphSAGE-Mean, which is the best applicable inductive representation learning method.  相似文献   

4.
Network data analysis is a crucial method for mining complicated object interactions. In recent years, random walk and neural-language-model-based network representation learning (NRL) approaches have been widely used for network data analysis. However, these NRL approaches suffer from the following deficiencies: firstly, because the random walk procedure is based on symmetric node similarity and fixed probability distribution, the sampled vertices’ sequences may lose local community structure information; secondly, because the feature extraction capacity of the shallow neural language model is limited, they can only extract the local structural features of networks; and thirdly, these approaches require specially designed mechanisms for different downstream tasks to integrate vertex attributes of various types. We conducted an in-depth investigation to address the aforementioned issues and propose a novel general NRL framework called dynamic structure and vertex attribute fusion network embedding, which firstly defines an asymmetric similarity and h-hop dynamic random walk strategy to guide the random walk process to preserve the network’s local community structure in walked vertex sequences. Next, we train a self-attention-based sequence prediction model on the walked vertex sequences to simultaneously learn the vertices’ local and global structural features. Finally, we introduce an attributes-driven Laplacian space optimization to converge the process of structural feature extraction and attribute feature extraction. The proposed approach is exhaustively evaluated by means of node visualization and classification on multiple benchmark datasets, and achieves superior results compared to baseline approaches.  相似文献   

5.
Detecting outliers is a widely studied problem in many disciplines, including statistics, data mining, and machine learning. All anomaly detection activities are aimed at identifying cases of unusual behavior compared to most observations. There are many methods to deal with this issue, which are applicable depending on the size of the data set, the way it is stored, and the type of attributes and their values. Most of them focus on traditional datasets with a large number of quantitative attributes. The multitude of solutions related to detecting outliers in quantitative sets, a large and still has a small number of research solutions is a problem detecting outliers in data containing only qualitative variables. This article was designed to compare three different categorical data clustering algorithms: K-modes algorithm taken from MacQueen’s K-means algorithm and the STIRR and ROCK algorithms. The comparison concerned the method of dividing the set into clusters and, in particular, the outliers detected by algorithms. During the research, the authors analyzed the clusters detected by the indicated algorithms, using several datasets that differ in terms of the number of objects and variables. They have conducted experiments on the parameters of the algorithms. The presented study made it possible to check whether the algorithms similarly detect outliers in the data and how much they depend on individual parameters and parameters of the set, such as the number of variables, tuples, and categories of a qualitative variable.  相似文献   

6.
Estimates based on expert judgements of quantities of interest are commonly used to supplement or replace measurements when the latter are too expensive or impossible to obtain. Such estimates are commonly accompanied by information about the uncertainty of the estimate, such as a credible interval. To be considered well-calibrated, an expert’s credible intervals should cover the true (but unknown) values a certain percentage of time, equal to the percentage specified by the expert. To assess expert calibration, so-called calibration questions may be asked in an expert elicitation exercise; these are questions with known answers used to assess and compare experts’ performance. An approach that is commonly applied to assess experts’ performance by using these questions is to directly compare the stated percentage cover with the actual coverage. We show that this approach has statistical drawbacks when considered in a rigorous hypothesis testing framework. We generalize the test to an equivalence testing framework and discuss the properties of this new proposal. We show that comparisons made on even a modest number of calibration questions have poor power, which suggests that the formal testing of the calibration of experts in an experimental setting may be prohibitively expensive. We contextualise the theoretical findings with a couple of applications and discuss the implications of our findings.  相似文献   

7.
All features of any data type are universally equipped with categorical nature revealed through histograms. A contingency table framed by two histograms affords directional and mutual associations based on rescaled conditional Shannon entropies for any feature-pair. The heatmap of the mutual association matrix of all features becomes a roadmap showing which features are highly associative with which features. We develop our data analysis paradigm called categorical exploratory data analysis (CEDA) with this heatmap as a foundation. CEDA is demonstrated to provide new resolutions for two topics: multiclass classification (MCC) with one single categorical response variable and response manifold analytics (RMA) with multiple response variables. We compute visible and explainable information contents with multiscale and heterogeneous deterministic and stochastic structures in both topics. MCC involves all feature-group specific mixing geometries of labeled high-dimensional point-clouds. Upon each identified feature-group, we devise an indirect distance measure, a robust label embedding tree (LET), and a series of tree-based binary competitions to discover and present asymmetric mixing geometries. Then, a chain of complementary feature-groups offers a collection of mixing geometric pattern-categories with multiple perspective views. RMA studies a system’s regulating principles via multiple dimensional manifolds jointly constituted by targeted multiple response features and selected major covariate features. This manifold is marked with categorical localities reflecting major effects. Diverse minor effects are checked and identified across all localities for heterogeneity. Both MCC and RMA information contents are computed for data’s information content with predictive inferences as by-products. We illustrate CEDA developments via Iris data and demonstrate its applications on data taken from the PITCHf/x database.  相似文献   

8.
Text classification is a fundamental research direction, aims to assign tags to text units. Recently, graph neural networks (GNN) have exhibited some excellent properties in textual information processing. Furthermore, the pre-trained language model also realized promising effects in many tasks. However, many text processing methods cannot model a single text unit’s structure or ignore the semantic features. To solve these problems and comprehensively utilize the text’s structure information and semantic information, we propose a Bert-Enhanced text Graph Neural Network model (BEGNN). For each text, we construct a text graph separately according to the co-occurrence relationship of words and use GNN to extract text features. Moreover, we employ Bert to extract semantic features. The former part can take into account the structural information, and the latter can focus on modeling the semantic information. Finally, we interact and aggregate these two features of different granularity to get a more effective representation. Experiments on standard datasets demonstrate the effectiveness of BEGNN.  相似文献   

9.
The development of Internet technology has provided great convenience for data transmission and sharing, but it also brings serious security problems that are related to data protection. As is detailed in this paper, an enhanced steganography network was designed to protect secret image data that contains private or confidential information; this network consists of a concealing network and a revealing network in order to achieve image embedding and recovery separately. To reduce the system’s computation complexity, we constructed the network’s framework using a down–up structure in order to compress the intermediate feature maps. In order to mitigate the input’s information loss caused by a sequence of convolution blocks, the long skip concatenation method was designed to pass the raw information to the top layer, thus synthesizing high-quality hidden images with fine texture details. In addition, we propose a novel strategy called non-activated feature fusion (NAFF), which is designed to provide stronger supervision for synthetizing higher-quality hidden images and recovered images. In order to further boost the hidden image’s visual quality and enhance its imperceptibility, an attention mechanism-based enhanced module was designed to reconstruct and enhance the salient target, thus covering up and obscuring the embedded secret content. Furthermore, a hybrid loss function that is composed of pixel domain loss and structure domain loss was designed to boost the hidden image’s structural quality and visual security. Our experimental results demonstrate that, due to the elaborate design of the network structure and loss function, our proposed method achieves high levels of imperceptibility and security.  相似文献   

10.
Raman spectroscopy has the potential to significantly aid in the research and diagnosis of cancer. The information dense, complex spectra generate massive datasets in which subtle correlations may provide critical clues for biological analysis and pathological classification. Therefore, implementing advanced data mining techniques is imperative for complete, rapid and accurate spectral processing. Numerous recent studies have employed various data methods to Raman spectra for classification and biochemical analysis. Although, as Raman datasets from biological specimens are often characterized by high dimensionality and low sample numbers, many of these classification models are subject to overfitting. Furthermore, attempts to reduce dimensionality result in transformed feature spaces making the biological evaluation of significant and discriminative spectral features problematic. We have developed a novel data mining framework optimized for Raman datasets, called Fisher‐based Feature Selection Support Vector Machines (FFS‐SVM). This framework provides simultaneous supervised classification and user‐defined Fisher criterion‐based feature selection, reducing overfitting and directly yielding significant wavenumbers from the original feature space. Herein, we investigate five cancerous and non‐cancerous breast cell lines using Raman microspectroscopy and our unique FFS‐SVM framework. Our framework classification performance is then compared to several other frequently employed classification methods on four classification tasks. The four tasks were constructed by an unsupervised clustering method yielding the four different categories of cell line groupings (e.g. cancer vs non‐cancer) studied. FFS‐SVM achieves both high classification accuracies and the extraction of biologically significant features. The top ten most discriminative features are discussed in terms of cell‐type specific biological relevance. Our framework provides comprehensive cellular level characterization and could potentially lead to the discovery of cancer biomarker‐type information, which we have informally termed ‘Raman‐based spectral biomarkers’. The FFS‐SVM framework along with Raman spectroscopy will be used in future studies to investigate in‐situ dynamic biological phenomena. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
With the growing availability of position data in sports, spatiotemporal analysis in soccer is a topic of rising interest. The aim of this study is to validate a performance indicator, namely D-Def, measuring passing effectiveness. D-Def calculates the change of the teams’ centroid, centroids of formation lines (e.g., defensive line), teams’ surface area, and teams’ spread in the following three seconds after a pass and therefore results in a measure of disruption of the opponents’ defense following a pass. While this measure was introduced earlier, in this study we aim to prove the usefulness to evaluate attacking sequences. In this study, 258 games of Dutch Eredivisie season 2018/19 were included, resulting in 13,094 attacks. D-Def, pass length, pass velocity, and pass angle of the last four passes of each attack were calculated and compared between successful and unsuccessful attacks. D-Def showed higher values for passes of successful compared to unsuccessful attacks (0.001 < p ≤ 0.029, 0.06 ≤ d ≤ 0.23). This difference showed the highest effects sizes in the penultimate pass (d = 0.23) and the maximal D-Def value of an attack (d = 0.23). Passing length (0.001 < p ≤ 0.236, 0.08 ≤ d ≤ 0.17) and passing velocity (0.001 < p ≤ 0.690, −0.09 ≤ d ≤ 0.12) showed inconsistent results in discriminating between successful and unsuccessful attacks. The results indicate that D-Def is a useful indicator for the measurement of pass effectiveness in attacking sequences, highlighting that successful attacks are connected to disruptive passing. Within successful attacks, at least one high disruptive action (pass with D-Def > 28) needs to be present. In addition, the penultimate pass (“hockey assist”) of an attack seems crucial in characterizing successful attacks.  相似文献   

12.
We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics’ data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data’s categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon’s conditional entropy (CE) and mutual information (I[Re;Co]) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and I[Re;Co] in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.  相似文献   

13.
We construct a 2-categorical extension of the relative entropy functor of Baez and Fritz, and show that our construction is functorial with respect to vertical morphisms. Moreover, we show such a ‘2-relative entropy’ satisfies natural 2-categorial analogues of convex linearity, vanishing under optimal hypotheses, and lower semicontinuity. While relative entropy is a relative measure of information between probability distributions, we view our construction as a relative measure of information between channels.  相似文献   

14.
One of the tasks of data science is the decomposition of large matrices in order to understand their structures. A special case of this is when we decompose relations, i.e., logical matrices. In this paper, we present a method based on the similarity of rows and columns, which uses correlation clustering to cluster the rows and columns of the matrix, facilitating the visualization of the relation by rearranging the rows and columns. In this article, we compare our method with Gunther Schmidt’s problems and solutions. Our method produces the original solutions by selecting its parameters from a small set. However, with other parameters, it provides solutions with even lower entropy.  相似文献   

15.
In many hypothesis testing applications, we have mixed priors, with well-motivated informative priors for some parameters but not for others. The Bayesian methodology uses the Bayes factor and is helpful for the informative priors, as it incorporates Occam’s razor via the multiplicity or trials factor in the look-elsewhere effect. However, if the prior is not known completely, the frequentist hypothesis test via the false-positive rate is a better approach, as it is less sensitive to the prior choice. We argue that when only partial prior information is available, it is best to combine the two methodologies by using the Bayes factor as a test statistic in the frequentist analysis. We show that the standard frequentist maximum likelihood-ratio test statistic corresponds to the Bayes factor with a non-informative Jeffrey’s prior. We also show that mixed priors increase the statistical power in frequentist analyses over the maximum likelihood test statistic. We develop an analytic formalism that does not require expensive simulations and generalize Wilks’ theorem beyond its usual regime of validity. In specific limits, the formalism reproduces existing expressions, such as the p-value of linear models and periodograms. We apply the formalism to an example of exoplanet transits, where multiplicity can be more than 107. We show that our analytic expressions reproduce the p-values derived from numerical simulations. We offer an interpretation of our formalism based on the statistical mechanics. We introduce the counting of states in a continuous parameter space using the uncertainty volume as the quantum of the state. We show that both the p-value and Bayes factor can be expressed as an energy versus entropy competition.  相似文献   

16.
Quantum error correction (QEC) is an effective way to overcome quantum noise and de-coherence, meanwhile the fault tolerance of the encoding circuit, syndrome measurement circuit, and logical gate realization circuit must be ensured so as to achieve reliable quantum computing. Steane code is one of the most famous codes, proposed in 1996, however, the classical encoding circuit based on stabilizer implementation is not fault-tolerant. In this paper, we propose a method to design a fault-tolerant encoding circuit for Calderbank-Shor-Steane (CSS) code based on stabilizer implementation and “flag” bits. We use the Steane code as an example to depict in detail the fault-tolerant encoding circuit design process including the logical operation implementation, the stabilizer implementation, and the “flag” qubits design. The simulation results show that assuming only one quantum gate will be wrong with a certain probability p, the classical encoding circuit will have logic errors proportional to p; our proposed circuit is fault-tolerant as with the help of the “flag” bits, all types of errors in the encoding process can be accurately and uniquely determined, the errors can be fixed. If all the gates will be wrong with a certain probability p, which is the actual situation, the proposed encoding circuit will also be wrong with a certain probability, but its error rate has been reduced greatly from p to p2 compared with the original circuit. This encoding circuit design process can be extended to other CSS codes to improve the correctness of the encoding circuit.  相似文献   

17.
One of the consequences of the big data revolution is that data are more heterogeneous than ever. A new challenge appears when mixed-type data sets evolve over time and we are interested in the comparison among individuals. In this work, we propose a new protocol that integrates robust distances and visualization techniques for dynamic mixed data. In particular, given a time tT={1,2,,N}, we start by measuring the proximity of n individuals in heterogeneous data by means of a robustified version of Gower’s metric (proposed by the authors in a previous work) yielding to a collection of distance matrices {D(t),tT}. To monitor the evolution of distances and outlier detection over time, we propose several graphical tools: First, we track the evolution of pairwise distances via line graphs; second, a dynamic box plot is obtained to identify individuals which showed minimum or maximum disparities; third, to visualize individuals that are systematically far from the others and detect potential outliers, we use the proximity plots, which are line graphs based on a proximity function computed on {D(t),tT}; fourth, the evolution of the inter-distances between individuals is analyzed via dynamic multiple multidimensional scaling maps. These visualization tools were implemented in the Shinny application in R, and the methodology is illustrated on a real data set related to COVID-19 healthcare, policy and restriction measures about the 2020–2021 COVID-19 pandemic across EU Member States.  相似文献   

18.
Automatic speech recognition (ASR) in children is a rapidly evolving field, as children become more accustomed to interacting with virtual assistants, such as Amazon Echo, Cortana, and other smart speakers, and it has advanced the human–computer interaction in recent generations. Furthermore, non-native children are observed to exhibit a diverse range of reading errors during second language (L2) acquisition, such as lexical disfluency, hesitations, intra-word switching, and word repetitions, which are not yet addressed, resulting in ASR’s struggle to recognize non-native children’s speech. The main objective of this study is to develop a non-native children’s speech recognition system on top of feature-space discriminative models, such as feature-space maximum mutual information (fMMI) and boosted feature-space maximum mutual information (fbMMI). Harnessing the collaborative power of speed perturbation-based data augmentation on the original children’s speech corpora yields an effective performance. The corpus focuses on different speaking styles of children, together with read speech and spontaneous speech, in order to investigate the impact of non-native children’s L2 speaking proficiency on speech recognition systems. The experiments revealed that feature-space MMI models with steadily increasing speed perturbation factors outperform traditional ASR baseline models.  相似文献   

19.
20.
Psychotherapy involves the modification of a client’s worldview to reduce distress and enhance well-being. We take a human dynamical systems approach to modeling this process, using Reflexively Autocatalytic foodset-derived (RAF) networks. RAFs have been used to model the self-organization of adaptive networks associated with the origin and early evolution of both biological life, as well as the evolution and development of the kind of cognitive structure necessary for cultural evolution. The RAF approach is applicable in these seemingly disparate cases because it provides a theoretical framework for formally describing under what conditions systems composed of elements that interact and ‘catalyze’ the formation of new elements collectively become integrated wholes. In our application, the elements are mental representations, and the whole is a conceptual network. The initial components—referred to as foodset items—are mental representations that are innate, or were acquired through social learning or individual learning (of pre-existing information). The new elements—referred to as foodset-derived items—are mental representations that result from creative thought (resulting in new information). In clinical psychology, a client’s distress may be due to, or exacerbated by, one or more beliefs that diminish self-esteem. Such beliefs may be formed and sustained through distorted thinking, and the tendency to interpret ambiguous events as confirmation of these beliefs. We view psychotherapy as a creative collaborative process between therapist and client, in which the output is not an artwork or invention but a more well-adapted worldview and approach to life on the part of the client. In this paper, we model a hypothetical albeit representative example of the formation and dissolution of such beliefs over the course of a therapist–client interaction using RAF networks. We show how the therapist is able to elicit this worldview from the client and create a conceptualization of the client’s concerns. We then formally demonstrate four distinct ways in which the therapist is able to facilitate change in the client’s worldview: (1) challenging the client’s negative interpretations of events, (2) providing direct evidence that runs contrary to and counteracts the client’s distressing beliefs, (3) using self-disclosure to provide examples of strategies one can use to diffuse a negative conclusion, and (4) reinforcing the client’s attempts to assimilate such strategies into their own ways of thinking. We then discuss the implications of such an approach to expanding our knowledge of the development of mental health concerns and the trajectory of the therapeutic change.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号