首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The complex networks approach for authorship attribution of books   总被引:1,自引:0,他引:1  
Authorship analysis by means of textual features is an important task in linguistic studies. We employ complex networks theory to tackle this disputed problem. In this work, we focus on some measurable quantities of word co-occurrence network of each book for authorship characterization. Based on the network features, attribution probability is defined for authorship identification. Furthermore, two scaling exponents, q-parameter and α-exponent, are combined to classify personal writing style with acceptable high resolution power. The q-parameter, generally known as the nonextensivity measure, is calculated for degree distribution and the α-exponent comes from a power law relationship between number of links and number of nodes in the co-occurrence network constructed for different books written by each author. The applicability of the presented method is evaluated in an experiment with thirty six books of five Persian litterateurs. Our results show high accuracy rate in authorship attribution.  相似文献   

2.
The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies.  相似文献   

3.
We present in this paper a numerical investigation of literary texts by various well-known English writers, covering the first half of the twentieth century, based upon the results obtained through corpus analysis of the texts. A fractal power law is obtained for the lexical wealth defined as the ratio between the number of different words and the total number of words of a given text. By considering as a signature of each author the exponent and the amplitude of the power law, and the standard deviation of the lexical wealth, it is possible to discriminate works of different genres and writers and show that each writer has a very distinct signature, either considered among other literary writers or compared with writers of non-literary texts. It is also shown that, for a given author, the signature is able to discriminate between short stories and novels.  相似文献   

4.
We introduce novel Information Theory quantifiers in a computational linguistic study that involves a large corpus of English Renaissance literature. The 185 texts studied (136 plays and 49 poems in total), with first editions that range from 1580 to 1640, form a representative set of its period. Our data set includes 30 texts unquestionably attributed to Shakespeare; in addition we also included A Lover’s Complaint, a poem which generally appears in Shakespeare collected editions but whose authorship is currently in dispute. Our statistical complexity quantifiers combine the power of Jensen-Shannon’s divergence with the entropy variations as computed from a probability distribution function of the observed word use frequencies. Our results show, among other things, that for a given entropy poems display higher complexity than plays, that Shakespeare’s work falls into two distinct clusters in entropy, and that his work is remarkable for its homogeneity and for its closeness to overall means.  相似文献   

5.
A series of phenomena pertaining to economics, quantum physics, language, literary criticism, and especially architecture is studied from the standpoint of synergetics (the study of self-organizing complex systems). It turns out that a whole series of concrete formulas describing these phenomena is identical in these different situations. This is the case of formulas relating to the Bose-Einstein distribution of particles and the distribution of words from a frequency dictionary. This also allows to apply a “quantized” from of the Zipf law to the problem of the authorship of Quiet Flows the Don and to the “blending in” of new architectural structures in an existing environment.  相似文献   

6.
Complexity measures are used in a number of applications including extraction of information from data such as ecological time series, detection of non-random structure in biomedical signals, testing of random number generators, language recognition and authorship attribution etc. Different complexity measures proposed in the literature like Shannon entropy, Relative entropy, Lempel-Ziv, Kolmogrov and Algorithmic complexity are mostly ineffective in analyzing short sequences that are further corrupted with noise. To address this problem, we propose a new complexity measure ETC and define it as the “Effort To Compress” the input sequence by a lossless compression algorithm. Here, we employ the lossless compression algorithm known as Non-Sequential Recursive Pair Substitution (NSRPS) and define ETC as the number of iterations needed for NSRPS to transform the input sequence to a constant sequence. We demonstrate the utility of ETC in two applications. ETC is shown to have better correlation with Lyapunov exponent than Shannon entropy even with relatively short and noisy time series. The measure also has a greater rate of success in automatic identification and classification of short noisy sequences, compared to entropy and a popular measure based on Lempel-Ziv compression (implemented by Gzip).  相似文献   

7.
《Physica A》2006,361(2):405-415
A new approach to describing correlation properties of complex dynamic systems with long-range memory based on a concept of additive Markov chains (Phys. Rev. E 68 (2003) 061107) is developed. An equation connecting the memory and correlation function of the system under study is presented. This equation allows reconstructing a memory function using a correlation function of the system. Effectiveness and robustness of the proposed method is demonstrated by simple model examples. Memory functions of concrete coarse-grained literary texts are found and their universal power-law behavior at long distances is revealed.  相似文献   

8.
We demonstrate an accurate procedure based on linear discriminant analysis that allows automatic authorship classification of opinion column articles. First, we extract the following stylometric features of 157 column articles from four authors: statistics on high frequency words, number of words per sentence, and number of sentences per paragraph. Then, by systematically ranking these features based on an effect size criterion, we show that we can achieve an average classification accuracy of 93% for the test set. In comparison, frequency size based ranking has an average accuracy of 80%. The highest possible average classification accuracy of our data merely relying on chance is ∼31%. By carrying out sensitivity analysis, we show that the effect size criterion is superior than frequency ranking because there exist low frequency words that significantly contribute to successful author discrimination. Consistent results are seen when the procedure is applied in classifying the undisputed Federalist papers of Alexander Hamilton and James Madison. To the best of our knowledge, the work is the first attempt in classifying opinion column articles, that by virtue of being shorter in length (as compared to novels or short stories), are more prone to over-fitting issues. The near perfect classification for the longer papers supports this claim. Our results provide an important insight on authorship attribution that has been overlooked in previous studies: that ranking discriminant variables based on word frequency counts is not necessarily an optimal procedure.  相似文献   

9.
In the last few decades, text mining has been used to extract knowledge from free texts. Applying neural networks and deep learning to natural language processing (NLP) tasks has led to many accomplishments for real-world language problems over the years. The developments of the last five years have resulted in techniques that have allowed for the practical application of transfer learning in NLP. The advances in the field have been substantial, and the milestone of outperforming human baseline performance based on the general language understanding evaluation has been achieved. This paper implements a targeted literature review to outline, describe, explain, and put into context the crucial techniques that helped achieve this milestone. The research presented here is a targeted review of neural language models that present vital steps towards a general language representation model.  相似文献   

10.
杨超  刘大刚  王学琼  王小敏  夏蒙重  彭凯 《物理学报》2012,61(10):105204-105204
理论分析了负氢离子源中中性粒子传输特性及引出电极表面产生负氢离子(H-)的物理过程, 研究了引出孔传输率对氢原子传输的影响,深入剖析了氢原子与不同属性导体壁碰撞以及碰撞后反射的物理情景.基于CHIPIC软件平台,成功研制了全三维 Particle-in-cell with Monte Carlo Collision 氢原子传输及负氢离子产生物理过程的模拟算法,并采用JAEA 10A负氢离子源进行模拟验证.模拟达到稳态后,氢原子平均能量约为0.57 eV, 且H原子呈现+Y漂移,当非均匀氢原子束轰击引出壁时,导致产生的负氢离子空间分布不均匀. 这些模拟结果都与文献符合,验证了算法的可靠性.  相似文献   

11.
A long-term pattern of the Czechoslovak Journal of Physics in terms of size, major subfields breakdown, multiple authorship, contributors' productivity and citation structure is presented. Selected features are compared with primary Czechoslovak mathematical and chemical journals over an 18 years period, the coverage of Physics Abstracts for 1961, 1965 and 1969, as well as 50 physics journals in 20 countries. Additional information is offered on the incidence of multiple authorship with respect to different physics domains, subjects and publication rates as provided by an examination of 25 000 physics papers issued in 1969, and on the output of physics publications measured relatively to gross national product.  相似文献   

12.
钝锥绕流流动稳定性分析与转捩预报   总被引:1,自引:1,他引:0  
研究了超音速钝锥绕流的稳定性和转捩点预报的数值计算方法,首先采用Euler方程求解钝锥绕流基本流场,用所得到的物面压力分布作为粘性边界层的外缘压力分布,给出了基本流场的初值;然后应用反迭代法与边界层渐近匹配的方法求解了钝锥边界层的稳定性方程,得到了钝锥边界层转捩数据.该方法可提高计算精度,节约计算时间.  相似文献   

13.
The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns’ entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.  相似文献   

14.
Experimental findings concerning liquid state relaxation processes of electronically excited molecules, coming from time-resolved spectroscopy and travelling wave dye lasers, offer a formidable basis for restructuring some of the theories dealing with relaxation processes. Comparison of the prime research literature with standard texts shows a severe time lag in the dissemination of essential concepts, while comparison of ‘traditional’ fluorescence with intra-cavity photon emission reveals basic differences involving relativistic effects and vector space construction, mitigating against the practice of directly incorporating laser based data into ordinary chemical dynamics.  相似文献   

15.
Patterns of publication output, multiple authorship and individual productivity are measured for three physics institutions. The findings are compared with earlier results on publication rates in journals and some other research laboratories.  相似文献   

16.
Complex network theory is used to investigate the structure of meaningful concepts in written texts of individual authors. Networks have been constructed after a two phase filtering, where words with less meaning contents are eliminated and all remaining words are set to their canonical form, without any number, gender or time flexion. Each sentence in the text is added to the network as a clique. A large number of written texts have been scrutinised, and it is found that texts have small-world as well as scale-free structures. The growth process of these networks has also been investigated, and a universal evolution of network quantifiers have been found among the set of texts written by distinct authors. Further analyses, based on shuffling procedures taken either on the texts or on the constructed networks, provide hints on the role played by the word frequency and sentence length distributions to the network structure.  相似文献   

17.
Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the spatial use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.  相似文献   

18.
The observation of single monolayer well size fluctuations in a superlattice is reported. Luminescence experiments show that several free exciton peaks occur, each of them corresponding, in the same layers, to a discreet well width differing by one monolayer from the next one. This attribution is confirmed by the intentional introduction of larger wells in the structure. Photoluminescence excitation experiments confirm this interpretation.  相似文献   

19.
Some recent criticisms of a proposed dynamical reduction theory are considered and are proved to be not cogent. By considering the visual perception process, it is made plausible that, at least at the perceptive level, the conditions required by the above-mentioned theory for dynamical reduction to occur are verified. This does not imply the attribution of a specific role to the act of conscious perception in the reduction process.  相似文献   

20.
The Variational AutoEncoder (VAE) has made significant progress in text generation, but it focused on short text (always a sentence). Long texts consist of multiple sentences. There is a particular relationship between each sentence, especially between the latent variables that control the generation of the sentences. The relationships between these latent variables help in generating continuous and logically connected long texts. There exist very few studies on the relationships between these latent variables. We proposed a method for combining the Transformer-Based Hierarchical Variational AutoEncoder and Hidden Markov Model (HT-HVAE) to learn multiple hierarchical latent variables and their relationships. This application improves long text generation. We use a hierarchical Transformer encoder to encode the long texts in order to obtain better hierarchical information of the long text. HT-HVAE’s generation network uses HMM to learn the relationship between latent variables. We also proposed a method for calculating the perplexity for the multiple hierarchical latent variable structure. The experimental results show that our model is more effective in the dataset with strong logic, alleviates the notorious posterior collapse problem, and generates more continuous and logically connected long text.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号