首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code
Authors:Kenneth Ward Church  Jonathan Isaac Helfman
Institution:1. AT&2. T Bell Laboratories , Murray Hill , NJ , 07974-2070 , USA
Abstract:Abstract

An interactive program, dotplot, has been developed for browsing millions of lines of text and source code, using an approach borrowed from biology for studying homology (self-similarity) in DNA sequences. With conventional browsing tools such as a screen editor, it is difficult to identify structures that are too big to fit on the screen. In contrast, with dotplots we find that many of these structures show up as diagonals, squares, textures, and other visually recognizable features, as will be illustrated in examples selected from biology and two new application domains, text (AP news, Canadian Hansards) and source code (5ESS®). In an attempt to isolate the mechanisms that produce these features, we have synthesized similar features in dotplots of artificial sequences. We also introduce an approximation that makes the calculation of dotplots practical for use in an interactive browser.
Keywords:Biology  Corpora  Duplication  Scatterplot  Software engineering  String matching
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号