Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code |
| |
Authors: | Kenneth Ward Church Jonathan Isaac Helfman |
| |
Institution: | 1. AT&2. T Bell Laboratories , Murray Hill , NJ , 07974-2070 , USA |
| |
Abstract: | Abstract An interactive program, dotplot, has been developed for browsing millions of lines of text and source code, using an approach borrowed from biology for studying homology (self-similarity) in DNA sequences. With conventional browsing tools such as a screen editor, it is difficult to identify structures that are too big to fit on the screen. In contrast, with dotplots we find that many of these structures show up as diagonals, squares, textures, and other visually recognizable features, as will be illustrated in examples selected from biology and two new application domains, text (AP news, Canadian Hansards) and source code (5ESS®). In an attempt to isolate the mechanisms that produce these features, we have synthesized similar features in dotplots of artificial sequences. We also introduce an approximation that makes the calculation of dotplots practical for use in an interactive browser. |
| |
Keywords: | Biology Corpora Duplication Scatterplot Software engineering String matching |
|
|