A Network Analysis Model for Disambiguation of Names in Lists |
| |
Authors: | Email author" target="_blank">Bradley?MalinEmail author Edoardo?Airoldi Kathleen?M?Carley |
| |
Institution: | (1) Data Privacy Laboratory, Institute for Software Research International, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;(2) Center for the Computational Analysis of Social and Organizational Systems, Institute for Software Research International, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA |
| |
Abstract: | In research and application, social networks are increasingly extracted from relationships inferred by name collocations in
text-based documents. Despite the fact that names represent real entities, names are not unique identifiers and it is often
unclear when two name observations correspond to the same underlying entity. One confounder stems from ambiguity, in which
the same name correctly references multiple entities. Prior name disambiguation methods measured similarity between two names
as a function of their respective documents. In this paper, we propose an alternative similarity metric based on the probability
of walking from one ambiguous name to another in a random walk of the social network constructed from all documents. We experimentally
validate our model on actor-actor relationships derived from the Internet Movie Database. Using a global similarity threshold,
we demonstrate random walks achieve a significant increase in disambiguation capability in comparison to prior models.
Bradley A. Malin is a Ph.D. candidate in the School of Computer Science at Carnegie Mellon University. He is an NSF IGERT fellow in the Center
for Computational Analysis of Social and Organizational Systems (CASOS) and a researcher at the Laboratory for International
Data Privacy. His research is interdisciplinary and combines aspects of bioinformatics, data forensics, data privacy and security,
entity resolution, and public policy. He has developed learning algorithms for surveillance in distributed systems and designed
formal models for the evaluation and the improvement of privacy enhancing technologies in real world environments, including
healthcare and the Internet. His research on privacy in genomic databases has received several awards from the American Medical
Informatics Association and has been cited in congressional briefings on health data privacy. He currently serves as managing
editor of the Journal of Privacy Technology.
Edoardo M. Airoldi is a Ph.D. student in the School of Computer Science at Carnegie Mellon University. Currently, he is a researcher in the
CASOS group and at the Center for Automated Learning and Discovery. His methodology is based on probability theory, approximation
theorems, discrete mathematics and their geometries. His research interests include data mining and machine learning techniques
for temporal and relational data, data linkage and data privacy, with important applications to dynamic networks, biological
sequences and large collections of texts. His research on dynamic network tomography is the state-of-the-art for recovering
information about who is communicating to whom in a network, and was awarded honors from the ACM SIG-KDD community. Several
companies focusing on information extraction have adopted his methodology for text analysis. He is currently investigating
practical and theoretical aspects of hierarchical mixture models for temporal and relational data, and an abstract theory
of data linkage.
Kathleen M. Carley is a Professor of Computer Science in ISRI, School of Computer Science at Carnegie Mellon University. She received her Ph.D.
from Harvard in Sociology. Her research combines cognitive science, social and dynamic networks, and computer science (particularly
artificial intelligence and machine learning techniques) to address complex social and organizational problems. Her specific
research areas are computational social and organization science, social adaptation and evolution, social and dynamic network
analysis, and computational text analysis. Her models meld multi-agent technology with network dynamics and empirical data.
Three of the large-scale tools she and the CASOS group have developed are: BioWar a city, scale model of weaponized biological
attacks and response; Construct a models of the co-evolution of social and knowledge networks; and ORA a statistical toolkit
for dynamic social Network data. |
| |
Keywords: | disambiguation social networks link analysis random walks clustering |
本文献已被 SpringerLink 等数据库收录! |
|