首页 | 本学科首页   官方微博 | 高级检索  
     检索      


A Network Analysis Model for Disambiguation of Names in Lists
Authors:Email author" target="_blank">Bradley?MalinEmail author  Edoardo?Airoldi  Kathleen?M?Carley
Institution:(1) Data Privacy Laboratory, Institute for Software Research International, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;(2) Center for the Computational Analysis of Social and Organizational Systems, Institute for Software Research International, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract:In research and application, social networks are increasingly extracted from relationships inferred by name collocations in text-based documents. Despite the fact that names represent real entities, names are not unique identifiers and it is often unclear when two name observations correspond to the same underlying entity. One confounder stems from ambiguity, in which the same name correctly references multiple entities. Prior name disambiguation methods measured similarity between two names as a function of their respective documents. In this paper, we propose an alternative similarity metric based on the probability of walking from one ambiguous name to another in a random walk of the social network constructed from all documents. We experimentally validate our model on actor-actor relationships derived from the Internet Movie Database. Using a global similarity threshold, we demonstrate random walks achieve a significant increase in disambiguation capability in comparison to prior models. Bradley A. Malin is a Ph.D. candidate in the School of Computer Science at Carnegie Mellon University. He is an NSF IGERT fellow in the Center for Computational Analysis of Social and Organizational Systems (CASOS) and a researcher at the Laboratory for International Data Privacy. His research is interdisciplinary and combines aspects of bioinformatics, data forensics, data privacy and security, entity resolution, and public policy. He has developed learning algorithms for surveillance in distributed systems and designed formal models for the evaluation and the improvement of privacy enhancing technologies in real world environments, including healthcare and the Internet. His research on privacy in genomic databases has received several awards from the American Medical Informatics Association and has been cited in congressional briefings on health data privacy. He currently serves as managing editor of the Journal of Privacy Technology. Edoardo M. Airoldi is a Ph.D. student in the School of Computer Science at Carnegie Mellon University. Currently, he is a researcher in the CASOS group and at the Center for Automated Learning and Discovery. His methodology is based on probability theory, approximation theorems, discrete mathematics and their geometries. His research interests include data mining and machine learning techniques for temporal and relational data, data linkage and data privacy, with important applications to dynamic networks, biological sequences and large collections of texts. His research on dynamic network tomography is the state-of-the-art for recovering information about who is communicating to whom in a network, and was awarded honors from the ACM SIG-KDD community. Several companies focusing on information extraction have adopted his methodology for text analysis. He is currently investigating practical and theoretical aspects of hierarchical mixture models for temporal and relational data, and an abstract theory of data linkage. Kathleen M. Carley is a Professor of Computer Science in ISRI, School of Computer Science at Carnegie Mellon University. She received her Ph.D. from Harvard in Sociology. Her research combines cognitive science, social and dynamic networks, and computer science (particularly artificial intelligence and machine learning techniques) to address complex social and organizational problems. Her specific research areas are computational social and organization science, social adaptation and evolution, social and dynamic network analysis, and computational text analysis. Her models meld multi-agent technology with network dynamics and empirical data. Three of the large-scale tools she and the CASOS group have developed are: BioWar a city, scale model of weaponized biological attacks and response; Construct a models of the co-evolution of social and knowledge networks; and ORA a statistical toolkit for dynamic social Network data.
Keywords:disambiguation  social networks  link analysis  random walks  clustering
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号