Information Genealogy

NSF-Project IIS-0812091

Cornell University
Department of Computer Science

Project Goals

In many areas of life, we now have almost complete electronic archives reaching back for well over a decade. This includes, for example, the body of research papers in computer science, all newsarticles written in the US, and most people’s personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in these archives, we still lack methods for understanding a corpus as a whole. In particular, these archives have grown through an "evolutionary" process, where new documents are influenced by the content of already existing documents and where ideas are iteratively refined. While this dependency structure is important for a high-level understanding of a collection and for improved retrieval (e.g. PageRank), little is explicitly represented or known about the influence between documents, their authors, and their effects on each other.

This project addresses the task of automatically detecting the influence structure and flow of ideas in document corpora that have grown over time (e.g. scientific literature, political debates, news, email, wikis, blogs). We call this the problem of Information Genealogy, where we trace the origin and development of ideas over time. A key difference to most prior work is that we will not require the existence of a formal citation network, but will rely primarily on the content of the documents.




Related Publications

[Sipos/etal/12a] R. Sipos, P. Shivaswamy, T. Joachims, Large-Margin Learning of Submodular Summarization Models. Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2012.
[PDF] [BibTeX] [Software]
[Sipos/etal/12b] R. Sipos, A. Swaminathan, P. Shivaswamy, T. Joachims, Temporal corpus summarization using submodular word coverage. International Conference on Information and Knowledge Management (CIKM), 2012.
[PDF] [BibTeX
[Xu/etal/10a] Zhao Xu, K. Kersting, T. Joachims, Fast Active Exploration for Link-Based Preference Learning using Gaussian Processes, Proceedings of the European Conference on Machine Learning (ECML), 2010.
[PDF] [BibTeX
[Shaparenko/Joachims/09b] B. Shaparenko, T. Joachims, Identifying the Original Contribution of a Document via Language Modeling, Proceedings of the European Conference on Machine Learning (ECML), 2009.
[PDF] [BibTeX
[Yue/Joachims/09a] Yisong Yue, T. Joachims, Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, Proceedings of the International Conference on Machine Learning (ICML), 2009.
[PDF] [BibTeX
[Yue/etal/09a] Yisong Yue, J. Broder, R. Kleinberg, T. Joachims, The K-armed Dueling Bandits Problem, Proceedings of the Conference on Learning Theory (COLT), 2009.
[PDF] [BibTeX
[Shaparenko/Joachims/09a] B. Shaparenko, T. Joachims, Identifying the Original Contribution of a Document via Language Modeling, poster abstract, Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2009.
[PDF] [BibTeX
[Shaparenko/Joachims/07a] B. Shaparenko, T. Joachims, Information Genealogy: Uncovering the Flow of Ideas in Non-Hyperlinked Document Databases, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2007.
[PDF] [BibTeX
[Pohl/etal/07a] S. Pohl, F. Radlinski, T. Joachims, Recommending Related Papers Based on Digital Library Access Records, Proceeding of the Joint Conference on Digital Libraries (JCDL), 2007.
[PDF] [BibTeX]
B. Shaparenko, R. Caruana, J. Gehrke, and T. Joachims, Identifying Temporal Patterns and Key Players in Document CollectionsProceedings of the IEEE ICDM Workshop on Temporal Data Mining: Algorithms, Theory and Applications (TDM-05), pp. 165–174, 2005.
[PDF] [BibTeX]

Acknowledgement and Disclaimer

This material is based upon work supported by the National Science Foundation under Award IIS-0812091. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation (NSF).