Information Genealogy

Project Goals

In many areas of life, we now have almost complete electronic archives reaching back for well over a decade. This includes, for example, the body of research papers in computer science, all newsarticles written in the US, and most people’s personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in these archives, we still lack methods for understanding a corpus as a whole. In particular, these archives have grown through an "evolutionary" process, where new documents are influenced by the content of already existing documents and where ideas are iteratively refined. While this dependency structure is important for a high-level understanding of a collection and for improved retrieval (e.g. PageRank), little is explicitly represented or known about the influence between documents, their authors, and their effects on each other.

This project addresses the task of automatically detecting the influence structure and flow of ideas in document corpora that have grown over time (e.g. scientific literature, political debates, news, email, wikis, blogs). We call this the problem of Information Genealogy, where we trace the origin and development of ideas over time. A key difference to most prior work is that we will not require the existence of a formal citation network, but will rely primarily on the content of the documents.

Related Publications

[Sipos/etal/12a]

R. Sipos, P. Shivaswamy, T. Joachims, Large-Margin Learning of Submodular Summarization Models. Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2012.
[PDF] [BibTeX] [Software]

[Sipos/etal/12b]

R. Sipos, A. Swaminathan, P. Shivaswamy, T. Joachims, Temporal corpus summarization using submodular word coverage. International Conference on Information and Knowledge Management (CIKM), 2012.
[PDF] [BibTeX]

[Xu/etal/10a]

Zhao Xu, K. Kersting, T. Joachims, Fast Active Exploration for Link-Based Preference Learning using Gaussian Processes, Proceedings of the European Conference on Machine Learning (ECML), 2010.
[PDF] [BibTeX]

[Shaparenko/Joachims/09b]

B. Shaparenko, T. Joachims, Identifying the Original Contribution of a Document via Language Modeling, Proceedings of the European Conference on Machine Learning (ECML), 2009.
[PDF] [BibTeX]

[Yue/Joachims/09a]

Yisong Yue, T. Joachims, Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, Proceedings of the International Conference on Machine Learning (ICML), 2009.
[PDF] [BibTeX]

[Yue/etal/09a]

Yisong Yue, J. Broder, R. Kleinberg, T. Joachims, The K-armed Dueling Bandits Problem, Proceedings of the Conference on Learning Theory (COLT), 2009.
[PDF] [BibTeX]

[Shaparenko/Joachims/09a]

B. Shaparenko, T. Joachims, Identifying the Original Contribution of a Document via Language Modeling, poster abstract, Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2009.
[PDF] [BibTeX]

[Shaparenko/Joachims/07a]

B. Shaparenko, T. Joachims, Information Genealogy: Uncovering the Flow of Ideas in Non-Hyperlinked Document Databases, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2007.
[PDF] [BibTeX]

[Pohl/etal/07a]

S. Pohl, F. Radlinski, T. Joachims, Recommending Related Papers Based on Digital Library Access Records, Proceeding of the Joint Conference on Digital Libraries (JCDL), 2007.
[PDF] [BibTeX]

[Shaparenko/etal/05a]

B. Shaparenko, R. Caruana, J. Gehrke, and T. Joachims, Identifying Temporal Patterns and Key Players in Document Collections. Proceedings of the IEEE ICDM Workshop on Temporal Data Mining: Algorithms, Theory and Applications (TDM-05), pp. 165–174, 2005.
[PDF] [BibTeX]

Information Genealogy

Project Goals

People

Software

Data

Related Publications

Acknowledgement and Disclaimer