Cornell University
Department of Computer Science
In many areas of life, we now have almost complete electronic archives reaching back for well over a decade. This includes, for example, the body of research papers in computer science, all newsarticles written in the US, and most people’s personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in these archives, we still lack methods for understanding a corpus as a whole. In particular, these archives have grown through an "evolutionary" process, where new documents are influenced by the content of already existing documents and where ideas are iteratively refined. While this dependency structure is important for a high-level understanding of a collection and for improved retrieval (e.g. PageRank), little is explicitly represented or known about the influence between documents, their authors, and their effects on each other.
This project addresses the task of automatically detecting the influence structure and flow of ideas in document corpora that have grown over time (e.g. scientific literature, political debates, news, email, wikis, blogs). We call this the problem of Information Genealogy, where we trace the origin and development of ideas over time. A key difference to most prior work is that we will not require the existence of a formal citation network, but will rely primarily on the content of the documents.
[Sipos/etal/12a] |
R. Sipos, P. Shivaswamy, T. Joachims, Large-Margin Learning of
Submodular Summarization Models. Conference of the European Chapter
of the Association for Computational Linguistics (EACL), 2012. [PDF] [BibTeX] [Software] |
[Sipos/etal/12b] |
R. Sipos, A. Swaminathan, P. Shivaswamy, T. Joachims, Temporal corpus
summarization using submodular word coverage. International
Conference on Information and Knowledge Management (CIKM), 2012. [PDF] [BibTeX] |
[Xu/etal/10a] |
Zhao Xu, K. Kersting, T.
Joachims, Fast Active Exploration for Link-Based Preference Learning using Gaussian Processes, Proceedings of the European Conference on Machine Learning (ECML), 2010. [PDF] [BibTeX] |
[Shaparenko/Joachims/09b] |
B. Shaparenko, T.
Joachims, Identifying the
Original Contribution of a Document via Language Modeling, Proceedings of the European Conference on Machine Learning (ECML), 2009. [PDF] [BibTeX] |
[Yue/Joachims/09a] |
Yisong Yue, T. Joachims, Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, Proceedings of the International Conference on Machine Learning (ICML), 2009. [PDF] [BibTeX] |
[Yue/etal/09a] |
Yisong Yue, J. Broder, R. Kleinberg, T. Joachims, The K-armed Dueling Bandits Problem, Proceedings of the Conference on Learning Theory (COLT), 2009. [PDF] [BibTeX] |
[Shaparenko/Joachims/09a] |
B. Shaparenko, T.
Joachims, Identifying the
Original Contribution of a Document via Language Modeling, poster
abstract, Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2009. [PDF] [BibTeX] |
[Shaparenko/Joachims/07a] |
B. Shaparenko, T.
Joachims, Information
Genealogy: Uncovering the Flow of Ideas in Non-Hyperlinked Document
Databases, Proceedings of
the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2007. [PDF] [BibTeX] |
[Pohl/etal/07a] |
S. Pohl, F.
Radlinski, T. Joachims, Recommending
Related Papers Based on Digital Library Access Records, Proceeding
of the Joint Conference on Digital Libraries (JCDL), 2007. [PDF] [BibTeX] |
[Shaparenko/etal/05a] |
This material is based upon work supported by the National Science Foundation under Award IIS-0812091. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation (NSF).