Department of Computer Science

Thursday, October 21, 2004
B17 Upson Hall

Raymond Mooney
University of Texas at Austin

Learning to Extract Proteins and their Interactions from Medline Abstracts

Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data on human genes from the 11 million abstracts in Medline. We have developed and evaluated a variety of learned information-extraction systems for identifying human proteins and their interactions in Medline abstracts. We demonstrate that machine-learning approaches using support-vector machines, maximum-entropy, and conditional random fields are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions more accurately than
manually-developed rules. I will also discuss our recent results on collectively extracting all protein names in an abstract using Relational Markov Networks that utilize specific relations between possible protein references.

Joint work with Razvan Bunescu, Edward Marcotte, Ruifang Ge, Rohit Kate, Yuk-Wah Wong, and Arun Ramani.


Bio Sketch:

Raymond J. Mooney is a Professor in the Department of Computer Sciences at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 100 published papers in artificial intelligence, primarily in the area of machine learning, a former co-chair of the International Conference on Machine Learning, and a former editor of the Machine Learning journal.  His recent research has focused on text mining, learning for natural-language processing, information extraction, text categorization and clustering, recommender systems, relational learning and inductive logic programming, and semi-supervised learning. Additional information is available on the web at