CS Colloquium
Tuesday, January 14, 2003
4:15 PM
5130 Upson Hall

Dr. Soumya Raychaudhuri
Stanford University

Towards incorporating free text scientific literature into the analysis of biological data

With the completion of the genome projects, a complete list of genes in many organisms is becoming rapidly available.  Additionally, high throughput technologies permit rapid characterization of many genes simultaneously; methods include gene expression measurements by microarray assay and identification of protein interactions by two-hybrid screens.  The current challenge in bioinformatics is to devise methods to interpret the results of such large-scale experimental assays so that the properties and interactions of individual genes can be identified.  To do this effectively, computational methods must integrate significant background information.  Since all biological discoveries are recorded primarily in the scientific literature, the corpus of biological scientific text contains almost all of the necessary background information.  I propose devising computational approaches that automatically access the corpus of scientific literature to analyze large-scale biological data.  These methods draw on concepts from machine learning and natural language processing.  Specifically, I will focus on examples from my work in gene expression analysis.  The methods presented here are relevant to the analysis of any data for which significant textual documentation is available