Computational Molecular Biology

1999 - 2000 CS Annual Report	Interdisciplinary

Computational Molecular Biology & Computer Science

The recent completion of the human genome project underlines the need for new computational and theoretical tools in modern biology. The tools are essential for analyzing, understanding
and manipulating the detailed information on life we now have at our disposal. Problems in computational molecular biology vary from understanding sequence data to the analysis of shapes and the prediction of biological function.

Cornell has a university-wide plan in the science of genomics; the Department of Computer Science is playing a critical role in this initiative. Researchers in the computer science department are engaged in a wide range of computational biology projects. Below a few of the ongoing
research endeavors are described:

Sequence analysis

David Shmoys is studying approximate algorithms for genetic mapping. Identifying the locations of markers on the genome (genetic linkage mapping) is a hard computational problem. Algorithms are developed that reduce significantly the cost of the (wet lab) experiments and improve the accuracy of the resulting maps.
Golan Yona, who will join the computer science department in January 2001, is working on clustering the "protein universe." The known sequences of proteins are classified into
families. Interesting and new protein families are identified and are subject to further investigation.
Ron Elber has developed a new system for threading in three dimensions, called LOOPP. Threading is a matching of a sequence into a protein shape. The system was trained
on tens of millions of data points, and can detect highly remote evolutionary relationships between proteins. In a recent publication in Science, LOOPP proposed an evolutionary link between the biological mechanism that controls the size of the tomato fruit and the mechanism responsible for the development of cancer.

Studies of protein shapes

Paul Chew, Klara Kedem, Jon Kleinberg, and Dan Huttenlocher develop algorithms for matching and identifying structural similarities in proteins. Ideally, we wish to manipulate
three-dimensional objects, such as proteins, with the same ease that strings (sequences) are studied. The new algorithm, URMS, is an efficient and accurate measure of protein similarities. It is used to study complete protein chains and to find similar fragments.
Jon Kleinberg studies simple models of threading. He made the intriguing observation that an existing simplified protein model (called the H/P model) can be mapped to the Max Flow
problem. The resulting exact algorithms are significantly better than those achieved by heuristic approaches used in earlier studies.

Dynamics and function of biological molecules

Ron Elber studies algorithms for simulating the long-time behavior of biological molecules. The new SDE algorithm provides an additional link between studies of structures and
studies of function.

Collaborations outside the department

Other key elements of the computational biology initiative include the Computational Genomics Institute and the NIH National Center for Research Resources at the Cornell Theory Center. The NIH NCRR provides an extensive set of software server that identifies protein families from sequences. The tools were developed in part in the Department of Computer Science.

Computational biology is also a crucial part of the recently announced $160M collaboration between Cornell and the Rockefeller and Sloan-Kettering institutes. The computer science department plays an important role in establishing the collaboration and enhancing the intellectual links between the different institutes.

Education

A new graduate program in Computational Molecular Biology that crosses colleges was initiated with the participation of the computer science field. A concentration in Structural Biology for undergraduate students majoring in computer science was also established.