Date Posted: 3/18/2005

By Bill Steele (Cornell Chroicle, March 17, 2005)

Not long ago a biologist had to understand Latin and memorize the vast array of Latin names for plants and animals. Today a biologist has to understand computer databases and work with the names and numbers that describe thousands of known genes and proteins and their functions in living things.

Now a computational biologist at Cornell has created a new tool that ties much of this information together. Biozon, at, is a sort of combination of Google and applied to databases of biochemical knowledge. Biozon allows researchers to search published data on DNA sequences, protein sequences, protein structures, protein-protein interactions, cellular pathways and protein families, and provides tools to combine information from all these sources.

"Each biological entity can be viewed in its extended biological context, through its relations to other biological entities," said Golan Yona, Cornell assistant professor of computer science, who is heading the Biozon project. For example, genes can be connected to the proteins for which they code, and to the biological processes in which those proteins participate. A researcher might call up the structures of proteins related to cancer, then identify the genes that code for those proteins. Or, to take a more technical example, search for "3D structures of proteins that are involved in phosphorylation interactions and are part of the Prostaglandin and leukotriene metabolism pathway." The results returned are ranked, using a first-of-a-kind biological ranking system that resembles the methods implemented in Google.

Beyond complex searches, Biozon also supports"fuzzy" searches that will find things similar to the search key. There are tools that allow users to store queries and download the resulting data. In addition researchers may incorporate their own datasets, allowing them to search and combine their own data and public data, but without making their proprietary data visible to others.

Biozon also includes analysis tools, such as methods for comparing protein shapes or predicting what common building blocks may appear in a protein based on its amino acid sequence.

The searchable databases include some 38 million nucleic acid sequences, 1.8 million protein sequences, 28,000 protein structures, 73,000 interactions, 2.25 billion sequence alignments and 8 million structural alignments. Sources include PDB, Genbank, Uniprot, KEGG and BIND, among others.

The information is stored in many different formats. The secret of Biozon is its ability to read the many formats and translate the results into a model that allows the datasets to be linked. Creating the infrastructure, the data model, the transformation functions and the search tools has required more than three years of work, Yona said, and it was the result of a team effort, most notably involving programmer/developer Aaron Birkland.