Date Posted: 1/18/2001

By Blaine P. Friedlander Jr.

A weedy, inedible member of the mustard family, related to broccoli and cauliflower, has become the first plant to yield the secrets of its primordial origins. In a computational research effort at Cornell, the plant, Arabidopsis thaliana, was shown to contain genetic evidence of its emergence between 50 million and 200 million years ago. The finding, say the Cornell researchers, will be invaluable to those using Arabidopsis as a genetic model for other plant species, unlocking genes for important traits in agricultural crops like corn, tomatoes and wheat. The researchers report on their discovery in the Dec. 15 edition of the journal Science.

A decade ago, Arabidopsis was widely adopted by plant scientists as an easily manipulated model for other plants because it is simple to grow in the laboratory, has a short life cycle and has a small genome -- only about 140 million base pairs of DNA compared with wheat, which might have as many as 16 billion pairs. This year, the entire DNA sequence of the plant was completed, and for the first time researchers were able to understand the sequence of the 25,000 genes necessary for an organism to function as a flowering plant. Using this genome sequence -- which is in the public domain on the Internet -- the Cornell researchers used computers to sort through the plant's DNA and find its genetic roots. "We can take the entire genome of one plant and look back at it," said Steven D. Tanksley, the Liberty Hyde Bailey professor of plant breeding at Cornell and an author on the paper. "We are going back into genetic time, and we can see what the ancient genome looked like. If we can understand what the ancestral gene content in one plant is, then we can use that to learn the gene content in other plants."

Tanksley and the lead researcher, Todd Vision, a Cornell visiting scientist, explained that for many plant genomes there is a lot of empty material between the proteins. Tanksley suggested that understanding a genome is like driving along a highway. On the East Coast, you do not have to drive far before you reach another city, while out west, there are long distances between cities. The point of the analogy is that scientists can gather more general genetic information from Arabidopsis in a shorter period of time. Said Tanksley: "Arabidopsis is the East Coast of DNA sequencing."

The researchers used a computer program called BLAST to classify the thousands of genes in Arabidopsis into gene families. BLAST (an acronym for Basic Local Alignment Search Tool) is a sequence similarity program designed to support analysis of nucleotide and protein databases. It was developed at the National Center for Biotechnology Information, part of the National Institutes of Health, in Bethesda, Md. The researchers then used novel algorithms to find large chunks of the chromosomes that were duplicated long ago. In the process of duplication, all the genetic material in a species doubles, creating what is known as a polyploid. The researchers inferred that Arabidopsis was an ancient polyploid because it contained evidence of multiple duplications.

Although duplicated chromosomes diverged from one another and became scrambled over the eons, the research team was able to find 103 duplicated chromosome segments that ranged in age from 50 million to 200 million years. "We figured out where gene family members are located and used that information to find the ancient duplicated segments," said Vision, who is a molecular biologist at the Center for Agricultural Bioinformatics (CAB) at Cornell. The CAB is supported by the U.S. Department of Agriculture, Agricultural Research Service, in partnership with the College of Agriculture and Life Sciences and the Theory Center at Cornell. With help from the dating estimates obtained by paleobotanists, the team was able to look at the duplicated gene sequences and deduce when the genome duplications in Arabidopsis occurred. The team found that a few large duplication events were responsible for the pattern they saw. "Our work was entirely computational, but a lot of other researchers' laboratory work went into it before that," said Vision. He draws an analogy between finding prehistoric genetic relationships and the development of language. Many words in Romance languages like Spanish, Italian, French and Portuguese are derived from Latin. "We can see the roots of the modern words as being derived from Latin," he said. "In our case, we are finding the genetic roots of the genes before they duplicated and diverged." The paper, "The Origins of Genomic Duplications in Arabidopsis," was authored by Vision, Tanksley and Daniel G. Brown of the Whitehead Institute at the Massachusetts Institute of Technology. Brown participated in the research while completing his doctoral degree, which he earned from the Department of Computer Science at Cornell last spring. The research was funded by the USDA Agriculture Research Service and grants from the National Science Foundation and the Office of Naval Research.

How the Theory Center's computing resources help to BLAST researchers ahead

Running a massive BLAST search on the Arabidopsis genome was easier for Cornell researcher Todd Vision than it might have been for many other genomics researchers, thanks to the Cornell Theory Center (CTC). The center maintains a special computing resource in Rhodes Hall in conjunction with the U.S. Department of Agriculture's Center for Agricultural Bioinformatics (CAB). The resource, loosely named the 'genomics cluster,' consists of 12 computers, each made up of four 500-Mhz Pentium III processors running the Windows 2000 operating system. With software developed for CTC, eight of the machines run as a parallel-processing cluster, effectively a supercomputer. The cluster is primarily used for searches using BLAST (an acronym for Basic Local Alignment Search Tool), a program that searches gene and protein databases for pattern matches, much the same way a text searcher will match words and phrases. BLAST servers are available elsewhere to the worldwide research community through World Wide Web interfaces, but they are not suitable for running a large batch of queries such as the one Vision used to track the genetic history of Arabidopsis.

The BLAST server on the web takes one query and "blasts" it against the database, Vision explained. "But I needed to run twenty-something thousand proteins. Imagine sitting there and clicking the mouse that many times. Doing it on Theory Center computers allowed us to do it in a batch. Doing it all on local computers allows you more speed, more flexibility and less hands-on processing." Vision also assembled on a Theory Center computer a special version of the Arabidopsis genome database that would tell him the location on the genome of each protein it found. The processing, he said, took only about half a day on just one of the four-processor Pentiums. Other computers in the resource are used for databases. One is a server for the CAB Web site known as Demeter's Genomes, which makes extensive databases of plant genomes available to the research community. The genomics cluster was established several years ago with about $400,000 in funding from the USDA and is maintained by an annual USDA grant. David Schneider, Theory Center staff researcher, is principal investigator of the genomics cluster.