Assignment #4


Due date: 12/4/04

Go to the Pfam database site and pull out the page for globins. Download the "seed" alignment file which contains a multiple alignment similar to the one in the book. Estimate a profile HMM model from this alignment and use that to scan, as described next, the UniProt/SWISS-PROT database which you can download in FASTA format (28MB).

For each sequence whose LLR is greater than 0 look for the longest word in its title line that contains the string "globin". That word could be null, "Globin", or something else. For example, for GLBB_OLIMA, the word is Hemoglobin:

>GLBB_OLIMA (Q7M419) Hemoglobin, extracellular, major globin chain b

Finally, compute and output the median score of each word in your dictionary (including the null word).