An SVM alignment program for protein sequence alignment (14 Nov 2006)
---------------------------------------------------------------------
(Implemented by Chun-Nam Yu using the API functions in the SVM^struct program by Thorsten Joachims)


INSTALLATION
1. after unzipping the tar file, simply type 'make' under linux to compile the program with gcc.
2. explanation of individual directories:
   - svm_light: the core svm solver
   - svm_struct: code for adapting svm for structural learning [reference 1]
   - hashtable: some hash table code used in the alignment program (thanks to Christopher Clark for providing the code)
   - examples: contains a small set of 10 protein alignment examples, complete with structural annotations

TRAINING
1. to test training after compilation, you can run 
   ./svm_align_learn -c 10 -e 0.01 --r test.target_annotations --s test.template_annotations test.examples test.model  

   Syntax:
   svm_align_learn [-c REGULARIZATION_PARAM] [-e PRECISION] --r LIST_OF_TARGET-ANNOTATIONS --s LIST_OF_TEMPLATE_ANNOTATIONS LIST_OF_ALIGNMENT_FILES OUTPUT_MODEL

2. explanation for some of the parameters
   -c REGULARIZATION_PARM: regularization parameter C used in SVM
   -e PRECISION: precision for the cutting plane approximation algorithm in structural SVM [reference 1] (default 0.01)
   --r LIST_OF_TARGET_ANNOTATIONSS: specify the structural annotation files for the target sequences in the alignments (see section on input file format below; or refer to the file test.target_annotations for an example)
   --s LIST_OF_TEMPLATE_ANNOTATIONS: specify the structural annotation files for the template structures in the alignments (see section on input file format below; or refer to the file test.template_annotations for an example)
   LIST_OF_ALIGNMENT_FILES: specify the list of alignment files (see the section on input file format below)
   OUTPUT_MODEL: name of the output model file to write to   

3. choice of feature vectors
   There are four feature vectors available, SIMPLE, ANOVA2, TENSOR and WINDOW (default: ANOVA2). To change the feature vector, you need to change the file to include in the section on feature vector in svm_struct_api.c to one of the four feature vector files (SIMPLE.c, ANOVA2.c, TENSOR.c, WINDOW.c) and recompile. For details about these four feature vectors, please refer to [reference 4].   

CLASSIFICATION 
1. to test the alignment model after training, you can run
svm_align_classify -m 1 --r test.target_annotations --s test.template_annotations test.examples test.model
  OR
svm_align_classify -m 0 --r test.target_annotations --s test.template_annotations test.classification_examples test.model

  Syntax:
  svm_align_classify -m CLASSIFICATION_MODE --r LIST_OF_TARGET_ANNOTAITONS --s LIST_OF_TEMPLATE_ANNOTATIONS LIST_OF_PAIRS MODEL_FILE

2. explanation for some of the parameters
   -m CLASSIFICATION_MODE: there are two different modes for classification, with different input file format. In mode 1(evaluation mode) the test set is specified in the same format as training set, and the error in terms of Q-loss are computed for each test example. In mode 0 (alignment only mode, which is default) the correct label(alignments) are not given, and the goal is to produce the alignment given a target-template sequence pair with structural annotations. The input file format in this mode is colon(':')-separated names of target and template, with the first line being the total number of pairs to align. For an example, see the file test.classification_examples. 
   --r LIST_OF_TARGET_ANNOTATIONS: Same as in traing. See the section TRAINING above.
   --s LIST_OF_TEMPLATE_ANNOTATIONS: Same as in training. See the section TRAINING above.
   LIST_OF_PAIRS: if classification mode is 1(evaluation) the input file format is the same as LIST_OF_ALIGNMENT_FILES in training. If classification mode is 0(alignment only) the input file format consists a list of pairs of target and template names (see test.classification_examples for an example)
   MODEL_FILE: the model file from training 



INPUT FILE FORMATS
1. alignment files
   -- an alignment file is a single line with seven data items, separated by colon ':'. For an example, see the files in examples/alignments/
   1. PDBID or unique identifier for target sequence (must match corresponding identifier in target annotation file)
   2. PDBID or unique identifier for template sequence (must match corresponding identifier in template annotation file) 
   3. starting position of local alignment in the target sequence 
   4. starting position of local alignment in the template sequence
   5. length of local alignment
   6. local alignment sequence of target sequence
   7. local alignment sequence of template sequence
2. target annotation files
   -- first line contains the PDBID or unique identifier for the protein
   -- second line is the length of the protein sequence
   -- from the third line onward are the actual structural annotations for the protein sequence, divided into three columns
      --- first column is the amino acid, in single letter code
      --- second column is the predicted secondary structure by the program SABLE [reference 2] (range of values: H, E or C)
       --- third column is the predicted relative exposed surface area by the program SABLE (range of values: 0-9 integers)
3. template annotation files
   --similar to the format used in target annotation files, only with different data in the columns
     -- first column is the amino acid, in single letter code
     -- second column is the secondary structure designated by DSSP [reference 3] (column 16 in a DSSP output file, but need to replace ' ' with an 'X' since DSSP uses an empty space ' ' to represent coil region or missing values)
     -- exposed surface area computed by DSSP (column 35 in DSSP output file)
	
CONTACT
If you had any suggestions to the program or have bugs to report, you can email Chun-Nam Yu at cnyu@cs.cornell.edu. Any feedback is very welcome. 


REFERENCES
[1] I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Altun: Large margin methods for structured and interdependent output variables, Journal of Machine Learning Research (JMLR), 2005 Vol 6, pp1453-1484
[2] R. Adamczak,  A. Porollo and J. Meller: Accurate prediction of solvent accessibility using neural networks-based regression, Proteins, 2004 Vol 56, pp753-67
[3] W. Kabsch and C. Sander: Dictionary of protein secondary structure: pattern recognition of hydrogen bond and geometrical features, Biopolymers, 1983 Vol 22, pp2577-2637
[4] C. N. Yu, T. Joachims, R. Elber and J. Pillardy, Support Vector Training of Protein Alignment Models, RECOMB, 2007
