Linear Optimization Of Protein Potentials (LOOPP)
by Jarek
Meller and Ron Elber
Jerusalem (Hebrew University) and Ithaca (Cornell University)
LOOPP is a program to enhance the design and use of PROTEIN FOLDING potentials. LOOPP can thread sequences through a given set of structures, sequences against sequences and structures against structures.
This page provides on-line help and a tutorial to "soften" the use of LOOPP. This code is relatively new and is still not user friendly. Nevertheless, we believe that it is quite useful already at this stage.
The prime objective of LOOPP is to develop protein folding potentials using Linear Programming. Various models of threading potentials (off-lattice, reduced representation) are implemented. The models include contact and continuous pairwise potentials (of Lennard-Jones type) and solvation-like Threading Onion Models, as introduced by us. What LOOPP does may be summarized as follows:
Novel folding potentials for efficient threading based on optimization with linear constraints. Ron Elber and Jaroslaw Meller Computer Science Dept. Cornell University The protein threading problem is the assignment of a sequence to a family fold, according to an energy value. One of the key difficulties in the threading problem is the use and design of appropriate energy functions, that recognize the correct or similar fold and do it efficiently. The threading problem is known to be NP hard [1], when standard pairwise potentials are used. In this abstract a different kind of potential is used to solve the problem in polynomial time. The proposed computational study relies on local energy functions in the spirit of the work of Eisenberg and his collaborators [2] and on linear optimization of the potential energy parameters [3]. To speed up conformational searches and evaluation of energy function proteins are frequently represented at the residue level (reduced representation). Two widely used classes of reduced folding potentials are based either on pairwise interactions [4-8] or on the local structural environment of the amino acid [2], respectively. We are using local energy functions of our own design. The local energy function has the advantage that the threading (with gap) can be performed in polynomial time, due to the use of dynamic programming. Given a functional form of the potential, the energies of all the decoy structures are required to be larger than the energies of the native conformation. If the potential is linear in its parameters, as is the case in numerous formulations, the parameters can be found efficiently with the LP approach. We trained a set of 594 representative protein sequences and structures [8] using gapless threading. A total of 3*10^7 inequalities are solved. We consider a class of `solvation like' potentials with energy that depends only on the identity of the site residue and the number of its neighbors in the first and second solvation shell. We call it TOM (Threading Onion Model). The model is local, and thus enables efficient threading search, and it is demonstrated to have recognition capacity comparable to that of pairwise potentials. The parameters of the potential can be improved in a systematic way by imposing more constraints on parameter space. Further developments in the functional form may follow. References: 1. Lathrop RH; Protein Eng 1994 Sep;7(9):1059-1068 2. Bowie JU, Luthy R and Eisenberg D; Science 253 (1991):164-170 3. Maiorov MN and Crippen GM; J Mol Biol 227 (1992):876-888 4. Miyazawa S and Jernigan RL; Macromolecules 18 (1985):534-552 5. Hinds DA and Levitt M; Proc Natl Acad Sci 89 (1992):2536-2540 6. Godzik A, Kolinski A and Skolnick J; Proteins 4 (1996):363-366 7. Betancourt MR and Thirumalai D; Protein Science 2 (1999):361-369 8. Tobi D, Shafran G, Linial N and Elber R; to be published |
As indicated in the introduction, our main goal is to develop various folding potentials (or scoring functions if you prefer) using the Linear Programming (LP) approach. These potentials are linear in their parameters and are trained by LP solvers in order to satisfy the requirement that the "energies" of the native structures are lower than alternative structures presented to the sequence.
In order to obtain the LP constraints gapless threading is performed (although explicitly generated decoys can be used as well) and the differences between native structures and the decoys (in terms of the functional model of the potential to be trained) are stored in files to be read by LP solvers. At present we use mostly BPMPD program by Cs. Meszaros to solve the linear inequalities. However, other LP solvers may be used as well. The optimized parameters define threading potential that recognize all the native structures (against all the possible decoys generated by gapless threading) within the training set.
Alternatively, the problem may be infeasible i.e. there are no potential parameters that would make all the native energies lower than those of the decoys (within a given training set). This is clear sign that the functional form of the potential (or the content of the training set) must be adjusted e.g. one should increase the number of parameters.
Such potentials may be used for fold recognition of proteins of unknown structures. The best matching structure is picked up from the training using again gapless threading or using optimal alignments with gaps obtained by Dynamic Programming protocol for TOM-like potentials.
Some of the potentials that we develop aim at "ab initio" folding rather than threading. For example, continuous Lennard-Jones potentials that do not impose predefined cutoff distance are useful for the further structural refinement.
Despite the NP-hardness, pairwise models are used in threading as well - either performing gapless threading or employing the so-called frozen environment approximation. The frozen environment approximation is not available in the code (yet).
Another possibility is to use this program to develop statistical potentials, based on the frequency of contacts in a set of representative structures. LOOPP is meant to provide an alternative to this approach, but if you use it for this purpose we will still be happy. Contact statistics may be generated for most of the potential models (they need to be normalized however).
One may also use this program to perform standard sequence alignments as well as structure vs structure alignments using dynamic programming. Reduced representations of the proteins (in the relevant formats) may be also generated. The definition of amino acid types (an alphabet) is very flexible and ranges from H/P two-letter alphabet to extended alphabets with user defined symbols. LOOPP can read CRD format used also by the molecular dynamics package MOIL. Thus, one can use MOIL to generate decoy structures for LP training of a folding potential.
The general assumption is that protein structures are represented in terms of reduced models with one point per amino acid residue. This may be position of alpha carbons or the center of side chain or whatever you like. Coordinates of point residues of proteins belonging to some data set (training set or representative folds set) are necessary for all the threading functions. Since the allocation of memory is based on the content of the file with residue identities (sequences) this information is also required in most of the functions.
In order to build the program you have to first uncompress and untar the distribution file. Type:
gunzip loopp.tar.gz
tar -xvf loopp.tar
The distribution file contains several subdirectories with the source code, documentation and examples. The source code is included in the subdirectory loopp/source. You may find a simple Makefile to compile it - type (in the source subdirectory):
make
You should adjust Makefile to your local configuration if necessary. When the executable is successfully built it is moved to the subdirectory loopp/exe. To run it type (assuming that you are in loopp directory):
exe/loopp -h
Option -h makes the program printing short help. It is probably too short, so you'd better read the remaining paragraphs of this tutorial.
To start using the program one needs basically three things:
a) file with sequences (with the default name
SEQ)
b) file with coordinates of residues in reduced
representation (XYZ)
c) at least tentative potential e.g. zeros for
all parameters (current.pot)
One may use the code as an interface to get the reduced representation having file with protein coordinates in the CRD (CHARMM and MOIL) format. In order to do so, use option -crd2xyz which generates coordinate and sequence files, given CRD file. There is also a script loopp/examples/red_repr/run_crd2ca-sc.sh which allows to do this automatically for a series of structures.
In the most common situation i.e. when your starting files are in PDB
format you have to rely on
your own interfaces (which I am sure you have). I prefer it this way,
since there are quite many
features in PDB files that one may take care of in a different manner.
Most of the options are read from the standard input, although the descriptions of the scoring function (threading potential) is defined in the file Model. It is used to specify parameters, as type of potential, alphabet, cutoff distance for non-continuous models.
Regarding the command line options,
first of all one may try option -h to get small help. Besides there
are some examples in dir examples
and in the next section of this tutorial. The list of options to be
used in the command line is the
following:
-s name (to specify name
of the sequence file)
-x name (to specify name of the coordinate file) -p name (to specify name of file with potential parameters) -m (to use matrix format for pairwise eij) -i #size (to get info and histogram of size) -e name (to search for plausible structures of seq in file name) -g (to allow gaps while threading with option -e) -d (to generate ineq for explicitly given decoys) -t #val (to specify energy threshold for writing to dcon/mps) -u (to make val a lower, rather than an upper, bound) -b #start #end (to specify range of threading - in terms of seq) -k #col #ncol (to specify format of XYZ and conversion to XYZ) -crd2xyz (to use the program as an interface for conversion) -pdb2xyz (not active - use external interface) -w name (to write differences in contacts between native and decoys) -mps name (to write the same in MPS format) -o name (to write output to file name) -m (to use potential given in matrix format) -rsc #val neg (to rescale potential by val or -val when using neg) -pvdw #p1 #p2 (to specify powers in the Lennard-Jones model) -a #seq1 #seq2 (to perform sequence alignment) -g #ini_gap_pen #gap_pen (to specify gap penalties while using -a) -l (to perform local rather than global alignment) -1let (to use sequences defined using one-lett. codes) -c #str1 #str2 (to perform structure vs structure alignment) -fps name (structural fingerprints - not active) -dbg (to debug the source code) -h (to get small help) |
As for the file Model, it allows to redefine the type of potential, the cutoff distances and the alphabet. If file Model is not present in the current working directory, the default setup i.e. standard pairwise potential eij, with the generic alphabet of 20 letters (the different amino acids) and 210 parameters, will be assumed.
At present the order of parameters in Model is fixed:
type n_par r_cut
ALA 10
...
where type may be eij (standard pairwise model), evdw (continuous pairwise model), ein (threading onion model (TOM) with first solvation shell only - e_i(n) functional form), einm (TOM with first and second solvation shell - e_i(n,m) functional form). After the first three parameters one may define an alphabet, which is different from generic one, giving explicitly symbols of letters to be used (the letters refer here to 3-symbol codes of amino acids).
In the case of TOM models the number of types per letter is also read - for pairwise potentials it is obsolete but at present it has to be given anyway (one may put ones for all letters for instance). If an amino acid symbol is not specified in Model, then a replacement should be given, according to the chemical type of the amino acid. For instance when alanine is not specified, HYD is the expected replacement, otherwise program stops. Similarly, if LYS is not specified, then one of the symbols: POL (for polar res.) or CHG (for charged res.) has to be given.
The list of non-standard symbols is the following (one-letter code in parenthesis, short mnemonic to the right):
GAP (-) - GAP
HYD (B) - hydrophoBic
POL (O) - pOlar
CHG (U) - plUs-minUs (charged - only positively
charged if CHN used as well)
CHN (X) - negative character (negatively charged)
CST (J) - sulphur bridge (Joint CYSs)
HST (Z) - attached ions with some Z e.g. Z=2
for Fe
USR (=) - user defined
The three last symbols (CST, HST, USR) do not imply any special treatment and there is no automatic conversion into it. These symbols are expected to be given explicitly in the sequence definitions of the file SEQ.
The generic alphabet and chemical types of residues are defined in functions included in the file alphabet.c - there is no external file that would allow changing the generic definition. The following residues are treated as hydrophobic ones in first place:
ALA, CYS, HIS, ILE, LEU, MET, PHE, PRO, TRP, TYR, VAL
So, if you want to change the character of TYR residues, you need to give it explicitly in the definition of an actual alphabet in the file Model.
Ex1. In order to use potential e_i(n) with 12 different types per residue
Model should contain:
ein 240 6.4 ALA 12 ARG 12 ASN 12 ASP 12 CYS 12 GLN 12 GLU 12 GLY 12 HIS 12 ILE 12 LEU 12 LYS 12 MET 12 PHE 12 PRO 12 SER 12 THR 12 TRP 12 TYR 12 VAL 12 |
Ex2. In order to use e_vdw(i,j) model in turn, with 5 letter alphabet
(HYD,POL,CHG,CYS,GLY),
Model should contain:
evdw 30 11.0 HYD 1 POL 1 CHG 1 CYS 1 GLY 1 |
The order of symbols in the specification of the alphabet is irrelevant. The ones after residue symbols are ignored, but they need to be given to make the program read the alphabet correctly.
The sequences and structures of proteins are read from files SEQ and XYZ and potential from current.pot, respectively (if not specified otherwise). The default format of XYZ file is (sc stands for side chain coordinate):
name num_of_res
x_ca_i y_ca_i z_ca_i
x_sc_i y_sc_i z_sc_i
...
Other possible formats are:
name num_of_res
x_sc_i y_sc_i z_sc_i
...
and
name num_of_res
x_ca_i y_ca_i z_ca_i x_sc_i
y_sc_i z_sc_i x_cb_i y_cb_i z_cb_i
...
See option -k to change the format and pick up right column (e.g. -k 1 2 means that the first column in the first format - composed of two 3-column blocks - is chosen).
The default format of SEQ file is:
name num_of_res
symbol_i
...
where symbol_i denotes identity of the subsequent residues in terms of an alphabet defined by the user. To change names of the files to be read use options -s and/or -x.
The default format of the potential file is a vector of numbers in the order to be used by the energy function. In the case of pairwise models the upper triangle of an interaction matrix should be put row by row into a potential file.
Ex3. Suppose that we employ two symbols only and our Model file contains:
eij 3 6.4
HYD 1 POL 1 |
Since we specified symbol POL after symbol HYD in the Model, this order
will be assumed throughout
the program. Thus, assuming that the interaction matrix takes the form:
HYD POL HYD -1.0 0.0 POL 0.0 0.0 |
our potential file should contain three numbers, as follows:
-1.0
0.0 0.0 |
One may also use potential file in the matrix form:
-1.0 0.0
0.0 0.0 |
Option -m is required to employ matrix form and it is used only for the pairwise potentials. In case of TOM-like potentials (ein, einm) only vector format is supported. The order of the symbols and the number of parameters per symbol are read in Model.
Ex4. For example, when using e_i(n,m) model with two symbol alphabet,
Model should contain:
einm 30 6.4
HYD 15 POL 15 |
and potential file should contain 30 numbers, first 15 corresponding to hydrophobic residues and then 15 corresponding to polar residues.
There is fixed coarse-graining in the current version of the TOM potential with two solvation shells (einm), which we found sufficient for protein family recognition. The number of neighbors is defined with a cutoff distance specified in Model. Each site has specific number of neighbors, such that it belongs to one of five categories:
1. with very small number of neighbors (up to
two),
2. small number of neighb. (three or four),
3. medium number (five or six),
4. large number (seven or eight),
5. very large number of neighbors (nine and
more).
Then, each neighbor may have different number of its own neighbors, such that it belongs to one of three categories:
1. with small number of neighbors (up to two),
2. medium (between three and six),
3. large number of neighbors (seven and more).
In this way each contact is characterized by two numbers n and m, defining the categories of the two sites involved in the contact. While threading we consider only the identity of site (i) that we are currently occupying. The result is the e_i(n,m) contribution for a contact involving site i of n neighbors with another site (of unspecified identity) of m neighbors.
Thus, the 30 parameters in the potential file may be logically divided in five consecutive blocks with three numbers each. In the potential file we would simply have 30 numbers, separated by at least one space character. Using option -i (info) makes it possible to check whether the order of parameters in the potential file is correct. With this option the program prints the potential that is actually used in a format that is easy to understand.
With -i the program prints some other useful information:
1. contact statistics, the contributions of all
the parameters to the energies of the native structures in
the data set used (whole data
set information),
2. contact statistics, information for particular
proteins (stored in file current.info),
3. histogram for energy differences (between
native structures and decoys) distribution.
To predict the structure of a new sequence the option -e (exam or external) is used. The sequence (or a number of sequences) of an unknown structure is expected to be specified in the file seq_to_examine (if not defined otherwise by -e name). Sequences to be checked are then threaded through structure from the data base (XYZ file) in order to pick up the energetically most favorable alignments. If option -g is used together with -e, then gaps are allowed and dynamic programming is used to get optimal alignments (instead of the whole bunch of gapless alignments without option -g)
In the learning phase, in turn, linear constraints are generated to allow optimizing a new potential by LP solver (not provided in LOOPP!). It is the default mode of operation and does not require special option. The threading is performed for sequences of known structures, comparing native energies with the energies of the decoy structures, to compute the linear inequalities. All the possible gapless alignments are generated. Each alignment generates one inequality.
Another possibility is to put explicitly generated decoys into XYZ and SEQ files and use option -d to generate inequalities with respect to native structure, which is assumed to be the first one in XYZ and SEQ files.
Energy differences between the native and decoy structures are printed to a file xscan.log whereas differences in contacts are put onto a file current.dcon (if -w is used) or current.mps (if -mps is used). It is actually better to talk to LP solvers without the intermediate step in which the MPS format file is generated by loopp and then read by solver. MPS format is outdated and it implies usage of large memory and disk resources, especially when the primal problem is generated. The best strategy is to generate the dual problem in a compressed format understandable for LP solver. Therefore option -w which generates just the coefficients of the constraint matrix in row fashion is more handy.
File xscan.log (use option -o to change its name) is also used to store "energies" of alignments when the options -a or -c are used. Some examples of how to deal with the above described options and modes are given in the next section. There are also ready to use input examples in the directory loopp/examples which should provide further hints.
Certainly, it would be nice to have more advanced system with a graphical interface that would allow to prepare input easier, but this has to wait. Well, science is more interesting than just programming ...
In the subdirectory loopp/examples/templates you may find a number of potentials from the literature or potentials developed by us, as well as some data sets: the Hinds-Levitt set of 246 representative structures and small set of 31 proteins used for independent tests by Tobi and Elber. In the examples described below we will make use of them to demonstrate various features of the program. Notice that data sets are stored only in the templates directory and symbolic links are used to access them from test directories.
Ex6a. Developing new potential
In our first example we will assume that we want to find a standard pairwise potential solving given set of proteins i.e. we want to get 210 parameters, such that the native energies would be always lower than the energies of decoy structures generated by means of the gapless threading of the native sequence through all the structures in the data set.
There are many pairwise potentials, that have been developed by now, and one may use one of them to start with (although it is not necessary - we may just assume that our initial potential is random). We picked up Hinds-Levitt potential, which is based on the frequency of contacts extracted from the data base that we include in templates.
If one performs (gapless) threading loop to compare energies of decoys with those of native structures, one finds (see next example) that there are three structures with energies higher than some decoys, according to Hinds-Levitt potential. Example in the directory learn shows how to use LOOPP to generate inequalities to be solved by an LP program in order to get a potential, that correctly identifies all the proteins in the set.
Assuming that the files XYZ and SEQ are in the current directory, the command (as used in shell script learn/run_ex.sh):
loopp -p hinds-levitt_eij.mat -m -t -w hl.dcon
will produce the file hl.dcon with the coefficients of the constraint matrix. It may be then read by an LP solver providing us new solution in the parameter space or a confirmation of the infeasiblity of the problem. Alternatively, one may use option -mps to generate LP constraints in the MPS format.
Running run_ex.sh in directory learn generates both hl.dcon and
hl.mps files, which are then compared
with their templates from directory output_tmpl. If everything is fine
the file out_diff should be empty
after completion of run_ex.sh (simply type run_ex.sh
to
run it).
Option -m in the above example means that we want to use the potential in a matrix form, whereas -t means that we want to generate only those inequalities that are still not satisfied, given current potential. One may specify threshold for energy differences e.g. -t 10 to get only those with energy differences (with respect to native structures) smaller than 10 (it is not always feasible to generate all the inequalities).
Ex6b. Testing new potential
Let us now assume that we already developed some potentials and we want to test their prediction capacity. Before we do a real test, however, trying to recognize an unknown structure, we may check our potentials with known structures, which were not included in the training set.
In directory test we provide a realistic example of this type, with two potentials developed using Hinds-Levitt data set: one by Hinds and Levitt themselves (employed already in the previous example) and one developed by us using LP protocol and LOOPP. The potential me_r10_eij.pot is defined in terms of a reduced alphabet (consisting of 10 symbols only - see file Model_r10_eij) and it provides a nice demonstration of this additional flexibility of loopp.
Command lines from the script run_ex.sh in this directory:
loopp -p hl_eij.mat -m -t -o hl.log
loopp -p me_r10_eij.pot -t -o me.log
generate two log files (hl.log and me.log) that first of all should
be compared again with templates in
the directory output_tmpl. Their content is worth a closer look. An
example of a line from a log file
is given below:
971 -2.250000 1ptx 1bbt1:66 129 |
First, the number of a decoy (or gapless alignment) is listed, then its energy with respect to native one. In the next two columns we learn that it referred to the alignment of sequence of the protein 1ptx (PDB codes are used for names of the proteins) against structure of the protein 1bbt1, starting from residue 66 and ending at residue 129 in the 1bbt1 structure. Apparently, the other alignments of 1ptx against 1bbt1 (e.g. 65 128) had energies higher than the alignment of the sequence of 1ptx against its own structure (i.e. native energy) and they were not printed as we used -t option with the default threshold equal to zero.
It is interesting to note that although the number of inequalities, which are not satisfied is larger in case of our me_r10_eij.pot potential, the number of proteins that were not recognized is smaller. It is equal to two, whereas in case of Hinds-Levitt potential (which employs full alphabet and 210 parameters instead of 55 parameters of the reduced model) there are three proteins having decoys better than native structures.
Ex6c. Searching for protein families using threading with gaps
In the next example we will employ the potential that we trained, using LP protocol, in order to identify a plausible structure of a protein by the means of the optimal alignments with gaps. We will use the potential in file me_ein_g.pot. Its name (except for the reference to authors) informs us that it is TOM-like model with the first solvation shell only (ein) and with gaps (g). Our set of representative structures will be still that of Hinds and Levitt.
Indeed, inspection of the file Model, used this time tells us that we shall use option ein and an extended alphabet with 21 symbols (the additional symbol is GAP). There are 12 parameters for each residue which correspond to a residue of a given identity having a number of neighbors: from 0 to 11 (giving rise to 12 parameters). Using option -i while performing one of the threading functions one may get the statistics of contacts in terms of the current model. In this case, if we type:
loop -p me_ein_g.pot -i -b 1 1
then we get:
Contact type statistics:
ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE 0 0 0 0 0 0 0 0 0 0 710 460 636 876 29 462 838 951 190 166 703 530 515 673 89 435 695 860 199 287 834 512 443 522 147 365 521 842 241 442 924 397 318 321 203 257 323 752 240 590 806 214 176 220 212 156 196 529 172 675 431 99 94 110 142 96 106 237 78 613 168 35 44 48 87 42 40 99 34 344 34 6 13 10 24 5 9 25 17 109 4 2 3 4 4 2 1 8 3 22 1 0 0 0 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 |
Only the first ten amino acids are included here (the standard output contains information about all 21 residues). The number of residues having zero neighbors is set to zero, to enforce proper normalization of the energy, which is assumed to be zero when a residue is in contact with water molecules only (unfolded state is the reference configuration). Having no differences in terms of residues without neighbors, the LP protocol sets the value of the corresponding parameters to zero. In fact, there are many (mainly polar) residues having zero neighbors. Note, that the above contact statistics correlates well with the common hydrophobicity scales.
The same output, generated with option -i includes also information about potential that is used. Thus, we have 240 parameters for standard residues and additional 12 for residue called GAP (the last 12 numbers in the file me_ein_g.pot). The 240 parameters of this e_i(n) potential were trained using LP protocol (and certainly LOOPP) and they are solving the Hinds-Levitt training set i.e. all the native energies are lower than energies of the decoys obtained by gapless alignments against the structures in the set.
The parameters for the GAP residue are fixed arbitrarily, but they seem to be quite reasonable. We do not expect gaps in the hydrophobic core, where the number of neighbors is usually large. On the surface and in loop regions, where the number of neighbors is on average much smaller, penalties for gaps are expected to be smaller. Of course, we may train gap penalties as well, using the same protocol, but you need to wait for better potential until we get it published.
So, how do we use our potential for structure prediction?
Suppose, that we do not know the structure of myoglobin (which is not exactly true as you know, but suppose ...) - all we have is its sequence. File SEQ_exam in the directory exam contains sequence of myoglobin, defined using 3-letter codes (if you have 1-letter codes only see next examples to learn how to use loopp to get it converted into 3-letter codes). All we need is to type (as in the first command of run_ex.sh):
loopp -p me_ein_g.pot -e SEQ_exam -g -o
thread.log
As usual, we specify the potential to be used and we assume that the Model contains correct definition of the potential type and alphabet (which should be the case if you are using file Model from directory exam). We also specify that the name of the log file will be thread.log. The new options are -e and -g.
First one says that we want to perform threading loop for an external sequences, defined in file SEQ_exam, aligning them against structures from the (default) data set XYZ. You may certainly use option -x to specify the name of the data set to be used, but in this case we will used the default name and -x is not necessary. There is only one sequence in the file SEQ_exam (that of myoglobin) and as a result only this one will be threaded through structures of file XYZ (Hinds-Levitt set).
Option -g is used to tell the program that only optimal alignments with gaps should be computed, using dynamic programming. Without this option all the gapless alignments would be generated (see below) and the part of the potential referring to the GAP residue will not be used.
The log file thread.log contains now description of 246 alignments of
the myoglobin sequence against
structures in the Hinds-Levitt data set (the number of them is equal
to 246). Sorting this file
you may notice that there is a couple of good alignments (see file
thread.log_sorted):
1mba 1mba length= 146
score= -21.72 type= global(seq-stru)
1mba 1babB length= 146 score= -12.89 type= global(seq-stru) 1mba 8atcB length= 146 score= -10.77 type= global(seq-stru) 1mba 2hbg length= 147 score= -10.01 type= global(seq-stru) 1mba 1pbxB length= 146 score= -9.71 type= global(seq-stru) 1mba 2pkaB length= 152 score= -9.61 type= global(seq-stru) 1mba 1lh2 length= 153 score= -9.47 type= global(seq-stru) |
The best score (i.e. the most negative, as we actually use "energy" rather than scoring functions) is obtained while aligning sequence of protein 1mba (which means myoglobin in the PDB language) against its own structure. Well, yes - 1mba is included in the Hinds-Levitt set, so our potential had to learn to distinguish it from all the other structures in this set. But now it tells us also what are other good candidates for the structure of 1mba.
And guess what - only 2pka and 8atc among them are not globins - the other structures are either beta units of different hemoglobins (1babB, 1pbxB) or other closely related proteins (leghemoglobin 1lh2 and hemoglobin monomer 2hbg). The same happens if we try to align any of these globins - other globins are found among best alignments. Thus we may hope that even this simple potential has some capacity for protein families recognition.
In the sequence versus structure alignments gaps are allowed not only in the sequences, but also in the structures. The gap penalties for gaps in sequences are taken from the potential file, as GAP is treated as another residue. In fact it means a generalized insertion into a sequence.
As for the gaps in structures, they actually mean deletions in sequence - there is a residue in the sequence, which is aligned against non-existing site of the structure! Such gaps are penalized using the default value of gap penalties equal to 8.0. This is high penalty, given our e_i(n) potential, and gaps in structures tend to be avoided. As a result, alignments against structures shorter than the sequence usually give bad scores.
The alpha units of hemoglobins for instance are shorter than myoglobin and they do not show among the best alignments analyzed above. We may redefine the gap penalties for the structural gaps, using option -g, as specified below (see second command line in run_ex.sh):
loopp
-p me_ein_g.pot -e SEQ_exam -g 1 2 -o g_1_2.log
Now, the price for gaps in the structures
is equal to one in case of prefixes and suffixes or it is equal to
two in the middle of the structure.
The price for gaps in sequences is still specified in the potential file.
As a result sequences may adjust
to slightly shorter structures (allowing deletions, which are by the
way counted while computing the
length of alignments and that is why all of them have length greater
than 145) and now the best alignments
are (see thread-g_1_2.log_sorted):
1mba 1mba length= 146
score= -21.72 type= global(seq-stru)
1mba 1babB length= 146 score= -12.89 type= global(seq-stru) 1mba 2hbg length= 149 score= -11.70 type= global(seq-stru) 1mba 4sdhA length= 147 score= -11.41 type= global(seq-stru) 1mba 8atcB length= 146 score= -10.77 type= global(seq-stru) 1mba 1pbxB length= 147 score= -10.45 type= global(seq-stru) 1mba 1pbxA length= 146 score= -10.20 type= global(seq-stru) 1mba 1babA length= 146 score= -10.15 type= global(seq-stru) 1mba 1lh2 length= 153 score= -10.01 type= global(seq-stru) 1mba 2pkaB length= 152 score= -9.61 type= global(seq-stru) |
Alpha units try to compete now with beta units, but they are still slightly worse. The new structure on the list is 4sdhA, which is alpha unit of a dimeric hemoglobin.
If you want to see which residues are aligned to which sites add option -i to the command line and check thread.log again. Gaps in structures are denoted by zeros, instead of a site number.
We may perform one more test, to appreciate the relative success of the new potential, namely the same search for plausible structures of myoglobin but using just gapless threading. As in the script run_ex.sh we type:
loopp
-p me_ein_g.pot -e SEQ -b 66 66 -o gapless.log
Note, that option -g is missing. Besides we use file SEQ instead of SEQ_exam - we know already that myoglobin sequence is included in our data set, so we may use the standard SEQ file to define it. We need to know its number in the data set, however. Then we may use option -b to specify the range of threading loop, which includes myoglobin only, in this case. Type just:
loop
-p me_ein_g.pot -i -b 1 1
to extract order of proteins in the current data base (-b 1 1 to perform
the loop for only one protein and
to get it quickly). The results
of gapless threadings are presented
below (again, only the best scores):
6616 -21.716677 1mba
1mba:1 146
5930 -15.446918 1mba 1lap:217 362 14679 -15.386775 1mba 2pkaB:3 148 11259 -15.288340 1mba 1tie:14 159 8869 -15.200359 1mba 1pii:72 217 1835 -15.159266 1mba 1colA:17 162 20365 -14.963407 1mba 4rcrM:141 286 7718 -14.904730 1mba 1ofv:23 168 6146 -14.866542 1mba 1ldnA:94 239 7561 -14.838972 1mba 1nsbA:111 256 21344 -14.825983 1mba 5rubA:277 422 9409 -14.775121 1mba 1ppn:54 199 18607 -14.713107 1mba 3rubL:82 227 22181 -14.691473 1mba 6taa:302 447 1959 -14.642129 1mba 1cox:89 234 2585 -14.627654 1mba 1ezm:119 264 10459 -14.570587 1mba 1r093:56 201 12958 -14.547121 1mba 2cts:239 384 21045 -14.537817 1mba 5fbpA:163 308 17421 -14.520823 1mba 3icd:100 245 |
We see that alignment of 1mba sequence against 1mba structure is the
best and the native energy
-21.72 does not change with respect to the alignments score, as expected
(there are no gaps in the
alignment of the sequence against its own structure). The separation
of the globin
family is completely lost however and in fact it is difficult to draw
any conclusion from this kind of test
(except maybe that this potential is not good enough for gapless threading
recognition of protein
families).
One may use LOOPP to perform standard sequence alignments, using dynamic programming for that purpose. The types of alignments, which are implemented, include: global, local and overlap matches alignments with the linear gap penalty (i.e. there is no additional cost for opening a gap with respect to an extension of gapped region).
Alignment is performed for a pair of sequences, specified by their numbers in the sequence file to be used. If it is standard LOOPP sequence file, as the one containing sequences of the proteins included in the Hinds-Levitt set, we may use again option -i to learn about their order. Typing:
loopp -p blosum50.pot -b 1 1 -i
we will see that myglobin (1mba) has number 66, whereas leghemoglobin (1lh2) number 59, in the (default) SEQ file. Giving just the names of the sequences to be aligned will not work. Thus, in order to align 1lh2 vs 1mba, using standard blosum50 scoring matrix (with the signs inverted to get "energy") we type:
loopp -p blosum50.pot -a 59 66
obtaining the following result (see the text version of this tutorial
if your browser does funny things
while displaying alignments and other tables):
-SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPKLRD
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVP-QNNPELQA VSSRIFTRLNEFVNNAANAG-KMS-AMLSQFAKEHVGFGVGSAQFENVR-SMFPGFVASVAAP
-PAGADAAWTKLFG-LIIDALKAA--GA
|
The second sequence (1mba) is the one on top. The gap penalties that are used take the default value of eight for both: prefix and suffix as well as middle gaps. The score of this alignment is printed to a log file (xscan.log if not specified otherwise by option -o) and it is equal in this case to -108.00.
Changing the value of the prefix/suffix gap penalty to zero, we obtain overlap match:
loopp -p blosum50.pot -a 59 66 -g 0
of the score -125.00, as follows (note that some residue in the beginning
and end of the chain are
truncated):
SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPKLRDV
ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVP-QNNPELQAH SSRIFTRLNEFVNNAANAG-KMS-AMLSQFAKEHVGFGVGSAQFENVR-SMFPGFVASVAAP-
PAGADAAWTKLFGLIIDALKAAGA
|
In order to obtain local alignment we need to add option -l:
loopp -p blosum50.pot -a 59 66 -l
which produces for us alignment of slightly better score (-132.00),
as we search for the best score in
the whole dynamic programming table. The local alignment is not
significantly shorter than the
global one, which indicates that they are closely related:
SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPKLRDV
ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVP-QNNPELQAH SSRIFTRLNEFVNNAANAG-KMS-AMLSQFAKEHVGFGVGSAQFENVR-SMFPGFVASVAAP-
PAGADAAWT
|
We may actually decrease the gap penalty, using option -g again:
loopp -p blosum50.pot -a 59 66 -l -g 0 5
Now the penalty for prefix/suffix gaps is zero (even if you set to
something different it will be set to zero
by option -l) and for the remaining gaps it is equal to 5. As a result
we get:
SLSAAEADLAGKSWAPVF-AN--KNANGLDFLVALFEKFPDSANFFADF-KGKSVADIKASPK
ALTESQAALVKSSWEE-FNANIPKHTHRF-FILVL-EIAPAAKDLFS-FLKGTSEVP-QNNPE LRDVSSRIFTRL-NEFVNNAANAG-KMS-AMLSQFAKEHVGFGVGSAQFENVR-SMFPGFVAS
VAAP-PAGADAAWTKLFG-LIIDALK
|
Gaps become more frequent now and the local alignment is longer as a result of decreased penalties (the score is equal now to -159.00). Another thing worth illustrating by means of an example is the use of sequences defined in terms of one-letter codes. One needs option -1let for that:
loopp -p blosum50.pot -s SEQ_align -a -1let
-i -o 1let.log
There are only two short sequences in the file SEQ_align and may print its content here:
2
PAWHEAE
hEAGAWGHEE
We need to specify first how many of them are to be read and then give
the definitions of subsequent
sequences in the subsequent lines (do not break the line in the middle
of the sequence). As may be
seen, the lower case letters are allowed. With the option -i to demonstrate
how the output changes
(see log file xscan.log):
length= 11
score= -1.00 type= global(seq-seq)
seq_1 seq_2 GAP HIS GAP GLU PRO ALA GAP GLY ALA ALA TRP TRP GAP GLY HIS HIS GLU GLU ALA GAP GLU GLU |
The above examples are gathered together in the script shell run_ex.sh. As previously, type its name to run them and check the file out_diff. The log files get names corresponding to the type of alignment, which is used - see directory align.
It is sometimes nice to have a possibility to search the entire data base for sequence similarities. In the present version LOOPP performs only one alignment at a time and therefore I use an external shell script to run a series of alignments. If you type
run_align.sh 59
where 59 is the number of 1lh2 in the Hinds-Levitt set, then you will
get all the alignments of
leghemoglobin sequence against other sequences in the file log_all.
Sorting its content one may notice,
that the good alignments refer again to globins only:
1lh2 1lh2 length=
153 score= -963.00 type= global(seq-seq)
1lh2 1mba length= 154 score= -108.00 type= global(seq-seq) 1lh2 2hbg length= 158 score= -76.00 type= global(seq-seq) 1lh2 1pbxA length= 155 score= -55.00 type= global(seq-seq) 1lh2 1pbxB length= 155 score= -37.00 type= global(seq-seq) 1babB 1lh2 length= 157 score= -29.00 type= global(seq-seq) 1ithA 1lh2 length= 157 score= -27.00 type= global(seq-seq) 1babA 1lh2 length= 155 score= -17.00 type= global(seq-seq) |
Interestingly enough one of them, namely 1ithA, was not found previously with the sequence versus structure alignments. In order to obtain local alignments one need to uncomment an alternatively command line in run_align.sh, which includes -l option.
One may certainly play with the gap penalties. The more advanced systems
(like BLAST) use actually
affine gap cost and this, although still arbitrary, is certainly better.
Hopefully, the next version of loopp
will contain this kind of alignment as well.
Ex6e. Training a potential using explicit decoys
Suppose now that you generated a number of decoy structures performing MD or Monte Carlo simulations. Thus we have the native structure and a collection of perturbed structures and we want to train a potential that would distinguish native structure with respect to all perturbed ones. First of all one needs to translate full atomic representations (if it was used in MD/MC run) into reduced representations. If you used MOIL, then LOOPP (together with the script run_crd2ca-sc.sh or run_ex.sh in the directory red_repr) may help you - otherwise you need to use your own interface.
If the decoy structures are already translated into reduced representation, then we may use option -d to generate LP constraints for them (in conjunction with option -w or -mps) or to scan all the inequalities to check the energies of alignments with respect to the native energy. There is one thing to keep in mind - the native structure is assumed to be the first one in the data base that we read.
Thus, if all the structures are in the file XYZ_1bpt and all the sequences (which are in this case identical) in the file SEQ_1bpt, we may type:
loopp -p evdw_1bpt.pot -s SEQ_1bpt -x XYZ_1bpt
-d -k 1 1
to get the energies of alignments of the native sequence against all the decoys (with respect to the native energy) in the file xscan.log.
To illustrate simultaneously some other features, we use here a continuous
pairwise model of the
potential (option evdw with the default Lennard-Jones powers 12 and
6) and moreover, we use alpha
carbons coordinates in the reduced representation. They are assumed
to be given in the first (and the
only) three column block of the XYZ_1bpt file and therefore we
need to use option -k 1 1 in order to
choose the right format of the coordinate file.
Ex6f. Performing structure versus structure alignments
Under development. There is a working version of structure vs structure alignments. It is implemented for both TOM models, however we provide a tentative potential for the structure comparison only for TOM2 model (with first and second contact shell) - see directory stru-stru. This potential is obtained from the TOM2 potential that we trained for large representative set of structures by averaging with respect to the identities of the sites: e(n,m)= <e_i(n,m)>_i (see file me_einm_av.pot). Option -c is used for structure alignments, but you may need to experiment a little bit with the gap penalties, using option - -g (see batch shells in the directory stru-stru). The current setup seems to work fine in case of global alignments when there is a closely related structure in the set of templates. Somewhat better version should be included in the next release of LOOPP.
This is certainly a code in statu nascendi, so I cannot give any warranty. Some of the dependencies are still not well defined and some of the situations are not tested e.g. onion (shell) models with varying number of parameters per residue type. The list of such unpleasant "features" (that I am aware of) is given in the next section.
The list of prefixed parameters is enclosed below (see loopp.h):
/* prefixed parameters to setup the defaults */ # define NTYPE 20 /* default number of letters in alphabet */ # define JUMP 1 /* step to move along scaffold while threading */ # define DELTA 4 /* chem dist excluding form contact counting */ # define MAX 3000 /* used to avoid infinite loop when junk */ # define MAXS 64 /* to set length of file names and other strings */ # define R_low 1.0 /* default cutoff distances */ # define R_high 6.4 # define RESC_VDW 3.0 /* to scale distances for vdw */ # define epsilon 1e-6 /* Ax > epsilon for LP constraints */ # define GAP_PNLT 8.0 /* default gap penalty for alignments */ |
Arrays used in the program to store data are allocated according to what is found in Model file (see below) and according to the content of SEQ file, which allows to define max_length, max _prot and other parameters used in function alloc_mem. It does not refer however to interfaces (crd2xyz, one2three), which allocate what they need for themselves (with the risk that wrong files to convert may succeed to kill the program). The allowed length of strings (and thus of the file names) is given by MAXS - if it is too short you need to adjust MAXS and recompile.
This is not 100% rule (one unfortunate exception is array seq storing sequences or in other words sites identities), but if one finds a name containing one capital letter, then it should represent an array. There are no multiple index arrays in the code - just vectors.
I use (in most cases) the pointer-offset notation for arrays. It makes reading the code somewhat difficult for people not used to it, but it makes it easier to change it into an incremented pointer later on (once we know we will not try to jump too much and incrementing pointer should more efficient than offset).
Another, related thing, is the lack of protection of the variables when they are used within functions. In many cases they are not supposed to be modified and they should be declared const to avoid accidental overwriting. Right now the code evolves too rapidly to do this, so you should be very careful.
As for the options, they are kept in the following structure (see the
definition file loopp.h):
struct keep_options { int eij,ein,einm,evdw; /* different models of the potential */ int wDc; /* wDc stands for write Difference in contacts vector */ int mps; /* to write Difference in contacts in MPS format */ int noth; /* noth means that there is no threshold for DE */ int up; /* th_ene is a lower bound if(up) */ int n_par; /* number of parameters in the Model */ int gapless; /* to define gapless threading for learn and decoy */ int gaps; /* to allow gaps in threading for exam */ int r_cont; /* to allow reading of existing f_cont file */ int info; /* to get general info with potential used, distributions etc */ int beg,end; /* to define which seq are to be thread */ int mat; /* to enforce matrix format of potential file */ int rep_pow, atr_pow; /* powers for vdw-like potentials */ int col,n_col; /* to choose C_alpha, C_sidech, C_beta 3-col blocks in XYZ */ /* to choose the mode of operation */ int learn; /* to make standard threading loop and get inequalities */ int exam; /* to pick up best plausible stru for an independent seq */ int decoy; /* to get inequalities for explicitly generated decoys */ int crd2xyz; /* to get reduced representation from CRD file */ int pdb2xyz; /* to get reduced representation from PDB file */ int fngrps; /* to use additional structural fingerprints */ int debug; /* to debug the code - get addt prints */ /* to allocate arrays */ int max_length; /* protein size limit (in terms of residues) */ int max_cont; /* number of contacts (per residue) limit */ int max_prot; /* protein number limit (size of the set of scaffolds) */ int max_histo; /* histograms size limit */ int n_type; /* to define how many letters are to be used */ /* for seq-seq, seq-stru and stru-stru alignment */ int align; /* flag to initiate seq-seq alignment */ int strucmp; /* flag to initiate stru-stru alignment */ int loc_a; /* to perform local alignment */ int onelet; /* one letter code for ama */ int gap_adr; /* address of the GAP res in the current alph */ float pref_pen; /* gap penalties: prefix/suffix */ float gap_pen; /* and regular */ float th_ene; /* threshold for energy differences to be printed */ float r_cut; /* contact distance cutoff */ float resc_pot; /* to rescale (e.g. by -1) the potential function */ } opt; |
The options eij, ein, einm, evdw, r_cont are treated in a special way and are specified in the Model file. All the others are to be specified in the command line (see examples). The default values of options are defined in function set_options. The default names of all the files are also specified there or in the function prepare_files. All the names are chosen such that they should be self-explanatory.
Files ContMap, ContMap_vdw_A, ContMap_vdw_B are used to store pointers to neighbors of each site in each structure belonging to a data set. They are recreated, if not found in the current directory. File cont_by_Type is recreated every time and it used to store protein representations in terms of contacts. This precomputed representation is then used to make threading loops efficiently. The issue of efficiency was nevertheless somewhat less important, so there is a lot of room for improvements.
Each option must be given independently, with the hyphen sign preceding it i.e. notation -ih cannot be used as a shortcut for -i -h
Floating numbers are used throughout the program and certainly the accuracy of numerical operations is limited. It may cause some small discrepancies in terms of energies while comparing new results with the templates provided in examples directory. It should concern, however, only the last digit - if you see more than that, something may be wrong.
The option -i (info), which produces more complete output, is not fully implemented and will not generate contact statistics with the continuous pairwise models of the Lennard-Jones type (option evdw).
The option -mps will not work with evdw either. The options -g (for the sequence vs structure alignment with gaps) and -c (for structure vs structure alignment) will not work with pairwise models (eij,evdw). Frozen environment approximation is under development and should be available in the next version.
The options -pdb2xyz and -fps are currently not active.
While rescaling the scoring (or potential function) using option -rsc one cannot give negative value directly, but rather use additional string to indicate that minus sign should be taken e.g. -rsc 1 neg to multiply the scoring function by -1. The same concerns option -t while specifying the selection threshold for the storing of inequalities.
The program may crash without control if one uses option -1let to allow one-letter codes in the definition of the sequences, but in fact the format of the file storing sequences is different from the default. The same concerns option -crd2xyz if there is something wrong with your CRD file.
The alignments are implemented assuming that we deal with an energy rather than a scoring function and will not work properly unless you use option -rsc to define properly rewards and penalties. Alignments assume also that the border between rewards and penalties is zero (as in the standard Smith- Waterman algorithm) and will not work properly with a scoring matrix that is e.g. all positive.
The TOM model with the second solvation shell (option einm) will work properly only when number of parameters per residue is equal to 15. This is a result of prefixed coarse-graining that may be changed only on the source code level at present.
The decision to rebuilt ContMap file (and its continuous counterparts in case of evdw) is based on the existence or not of these files in the current directory. The content is not checked until the threading loop is initiated and there is no information about successful completion of the contact maps during previous run. So, if the ContMap is for some reasons incomplete you may get funny effects. Remove file ContMap to get it rebuilt.
Even more dangerous situation may happen when you change for instance cutoff distance in the Model - program will work using previously generated ContMap and the results, although looking reasonably will be wrong. Remove file ContMap to get it rebuilt whenever you change cutoff distance.
Beware of the content of the file Model - if you use TOM-like definition for an option requiring pairwise model (e.g. sequence alignment with option -a) or pairwise model for something requiring TOM-like model (e.g. structure alignment), then you may get wrong results without warning.
I would be happy to hear from you about bugs and other problems.
Good luck!
Jarek Meller, 25.06.1999
P.S. I hope that you will not forget to ask Ron Elber or me about
proper references if you use this program
or potentials developed
by us. As for now, this electronic release
and the presentation
during International Conference on Optimization
in Computational
Chemistry and Molecular Biology at Princeton
(May, 99) are the
only references. We should get it published
soon (we hope) and
then we will update documentation to include
more appropriate
reference.
Jarek Meller http://www.phys.uni.torun.pl/~meller mailto:meller@phys.uni.torun.pl mailto:meller@cs.cornell.edu Ron Elber mailto:ron@cs.cornell.edu |