-------------------------
README File For Set-Ranking
-------------------------
Karthik Raman

Version 1
10/26/2011

http://www.cs.cornell.edu/~karthik/code/svm-dyn.html


-------------------------
INTRODUCTION
-------------------------

Set-Ranking is an a Python program used for generating either one-level or two-level dynamic rankings using either a greedy algorithm or a nested greedy algorithm respectively.

-------------------------
COMPILING
-------------------------


Set-Ranking does require Python version 2.4 or newer in order to run properly.
Run using the following command:

python SetRankings.py InputFile PredictionFile

--

You can download the latest version of Python at http://www.python.org/download/

-------------------------
INPUT DATA FORMAT
-------------------------

The input data file which SVM-div reads is an index file with the path+filename of all data files.   Each data file should contain all the documents (aka examples) for a single query (aka set of examples).

Within a data file, each line contains the information for a single documents.  Set-Ranking assumes all documents are represented using word frequency counts obeying the following format:

[label] [word_id]:[doc_word_freq] [word_id]:[doc_word_freq] ...

Labels are represented as binary strings (e.g., '10010'), where each digit corresponds to a subtopic for that query, and 1/0 indicates relevance/non-relevance.  All documents should be relevant to at least one subtopic.  Subtopics are allowed to change from query to query.

In addition to document word frequencies, Set-Ranking also uses title words.  A word is denoted as being in the title by prepending the word entry with 'T'.

Features are represented sparsely.  For each document, only non-zero word frequences need to be stored in the data file.  For example, a document could be represented as:

01100 T1:0.25 2:0.25 5:0.5

Here, this document is relevant to subtopics 2 & 3 for this particular query.  Word 1 has frequency 0.25, word 2 has frequency 0.25, and word 3 has frequency 0.5.  Notice the 'T' designation in front of the entry for word 1.  This indicates that word 1 is also present in the title of the document.  Set-Ranking ignores the frequency of words in the title, and only considers whether a word appears in the title.

All word ids should be invariant for documents in a single query.  Word ids can change from query to query.

Some sample toy data is provided to play around with in the Sample-Data directory. To run use the 'allQueries.txt' as the input file for the method


-------------------------
Predictions File
-------------------------
The predictions for each query are using the document id's given by the (array index) order in the file they are given in.
For two-level rankings, all rankings from a row are placed together.




-------------------------
REFERENCES
-------------------------

[1] "Structured Learning of Two-Level Dynamic Rankings",
     by K.Raman, T. Joachims and P. Shivaswamy
     In Proceedings of CIKM, 2011. 

**NOTE**
The code for the learning is seperate and should be found in another directory.
