Large-Margin Training of Submodular Summarization Methods


Revision: 24

Last updated: 9 Aug 2011





This software package implements a supervised learning approach to training submodular scoring functions for extractive multi-document summarization based on structural SVM framework. Resulting large-margin method directly optimizes for ROGUE-1 F score. The method is based on the sentence pairwise model described in [1]. The short name “sfour” represents four words starting with the letter S: structural, SVM, submodular, summarization.





The summarization method is implemented using svm-struct framework developed by Thorsten Joachims and available at http://svmlight.joachims.org/svm_struct.html.


The contents of archive are as follows:

-          binaries/ (precompiled binaries)

-          code/ (source code directory)

-          scripts/ (some shared internal scripts)

-          duc03/ (data directory for DUC ’03 dataset)

-          duc04/ (data directory for DUC ’04 dataset)

-          toyset/ (a small, read-to-run toy example)


Download is available here: http://www.cs.cornell.edu/~rs/sfour/sfoursrc.tgz.





If you want to compile the source code on your own you only have to download the archive and run make inside the code/ directory. This will produce two files, svm_sfour_learn and svm_sfour_classify.


Alternatively, you can use the provided binaries in the binaries/ directory.



Toy example


The archive includes a very small toy example in the toyset/ directory. This dataset is synthetic and included only to provide a quick and easy way of running the code for the first time. It contains three documents (each one is actually a paper closely related to this work), each one with a single manual summary (i.e. the paper’s abstract). There was some minimal preprocessing done to eliminate the most of the junk (a side product of format conversion) and retain only sensible words.


In a few easy steps anyone can train a model and then predict a summary using the provided data.

1.      Copy binaries (precompiled from binaries/ or your own from running make in code/) into toyset/exec/ directory.

2.      Train the model by running
    $ ./svm_sfour_learn -c 1 -e 0.01 -w 0 trainidx mdl
inside toyset/exec/ directory. This will use training examples listed in trainidx and save the model as mdl using C value equal to 1.

3.      Summarize the document listed in testidx using previously trained model mdl by running
    $ ./svm_sfour_classify testidx mdl out

4.      Performance is reported at the end of prediction in the line reading "Average loss on test set: 0.xxxx" which states average loss between the prediction and the best possible greedily selected summary sentences using the full knowledge.


More details are given in toyset/HOWTO.txt file.



DUC Datasets


The code was developed to work with DUC '03 & '04 datasets from http://duc.nist.gov/. We provide scripts for converting them into appropriate input format used in our code.


To use the code with those datasets and run the demo script you have to obtain the following files:

-          duc2003.breakSent.tar.gz (script for breaking articles into sentences, distributed with DUC '03 dataset)

-          duc03.results.data.tar.gz (the DUC '04 dataset)

-          duc04.results.data.tar.gz (the DUC '03 dataset)

-          ROUGE-1.5.5 evaluation scripts from http://berouge.com/

Then place the archives in the subdirectory corresponding to the target dataset and the ROUGE software in the same directory as unpacked archive with our code and run the appropriate script to preprocess the data.



Input data format


The code expects certain data files in specific locations relative to the executable and in a predefined format. To see an example simply convert one DUC dataset using the provided scripts.


The inputfile passed as an argument to the executable should contain a list of document sets used as training examples or, in case of prediction, test examples. It has to contain a list of data files with the path ../data/svm[0-9]+ (one per line). If you're using data with headlines then files ../data/hdr_d[0-9]+t (numbers matching with data files) contain one headline per line.


Data files (../data/svm[0-9]+) have to be in the following format:


<class> 1:<length> 2:<articleNo> 3:<lineNo> <wordID>:<count> …


Each line is one input sentence. Class 0 represents a sentence from the dataset and classes 1-4 sentences from the manual training summaries. Selection of wordID is arbitrary except for reserved numbers 0-10 which should not be used. The length field should contain the length of the sentence in characters (used for calculating the remaining budget), articleNo is the index (starting from one in an arbitrary order) of the article in the dataset from which is this sentence and lineNo is the line number (each sentence counts as one line) of the sentence in the source article.


Furthermore, a few additional files are required:

-          ../data/wmap.str with format "<wordID> <wordString>" providing the mapping between ID and corresponding word strings (one entry per line)

-          ../data/stops containing a list of stop-words (one per line)

-          ../data/CFs.str with format "<wordID> <wordFreq>" listing total word frequencies for the document set (one per line).



Running the code


The easiest way to immediately try our method is to look at the toy example described previously. Alternatively, you can run the binaries manually on your dataset (which should be located in ../data) using the following syntax:


$ svm_sfour_learn -c <C-value> -e 0.01 -w 0 inputfile modelfile

$ svm_sfour_classify inputfile modelfile outputfile


The outputfile will contain a list of selected sentence numbers (starting from zero) for each entry in inputfile. The argument “-w 0” is required during the training because it selects n-slack algorithm, which works the best in this setting. Parameter “-e 0.01” selects the precision of the solution (the suggested value worked well in our experiments). There are also other options (inherited from the svm-struct package) and are explained if one runs the executables without any arguments.





[1] Large-Margin Training of Submodular Summarization Methods