              Linear Expectation in Parse Forests
                         Mats Rooth
                         July 2002

Unpack the tar file with

   cat linexp.tar.gz | gzip -d | tar -xvf -

The linexp directory contains Java byte code files with the "class"
suffix. The file placed.fpf is a packed parse parse forest obtained
with Helmut Schmid's lopar, using the flags

   -heads -freq -quote -preorder.

The non-recursive context free grammar represented by the file has grammar
symbols consisting of (i) a symbol of the underlying grammar, (ii) a span,
and (iii) a head word. In addition, frequencies for parse forest symbols
and rules are represented; these have been computed with the inside-outside
algorithm.

When you create your own parse forests, it is important for the
correctness of the algorithm that you include the preorder flag when
running lopar; an ordered parse forest grammar is assumed by the
algorithm.

If you have Schmid's lopar-charge, you can look at 
the parse forest with

   cat placed.pf | lopar-charge.

bank.gram is the underlying syntactic grammar; it was derived from
Penn Treebank II by applying a freqency cutoff, and eliminating
certain rules which produce chain derivations.

Governors as defined in Schmid and Rooth (2001) are computed with

  cat placed.pf | java Governor bank.gram.

The underlying grammar is present as a command line argument because
it is used to recover headedness of rules. The parse forest file
includes rule indices, rather than representing headedness directly.

The above command should produce the output

0 they
        0.7631 placed:S:NP-SBJ
        0.20084 placed:S:NP-SBJ-1
        0.029381 placed:S:NP-SBJ-2
        0.0066856 placed:S:NP-SBJ-3
1 placed
2 books
        0.60089 placed:VP:NP
        0.3991 placed:VP:NP-PRD
3 on
        0.40669 books:NP:PP
        0.3991 books:NP-PRD:PP
        0.1942 placed:VP:PP
4 tables
        0.89369 on:PP:NP
        0.10631 on:PP:NP-LGS
5 .
        1.0000066 placed:S:-PER-.

On the left are words and string positions.  The indented lines after each
word give governor tuples and their frequencies.  Notice the spurious
ambiguities in the subject, due to the fact that the treebank grammar
includes various indices on the symbol NP-SBJ.

The governors are indexed by terminal positions rather than
terminal rules, so that if there are two part of speech
readings for "books" (e.g. NNS and VBZ), their governors
are summed into one vector. 

The file grmap defines a function from pairs of symbols to grammatical
relation names, such as sbj (subject) and obj (object). When the
map is included in the command line, pair grammatical relations are
replaced by named grammatical relations:

cat placed.pf | java Governor bank.gram grmap
Parse forest 1:
0 they
        1.0000066 placed:sbj
1 placed
2 books
        0.99999 placed:obj
3 on
        0.40669 books:NP:PP
        0.3991 books:NP-PRD:PP
        0.1942 placed:VP:PP
4 tables
        0.89369 on:PP:NP
        0.10631 on:PP:NP-LGS
5 .
        1.0000066 placed:S:-PER-

This produces more readable output, interpretable along the
lines of '"books" is almost certainly the head of an object of
"placed"'.  It also cleans up the problem with the indexed subjects.

A modular design allows other linear quantities to be averaged.  The
following computes the average depth of embedding at each string
position.

cat placed.pf | java Depth bank.gram
Parse forest 1:
0 they
        3.0000197999999996
1 placed
        2.9999831998679998
2 books
        4.805773199868
3 on
        4.805773199867999
4 tables
        5.805773199868
5 .
        2.0000132

The frequency-weighted average depth of embedding for "tables" is about
5.8.  The average depth of embedding for "placed" is nearly exactly 3,
becuase in all the high-frequency analyses, placed/VBD is dominated
by VP, which is dominated by S, which is dominated by the dummy
start symbol TOP.

To define a specialization of the algorithm, one defines:

(i) A linear space implementing the interface Line.java,
and a map from grammar rules and  symbols to elements of
the space. See Real.java and GrammarRealMap.java for an 
example.

(ii) A function m(r,i) from rules r and indices i of constituents
on the right hand side to scalars.  See PfDepthExpectation.java.

(iii) A function b(r,i) from rules and indices to elements of the
linear space. See See PfDepthExpectation.java.

(iv) A main function; see Depth.java. 


Let s be some symbol in an individual tree, with linear label x.
Where r is the rule expanding s in the tree, the linear label for the
i'th child of s is defined using the slope-intercept formula m(r,i)x +
b(r,i).  The algorithm computes a frequency-weighted average of this
quantity over the trees in the tree-language of the parse forest
grammar.  In the case of the depth computation, m and b are constant
unit functions: the depth of the child is 1*x + 1, where x is the
depth of the parent.  See Schmid and Rooth (2001) for an definition of
governor markup as a linear space.

Exercise 

  Modify m and b in PfDepthExpectation so that instead of depth of
embedding in the entire tree, depth of embedding in the maximal
projection is computed. Use the fact that the index of the head child
is 0.


