Observational Results

OBservations

The more global noun phrases are deleted prematurely from the Lists
Whilst applying Strube’s algorithm, I noticed that most common cause for not finding the correct antecedents stemmed from the last step in the algorithm’s loop: having processed an utterance, the S-List removes discourse entities that are not realized in that utterance. As a result, entities that were mentioned in some utterance, but not mentioned again after even only one sentence will be wrongly classed as hearer-new, which, in turn, will likely be deleted again unless it is evoked. This problem was more severe with noun phrases. This makes sense, because pronouns usually refer to discourse entities close to it – that is, local; whereas noun phrases, due to their informative nature, can refer to long distance antecedents – that is, global. One conclusion we can draw from this first observation is that, in order for any existing pronoun resolution algorithms to handle noun phrases, they must first be “scaled” up to handle global anaphora resolution as well as local.

How well one particular algorithm performs depend on genre of text
Although much data processing time went into the finer grains of differentiating different types of pronouns and noun phrases, the corpuses were too small and the resulting figures not meaningful. So, statistically, I concentrated on the coarser comparison between pronouns and noun phrases. As noted above, an exception occurs in News2 where the error of noun phrase resolution is less than that of pronouns. Thus I concentrated on reasoning how this text was different from the other two, and found out that this article is a lot more focused. This means that the discourse entities in the S-List will nearly always be evoked in consecutive utterances. As a result, the elements will stay in the S-List across many utterances, increasing the algorithm’s chance of hitting the correct antecedent. In fact, this observation also explains why the error rates associated with News2 are on the whole so much lower than that of the other two corpuses using the Strube’s algorithm. This observation shows that success rate may depend on the “suitable” algorithm-text pair – that is, there may exist one algorithm that performs better for one genre of text (of a particular writing style) and not in others. Extending this idea, an algorithm that handles one type of noun phrases may not handle another type. These hypotheses, of course, have yet far too little evidence that supports them.

The S-List algorithm is a combination of the Recency Constraint and the Centering Theory
One other observation about the S-List algorithm is that it is implicitly a combination of both the recency constraint and the centering theory. Where two discourse entities are in the same class but in different utterances, we prioritize the element that is in the most recently mentioned utterance. This is analogous to the ranking criteria in the recency constraint – where we apply the recency constraint to the utterances themselves (global), and not to the discourse entities within an utterance (local). On the other hand, if two discourse entities are in the same center and in the same utterance, we prioritize the element that appears first in the utterance. This is analogous to the ranking criteria in the centering theory (where subject >> objects >> others), because subjects tend to position themselves near the beginning of a sentence. Individually, both Centering Theory and the recency constraint are intuitively correct – that is, they “make sense” and correct enough. But, together, they give better performance, as shown by experiments by Strube (1996). If this S-List combines again with another algorithm, the new combo may have an even better performance. It is tempting to conclude that combining more algorithms gives better performance, but I would need to conduct more experiments on larger corpus to arrive at that conclusion.

Recency Constraint fails mainly becuase subjects are mainly placed at the beginning of a sentence
Whilst applying Hobb’s algorithm, I noticed that most common cause for not finding the correct antecedents stemmed from the conventional sentences structures. Sentences are usually structured so that the subjects and other more important discourse entities that the writer wants the reader to remember more distinctly are placed at the beginning of a sentence to stress their importance. Less important discourse entities are placed towards the end of the sentence. Since important entities are those that the writer wants to stress, they are the entities that the writer will most likely continue to use over the next utterances. This means that entities at the beginning of the sentence are likely to be the correct antecedents. However, Hobb’s algorithm chooses the most recently evoked entities which tend to be located at the end of the previous sentence – that is, the algorithm chooses relatively unimportant entities. This is one other reason why Hobb’s algorithm performs rather poorly, for both pronouns and noun phrases. Although Strube applied the recency constraint idea, his algorithm worked better. This is because instead of focusing on discourse entities within an utterance, he applied the constraint to the global utterance themselves, enforcing the aforementioned idea that we need to look more globally for accuracy.

The noun phrase A and B evokes two different discourse entities: "A and B" and "A" and "B"
However, there are two problems that neither algorithm handles, causing relatively high error rates in both. The first is the undetermined ranking criteria for the noun phrase – “A and B”, where “A” and “B” represents any objects. Two types of entities can be evoked from this noun phrase:

(1) “A and B”
(2) “A” and “B” as separate entities.

Neither algorithm gives specifications as to which should be ranked higher. For my experiment, I assumed that these noun phrases are interpreted as “A” and “B” unless the text made explicit indications otherwise. This eliminated the complication of having to rank “A and B” and “A” and “B” as separate entities. However, I failed to see any patterns as to when one type of discourse entity should be preferred over the other.

A pronoun may refer to some entities not yet mentioned
The second problem that both algorithms experienced stemmed from the fact that both algorithms updates their discourse entity lists incrementally, word/phrase by word/phrase , from left to right. This means that the algorithms are prone to the word order of a sentence. For example,

12a. Until he composed Parsifal, no one knew Wagner.

This causes a problem because we encounter the referent “he” before we encounter its antecedent “Wagner”. Furthermore, these kinds of sentences are not usually used when the center of the previous utterance is, in this case, Wagner, but used when the center is some other discourse entity. One obvious but somewhat complicated solution is to eliminate this type of sentence structure altogether: First, scan through the corpus and look for sentences with this kind of structure. Rearrange these sentences so that they become “less complicated”. For example, utterance 12a would become 13a:

13a. No one knew Wagner until he composed Parsifal.

However, this would require a fair amount of undesired preprocessing of the corpus. I propose a solution which does not require any preprocessing and maintains the incremental update property.