The RIPTIDES Project:


Rapidly Portable Translingual Information Extraction and Interactive Multidocument Summarization


Cornell University
CoGenTex, Inc.
University of Montreal



RIPTIDES Project Members

PI's

Asst. Professor Claire Cardie
cardie@cs.cornell.edu
Department of Computer Science
Cornell University
Ithaca, NY 14850
Phone: 607-255-9206

Dr. Michael White
mike@cogentex.com
CoGenTex, Inc.
840 Hanshaw Rd.
Ithaca, NY 14850
Phone: 607-266-0363 x25
  • Project Overview

    Project Overview

    In spite of recent progress in designing, building, and evaluating information extraction (IE) systems, relatively little research has addressed problems in the construction of translingual information extraction systems -- IE systems that extract domain-specific information from documents in a native language (e.g., Korean) and present the translated extraction results in English. In addition, very few methods exist for creating concise multi-document reports and summaries of the output of an information extraction system in any language. This post-processing capability becomes especially important in the context of translingual information extraction when analysts may need to quickly select for full translation only a small number of documents or passages.

    To address DARPA's current needs in translingual information extraction, CoGenTex Inc., Cornell University, and the University of Montreal are investigating an integrated approach to the rapid development of translingual information extraction and interactive summarization systems. The RIPTIDES1 system currently under development will ultimately support multilingual information extraction, translation of the extracted templates into English, and flexible, user-directed multidocument summarization of extraction results. There are two modes in which the final system can be employed. In the Training Scenario, the user interacts with the system to train and configure the system in a particular language and domain. In the Application Scenario, the user runs any of the available extraction systems and specifies a customized format in which the summarized results are to be reported. Each scenario is described below. For illustration purposes, we use Korean as the foreign language. In addition, we assume that the IE domain is the topic of meeting events where the elements to be extracted are participants, date, and location.



    Training Scenario


      
    Figure 1: RIPTIDES Training Scenario.
    \begin{figure*}
\epsfxsize=6.0 in %
\hspace*{\fill}
\epsffile{training-scenario.eps}
\hspace*{\fill}
\end{figure*}

    Figure 1 illustrates the scenario in which an end-user trains the system to extract information for a particular language and IE domain. The primary goal of training is for RIPTIDES to learn a set of patterns that can be used to extract similar domain information from novel texts in the specified language of interest. The end-user can be either a bilingual domain expert of a team consisting of a domain expert and a language specialist. RIPTIDES assumes that a scenario template has been defined and that a set of pertinent documents has been made available to the system. The user initiates the training sequence, then the user and the system participate in an interactive learning loop. The numbered arrows in the diagram correspond to the following steps:

    1. Drawing on the training corpus, the user annotates a small number of documents with the phrases that should be extracted to particular template slots. For example, consider the underlined words in the following sentence:

    mikwuk-kwa pwukhan-un ... 27il nyuyok-eyse ... hyepsang-ul kacnuntako
    USA-and NorthKorea-Topic ... the 27th NewYork-Loc ... negotiation-Accusative have
    USA and North Korea ...held negotiations ...in New York on the 27th ...
    Here the user would assign the Korean counterparts of `USA' and `North Korea' as participants, and `the 27th' and `New York' as a date and a location.

    2. RIPTIDES uses these annotations and the sentence in which they occur to create training examples for the IE pattern learner.

    3. At any point the user chooses, the learning component begins to induce patterns for the examples provided in the training set. The learned patterns are stored in the pattern base.

    4. RIPTIDES then applies the patterns throughout the training documents, collecting candidate matching instances.

    5. Performance of the IE patterns on the available training examples is measured and presented to the user upon request.

    6. RIPTIDES examines the candidate instances and selects the candidate example(s) (a clause, sentence, or document) that it considers the most informative in terms of its ability to improve the performance of the pattern learning algorithm.

    7. The user annotates the selected examples. In some cases, this will only require accepting correctly proposed extractions or rejecting incorrectly proposed extractions. Other times the user will have to annotate the text selections with the information to be extracted as in Step 1 above.

    This training loop continues until the desired performance level is reached. By selecting the most informative examples for annotation, RIPTIDES can more quickly acquire a good set of extraction patterns. The examples can also be leveraged by RIPTIDES to train other IE system components.


    Application Scenario


      
    Figure 2: RIPTIDES Application Scenario. Language: Korean. Domain: Meetings.
    \begin{figure}
\epsfysize=4.75 in %
\hspace*{\fill}
\epsffile{app-scenario.eps}
\hspace*{\fill}
\end{figure}

    Figure 2 sketches one way in which an end-user, possibly an intelligence analyst, might use the RIPTIDES system:

    $\bullet$ In a first step (not shown), the user selects a set of Korean documents in which to search for information. The user then selects one or more scenario templates (extraction domains) to activate in the query. Available scenario templates might include troop movements, terrorist acts, meetings and negotiations, etc. Here the selected scenario template is of type meeting.

    $\bullet$ Next the user optionally provides filters on the available template slots in order to restrict the search to the information considered to be relevant. In Figure 2, the values specified indicate that the information to find is about meetings having as location South Korea and as participant Masahiko Komura. The user can also specify what information s/he wants to be reported when information matching the query is found; here, the selected boxes under the Report column indicate that all information found satisfying the query should be reported except for the meeting date.

    $\bullet$ Once the user submits the query for evaluation, the system searches the database of extracted events. As a result, a hypertext report is generated summarizing the information matching the query. Note that the query contains English keywords that are matched against Korean slot fillers, which have been automatically translated prior to their inclusion in the summary. In Figure 2, the generated hypertext response indicates two documents in the input set that matched the query. As the third sentence indicates, both documents report on the same event, with the second document providing a more specific location. Since the meeting date was not selected for reporting, it is not mentioned in the summary.

    $\bullet$ For further exploration and verification, perhaps working with a human translator, the user can drill down to the original Korean documents, where the information matching the query is highlighted. If a general purpose MT system is available, an automatic translation is also provided.


    Advantages of the Approach

    The RIPTIDES system requires the integration of a number of system components:

    $\bullet$ The portable multilingual information extraction system contains components for learning and applying information extraction patterns in multiple languages for multiple domains.

    $\bullet$ The machine translation component translates the extracted foreign language phrases or sentences into corresponding English ones.

    $\bullet$ The natural language generation component generates well-organized, easy-to-read hypertext presentations by organizing and formatting the ordered (ranked) extracted information.

    $\bullet$ In the Application scenario, the user interface operates as a browser-based interface for entering queries and displaying the resulting presentations. In the Training scenario, the interface coordinates input from the IE pattern learner and the user to generate extraction patterns for a new language and domain.

    Each text-processing component must be designed to support a variety of foreign languages. Furthermore, it is imperative that these components be built quickly as text-processing capabilities in new domains and new languages become of interest. As a result, our research efforts focus on rapid portability: we are developing techniques and tools to support the automatic acquisition of all translingual information extraction and summarization system components. For this we propose to extend existing information extraction techniques for use in languages other than English and to develop new methods for incorporating translingual report generation and summarization. The underlying technology of our proposed effort is a novel combination of corpus-based techniques that together address both translingual and domain portability issues.

    In contrast to previous efforts to increase the portability of information extraction systems, our research emphasizes the development of powerful techniques for drastically reducing the amount of human-annotated data required to train an information extraction system in a new language or a new domain. The major innovations of the user-driven translingual summarizer include rapidly configurable and portable machine translation of extracted event instances; the use of supervised machine learning techniques for the identification of coreferent extracted events; and a robust and portable combination of bottom-up text planning and top-down structuring of a set of hyperlinked pages that support the data-driven groupings of extracted events matching the user query.

    We believe that our approach offers an innovative and promising solution to portability problems for translingual IE by dramatically reducing the amount of time and annotated data required to develop end-to-end IE systems for new languages. Furthermore, our approach to user-directed IE summarization ensures that end-user effectiveness will not be significantly compromised in the face of errors by the multilingual IE system or MT subsystem.


    Techniques for Rapid Portability

    As noted above, our research efforts focus on rapid portability: we are developing techniques and tools to support the automatic acquisition of all translingual information extraction and summarization system components. The underlying technology of our proposed effort is a novel combination of corpus-based methods that together address both translingual and domain portability issues.

    Our research centers on two related areas:

  • Machine learning for multilingual information extraction. We will extend existing machine learning methods in information extraction for use in multilingual information extraction tasks. In addition, we propose to develop new machine learning methods for training multilingual information extraction system components. The methods will employ unsupervised learning, weakly supervised learning, active learning and cotraining techniques as a complement to standard supervised learning algorithms. In contrast to previous efforts to increase the portability of information extraction systems, our research emphasizes the development of powerful techniques for drastically reducing the amount of human-annotated data required to train an information extraction system in a new language or a new domain.

  • Translingual user-directed multidocument summarization. We propose to develop a novel, user-driven approach to enabling an end-user to access the results of translingual information extraction. Our translingual summarizer will respond to highly configurable user queries with hyperlinked reports containing informative and fluent English summaries including confidence ranking of extracted and translated information. It will allow the user to drill down to the original documents in the source language for further exploration and verification. The major innovations of the approach include:

    Our final system will integrate the techniques and tools in a user interface that supports translingual information extraction (IE) and summarization in two domains and two to five foreign languages selected to test the applicability of our techniques across a variety of language families including Asian, European, and Eastern European languages.