Investigation of Sentence Level Text Reuse Algorithms
Author: Ron Hose.

This project analyzes the applicability of various algorithms to the problem of measuring text reuse at the sentence level. The motivation behind this project is to identify a high recall, high precision algorithm to determine Hebrew sentences in the dead sea scrolls that are derivations of biblical passages.

Labeled the most significant archaeological discovery of the 20th century, the dead sea scrolls have been puzzling researchers ever since their discovery in the early 60's. While still controversial, it is believed the scrolls were authored by a Jewish sect around the time of second temple. The scrolls offer insight to the political and religious life of the era, as well as the formation of the modern bible. [Schiffman,1994] Revealing the connections between the scrolls writing and the biblical texts is an important key in setting the context for scholars seeking insight into the sect's believes and practices.

Derivations in the scrolls are variations on the biblical source. This requires an inexact match finding algorithm that can determine the likelihood that a scroll sentence is a derivation of a bible sentence. The approach taken in this project is to treat the scroll sentences as queries to the biblical text corpus, measuring similarity and ranking matches accordingly. The algorithms tested are Ngrams, Sentence Alignment, Simple TFIDF and OKAPI.

This project follows the experimental framework of the Measuring Text Reuse (METER) project [Clough, 2002], which explores text reuse in journalistic material. The METER project measures reuse between at the document level. In contrast, this project measures reuse at the sentence level, where error margins are smaller.

Previously the finding of Bible/Scroll reference matches has been conducted by manual exploration of the texts. While such work is precise, It is time consuming and expensive, requiring master knowledge of the bible and scrolls, and hence limited by the recall abilities of the scholar. The most significant such index has been constructed by Wise [Wise, 1996]. My work automates the process of discovering matches, offering significantly improved recall, as well as high precision. The match candidates produced by the system can be reviewed by a scholar to offer similar precision to that of the manual process.

Read the paper