Differences with the DH Projects - Comparison with the New Method

CHAPTER 2: Insights from the Digital Humanities

2.3 Comparison with the New Method

2.3.2 Differences with the DH Projects

As well as the three similarities between the new method and the DH projects, there are also three significant differences, as follows:

1. The scope and resources;

2. The number of reference forms that are considered; 3. How the ‘most likely’ parallels are highlighted.

These differences are now explained in the following discussion.

2.3.2.1 Scope and Resources

Although the method used by this study follows a similar pattern of steps (see Section 2.3.1.1), it does not seek to create a new DH project. Most of these existing DH projects represent collaborative ventures, usually in a university setting, that employed a team of computer scientists to write a program and a team of literary scholars to assess the results. As such, they were able to test and refine their models multiple times in order to achieve the best results. This present study is therefore different to these projects in that it does not have those same resources. As such, this study does not write a new computer program, instead the method is executed manually and tested only once for each set of source texts.

The new method is best understood as reusing/configuring one of these existing projects. As such, the method could have potentially used any of these projects, provided they contained the required databases of source texts. However, of the ten projects that were surveyed, only the eAQUA project had the same set of source texts, but this project is now discontinued. The TLG database that was used by eAQUA is still available, but the web-based interface to the database cannot be configured to search for a wide variety of reference forms.

Among the remaining DH projects, the Tesserae project appears to be the easiest to configure (i.e. the most ‘user-friendly’), but it currently lacks the source and target texts that are required by this study. Tesserae would also need to add an option to rank

parallels based on the rarity of the matching word combinations, rather than just the rarity of the individual words (see Section 2.2.4). The String Resemblance Systems framework appears to do this ranking successfully (see Section 2.2.9), but it is only designed for use with Japanese Waca poetry, and so it not reusable for this study. Therefore, this study chose to use the Accordance Bible software program as the underlying project, or search platform. This choice was based on two key reasons: firstly, it can be easily configured to detect the variety of reference forms that are needed for this study of potential references in the Pastoral Epistles (see Chapter 1); and secondly, Accordance already contains the two databases of source texts (i.e. the Septuagint and the Jewish Pseudepigrapha) that are used by the study. Recreating these databases, which would be required to use the Tesserae or Tracer programs, is beyond the scope of this study.

However, the most significant difference between the new method and the majority of the DH projects is not the databases, but way that the parallels are scored/ranked in order to highlight the most-likely parallels (see Section 2.3.2.3, below). As such, many of these DH projects, including Tesserae and Tracer, could be easily modified to include this scoring system as one of their configurable options. It is hoped that the results of this present study, including the effectiveness of the method (see below), might provide the impetus to implement this relatively minor change in future versions of these programs.

2.3.2.2 Reference Forms

Since the method was originally developed for use with the Pastoral Epistles, it is designed to be able to detect all the parallels that are listed for these Epistles in the three baseline lists (i.e. the UBS5, NA28 and Evans). Some of these known/existing parallels have just one matching word, meaning that they would not even register as a single bi-gram (i.e. a two-word sequence). Consequently, many of the algorithms that are used by the DH projects would not detect them. Tracer can be configured to detect single words (by setting the length of the n-grams to one word), but without ranking these words based on their potential singularity, this configuration produces a vast number of false parallels (i.e. those that are not thematically coherent), which then need to be manually inspected.

In order to accommodate reference forms with low verbal similarity, the study developed a rule-based system that is similar to the PHŒBUS project. However, PHŒBUS has a fixed window/segment size (such as five words, as was the case in the example given in Section 2.10) and a fixed number of words in that window that must match (in the example in Section 2.2.10, this was 3 matching words, or 2 holes). These settings reduce the number of rules that are required (i.e. just six in the

example) but they also effectively limit the reference forms to just verbatim and non- verbatim clauses/phrases (or quotations and paraphrases).

In contrast to PHŒBUS, the method used by this study caters for different

window/segment sizes (up to 14 words) and various numbers of matching words in the window (as little as one). This enables the method to detect a greater variety of reference forms, like single keywords and multiple keywords. The method also defines rules that span two windows/segments of the target text, which allows it to detect the structural parallel reference form. Including these additional reference forms meant that the method needed to define about 10,000 rules. The following chapter will outline how these rules were defined and explain how the method will check to see if a rule is true.

2.3.2.3 How the ‘most-likely’ parallels are identified

Notably, none of the DH projects claim to detect all of the parallels that scholars have detected. This is because the task of detecting all the thematically coherent parallels actually requires the ability to understand the meaning of the texts, which is beyond the limits of modern computing. As a compromise, these projects look for instances of matching words (i.e. the ‘linking’ step) and then they highlight the matches that they think are ‘most-likely’ (i.e. the ‘scoring’ step).

The new method uses a combination of linking and scoring that appears to be unique. It links texts, or find parallels, using a rule-based system that defines a large variety of reference forms and then scores the results, or highlights the most-likely parallels, based on the rarity of the word combinations. There are two projects that are

somewhat similar: the PHŒBUS project links the texts using a rule-based system but then scores the results differently;278_{and the String Resemblance Systems}

computational framework scores the parallels in a similar manner, but it can only link poems that are exactly 31 syllables in length.279

This unique combination proved to be effective for the Pastoral Epistles because the method was able to detect 94.9% of the interpretable parallels that are listed for the Septuagint, and 100% of the interpretable parallels that are listed for the Jewish Pseudepigrapha.280 Furthermore, many of these parallels have much lower verbal similarity than the instances of text reuse that were detected by the DH projects, including parallels that involve just one matching word (see Chapters 4–7).

In order to put these percentages in perspective, Coffee et. al. detected only 37.6% of the interpretable parallels between book 1 of Lucan’s Civil War (BC) and Vergil’s

Aeneid using the Tesserae program.281 When the PAIR program was used to search for parallels containing four or more words, it had a success rate of 73.8%.282_The only project that recorded similar levels of effectiveness as the new method was Lee’s theoretical models for detecting text-reuse between the Gospels of Mark and Luke. Here, the effectiveness of his models was as high as 97.2% for the commentator that the model was based on, but the model only detected an average of 85.7% of the parallels for other commentators.283 Therefore, while this study investigates a

different research area to these ten DH projects, it does appear that the new method is relatively effective.

2.4 Summary

In the previous chapter, it was noted that this present study requires a method of detecting verbal parallels that can systematically and efficiently search through multiple intertextual frameworks and can detect a variety of reference forms and

279_{Takeda et al.}

280_{These percentages are calculated against the total number of parallels in the three} combined baseline lists (i.e. UBS5, NA28 and Evans).

281_{Coffee et al., 401.}

282_{Horton, Olsen, and Roe, ‘Something Borrowed’, 10.}

283_{Lee, ‘A Computational Model’, 478. None of the other studies that were surveyed} contain the percentages of parallels found.

reference types. These requirements make the study comparable with a number of recent DH projects that have sought to introduce a level of automation into the

detection of verbal parallels. Therefore, this present chapter surveyed ten of these DH projects, including recognizing their similarities and differences to the method used by this study.

This survey revealed some of the benefits of systematically searching for parallels, including the ability to detect ‘new’ interpretable parallels that have not been detected by the scholars, and also the ability to collect interesting metadata that can indicate potential areas of future research. Similar benefits will also be demonstrated by the method in Chapters 4–7.

The comparison with the DH projects highlighted that the new method has a unique combination of linking texts and scoring/ranking the results. This combination makes the method effective in its coverage of different forms of parallels, as well as efficient in the way that it highlights the most-likely parallels for manual analysis.

The following chapter will now describe the details of this new method. Rather than creating a new project, the method is designed to configure an existing search platform like one of these DH projects. As such, the terminology that is used to describe the steps involved in the method has been chosen to reflect the terminology used in many of these DH projects. In fact, the method could be used with one of these existing DH projects, provided they are modified to allow the parallels to be ranked based on the frequency/rarity of the word combinations. However, the Accordance program was preferred as the search platform because it is highly configurable and because it contains the required databases of source texts.

CHAPTER 3:

In document Echoes of Scripture and the Jewish Pseudepigrapha in the Pastoral Epistles: Including a Method of Identifying High-interest Parallels (Page 84-89)