Use of a Corpus - Functions of the Modern Conditional

Chapter 3: Functions of the Modern Conditional

4.2 Use of a Corpus

The research to be undertaken required a body of data which could be examined in the light of the hypotheses presented in the Introduction. Given the nature of the research question - the distributions of two conditional periphrases - this data had to consist of attestations of reflexes of the two periphrases, to which it would be possible to link contextual information on the area, date, and genre of the text in which the attestation occurred. The primary task was, then, to establish a source from which these attestations and the bibliographic detail to accompany them could be drawn.

Texts are the only sources of conditional attestations dating from the late mediaeval period, so it was necessary to define a corpus of texts that would be used both to form a source of attestations, and to limit their numbers: in effect, the corpus would form a a sample rather than the population.268 There are two types of corpora. There are manuscript corpora, in which text exists solely in printed form, ranging in size from Dionisotti and Grayson’s short compilation Early Italian Texts or Castellani’s La

268_{In statistical terms, the population refers to the complete set of data from which the sample is taken}

which, in this case, would consist of every surviving attestation of the conditional, in published and unpublished texts. To compile a list of the population would be a task beyond the scope of a Ph.D. thesis, which is why a sample was the preferred option.

prosa italiana delle origini: Testi toscani di carattere pratico, to large scale, comprehensive collections of texts such as the Concordanze della lingua poetica italiana delle origini under the direction of Arco Silvio d’Avalle, which aims to comprise “tutta la poesia italiana trascritta in codici grosso modo anteriori alla soglia del 1300”.269This type of corpus is only searchable manually, obliging the researcher to read the complete corpus to produce comprehensive data.

The second type of corpus is the electronic corpus, of which there are two main types. An internet corpus is a body of text or collection of texts accessible via a website which includes dedicated corpus software. These corpora are generally built and maintained by a research group or public body for either public or restricted access on agratisor charged basis. A local corpus270 may be a collection of texts incorporating dedicated corpus software, usually produced by research groups or public bodies on CD for acquisition by individuals or libraries, for example, the Library of Latin Texts.271 Alternatively, a local corpus may be a collection of texts compiled from the

269_{Carlo Dionisotti and Cecil Grayson,}_{Early Italian Texts}_{(Oxford: Blackwell, 1965).}

Arrigo Castellani,La prosa italiana delle origini: Testi toscani di carattere pratico(Bologna: Pàtron Editore, 1982).

Arco Silvio d’Avalle,Concordanze della lingua poetica italiana delle origini(Milan: Ricciardi, 1992). While the original intention of the CLPIO project, under the auspices of the Accademia della Crusca, was to produce manuscript editions of the works in question, the advances in technology since its inception have led to a different aim, namely “non più la stampa dei volumi con le concordanze, ma la diffusione del corpus lemmatizzato su supporto elettronico, nella forma di un CD-ROM”. <http://www.accademiadellacrusca.it/progetti/progetto_singolo.php?id=2568&ctg_id=27> [accessed 3 September 2008].

270

‘Local’ is to be understood as data held at the same location as the researcher on CD-ROM, hard disk drive or floppy disk, as opposed to ‘internet’, which refers to data stored elsewhere and accessed remotely by the researcher.

271 _The _{Library of Latin Texts} _{is produced by the} _{Centre ‘Traditio Litterarum Occidentalium’} _in

Turnhout under the supervision of Paul Tombeur. This is a database of texts from the second to the fifteenth centuries. Updated at regular intervals by additional CDs, version 5 contains “texts from the beginning of Latin literature (Livius Andronicus, 240 BC) through to the texts of the Second Vatican Council (1962-1965). It covers all the works from the classical period, the most important patristic works, a very extensive corpus of Medieval Latin literature as well as works of recentior latinitas”. <http://www.brepols.net/publishers/cd-rom.htm#CLCLT> [accessed 2 September 2008].

internet or local sources requiring external corpus software, such as Wordsmith.272 This type of corpus is usually built by a researcher to respond to project-specific needs for particular texts or types of texts.

An electronic corpus has a number of advantages over a manuscript corpus. Instead of the researcher being obliged to read the entire corpus, noting down each attestation of the feature or features under scrutiny, parameters are entered into the search fields of the corpus software, which will then retrieve all the attestations of the feature corresponding to the search query. Providing the queries are correctly formulated, this eliminates researcher-introduced error. Additionally, when using a manuscript corpus, only a limited number of features can be searched for and noted on a single reading. Using an internet corpus enables multiple concurrent searches by running the corpus in separate browser windows. While this is usually not possible in the case of local corpora, their use is still immeasurably faster than manual searching. This level of efficiency means that an electronic corpus increases enormously the amount of text that can be analysed. A large corpus may contain millions of words, yet still be searched in seconds for a specific feature, a task which would take months of work by hand. A corpus containing the entire works of Shakespeare, for example, can be searched in a similar time-frame to searching a single sonnet manually. Further analysis of data produced by an electronic corpus search is also simplified, as the data can be printed and studied in its original layout, or copied into other programmes such as Word, Access or Excel for further editing, tagging and analysis.

There is, however, a negative side to the use of an electronic corpus: a great deal depends on the entry of correct search parameters into the corpus search form. While this can have as little impact as requiring additional searches to retrieve the data, or the subsequent elimination of irrelevant results, there is also the danger that badly- chosen search parameters may exclude relevant attestations. An electronic analysis of a corpus will find only the data that it has been programmed to retrieve; the chance of a serendipitous discovery of a related or contrasting piece of evidence is minimal in

272

Wordsmith is the corpus analysis programme, written by Mike Smith of the University of Liverpool, and published by the Oxford University Press. It was released in 1996 and is currently at version 5. <http://www.lexically.net/wordsmith/> [accessed 2 September 2008].

comparison to an analysis of a manuscript corpus, where the entire corpus will have been read in context by a researcher who will be aware of the wider implications in a way that a computer cannot be. Despite these potential drawbacks, it was decided that in light of the breadth and volume of data required to produce a detailed picture of conditional use, the advantages of an electronic corpus would outweigh the disadvantages. There are a limited number of on-line or electronic corpora, and the only corpus containing sufficient early Italo-Romance vernacular texts is the Opera del Vocabolario Italiano (OVI).

In document A diachronic study into the distributions of two Italo Romance synthetic conditional forms (Page 81-84)