Chapter 4 The Charles Dickens Complete Corpus
4.6 The construction of the DCC
The DCC is restricted to the texts of Dickens published in the National Edition of his works in thirty-eight volumes out of forty. Therefore, the only
indispensable criterion for constructing the DCC was solely to have these works fully identified. The rationale for inclusion in the DCC was to precisely create an electronic version of the thirty-eight printed volumes, which obviously required some considered steps and effort so as to match the electronic texts with their printed versions.
One of the essential criteria of a corpus is that it represents a particular variety or genre; otherwise it would merely be referred to as a text archive, as differentiated by Leech (1991, cited in Baker 2006: 26) and Sinclair (2004).
Being a relatively small and specialised corpus, the DCC represents clearly the original language of the author, and is more likely to contain some oddities in the linguistic norms used by the author, where such curious usage may become lost in the large corpora. Following a review, the accuracy of the used to typify the English language of the 19th century, nor the literary
129 language of that given period in general; rather, it is representative of the language of Dickens himself. Meanwhile, the DCC can exemplify the common Victorian era of novels, along with the period in which these works are situated.
4.6.1 Selecting texts’ sources
Following the creation of an accurate list of Dickens’s works along with their respective contents, gathering the full texts of the complete works has been conducted from one primary source, namely, Project Gutenberg. Other sources were employed as supplementary pools to complete those texts that were not included in Project Gutenberg’s electronic versions, such as the University of Adelaide Library, Australia, and the University of Toronto’s Robarts Library, both accessed via www.archive.org.
As the comparison progressed between the two lists, the collected texts from these sources were all compared against the printed versions in the Dickens Complete Works, National Edition, in terms of the titles of each work and the table of the contents. This comparison between the electronic texts and the National Edition was to ensure the final corpus contained precise contents with no duplication of either the titles of the works or their contents in general. If, for any reason, the main Project Gutenberg source lacked any electronic texts, these were added manually after first being digitalised and proofread.
The most prominent examples of such absent texts from Project Gutenberg’s e-texts database are the introductions that accompany some of the new edition of Dickens’s Complete Works, as shown in the final itemised list (see Appendix 4.2). Through completing this task, a precise and accurate comparison has been conducted on two levels: the first comparing the two
130 lists that enumerate the complete works of Dickens, and the second level comparing the collections of the full e-texts sourced from Project Gutenberg with the National Edition printed versions, in order to prevent any recurrence of the identical texts inside the corpus so as to avoid the multiplication of the number of words either inaccurately or randomly.
The first step towards the construction of the DCC consisted of downloading the complete publically available set of relevant documents from Project Gutenberg to a computer hard-disk. As the available documents related to Dickens are of limited number, the downloading process was carried out manually. Following a review, the accuracy of the documents was deemed to be both high and sufficient for inclusion in the corpus. Then, they were reorganised and classified so as to match the National Edition printed version, forming twenty-four files in total. These files were saved to ensure their compatibility with standard corpus tools. Nevertheless, every effort was made to match the contents of these documents precisely with the printed versions of the National Edition of the Complete Works of Charles Dickens, in respect to the thirty-eight of forty volumes concerned with here. The files are classified as follows:
Table 4.5 Contents of the DCC precisely matching the National Edition
volumes
File no. Volume Work
01 I & II Sketches By Boz, Illustrative Of Every-Day People
02 III & IV The Pickwick Papers
03 V Oliver Twist
04 VI & VII Nickolas Nickleby
05 VIII & IX The Old Curiosity Shop
06 X & XI Barnaby Rudge
07 XII American Notes For General Circulation and Pictures From Italy
131
09 XIV & XV Martin Chuzzlewit
10 XVI Christmas Books
11 XVII Hard Times and Other Stories
12 XVIII & XIX Dombey and Son
13 XX & XXI David Copperfield
14 XXII & XXIII Bleak House
15 XXIV & XXV Little Dorrit
16 XXVI & XXVII Christmas Stories From ‘Household Words’ And ‘All Year Around’
17 XXVIII A Tale Of Two Cities
18 XXIX Great Expectations
19 XXX The Uncommercial Traveller
20 XXXI & XXXII Our Mutual Friend
21 XXXIII The Mystery of Edwin Drood – And Master Humphrey’s Clock
22 XXXIV Reprinted Pieces – Sunday under Three Heads
23 XXXV & XXXVI Miscellaneous Papers – Plays and Poems
24 XXXVII &
XXXVIII Letters and Speeches
Following, where applicable, the principles introduced by Sinclair (2004), the information regarding the texts — i.e. the edition, table of contents, publisher, and introductions by editors — was not included in the plain texts. The text files from the Project Gutenberg eBook customarily contain a description of the file. This pertains to the title of the work, the author, the
illustrator, the copyright (that has virtually no restrictions whatsoever on the work’s use), the release date, the eBook number, when it was first posted, its language, and finally, the character set encoding. There would also be an indication of the original copy from which the work was transcribed, as well as the proof-reader’s details. Yet, with such descriptions and details offering no apparent benefit to Dickens’s discourse, these were obviously removed from the text files. The TXT files are intended to represent solely the texts produced by Dickens himself, and any additional details added by editors or publishers who supported the works have been omitted. This describes how
132 the text files were named, which information was included in each file, and how these text files were stored, that is, the files’ format.
4.6.2 Storing and checking
The texts attributed to Dickens in the Project Gutenberg website which were available in TXT format comprised of fifty-five files in total, spanning novels and extracts from various works of Dickens. The TXT format was selected as it is apparent that the majority of text-analysis programmes operate efficiently if the texts are available in such a format, thus ensuring compatibility with the text analysis tools. Each of the main works of Dickens were saved in separate files, as it was suggested by Shattock (2000) that ‘[i]t is always best to create files at the smallest ‘unit’, since it is easier to combine files in analysis’, while
being ‘stored as individual files rather than as a whole class will allow the most options for analysis’ (Shattock 2000: 32–3).
The convention established for naming the files of Dickens’s work follows identically the published edition of his complete works, that is, the National Edition as shown in Table 4.5. This resulted in twenty-four files matching the titles and works, as detailed in the DCC’s description of contents in Appendix 4.2. This practice thus ensures that the file names relate to their content, the title of the work and the number of the volume in the National Edition.
To the best of my ability, Dickens’s texts were preserved as they appeared in the printed version of National Edition, including the non- standard spelling and any ungrammatical structures. These non-standard features of the text may be of interest to some aspects in this study or in futures studies to be conducted on this corpus. The final result was a
133 compilation of Dickens’s Complete Corpus (DCC) which has a total of just over six million tokens (6,202,886).