Trusting the Edition: Preservation and Reliability of Digital Editions

Scholars seem not to trust digital editions to any great extent. They probably use them online, but then they look to printed version to cite them. Various contributions (Robinson, 2005; Deegan and Sutherland, 2009; Vanhoutte, 2009, to name a few) have presented their explanations of why this is the case, some of which will be examined below. However, it seems most likely that one of the main reasons for this scepticism is that digital editions look much less stable and durable than printed editions.

Printed scholarly editions have long shelf lives. Some texts may be edited every hundred years, and sometimes at even longer intervals. Editing texts is a monumental task, in particular when there is a large number of surviving witnesses, and can take years, if not decades, for a scholar (or a group of scholars) to achieve it. The large amount of work required has somehow encouraged the view that editions may be forever, and that once texts have been edited well they do not need to be edited again.159_{According to Eggert, the work}

of the critical editor, although with differences, is comparable to the conservators of artefacts such as paintings or buildings, and its purpose is ‘securing the past’ (Eggert, 2009). Taking a similar line is Price: ‘Editorial work is one way to engage in historical criticism and to help bring the past into the present so it may live in the future’ (Price, 2009b, § 10).

Digital editions, on the other hand, are subject to rapid obsolescence and potentially to constant updates. The technology is evolving at a spinning rate, as is the taste and the expectations of the users, and this is contrary to the long shelf life view that the scholarly community holds of scholarly editions. The problem is not an isolated one: with more and more data being born digital and more and more data becoming unreachable, the durability of digital objects is a general issue that society is trying to deal with. New professions, new strategies, reliable data formats and standards, good practices, all of these are being deployed across the planet in order to try to tackle the problem of longevity and the preservation of digital assets. Nevertheless, in spite of these efforts, it seems clear that digital curation and preservation is a very costly activity and a continuous one too, to the point that one has to ask if it is indeed realistic to think of a digital product able to match the durability of print products, when the latter could potentially last for centuries, and require little effort once stored on a library shelf.

159_{For instance Deegan and Sutherland (2009, pp. 60-61) present a short collection of overconfident} statements about the definitiveness of scholarly editions.

It is worth noting here, however, that the alleged durability of print products needs to be reassessed. When we say that books have lasted for thousands years, we tend to forget that only a very small portion of them have actually survived. The large majority of the codices produced in the Middle Age are lost, not to mention documents produced in the classical era, for which only fragments remains. For the early modern period, many editions of books have completely disappeared too, and many more are preserved by only a handful of exemplars, saved as they are by the large number in which they were produced. In this case we are reminded by analytic bibliographers that each copy of any given impression could be

potentially different from any other, and so many important versions may have been lost even if some exemplars have survived. In the modern era, acidic paper has also lead to more losses, in addition to the ‘natural’ loss due to fires, other disruptive agents and disinterest. The point is that not all books have survived, but only the ones that people were interested in or were capable of preserving; some books have simply survived by chance. Every

generation has exercised a selection, a choice of the items to be preserved, discarding the ones that have been judged of no interest for the future. Most textual scholars know the frustration of having to deal with glimpses of the stages of transmission of any given work and are aware of how educated guess-work must sometimes supply for deficiencies of traditions. In a sense, textual scholarship is the discipline that has been created in order to deal with the missing documents, reconstructing lost evidence and circumstances.

The other myth to be challenged is the idea that preserving printed books is not an expensive undertaking. Libraries are such established assets of our culture that we tend to take them for granted, and we remember that they are actually quite expensive services only when they are threatened. Indeed the current economic climate has lead to substantial cuts to library budgets and the closure of many of them, and this is because they are expensive. Deegan and Sutherland, quoting Gabriel Zaid, report that in ‘1989 the British Library estimated that, though the books it receives on legal deposit are free’, receiving, cataloguing, storing and in general managing them ‘cost fifty pound per copy. Plus one pound per copy a year’ (2009, p. 160). Even more problematic is space: many libraries are collapsing under the volume of printed books being produced yearly. To face shortage of space, some libraries have started to store books off site (such as, for instance, the British Library), or have built larger, mostly underground, buildings (such as the Bibliotheque Nationale de France), others have digitized and then destroyed some of their belongings (a practice started as a

consequence of microfilming: Tanselle, 1989b, p. 44). Preserving the past is no trivial matter, whatever the format of the artefact that needs preservation; it requires a policy and a vision, a will and a handsome amount of investment. Digital preservation is only one aspect of such an endeavour. It is not denied here that digital preservation is a problem – it is, and it is of the

most serious nature – however, it is important that we do not set our targets too high and that we are not building too high expectations over incorrect premises. It is very clear that we will not be able to preserve everything, no matter how hard we will try; it remains to be

ascertained that preserving everything is actually what we want and a good idea.

8.1 Data and Metadata: Standards, Formats and Preservation

It is then unrealistic to believe that we will be able to preserve all digital objects that we produce: we are now producing every few days as much data as we have produced since the beginning of history, a phenomenon known as ‘data deluge’ so it is not realistic to think that we will be able to preserve every byte we generate. However, there are data and complex digital objects that are more special than others and that we might want to preserve; digital scholarly editions fit this latter category. But how can digital objects be preserved long-term? According to Deegan (2006) a set of good practices can take you a long way. The use of appropriate standards for both data and metadata, as well as separation of content from interface are suggested as key factors for the preservation of data. Deegan calls for the responsibilities of editors in the preservation for their editions: by making the right choices during the preparation of the edition phase, they will make their edition preservable.160_The

reason why the use of standards may improve the durability of one’s edition lies in the fact that the more such standards get used, the more likely that there will be an interest in keeping it alive; and even if the body responsible for the maintenance of such standards will change them or cease to exist. It is assumed that, if there is a sufficient amount of data that complies to this standard, someone, somewhere will find a solution (or many solutions) as to how to migrate the data from one format to the other, as it will seem to be more economical to invest in such an activity than loosing the data or redoing everything from scratch. If a critical mass of data exists in one format, tools will be produced to migrate it. So choosing widely adopted standards is a good idea. However, it is also necessary to choose one’s standard well, that is, to choose standards that are extensively documented not only from a technical point of view (how they work), but also from a semantic point of view (what they mean). The TEI

Guidelines, for instance, represent such a standard; the same can be said for METS,

160_{Digital Preservation and Curation are quite lively disciplines which comprise a large number of} publications. At the moment of writing, the most respected one are Harvey (2012) and Brown (2013). I would like to thank here Gareth Knight and Richard Gartner for pointing me, as always, in the right direction.

PREMIS, MODS, and Dublin Core.161_{While the TEI is at once a format for data and for}

metadata, all the other formats are for metadata only. These have a crucial role in order to ensure a longer and safer life for digital objects, as described by Lavoie and Gartner (2013):

[Preservation metadata] facilitates the process of achieving the general goals of most digital preservation efforts: maintaining the availability, identity, persistence, renderability, understandability, and authenticity of digital objects over long periods of time […] preservation metadata establishes an informational frame of reference around a preserved digital object that remains attached to that object over time.

All the standards mentioned above are open, developed and maintained by public institutions or charitable organisations, a fact that in theory should be a further guarantee of durability. However, the recent ‘shutdown’ of the US government has seriously damaged this belief. On October 1, 2013, the United States federal government entered a shutdown after Congress failed to enact regular appropriations or a continuing resolution for the 2014 fiscal year. As a consequence of this, many non essential federal services were also closed, including the Library of Congress, which however, restored access to their website on the following 3rd_{of October. The consequence of the closure of the LoC’s servers was that an}

undetermined but certainly large number of web services became suddenly unreachable, resulting in the inaccessibility of thousands if not tens of thousands websites worldwide. The reason why this happened is due to one of the characteristics of XML, the language of choice for most of the LoC standards. In order to declare the specific XML vocabulary one is actually adopting, XML files normally include at the very beginning a reference to an URI, a link to a site where the format is documented and maintained; this mechanism is called

namespace. However, if the URI of the namespace is not reachable, all files that contain a

link to such URI when recalled by a web user will return an error instead of the content of the file. This was what happened to all web services that made use of any of the LoC metadata standard. The disruption only lasted for three days, however this was enough to demonstrate the fragility of the allegedly safest of standards and the global interdependency of digital objects.

161_{The first three standards mentioned are developed and maintained by the Library of Congress:} METS (Metadata Encoding and Transmission Standard): <http://www.loc.gov/standards/mets/>; PREMIS (Preservation Metadata): <http://www.loc.gov/standards/premis/>; MODS (Metadata Object Description Standard): <http://www.loc.gov/standards/mods/>. Dublin Core is otherwise an

When it comes to standards that are maintained by commercial companies, their adoption represent a bigger risk, as their support and maintenance will only be guaranteed as long as the company considers them commercially sustainable. In some cases, however, a proprietary format can become an open standard. This is the case, for instance, of one of the most used formats for textual storage, the PDF (Portable Document Format) developed by Adobe.162_{The format was developed in 1993 and its specifications were made available from}

the very beginning, but it remained a proprietary format until 2008, when it became an open standard, certified by ISO.163_{The same route has been taken by Microsoft Office formats,}

that have become ISO standards in 2012.164_{It is worth noticing how both formats have}

become XML-based on the occasion of their opening up as formal standards. XML, in fact, as been defined as the ‘acid-free paper of the digital age’ (Price, 2008, p. 442), because of the fact that it is text-based, human and machine readable without the need of any specific software, and platform independent, all facts that make it suitable for long-term storage.

8.2 Preserving Data, Preserving Interfaces

We have seen the argument according to which the use of an open standard helps with the preservation of a digital object if there is enough critical mass of data to make it viable to create a converter if the standard happens to be abandoned. This argument is germane to the one that says that if an artefact will be perceived to be valuable, it will be taken good care of. These are hardly reassuring arguments for editors who would like to be told that their editions will survive for sure, but these reassurances are hard to give. And even if weak, this type of reassurance comes at a price, and not a cheap one. The price is the interface. In the previous chapter both the scholarly relevance and the problems connected with the political

162_{http://www.adobe.com/uk/products/acrobat/adobepdf.html}

163_{ISO 32000-1:2008 - Document management — Portable document format — Part 1: PDF 1.7".} Iso.org. 2008-07-01

<http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502> 164_{Office Open XML (also informally known as OOXML or OpenXML – not to be confused with} Open Office XML, the format adopted by Open Office: <http://www.openoffice.org/xml/>) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents and it is recognisable by the suffix ‘x’ and the end of the file extensions (such as, for instance, ‘myfile.docx’). The format was standardized by ISO and IEC (as ISO/IEC 29500): Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 1: Fundamentals and Markup Language Reference.

sustainability of interfaces was introduced, now is the moment to analyse the problem they pose for long-term preservation, namely their technical sustainability. Provided that the text of a digital edition is encoded following an international standard and equipped by a suitable set of metadata, then it should be relatively easy to preserve and, if necessary, to migrate; the same cannot be said for the interface.

Interfaces are the territory where the negotiation of meanings between the producers of the digital resource and the user takes place. However, they are complex and fragile objects. For instance, it should be remembered that the role of the ‘producer’ of a digital resource cannot be attributed to the editors alone, nor to their team of developers. The production of complex digital artefacts relies on several layers of software: the operating system, a set of libraries of utilities, the web server, the browser, and so on; all of these components are mutable, are controlled by different entities, are available in many versions, are competing against each other and present a set of reciprocal compatibilities and

prerequisites. Each digital edition is based on a custom, idiosyncratic selection of these components, and this makes the effort of maintaining more than one edition within the same system a very demanding task. The many editions developed at the Department of Digital Humanities at King’s College London can be taken as a case study. Established in 1991 with the name Centre for Computing in the Humanities, it has contributed to the creation of around ninety digital resources across its more than twenty years of existence,165_{all of them}

created at different stages of maturity of the digital technologies we are now accustomed to use, and using slightly different sets of software, distributed along a continuous line of development. The curation and maintenance of these resources are mammoth tasks, only made possible by modular development that allows the reduction of the number of variables of the software deployed. This effort does not come cheap, and it is made affordable almost entirely by the fact that all these resources have been developed within the same institution and therefore their architecture is more or less familiar.

But what about all the valuable digital resources that have been developed outside a specialized centre for Digital Humanities: how does one maintain them? In 2004 I was technical director of a project called OperaLiber, a digital collection of opera librettos (Pierazzo, 2006). The project was based at the University of Pisa and had been supported by the Italian research council (MIUR) on two consequent occasions, 2002 and 2005. The website has been developed by a company specializing in digital humanities developments,166

165_{See <http://www.kcl.ac.uk/artshums/depts/ddh/research/index.aspx> for list of project produced by} DDH.

using a set of solutions widely used within the DH community;167_{the librettos themselves}

had been encoded in XML-TEI and well documented. The problems started when a web host had to be found. After several unfruitful attempts, the University of Bergamo eventually accepted the task, but the IT team there was, and is, unfamiliar with the set of software deployed. The result is that the search and the browsing of the website does not work anymore, and nobody on site seems to know how to fix the problem; the resource as a consequence is unusable. This sad story is emblematic of the difficulties that are encountered if digital resources are to be kept alive outside the organization that produced them.

The problems encountered by OperaLiber also hint at a further issue that has been only cursorily mentioned above, namely the cost of maintaining the resources. The choice of hosting the resource in a university that does not provide the necessary support is due to the fact that it was the only institution that was able to do it for free. Keeping digital resources alive after they are finished is not trivial in terms of effort and cost. However, research funds only rarely include provisions for the long-term maintenance of the resources, leaving each producer of digital resources to develop their own financial model to support them. In some cases the maintenance is guaranteed by good will and by personal and institutional interest. In other cases, the resources are subscription based, kept behind a pay-wall. Such pay-walls

In document Digital Scholarly Editing: Theories, Models and Methods (Page 183-200)