Finding the Finding Aid: Shifting Archival Description from Documents to Data

(1)

Kelly Bolding

Advanced Archival Description December 9, 2014

Finding the Finding Aid: Shifting Archival Description from Documents to Data

Introduction: What Is A Finding Aid?

When I tell friends outside of the field that my job is to create finding aids, the next question is inevitably, what’s that? Most people outside of a library or academic research setting are not familiar with the term finding aid. If I said, instead, that I create groups of linked, dynamic web-pages that allow researchers to search through descriptions of various parts of a collection of archival records, I might get something closer to comprehension. Finding aids have been based on a paper document model for so long that their virtual equivalents were

developed in their spitting image, resulting in HTML pages that essentially mirrored their paper forms but with hyperlinks. Most institutions have now adopted EAD and have made at least some, if not most, of their collection descriptions available online, either through local or cross-institutional discovery portals. While this online accessibility is obviously a great

improvement upon paper inventories found only in the reading room of the institution holding the records, EAD in this context does not actually make use of its structured data elements beyond merely engaging element tags for formatting purposes. Several single-view or

component-view finding aid delivery systems, such as those at Princeton University, the New York Public Library, and Brigham Young University, have begun a trend towards interfaces that break up long archival descriptions into their component parts for a better user experience and

(2)

employ faceted searching that allows users to browse and limit archival descriptions based on their underlying EAD structure.

Given these advances, archival description still continues to pose problems for archivists and users alike. For one, many archivists struggle to develop the technical skills necessary to maintain highly structured EAD data and to convert idiosyncratic legacy data that may not even conform to current data content standards into valid EAD. Furthermore, once archivists do encode finding aids in EAD, many institutions do not have the resources to construct the sophisticated web applications and discovery interfaces required to make very much use of structured data besides allowing for full-text searching of HTML documents. Despite all of these efforts to promote online accessibility and interoperability, users still struggle to navigate and understand finding aids in virtual form, as they do not often meet expectations created by other web content, nor do they always meet users’ information needs. Without sacrificing basic archival principles, archivists must rethink the way they create and display finding aids in an online environment.

In this paper, I will discuss the implications of several insights from various published user studies on finding aids to explore the conflict between the subject-driven,

information-seeking behaviors of users versus the hierarchical, provenance-oriented model of traditional archival description, and how EAD can help archivists meet user needs without betraying archival principles. Next, I will delve into the value of EAD finding aids as structured data, making an argument for a movement away from a document-centric model to a

data-centric model of archival description. Finally, I will discuss several major critiques and shortcomings of EAD and suggest other structured data models that might eventually replace it.

(3)

User Experience of Finding Aids in a Digital World

In a 2009 OCLC report, Jennifer Schaffner asserts, “Now that we no longer control discovery, the metadata that we contribute is critical. In so many ways, the metadata is the interface.” Her emphasis on the importance of rich, accurate metadata suggests that, in a ¹ world where researchers are expert Googlers, they expect searching for archival materials to be similar to searching for information in the open web. Furthermore, now that contemporary users most often arrive at online archival descriptions through a web search rather than

through a library catalog, archivists have all the more reason to provide for seamless movement between external web search engines and finding aids in order to promote accessibility,

especially for users who may be unfamiliar with traditional archival research methods. Without an archivist physically present to mediate remote users’ understanding of archival descriptions during the search and discovery process, we must rely much more heavily on the thoroughness and comprehensibility of our descriptive metadata, and thus it is justified for archivists and archival institutions to devote increasingly more time and resources to developing metadata standards and interfaces.

Following an outcry over the lack of usability studies on finding aids in the mid-2000s , ² several trends can be extracted from the resulting user studies that appeared in major archival journals over the following years. These include the lack of understanding of hierarchical description, confusion due to the inclusion of administrative information within description,

1 Jennifer Shaffner. The Metadata Is the Interface: Better Description for Better Discovery of Archives and Special Collections. Report produced by OCLC Research, 2009.

2Lisa R. Coats. “Users of EAD Finding Aids: Who Are They and Are They Satisfied?” Journal of Archival Organization, Vol. 2, Issue 3 (2004): 25-39

(4)

expectation of item level description and a complete digital surrogate rather than an aggregate representation of collections, and a desire for subject access over access by provenance . Many ³ of these user desires go against long-held traditions and principles of archival description, challenging archivists to adapt to the needs of users without sacrificing their duty to preserve context, original order, and administrative history of records. While some might argue that rethinking archival principles and access tools in order to account for these needs indulges a flavor-of-the-month model of archival description, we need to look to new technologies that allow us to have it both ways.

As Anne Gilliland-Swetland noted as early as 2001 regarding EAD, “Once the appropriate elements have been completed and their structural interrelationships… are established and documented, the content of those elements can be physically rendered in any number of ways, according to user needs or preferences. This makes it possible to generate a variety of alternate user views of the metadata held within the archival information system.” I think that the most ⁴ powerful aspect of EAD lies in the quality that Gilliland-Swetland mentions here, that its

internal structure, unlike the fixity of a printed document, allows for the recombination of elements to support multiple discovery and retrieval strategies while preserving, though not necessarily always displaying, hierarchical structure and context. Unfortunately, in 2014, few discovery interfaces make use of this aspect of EAD, and those that do could likely do so to a greater extent.

3 J. Gordon Daines III & Corey L. Nimer, "Re-Imagining Archival Display: Creating User-friendly Finding Aids," Journal of Archival Organization, Vol. 9, Issue 1 (2011): 4-31.

4 Anne J. Gilliland-Swetland, “Popularizing the Finding Aid: Exploiting EAD to Enhance Online Discovery and Retrieval in Archival Information Systems by Diverse User Groups,” Encoded Archival Description on the Internet (Binghamton, N.Y.: The Haworth Information Press, 2001): 199-225.

(5)

Archivists at Brigham Young University developed a single-level finding aid display in 2010, to address the demonstrated need to create user-friendly “systems that provide access to individual items from archival collections,” and at the same time, “contextualize the description and facilitate browsing” through a contextual navigation pane. Princeton University’s finding ⁵ aids website allows users to browse and view single-component descriptions that are linked back to higher levels of description through a left-side navigation tree. Users unhappy with this model can click a button at the top of the screen to display a typical HTML view of the finding aid in list form. For those older, more traditional researchers who are accustomed to

document-based finding aids, they can click another button to provide a printable PDF version.

Archivists, digital humanists, programmers, and others crazy enough to want to view the raw EAD file can append the XML file extension to the end of the URL to view the source document.

The key here is supporting flexibility of display to offer multiple avenues of entry for different types of users.

Daines and Nimer’s strongest critique of the traditional finding aid is its “dual inheritance as a management and access tool,” providing administrative information not directly relevant to users that only adds to their confusion when searching for specific

content-based information. While I agree that some administrative information may not need ⁶ to be included prevalently in online finding aids, I disagree that finding aids cannot and should not effectively serve this dual purpose, keeping in mind, however, that for them to do so

effectively, we need to think of finding aids as data sets rather than as documents, an idea I will return to at greater length later on. I also disagree that these two types of metadata should be

5 Daines and Nimer, 8.

6 Daines and Nimer, 6.

(6)

segregated on principle, when we have systems that allow data to live at home, readily accessible and displayed in various contexts only when it is required. Keeping

Gilliland-Swetland’s insights in mind, we should remember that, unlike a text document that displays its entire content, EAD is capable of including information that is not displayed in public views but remains readily available to archivists who may require it for administrative or

reference purposes. The @audience attribute, when set to a value of “internal,” is intended to allow for the inclusion of administrative content that a production stylesheet could recognize and hide from the public display unless accessed from a staff username. If we can take ⁷ advantage of the ways in which EAD is different from a paper document, we can begin to address the conflict between traditional archival principles and contemporary user needs.

While it may seem surprising to historians and archivists, some users are simply not interested in using archival materials for traditional research purposes. Sometimes users may need a photograph of a specific subject, and maybe even they want that photograph to have a certain predominant color or quality; the photographer or creator is likely not of any interest at all to this sort of user. While this is a somewhat extreme example, we should encourage uses of archives that may be more artistic or exploratory than purely academic. While it is archivists’

job to make sure context is preserved over time for users who want to use records for their evidential value, it is not our job to prevent other types of use. One issue that comes up again and again in user studies is that many researchers are much more interested in subject access to materials than they are in access by provenance, which is the traditional main entry point for archival resources. As Schaffner bluntly asserts in her compilation of decades worth of user

7 “EAD Attributes: General Attributes” Encoded Archival Description Tag Library, Version 2002. Accessed December 7, 2014. http://www.loc.gov/ead/tglib/att_gen.html

(7)

studies, “For thirty years, people have reported that they want to discover archival materials using subject information.” While the inclusion of controlled access terms in EAD findings and ⁸ software that can enable full-text searching are obviously crucial to providing subject access, researchers are still fairly limited in that they can usually successfully browse for subjects only once they are already in a single finding aid that they first had to locate by provenance. By adopting a discovery interface that incorporates faceted searching, such as Blacklight, where ⁹ users can limit their browsing and searching not only by provenance, but also by physical format, time period, subject, or other category, archivists can provide much richer

subject-based access to collections, without compromising provenance, which can remain present through a contextual navigation pane on each single-level component that displays in the search results.

Finding Aids as Data

As bibliographic description and MARC 21 were forced to escape the card catalog model in order to come online, archival description and EAD must escape the document model if archivists hope to address any of the user needs addressed above. As follows from this

reconceptualization, we need to stop thinking about finding aids as headings, paragraphs, and lists, but rather, as a collection of discrete units of metadata. Since the fixity of a document is no longer a restraint, we should take advantage of the increased flexibility of EAD finding aids to hold complex data that can be displayed flexibly and in multiple formats in order to

accommodate a variety of uses and users. It would be a shame to view the primary value of EAD simply as a way to publish finding aids online; there are much easier methods of creating

8 Schaffner, 6.

9 Blacklight. Accessed December 7, 2014. http://projectblacklight.org/

(8)

an online document than encoding a body of text in a highly structured set of tags, running it through one or more stylesheets to display as rendered HTML online, and then run another program to make it full-text searchable. If that were EAD’s only offering, I would be the first to argue that encoding finding aids this way would be a waste of many small repositories’ time and resources. Fortunately, although it may not look that way from surveying the existing online finding aids of many repositories, EAD can do much more than this, as long as we encode data granularly and consistently and continue to develop programs that make use of its internal structure. In support of rich encoding practices, Anne Gilliland-Swetland, on the broader topic of metadata, asserts, “the more highly structured an information object is, the more that structure can be exploited for searching, manipulation, and interrelating with other information objects.” In a nutshell, the more structure we provide for our data, the more functionality we ¹⁰ can get out of it for our users.

One of the major critiques of EAD is its looseness as a DTD despite being structured data. Even when used in tandem with DACS and other content standards, so many elements and attributes are optional or variable that institutions often end up with starkly different EAD files, a result that poses problems both for interoperability and for repurposing and displaying data. Institutions often somewhat arbitrarily base data entry and encoding practices on quirks of their particular XSLT production stylesheet, rather than deciding how best to logically encode their data and then developing a functional stylesheet from there. Part of the problem is that it is very difficult to escape the document model. One example of this failure to leave the

document model behind is institutions making decisions about EAD encoding practices based

10 Anne J. Gilliland, “Setting the Stage,” Introduction to Metadata 3.0. The Getty Research Institute, 2008.

http://www.getty.edu/research/publications/electronic_publications/intrometadata/setting.html

(9)

on desired formatting, for example, leaving commas at the end of the value of <unittitle>, or even worse for machine processing, leaving dates inside of <unittitle> so that they will display on the same line as the title rather than encoding them in a separate <unitdate>

element so that they can be searched as discrete units of data and writing the stylesheet to display them as a text string following the title. If we adopt the web design model of Semantic HTML, which separates formatting information from semantic information about data and its meaning, we benefit in that we can change what data displays online and how it looks only by changing our XSLT stylesheet once for each update rather than having to change hundreds or thousands of EAD files individually. If we want to apply “minimal” processing to the technical side of archival description, adopting this as a best practice would be a significant step towards efficiency in our encoding and data management practices.

The other part of the problem, which is also related to efficiency, is the lack of technical skills in the archival community necessary to write and edit XSLT stylesheets and perform other complex data querying and transformation necessary for the ongoing management and

maintenance of EAD data. It is understandable that many archivists are underprepared for these tasks, given that there are so many other skill sets that that are simultaneously required of them. Those who entered the field because they studied English literature or enjoy

organizing old historical documents do not often moonlight as computer programmers;

however, from the look of things, they need to start. For one, if archivists can be more

technically self-sufficient, they can rely less on the help of systems staff who may have a lesser understanding of archival principles or researchers’ needs to develop the software programs that are so much in demand in order to make use of EAD’s structure to promote better access

(10)

to collections. We can start by not leaving the encoding of finding aids and their ongoing maintenance to one staff member at a repository, but rather involving everyone in the process of putting content online. Knowing which data elements to include in an archival description is important, but just as important is knowing where to put those elements in a structured data environment so that machines, and thus users, can best find them when searching online.

Admittedly, languages like XSLT and XQuery are time-consuming and challenging to learn; however, the time spent learning them is worthwhile, especially when using them to automate certain tasks prevents the need for many rote data entry and clean-up activities required to make archival descriptions optimally available online. As Yale University archivist Maureen Callahan argues convincingly in a blog post about humanists taking responsibility for more technical work, “At first, it may take a lot more time to do something with a script than to do it by hand. But once you’ve done it with a script, you’ve actually learned something. If you do it by hand, you’re stuck in the same learning mode that you were before.” For all of the ¹¹ time spent developing a computational solution to a data problem, you’ve not only fixed the finding aid you were working on, but you have also gained a tool that you can reapply the next time you run into a similar problem, a practice that would seem very much in keeping with MPLP’s impetus towards efficiency. These skills can be repurposed again and again. For

example, say you learned XSLT to develop a stylesheet to display your institution’s finding aids online. The same language, put to a different purpose, can be used to reformat existing EAD files and elements in order to keep up to date with new data content and encoding standards, convert Excel spreadsheets from donors into EAD, and even “scrape” formatting from older

11 Maureen Callahan, September 30, 2014, “Computational Thinking and Archives,” Chaos → Order, Accessed December 7, 2014. https://icantiemyownshoes.wordpress.com/2014/09/30/computational-thinking-and-archives/

(11)

finding aids in Word documents to create EAD without having to rely on extensive copy-pasting.

Time saved later can be spent doing the complex and rewarding parts of descriptive work, rather than changing the formats of dates by hand or copy-pasting pieces of a legacy finding aid into an EAD template. As Callahan puts it, if we learn a little bit of code, we can “let the robots do the boring stuff.”

Additionally, working closely with archival metadata forces archivists to think deeply about the ways in which computers interpret the description they contribute to online access tools, and thus about the ways in which users directly interact with this metadata. As Gregory Wiedeman argues, in his brief but informative overview of XQuery for archivists, “Through learning and making use of XQuery to process EAD finding aids and other XML data, archivists will better understand how that data may be used, and in turn, better understand how to store and manage it.” Because learning these languages forces us to think about finding aids as ¹² data, knowing them puts archivists in a better position to develop innovative solutions to increase access to their collections. Furthermore, looking further down the road for the archival profession, the next major task for archivists is to wrangle the great bulk of primary source, born-digital content that has been amassing while we have been struggling to adopt systems that will allow us to appraise, ingest, preserve, and provide access to content in digital formats.

Dealing with born-digital archives will require all of our knowledge of archival and appraisal theory, along with a great deal of technical know-how and computer skills. Any skills that we learn to help with our task of creating good structured data finding aids in order to promote better online access will help us down the road when we are processing born-digital materials

12 Gregory Wiedeman, “XQuery for Archivists: Understanding EAD Finding Aids as Data,” Practical Technology for Archives, Issue 3, November 2014. http://practicaltechnologyforarchives.org/issue3_wiedeman/

(12)

that can only be read in their native form on a computer and will require scripting to interpret, determine authenticity, preserve, and describe.

Is EAD Dead?

After arguing for the importance of good encoding practices and technical skills for archivists, it may seem strange to claim that EAD is a dying standard. To be clear, EAD is undoubtedly a step in the right direction; however, its theoretical model and looseness of structure will not ultimately stand up to the task at at hand if archivists intend to fully address the results of user studies on archival finding aids. For one, despite the fact that EAD is a great improvement on unstructured data, it is notoriously loose for a data encoding standard. The EAD schema only assures that elements are nested properly and encoding syntax is correct, while institutions are on their own to validate their own data content based on DACS. Unless institutions write their own content validation schemas, there are very few controls on what information shows up in between tags. While EAD’s flexibility in terms of allowing for a variety of displays and presentations is a good thing, its flexibility regarding what archivists can enter into many of its tags does not foster consistency, which is necessary for the flexible but accurate display of information. While many elements of description will always be free text, many controlled access terms, normalized dates, identifiers, and fields with a set number of options could be much more closely controlled in a relational database context.

Pointing to another barrier to using EAD to broaden access to remote users, Kathleen Feeney’s study concluded in 1999 that online search results did not often lead to archival resources, and Chris Prom admitted in 2002 that “we do not know how useful EAD metadata ¹³

13 Kathleen Feeney, “Retrieval of Archival Finding Aids Using World-Wide-Web Search Engines,”American Archivist 62 (Fall 1999): 206-228.

(13)

actually is for complex search and retrieval” and recommended that EAD be blended with ¹⁴ Open Archives Initiative Protocol in order to better facilitate interoperability. While, these days, finding aids from major repositories do come up in Google search results much of the time, institutions still struggle to have their supposedly more accessible single-level pages indexed, as the algorithms currently often index collection-level pages only, meaning much of the

lower-level content institutions are generating is not making it into search results. Migrating to another system that uses linked data and RDF may help solve this problem in the future.

Furthermore, EAD alone will never be able to fully push archival description out of the realm of the document because XML is a document-centric data model, which basically takes a regular text document and marks it up with semantic tags to distinguish discrete data elements.

The good news is that the structuring of text documents is the first step towards getting all of this archival descriptive data into a relational database or other data model that would be better suited to it. EAD was originally built on the assumption that traditional finding aids were sufficient, rather than challenging their document-based model from the start, and thus, it is difficult to extend the standard to embody a model that differs greatly from traditional finding aids. While Archivists’ Toolkit and ArchivesSpace make use of the relational database model, making creation of finding aids easier for archivists, they still rely on EAD as the canonical data form, and neither provides enough features for data controls, content validation, or

sophisticated back-end querying and batch data manipulation at this point in their development to safely rely on them as much more than EAD authoring tools.

14 Christopher Prom, “Does EAD Play Well with Other Metadata Standards? Searching and Retrieving EAD Using the OAI Protocols,”Journal of Archival Organization, Vol.1, No. 3 (2002): 51-72.

(14)

Ideally, making changes to one or many finding aids should be more like updating a database rather than editing a document, with the benefit that changing data in one location cannot create conflicts in linked fields, fostering greater consistency and accuracy. Additionally, less experienced processors and student workers could create and easily and instantly publish archival descriptions without needing a great deal of technical expertise, while controls and restrictions on what information goes in which fields could prevent errors from making it into the system in the first place. There would additionally be less of a need for an external version control system, as the database could independently prevent conflicts. EAC-CPF also starts to make much more sense in a relational database context, with archival authority records appearing as just another “table” in a database that includes a parallel table of archival

descriptions of collections, which could be linked and simultaneously updated as needed across the database, without having to deal with two entirely separate data standards, as EAD and EAC-CPF currently are. There could also be room for another linked table of archival

descriptions of functions, as Larry Weimer makes a case for in his discussion of EAC-CPF. With ¹⁵ the right linked data elements in a well-designed database, archival descriptions could be exported in any number of formats, including HTML, XML, MARC 21, RDF, or linked open data . ¹⁶ While EAD has helped to get archival description to this point, I hope that something more along these lines will be the next step.

15 Larry Weimer, "Pathways to Provenance:DACS and Creator Descriptions," inRespect for Authority: 33-48.

16 LiAM: Linked Archival Metadata. http://sites.tufts.edu/liam/

(15)

Bibliography

Coats, Lisa R.. “Users of EAD Finding Aids: Who Are They and Are They Satisfied?” Journal of Archival Organization, Vol. 2, Issue 3 (2004): 25-39

Daines III, J. Gordon & Corey L. Nimer, "Re-Imagining Archival Display: Creating User-friendly Finding Aids." Journal of Archival Organization, Vol. 9, Issue 1 (2011): 4-31.

Feeney, Kathleen. “Retrieval of Archival Finding Aids Using World-Wide-Web Search Engines.”American Archivist 62 (Fall 1999): 206-228.

Gilliland, Anne J. “Setting the Stage.” Introduction to Metadata 3.0. The Getty Research Institute, 2008.

http://www.getty.edu/research/publications/electronic_publications/intrometadata/setting.ht ml

Gilliland-Swetland, Anne J. “Popularizing the Finding Aid: Exploiting EAD to Enhance Online Discovery and Retrieval in Archival Information Systems by Diverse User Groups.” Encoded Archival Description on the Internet (Binghamton, N.Y.: The Haworth Information Press, 2001):

199-225.

Schaffner, Jennifer. The Metadata Is the Interface: Better Description for Better Discovery of Archives and Special Collections. Report produced by OCLC Research, 2009.

Weimer, Larry. "Pathways to Provenance: DACS and Creator Descriptions." Respect for Authority: 33-48.

Wiedeman, Gregory. “XQuery for Archivists: Understanding EAD Finding Aids as Data.” Practical Technology for Archives, Issue 3, November 2014.

http://practicaltechnologyforarchives.org/issue3_wiedeman/