Release of the MySQL based implementation of the CTS protocol
5 Unique Features
There are four unique features to discuss: the pos- sibility to post process the passage, the configura- tion parameter, the generated text inventory and possibility of multiple import methods. The fol- lowing chapters will explain these features in de- tail, give examples of use cases and explain how they fit into the specifications.
5.1 Passage Post Processing
According to (Blackwell and Smith, 2014), the passage “may (…) be further structured or format- ted in whatever manner was selected by the editor of the particular edition or translation“. This means, that CTS does not restrict the content of the passage in any way as long as "The CTS im- plementation (…ensures…) that including the contents of the requested in the cts:passage ele- ment results in well-formed XML" (Blackwell and Smith, 2014)6. As long as it does not break the structure of the reply, the passage may be plain text or – for example – text that either contains XML tags as text or text with XML tags as meta information describing a part of the text.
The following examples help to illustrate the difference.
6 The cts:passage element is the XML element in the CTS reply that contains the text passage specified the the URN
a) The tag <speaker> refers to a speaker and must be closed by </speaker> b) <speaker>Hamlet </speaker>To be, or
not to be(...)
While a) should clearly be seen as plain text de- scribing the tag <speaker>, it is reasonable for an editor to prefer the structured output in example b).
Changing a) to
A) The tag <speaker> refers to a speaker.
it becomes obvious that this probably breaks the structure of the CTS reply.
One solution here would be to make sure that every document only contains valid XML. This means that you would either restrict your text to valid XML or have to make sure that anything that would potentially break the XML structure, must be escaped. This results in a lot of work for the editors since they cannot simply escape the whole text but have to differentiate structural tags used by the CTS (like <chapter>) from meta tags that are part of the text (like <speaker>).
The solution that I propose is to make it possi- ble to adapt the content of the passage by the CTS to the needs of the individual text collection or even to the needs of the individual viewer or edi- tor. As long as the post processing method, that is used to modify the passage, is not changed, the CTS still guarantees a persistent citation. One URN will always result in the same text passage, but the data is presented differently. The CTS does not change the textual content, but its repre- sentation (or the view on the data) changes.
On the side of the server, this is nothing differ- ent than the possibility to serve the text in “what- ever manner was selected by the editor" (Black- well and Smith, 2014). In general, this is the same as creating annotated editions of one document, which is already a common method in today's Digital Humanities as – for example – described in (Almas, 2013). Doing this on CTS level is just automating the process.
On the opposite side, the client can benefit from this by having options. Imagine someone who wants to develop a universal reader for documents in EpiDoc format. It would be very useful to be able to connect to a CTS and have the possibility to request any text in this format without the need to rebuild all the documents and add additional EpiDoc editions. Another reader wants to look up some text but the edition is heavily annotated,
7 http://folio.furman.edu/projects/citedocs/ cts/#client-server-communication
making it hard to read. A view without all the XML tags would probably be something nice.
To enable the client to control the format of the passage, it is required to give the possibility to specify a configuration that should be used. This can be achieved with the configuration parameter that I will discuss in the next chapter.
5.2 Configuration Parameter
The configuration parameter was added to this im- plementation to give any client the possibility to adapt the output of the CTS in different ways. Its use is not described in the specifications but a side note makes it clear, that it does also not violate them. One valid example URL is http://myhost/mycts?configuration=default&re- quest=GetCapabilities7. Because this url is valid, it is allowed to add additional parameters to the requests. Therefore it does not contradict the spec- ifications to use it to give the client the ability to configure the CTS as long as the results are still valid against the specifications. In especially the CTS must still make sure, that the reply results in valid XML and all of the required information is included.
It is possible to combine multiple parameters by combining them with "_". For example, the con- figuration ?configuration=div=true_stats=true combines the parameters div and stats.
The following parameters are currently sup- ported. The default values for each parameter can be defined for every CTS instance. The configu- ration that the client provides will overwrite this default configuration.
Div / Epidoc
The parameters div and epidoc are useful if you want to see the structure of the text passage – for example to render it nicely. div uses a notation with numbered <div> elements and includes the type of the text units as a @type value.
<passage> <div1 n=“5“ type=“book“> <div2 n="1" type="line"> (TEXT) </div2> </div1> </passage>
epidoc uses EpiDoc notation, a variation of
TEI/XML. <passage> <tei:TEI>
<tei:text> <tei:body> <tei:div n="1" type="song"> <tei:div n="1" type="stanza"> <l n="1">(TEXT)</l> <l n="2">(TEXT)</l> </tei:div></tei:div> </tei:body> </tei:text> </tei:TEI> </passage>
epidoc is ignored if div is set to true.
Stats
stats does not yet serve a useful purpose but illus-
trates this implementations flexibility nicely by adding some simple statistics as @-values in the numbered divs. This setting is ignored if div is set to false.
<div3 n="1" type="line" letters="24" to- kens="4" avg_tokensize="6">
(TEXT) </div3>
Escapepassage
escapepassage specifies whether or not the XML
content of the passage should be escaped. This is always true if URNs with subpassage notation are requested to ensure the validity of the reply.
Seperatecontext
If seperatecontext is set to true, then the context that is specified for GetPassage or GetPassagePlus is returned in separate XML ele- ments with the name context_prev and con- text_next. Else the context is added to the passage and returned inside the passage element.
Formatxml
formatxml configures whether or not the reply
should be formatted. Formatted XML is easier to read but if you want to process it automatically, formatting may not be needed and influence the performance of the CTS negatively without hav- ing any benefit.
Smallinventory
smallinventory reduces the text inventory to a list
of <edition> elements with their URNs. I noticed, that dealing with lots of documents can result in large text inventories that are hard to parse if all
8 See https://github.com/cite-architecture/ ctsvalidator/blob/master/src/main/webapp/ testsuites/3-19.xml
the meta information is included. This meta infor- mation may be unnecessary if you only need a list of the documents URNs.
Maxlevelexception
If you set maxlevelexception to true and then spec- ify a level for GetValidReff that is higher than the levels that the document ‘has left’, it will return CTS error 4. Else it will return the URNs up to that level. For example if your document has two levels: chapter and sentence, and you request Get- ValidReff with level=100, then the CTS will re- turn error 4 if this is set to true. It will return all the URNs that belong to the given URN if this is set to false.
The validator requires the CTS to return error 4 if you request a level higher than the document provides8. However since there is no way of knowing, how a document is structured and Get- ValidReff is the function that gives you this infor- mation, this would force a user to try out levels until they receive an error, which gets more com- plicated considering that the document structure is not fixed for the complete document. While in a document book 1 may have 3 levels – chapter, passage, sentence – book 2 of the same document may be structured in 2 levels – stanza, line. This means that you can never know, if you can request another level until you received an error. You can add this information as meta information in GetCapabilities but it is not required by CTS to do so and this solution would still make it problem- atic to work with documents containing different citation levels.
In my opinion it is more reasonable to ignore this error and make it optional for validation pur- poses.
This also fits with the specifications noting that "The GetValidReff request identifies all valid val- ues for one on-line version of a requested work, up to a specified level of the citation hierar- chy"(Blackwell and Smith, 2014)9.
5.3 Dynamically Generated Text Inventory
GetCapabilities returns a text inventory contain- ing all URNs that belong to works or editions. This text inventory is manually edited and serves as an overview about what texts are part of the CTS and as a guide for the CTS to know which XML tags of a document are part of the citation.
9 http://folio.furman.edu/projects/citedocs/ cts/#cts-request-parameters
Working with a big number of documents, it might be problematic to require someone to read all the documents, create citation mappings, col- lect the meta information for each document and store it in the inventory file.
While you still have to configure the citation mapping in this implementation, you do not need to do this for every document (you still can if you want). It can be configured in one line for all doc- uments while setting up the CTS. This means that the text inventory is not required to import data, reducing its purpose to the output of GetCapa- blities. According to (Blackwell and Smith, 2014), the response of GetCapabilities is "a reply that defines a corpus of texts known to the server and, for texts that are available online, identifies their citation schemes". This information can be gathered in an automated process once the data is made available to the CTS.
This way a basic default text inventory is gen- erated which contains all the referenceable edi- tions without the need for manual editing. At the moment of writing, the label and author of an edi- tion and the information, whether or not the edi- tion can be parsed as valid XML, is added as meta information. This result is generated with every new request.
The following example shows the content that is currently included in the text inventory.
<TextInventory> <textgroup urn="urn:cts:greekLit:tlg0003"> <groupname>tlg0003</groupname> <edition urn="urn:cts:greekLit:tlg0003. tlg001.eng1:"> <title>
History of the Peloponnesian War </title> <author>Thucydides</author> <contentType>xml</contentType> </edition> </textgroup> </TextInventory>
The citation mapping – as it is used to specify, which XML elements are used for citation in the CTS implementation based on a XML database – is not part of the generated inventory because from my understanding it is only useful for the data import. My argument is that once you refer- ence texts with URNs, the citation mapping has only descriptive use and it is better located in the specific text passage or in the reply of the CTS
10 A cronjob collects the files, that were changed since the last update via OAI-PMH and timestamps as part of the URNs guarantees persistency.
request GetLabel. If you refer to a passage with a URN like urn:cts:demo:a:1.2, it is not relevant, whether the passage – 1.2 – refers to a sentence or verse or line. Adding it to the text inventory can however increase the complexity of the XML doc- ument making it harder to process the file. Espe- cially consider that – in theory – every text unit that is referenced by an URN can have its own ci- tation mapping. Mapping one unit to a sentence does not mean that every text unit is a sentence. In the worst case scenario, if citation mappings are included, the text inventory would have to contain one entry for any URN on level of the text units in the complete text collection.
By adding a file named inventory.xml, admin- istrators can instead use one that is manually ed- ited. It is a very reasonable workflow to save the generated inventory as inventory.xml and edit it further to manually add information.
5.4 Multiple Import Methods
The implementation is divided into two parts: one part imports the data into the database and the other part reads the data from the database. This separation makes it possible to plug in new import scripts. At the moment of writing, there exist 3 supported ways to import data.
Local import is the default way that this system uses.
CTS cloning makes it possible to clone one CTS. Since it relies on the div-configuration, it is currently only compatible with this implementa- tion. In theory, this feature allows community driven decentralized data backups.
The third method relies on a MyCore installa- tion that was used in the project "A Library of a Billion Words" and therefore might require a spe- cific setup. However, together with this setup and using the possibility of timestamp related queries in OAI PMH, we created a self-updating CTS with support for versioning and this way created a persistent CTS with editable content10.