• No results found

3.8 The Mizar FPS

4.1.2 The Text Encoding Initiative Guidelines

Another standard for representing texts in digital form has been developed. The Text Encoding Initiative is a consortium which develops a set of Guidelines which specifies encoding methods for machine-readable texts. Similarly to DocBook, it was first rooted in SGML3 and at present developed in XML format. It is mainly oriented toward annotating and encoding literary documents in humanities, social sciences and linguistics. According to the TEI website4, “the TEI Guidelines have

been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation.”

The TEI differs from DocBook and takes another approach for annotating docu- ments. One reason is that DocBook is specifically designed for computer hardware and manuals, whereas TEI concentrates on literature. The TEI Guidelines are expressed as modular and define a number of modules, each of which declares particular XML elements and attributes. Using this modular approach one could customise and construct its own TEI schema using any combination of modules. However, some TEI modules are core and are mandatory to all customised schema. A main example of the TEI customised schema is TEI Lite5. Any TEI customised

schema conforms to TEI element tags which might be used to annotate the docu- ment. Such annotated document can be later on validated, similarly to DocBook, using XML tools.

Each document annotated using TEI schema is either a single document, where the first element tag is <TEI>, or a collection of documents, where the first element tag is <teiCorpus>.

< teiCorpus > < teiHeader >... <TEI >

< teiHeadr >...

3SGML stands for Standard Generalized Markup Language 4http://www.tei-c.org/index.xml

< text >... < teiCorpus > < teiHeader >... < TEI > < teiHeader >... < text >... < TEI > < teiHeader >... < text >...

Listing 4.2: The TEI example presenting the main document or a collection of documents.

For readability and brevity, we show only the opening tag of each XML element; instead we use indentation to express nesting.

The <teiHeader> is a mandatory tag which supplies the descriptive and declar- ative information about the text itself, its source, its encoding, and its revisions. It also provides an electronic analogue to the title page attached to a printed work.

The <text> element tag, contains a single text of any kind, whether unitary or composite, for example a poem, a collection of essays, etc. The default overall structure of any <text> is defined by the following elements, as discussed on the TEI website subpage 6:

• <front> - (front matter) contains any page found at the start of a document, before the main body, e.g., title page, dedication, preface, etc.,

• <body> - contains the whole body of a text without its front and back matter, • <group> - groups together a sequence of distinct texts (or groups of such texts)

which are single unit texts, e.g., the collected works of an author, a sequence of novels, etc.,

• <back> - (back matter) contains any appendixes.

The <body> of a document can be divided into a number of chunks of text, which form a hierarchical textual divisions and subdivisions, such as chapters or sections. As mentioned above, these divisions and subdivisions vary depending on the style of the author writing the document. For instance a major subdivision of a book will be ’chapter’, of a report is usually called ’part’ or ’section’, etc. Similarly, texts which are not organised as linear prose narratives, or not as narratives at all, will frequently be subdivided in a similar way: a drama into ’acts’ and ’scenes’, a diary or a day book into ’entries’, a newspaper into ’issues’ etc.

Because of this variety, the TEI Guidelines propose that all textual divisions

will be encoded using the same named element tag with an attribute type used to provide the hierarchical level of such annotated element. Similarly to the DocBook sectioning element, the TEI provides numbered (i.e., <div1>,...,<div7>) and un- numbered (i.e., <div>) division element tags. Apart from the division element tag, the TEI introduced another tag for annotating paragraphs, i.e., a tag named <p>. All of this group of elements uses three types:

1. type - which indicates the conventional name for a category of this element; it also indicates the hierarchical level of the element, e.g., ’book’, ’part’, ’chapter’, ’section’,

2. xml:id - which specifies a unique identifier of that element within the whole document,

3. n - which specifies a short name or a number for the division.

For illustration purpose of the usage of TEI Guidelines, we present a short example in Listing 4.3.

< TEI >

< teiHeader >...

< div1 type =" book " n =" I " xml : id =" L 0 1 0 0 0 0" > < head > Book I

< div2 type =" c h a p t e r" n ="1" xml : id =" L 0 1 0 1 0 0" > < head > Of w r i t i n g lives in general ,... <p > This c h a p t e r d e s c r i b e s ...

< div2 type =" c h a p t e r" n ="2" xml : id =" L 0 1 0 2 0 0" > < head > DRa description , ...

<p > This c h a p t e r d e s c r i b e s ...

< div3 type =" s e c t i o n" n ="2.1" xml : id =" L 0 1 0 2 0 1" > <p > s e c t i o n ...

< trailer > The end of the first Book ... < div type =" book " n =" II " >

< head > Book II < div type =" c h a p t e r" n ="1" > < head > Of d i v i s i o n s in a u t h o r s <p >... < div type =" c h a p t e r" n ="2" > < head >... <p >... < div type =" s e c t i o n" n ="2.1" > <p >...

Listing 4.3: The example presenting the usage of TEI Guidelines.

we use indentation to express nesting.