The “General. Language” element - Qualitative element analysis 117

3. Context and Research Design

4.3 Qualitative element analysis 117

4.3.4 The “General. Language” element

This chapter presents the use of existing, automatically generated language tags for common document formats for AMG purposes. The document code can contain tags reflecting the language of the document’s intellectual content. This can be used for the IEEE LOM’s “General.Language” element [74] and the execution of AMG algorithms based on natural language in multi-lingual environments.

Purpose of efforts and the dataset

The IEEE LOM’s “General. Language” element [74] relates to the language used for a document’s intellectual content. It is not to be confused with the element “Educational.

Language” element, which reflects the primary spoken or written language used by the intended user group. AMG algorithms based on natural language can be used to generate keywords and descriptions, and perform classification, as just a few examples.

These efforts require knowledge of the language of the document’s intellectual content.

This knowledge is manually built into the efforts of related work by applying a default language. In a multi-lingual user environment, as is found in the case LMS, such an assumption would not hold as documents are published in a number of different natural languages. Current AMG algorithms based on natural language can hence not be executed win this user environment since the natural language of the intellectual content is undetermined. The qualitative analysis of Chapter 4.2.1 showed that none of the document formats that were published in the case LMS contained embedded language metadata.

The document code can contain language tags that indicate the language of the document’s intellectual content. This allows for populating the “General. Language”

element and the execution of AMG algorithms based on natural language. Language recognition is automatically performed by applications such as MS Word and MS PowerPoint on document text sections to enable spelling and grammar checks. These section-wise language descriptions are stored as language tags in their created Word and PowerPoint documents. Our research documented that language tags are discarded if the document is converted to a PDF. This research is hence focused on Word and

PowerPoint documents. The language tags can be presented when Word and PowerPoint documents are lossless converted to their native Open XML document format. This chapter presents how these data sources can be used in order to generate metadata.

One hundred documents were selected at random, resulting in 60 Word and 40 PowerPoint documents. These documents were lossless converted to their native Open XML document format. The analysis was performed on the main document content of Word documents and on the first slide of PowerPoint documents.

Locating the language tags

An introduction to the content of Open XML documents from Word and PowerPoint documents is presented in Chapter 4.3.3 on page 138. Content creator software includes language tags in the main document (“document.xml”) and in the document description file (a DTD-file), called “style.xml” in Word Open XML documents. PowerPoint Open XML documents do not contain this type of a document description file. Instead, a whole range of files can contain language tags, such as files for the template master, the header, footer and each individual slide. These language tags look like the examples below:

Example 1: <w:lang w:val="en-US">

Example 2: <a:rPr lang="nb-NO">

Language tags can be located in multiple places in a single document, allowing sections to be tagged with different language formats. Default language tags are assigned based on the language of the user interface in the content creator software used in creating the document. This research analysed the “document.xml” and “styles.xml” files from Word Open XML documents, and the first slide of PowerPoint Open XML documents retrieved from the “slides1.xml” file.

Case results: Word documents

The analysis revealed that the language tags located in the document body and the description file gave misleading impressions of the language of the document’s

intellectual content. This is because there are language tags in the document that are not in practical use. All the analysed documents contained “en-US” (US English) language tags, even though only 7.5% of these used these language tags. Extraction of all available language tags and using them to specify the language of the document’s intellectual content will thus result in a low correctness rate. Extraction efforts need to be focused on tags that are in practical use. The extraction effort showed that all text sections were formatted with a single language tag. This allows the use of language-specific natural language AMG algorithms on individual sections formatted with a specific language tag. Both single and multi-lingual documents were found. As far as could be determined by this research, the language tags were placed correctly in accordance with the language of the document content¹⁶. These documents contained language tags indicating that their intellectual content was in:

x Norwegian (“nb-NO”): 42 documents x British English (“en-GB”): 8 documents x US English (“en-US”): 3 documents x Danish (“da-DK”): 1 document

There was a clear majority of documents in Norwegian, followed by British English, US English and Danish.

Related research on AMG efforts that are based on natural language operates on the assumption that the document’s intellectual content is in a single natural language. In a multi-lingual publishing environment documents with content in multiple natural languages can be present. The analysis revealed that six documents contained multi-lingual text sections. Distinguishing between content sections containing different natural languages enables extraction efforts based on the correct section language.

Hence, each language section can be analysed separately. This approach avoids contamination of the dataset caused by content of other language(s) than the section language. The multi-lingual documents that were found were in:

16 The data results (from both the Word and PowerPoint analysis) included references to “US English,” “British English,” “Australian English,” “Greek” and “Brazilian Portuguese.” This thesis has not conducted an analysis to determine if the dialects of the main languages of English, Greek and Portuguese match the language tags, as long as the language tags were representative of the main language.

x New Norwegian and English (“nn-NO” and “en-US”): 3 documents x Norwegian and English (“nb-NO” and “en-US”): 2 documents x English and German (“en-US” and “de-DE”): 1 document Case results: PowerPoint documents

PowerPoint documents typically contain a limited number of complete sentences for which language recognition can be performed. Hence less data is commonly available with which to determine the language used in the document. This can result in less accurate language tags than for Word documents. All but one document contained language style tags that were in use. The exception only contained photographs and no text. In this case, the fact that no language was specified is regarded as the most correct.

Single language PowerPoint documents were found in Norwegian (17 documents), US English (5 documents) and British English (3 documents). One document used false language tags, when a few Norwegian keywords were included on the first slide of an US English slide show. This illustrates the difficulties posed by recognizing short language sections.

Thirty percent of the PowerPoint documents were correctly labelled as containing multi-lingual intellectual content. All these documents were formatted correctly. All these documents contained extensive text sections in the primary and secondary intellectual language. This has enabled the content creation software to correctly identify the intellectual language of each document section. These documents were in:

x Norwegian and US English (“nb-NO” and “en-US”): 9 documents x Norwegian and Portuguese (“nb-NO” and “pt-BR”): 1 document x Australian English and US English (“en-AU” and “en-US”): 1 document x Greek and US English (“el-GR” and “en-US”): 1 document

x British English and Norwegian (“en-BR” and “nb-NO”): 1 document x US English and Norwegian (“en-US” and “nb-NO”): 1 document x US English and Swedish (“en-US” and “sv-SE”): 1 document

The primary intellectual language of these documents was correctly formatted with style tags. The language tags of the secondary intellectual language content were used to format number characters without any text. These language tags are not false, though this research regards these secondary language style tags of being misleading. Language style tags from document sections containing only numbers and symbols (not text) could be considered to not have been extracted.

Summary

This analysis has confirmed that the case LMS is a multi-lingual publishing environment. Documents were observed with intellectual content in Norwegian

(“Bokmål”), US English, British English, Australian English, New Norwegian, German, Greek and Danish. Other analyses undertaken as a part of this research found documents in Canadian English, Swedish, French and Spanish.

All the Word and PowerPoint documents with intellectual content used language style tags. This research has shown that the language of a document’s intellectual content can be determined correctly for nearly all Word and PowerPoint documents by extracting the language style tags that are in use. Using the document code and its language style tags enables identification and segmentation of documents into sections based on a specific natural language. This allows extraction efforts based on natural language to be executed in a multi-lingual user environment and for documents containing intellectual content in more than one natural language.

In document Using the structural content of documents to automatically generate quality metadata (Page 167-171)