3. Context and Research Design
4.3 Qualitative element analysis 117
4.3.2 The “Creator” element
The “Creator” element can provide important information about the origin of the document and can be regarded as providing quality information about the intellectual content of the document. This element should contain an entity with a single or multiple creator names, a group or organization name. A preferred person name consists of at least a given name and a surname. A person name can be formatted in a multitude of ways, e.g. by including abbreviations, middle names and the sequence of names as presented. Organization and group names can also be formatted in a multitude of ways.
Due to different formatting of creator entities, Boolean comparisons are not sufficient to determine if candidate entities or other data sources are identical. Manual evaluation was therefore needed in this analysis to determine if entities were in fact the same creator(s).
The dataset
This analysis is based on PDF, Word (DOC & DOT) and PowerPoint (PPT & PPS) document formats. These document formats represented 91% of all the published stand-alone documents from the LMS used in this research. They support the inclusion of embedded metadata and formatted sub-sections of the document and can contain visible creator information. All the documents also have a full person name of the person who published the document to the LMS. From the final dataset, 100 PDF documents, 100 Word documents and 100 PowerPoint documents were selected at random for analysis.
Presence of visible creator information
Visual data to verify element content were present in only a limited way, which increased uncertainties and the ability to draw conclusions regarding the embedded metadata and the extracted metadata. Only 9% of Word documents contained such information, while 27% of PDF and 44% of the PowerPoint documents contained visible information. The scarcity of visible creator information has two consequences:
1. It can be impossible to evaluate the correctness of candidate entities based on AMG efforts.
2. AMG efforts based on visible characteristics need to be extremely careful not to generate entities for documents without visible creator information.
Due to these issues, any AMG efforts based on visual characteristics would result in entities of very low semantic quality. This research continued by analysing available data sources.
Harvesting creator metadata
Word and PowerPoint documents can contain “Author” and “Last author” elements.
PDF documents can contain a general “Author” element and an Extensible Metadata Platform (XMP) section with “DC. Creator” and “XAP. Author” elements. Such elements have been harvested in related work [63]. An analysis of the dataset illustrated issues with entities from the XMP section:
x Additional characters were included to indicate the start and end of brackets: “\(“
and “\).”
x Different characters were extracted: “” (blank) instead of “-“ (line).
x The Norwegian character “ø” was replaced by “.” (period).
All the “Creator” elements in the XMP section were present in the general element section as well. The general elements did not show these kinds of character errors.
Hence, the general and XMP elements, which should have been synonymous with identical entities, do not have identical entities. These errors could not be traced back to a “faulty” application: The content creator software applications were commonly used with correct results. It is evident that there are issues regarding the content of the information placed in the XMP section of PDF documents. As a result, this research focused subsequent efforts on the general elements.
Table 17: Creator metadata from PDF, Word and PowerPoint documents
PDF Word PowerPoint
Contain full author or organization name 11% 30% 38%
Visibly verifiable correct 4% 3% 6%
Not visibly verifiable correct 7% 27% 32%
Contain partial author or organization name 61% 36% 34%
Visibly verifiable correct 9% 2% 13%
Not visibly verifiable correct 52% 34% 21%
No results 20% 1% 7%
Verified false entities 8% 33% 18%
Author or organization names were contained in the metadata from 72% of the PDF documents contained, although only 11% of the metadata elements could be visibly verified to be correct. Eight percent of the documents contained false metadata, mainly as commercial content for online converting services. Sixty-six percent of the Word documents contained author or organization names in their metadata, but only 5% (!) of the metadata elements could be visibly verified as correct. A third of the entities could be verified as false with values such as “standard user” and “test.” Seventy-two percent of the PowerPoint documents contained author or organization names in their metadata,
although only 19% of the metadata elements could be visibly verified as correct.
Eighteen percent of the entities were verified as false.
Figure 54: Verified correctness of embedded creator metadata
Figure 54 shows that 75% of PDF, 62% of Word and 60% of PowerPoint documents cannot be verified as either correct or false. Due to the lack of visible information against which to compare the embedded metadata, there is high uncertainty regarding the correctness of the harvested entities.
Table 18: Verifiable correct and false embedded creator elements
PDF Word PowerPoint
Verified correct 16% 5% 20%
Uncertain 74% 62% 60%
Verified false 10% 33% 19%
Extraction using visual characteristics
Extraction based on visual characteristics has been performed by a number of
researchers [49, 56, 85, 91, 122]. The current research has attempted using AMG based on visual characteristics in order to generate “Creator” element entities in a selected dataset. Table 19 shows that using the first line of text or the content with largest font does not generate “Creator” entities. These approaches are also more commonly used to generate “Title” elements. If the title can be correctly identified, then the likelihood of generating “Creator” metadata elements increases, although it is still low due to the limited number of documents with visible creator information. In all cases where visible creator information is not present, false entities are generated.
0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 %
PDF Word PowerPoint
Correctness rate
Verified false Uncertain Verified correct
Table 19: Algorithms for generating "Creator" entities based on visual characteristics
First line Largest font Located under the title
Correct False Correct False Correct False
PDF 3% 97% 1% 99% 12% 88%
Word 1% 99% 0% 100% 4% 96%
PowerPoint 0% 100% 0% 100% 20% 80%
Extraction based on the document code
Word and PowerPoint documents can contain style tags that present the formatting used for specific sections of the document. No documents contained the “Author” or
“Creator” style tags. Later versions of the Adobe PDF document format also support inclusion of style tags. This can allow retrieval of style formatted content from the original documents after conversion to PDF [109]. Six PDF documents contained format tags, though these referred to other content (descriptions of images).
Half of the PowerPoint documents contained “Sub-title” style tags. Sixty-eight percent of all visible creator information was found within this section. These sections were visually formatted in a variety of ways and contained a range of different data, such as subtitles, dates, course descriptions and creator information in a multitude of different orders. Creator information was included in 60% of the “Sub-title” sections. Eight percent of the “Sub-title” sections contained only creator information. The variety in regards to content types and visual formatting makes extraction efforts from this section reliant upon identification of user and organization names, among other text. This is a technology that has yet to be developed.
Table 20: Formatting information available from PDF, Word and PowerPoint documents
Adobe PDF MS Word MS PowerPoint Contain “Creator” or “Author” formatting 0% 0% 0%
Contained “Sub-title” formatting 0% 0% 50%
Section included creator info. only 0% 0% 8%
Section included creator info. 0% 0% 52%
Section did not include creator info 0% 0% 40%
Using the LMS publisher data as data source for the “Creator” element An alternative to harvesting or extracting of creator metadata from stand-alone
documents could be harvesting context publisher data from the LMS. Such an approach can generate valid entities for individual publishers. False entities would be generated for groups and organizations. Using an external data source for creator information has been performed by Greenberg [60] and Jerkins et al. [82].
Due to the limited number of publishers that are allowed access to the case LMS (only course lecturers), validation can be performed even though limited user information is available from the stand-alone documents. This research compared user profile names in the LMS against the embedded metadata. Positive results were obtained when entities that were related to the course authors were collected. For example, the harvested entity
“Lars” would register as a positive match if the document was published by a “Lars”
when no other “Lars’” could have made the publication. A match is considered positive if the publisher is included in the list of visible authors. This resulted in the correctness rates presented in Figure 55 and Table 21, which show a rate of 34% for PDF
documents, 74% for Word and 55% for PowerPoint documents. This research also confirmed that the LMS publisher was not the document creator for 28% of the PDF, 7% of the Word and 35% of the PowerPoint documents.
Table 21: Verifiable publisher as document creator
PDF Word PowerPoint
Verified correct 34% 74% 55%
Uncertain 38% 19% 10%
Verified false 28% 7% 35%
Figure 55: Verifiable publisher is document creator
This research also evaluated the correctness rate when a correct entity needed to contain the author names of all document creators, presented in Table 22. There are no
differences regarding correctness for PDF and Word documents. This confirms that most Word documents are published by the document creator. Multi-creator Word documents are not commonly published. Rather, such documents are converted to PDF before being published. PowerPoint documents created by multiple persons are published in their original document format. Hence the correctness and false rates are
0 %
affected by the different requirements for verification of correct results; see Table 21 and Table 22.
Table 22: Stricter verification of publisher, including multiple authors
PDF Word PowerPoint
Verified correct 34% 74% 47%
Uncertain 38% 19% 10%
Verified false 28% 7% 43%
Summary
This chapter presented the generation of “Creator” element entities. This analysis has demonstrated the challenge of not having a validated correct data source against which to compare the embedded and extractable data results. As a result, there are large uncertainties as to whether the generated entities are correct or false. This is due to:
x Content creation software that generates entities of low or very low semantic quality.
x Extraction based on visual characteristics which generates high quantities of false entities due to the use of data sources that do not contain the desired content.
This research has found that there is a potential for generating creator metadata based on creator style tags present in the document code. This approach would only generate entities when the desired content is present. However, due to the lack of practical use of document templates and use by document creators, this approach does not generate any entities for this dataset. The potential of this approach could therefore not be explored.
These harvesting and extraction efforts offer vastly contrasting results from the system controlled environment, where all documents were automatically given a valid creator element, as described in Chapter 4.1.2. Neither the consistency of information nor the correctness of the specific types of data available from stand-alone documents is comparable to the documents in the system controlled environment.