Development in Python of a tool to check the Final Degree/Master Project report

(1)

Development in Python of a tool to check the

Final Degree/Master Project report

Bachelor’s Thesis

Author:

Alex Moyano Núñez

Supervisor:

Manuel Moreno Eguílaz

Call:

June 2021

Escola Tècnica Superior

(2)

(3)

Summary

Supervision of every single Final Degree/Master Project is a lengthy endeavor that requires constant error checking from both the author and the project supervisor - a task which can add up to many hours over the course of a project.

The main objective of this project is to develop a tool in Python that would help in this task, lightening the load for both the project supervisor and its author.

This objective has been met. The program is developed almost entirely in Python, with the exception of the graphical user interface, and it can find, highlight and offer solutions to many common errors and defects in TFE reports.

(4)

(5)

Table of Figures

Figure 1. Comparison between uncaptioned elements and captioned

ones.

15 Figure 2. File structure of a .docx file.

20 Figure 3. View of an example document in Microsoft Word.

23 Figure 4. Snippet of document.xml corresponding to the heading in Fig. 3.

23 Figure 5. High-level program flowchart.

24 Figure 6. Python code example for retrieving image and caption pairs.

27 Figure 7. Python code for grammar checking the main paragraphs in a

Docx file.

31 Figure 8. Sentence in English and its corresponding match object.

32 Figure 9. Accepted numbered format for the bibliography section.

34 Figure 10. File selection page.

38 Figure 11. File selection page with file selected.

39 Figure 12. Check selection page.

39 Figure 13. Final page: Download and Reset.

40 Figure 14. Error report summary.

41 Figure 15. Alert message for checks with execution errors.

42 Figure 16. Missing sections errors.

42 Figure 17. Citation errors.

43 Figure 18. Missing figure captions errors.

43 Figure 19. Example of a missing figure label error.

44 Figure 20. Examples of some missing references to figures errors.

44 Figure 21. Examples of some spelling and grammatical errors.

45

(8)

Table of Tables

Table 1. Accepted labels for captioned elements.

34 Table 2. Estimation of time required for each task.

49 Table 3. Estimation of time required for each task and subtask.

50 Table 4. Estimated hardware costs.

51 Table 5. Estimated software costs.

52 Table 6. Estimated human resources costs.

52 Table 7. Total direct costs.

52 Table 8. Total indirect costs.

53

(9)

1. Glossary

Boolean variable: Variables with only two possible values: True or False. Named thus due to

the influence of Boolean algebra, in which all variables are either True or False.

Django: Django is a Python-based free and open-source web framework widely used all

across the internet. It is the most popular Python-based web framework.

Flask: Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools or libraries and it does not provide abstraction layers common in more fully-featured frameworks like a database or form validation. It is well suited to small projects like this one since it does not require many dependencies and the code is lean.

LibreOffice: LibreOffice is a free and powerful office suite, and a successor to OpenOffice.org

(commonly known as OpenOffice). It offers many of the same features as Microsoft Office, like a text editor, a spreadsheet program and more; and serves as an open-source alternative to commercially available office suites.

Namespace: XML namespaces provide a simple method for the program or user reading the

XML file to associate each element with its defining schema without needing to define the schema every time the element appears in the document.

Natural Language Processing (NLP): Refers to the branch of computer science — and more

specifically, the branch of artificial intelligence or AI — concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

Open XML Standard: Also known as Office Open XML, it is a zipped, XML-based

file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents.

Parsing: To parse, in computer science, is where a string of commands – usually a program

– is separated into more easily processed components, which are analyzed for correct syntax and then attached to tags that define each component. In this specific project, parsing refers to separating an XML file into python objects the program can manipulate.

(10)

Schema: An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. This word has a more general meaning beyond XML files, but this is the meaning relevant to this project.

TFE: Acronym of Final Study Thesis, using its initials in Spanish and Catalan: Trabajo de Fin

de Estudios. Variations of this acronym appear through the document, such as TFG, which refers to Bachelor’s Thesis and TFM, which refers to Master’s Thesis.

WSGI server: WSGI stands for Web Server Gateway Interface and is a convention for web servers to forward requests to web applications or frameworks written in Python. It is used in production environments and requires security certificates in order to comply with other web standards like https.

XML: Extensible Markup Language (XML) is a markup language that defines a set of rules for

encoding documents in a format that is both human-readable and machine-readable.

XPath functions: They are a series of functions implemented by libraries following the XSLT

standard that allow programs and scripts to treat XML files like directory structures in an operating system. This way of accessing elements is called path-like and from this the name

XPath is derived.

XSLT: XSLT stands for (Extensible Stylesheet Language Transformations) and is a language

used to transform XML documents into other XML documents, or other formats such as HTML for web pages. It is not relevant to this project beyond knowing it is implemented in the XML library the program uses to parse XML files and it provides the functionality of XPath functions.

(11)

2. Introduction

2.1. Project objectives

The main objective of this project is to develop a program or tool to find, highlight and offer solutions to errors/defects in TFE reports written in the Docx file format (Microsoft) [1]. The specific design objectives are as follows:

- The entire program, wherever possible, must be written in Python.

- The program must be able to finish its runs without fatal errors stopping it. This means that even if an error is encountered, the run must still continue.

- The runtime of the analysis of any TFE report must not exceed fifteen minutes. - The program must have a graphical user interface to facilitate its use.

- The program must be able to run as a local instance, without any connection to the internet or other networks.

- The program must provide real utility to its users, with at least some feature other error correcting tools do not offer.

- The entire project must be open source, without using any closed source libraries. - The program must be built with further improvements in mind, since other people

may contribute new features as part of their own Bachelor’s thesis in the future.

2.2. Project scope

As with many other software engineering projects, some external libraries will be used when they already implement functionality that would otherwise have to be developed by the project author. These libraries will always be open source, with the intention the entire project will also maintain this designation.

Especially in regards to the analysis of spelling, grammatical and style errors, this project will limit itself to using an already existing natural language processing (NLP) library due to the complexity of the task.

(12)

(13)

3. Common errors in TFE reports

As the program is designed to find, highlight and offer suggestions on how to correct common errors/defects in TFE reports, it is best to start by introducing and describing those same common errors as a starting point.

Some of the errors described in this section apply to many reports and documents, like the spelling, grammar and style errors, while some others are more specific to TFE reports or even TFE reports that follow ETSEIB guidelines.

This list is not intended to be exhaustive and just includes those errors the project supervisor thought would be useful to check with an automated tool. The program has been designed modularly so that future contributors, if any, can add additional error checking functionalities.

3.1. Spelling, grammar and style errors

These errors are the most common defects in any TFG report. This is, in part, due to the sheer amount of text almost all reports have, which obviously raises the chances for a mistake to be made. In comparison, other errors may have a higher occurrence relative to the possible mistakes one could make but are overall less prevalent.

In the particular case that concerns the main intended use of the program, that of using it as a first filter tool for TFE reports written in English by ETSEIB students, the incidence may be even higher. For most students in ETSEIB, English is their third language and they usually do not have a native proficiency level.

Spelling errors consist of mistakes like adding double whitespaces between words or using the wrong letters. They usually come from typesetting mistakes that go unnoticed, but they can also come from ignorance regarding the correct way to write a word. Microsoft Word, like other similar programs, already has a very robust spelling correction system integrated in its editor, but not every user uses it correctly and even if they do, sometimes this check does not work correctly and errors slip through.

(14)

but the subject of the sentence needs to change. Nonetheless, this kind of technology has advanced by leaps and bounds in the last decade and their use is becoming more common, to the point that Microsoft Word itself has a (fairly limited) grammar check system integrated seamlessly with its normal spelling check.

The last kind of errors in this section, style errors, are a lot more nuanced than the other two categories. To the point of being debatable if some of them are even errors instead of different stylistic choices. An example could be using a comma instead of splitting a sentence into two or using passive voice instead of active in certain documents. The problem that grammatical errors already had, that of needing to use natural language processing (NLP) algorithms to understand the intended meaning of a sentence before passing judgement on it, is even more apparent here. Not only are style errors more nuanced, but their very definition as errors can be questionable without taking into consideration factors outside the text – like for example the intended reader – that are difficult to take into account with automated systems.

3.2. Missing sections

While the categories of errors described above can be applied to any kind of document, this particular one depends more on knowing the required sections that a specific document must have in its final form.

For the particular case of TFE reports done by ETSEIB students, the following sections have been considered as important to have in the final document:

- First page - Index - Glossary - Introduction - Budget - Environmental Impact - Conclusions - Bibliography

(15)

3.3. Missing captions on figures and tables

When a picture, illustration or table is added to a document where the rest of the information exists in text form, a caption explaining what this object is and its source must be added above or below the object to inform the reader. Figure 1 shows the difference between uncaptioned elements and captioned ones.

For this thesis, all figures and tables were made by the thesis’ author, so the sources won’t appear in the captions as their provenance is already established in this paragraph.

As a general rule, figures and illustrations usually have their captions just below them while tables vary, with their caption being above or below them, depending on the specific style rule the author is using.

3.4. Missing labels on captions

Even if a figure or table has captions, these captions themselves can contain mistakes. Apart from the possible mistakes inherent to any written text like spelling, grammar and style, these captions can have their labels missing.

(16)

A label in a caption is a small amount of text at the start of the caption that gives a name and a number to the object it is captioning, allowing for easy reference and identification. This label could be something like Figure X or Fig. X for figures and Table X for tables, where X is the identifying number of that element.

3.5. Wrong citation order in Bibliography

All citations in the document must have their sources listed in the Bibliography section, and they must be numbered and ordered according to their order of appearance in the document. A fairly common error is using an erroneous order on these sources, usually as a result of iterative changes on the document which can remove citations and sources over time, while some old sources keep their old numbers without being noticed.

3.6. Missing and/or disordered references

This error category is related to some of the others explained above, and consists of errors pertaining references to sources, tables or figures not being there in the correct order or at all. In regards to citations and sources, all cited sources must be ordered and numbered in the Bibliography section. But this also works vice versa: all sources present in that section must be referenced at least once in the text. As explained in the section above, the order of the references must match the order of the sources in the Bibliography section and if this is not the case, one of the two orders must be changed to match.

For figures and tables, their inclusion in the document must serve the function of illustrating and supporting the argument of the text they accompany, and this text must reference them directly and by label.

The order of the references in text must also match the order in which they appear in the document, and their label should also match this same order.

(17)

4. State of the art

After introducing the errors most common in TFE reports, and before entering the discussion about the developed program itself, it is useful to first check what the existing solutions and applications offer in terms of error checking capabilities.

For any new program, whether it be the program developed as part of this project, a commercial venture or a non-profit work, its usefulness will be determined by whether it can offer some feature or aspect other choices cannot.

This feature could be some specific error other applications do not cover, differences in the performance of the checks, a better user experience or simply reasons like the price being lower or the application being open-source.

In this section, some of the existing error checking applications and solutions will be examined and a case will be made for why the program developed in this thesis is necessary for the specific use case presented.

4.1. Existing solutions and applications

Given that the program is being made to error check Docx files, the first relevant solution it must surpass in some manner is the integrated error checking functions present in Microsoft Word.

Microsoft has been continuously upgrading the error checking functionalities in Word since Microsoft Office was first released in 1990. When the technology improves enough that it constitutes a meaningful change in error checking capabilities, an upgrade is pushed to integrate the new technology into Word.

This process has matured the native error checking capabilities of Word from simple spelling checking to a full featured natural language processing suite capable of highlighting and correcting spelling, grammar and style errors with high accuracy.

But even with all of these upgrades, these integrated algorithms are not as powerful nor as accurate as completely dedicated applications entirely dedicated to error checking.

The most commonly used dedicated error checking software is either Grammarly [2] or

ProWritingAid [3], the available user date is not public and it is difficult to make an accurate

(18)

These programs offer a very comprehensive suite of tools to aid any author to write any text without spelling, grammatical or style errors. And their prowess in detecting these kinds of errors is considerably higher than the integrated functions present in most text editors, though again it is difficult to quantify this increased performance versus the integrated functionalities. Both of these services, and other similar competitors, are commercial for-profit ventures that charge money for their services. They usually include some number of free checks, either limited in depth, so they do not run all checks, by length, so a free user can only check a limited amount of text each day, or both.

This is done as a kind of demonstration and to show the usefulness of the application to potential users, but these free trials are intentionally limited so that they are only really useful for short texts – as their real business is checking entire documents, usually for companies which buy enterprise services so all their employees can integrate these checks into their normal workflow.

Apart from these commercial applications, open-source alternative projects also exist, like

LanguageTool [4]. These programs are usually less polished and a bit less powerful than their non-free competitors, but they are free to use and the code can be incorporated inside other open-source projects without any issue.

LanguageTool in particular also offers an online service to check documents as a paid feature,

but those checks can be run for free offline with a local instance instead.

4.2. Need for this project

While all of these applications and services offer really powerful error checking features for the first category of errors described in the previous section, the spelling, grammar and style errors, they do not consider any other kind of errors in their checks.

They must serve as a way of checking any kind of text, and those errors are the only ones universal enough that it makes sense to build a commercial enterprise around.

(19)

5. Internal structure of a Word document

Any error checking program generally follows three main steps: retrieving the information to analyze, filtering it according to some specific criteria to detect errors and then preparing and presenting the output with its findings.

As such, before describing the workings of the error checking program itself, it is important to first explain how Microsoft Word works and how it stores information in .docx files. A great deal of the information explained below will not apply to the older format of word documents (.doc files) as Microsoft changed the internal data storage format of Office documents to the Open XML standard in Office 2007, hence why modern Office documents end with an X (.docx, .xlsx, .pptx, etc.).

A Word document, both .docx and .doc, is essentially a ZIP file container with multiple files inside. When a user opens a Word file using Microsoft Word (or another text editor with support for Word files), the program decompresses the file container, reads the files inside and puts everything together in the form most users are familiar with: one or more pages with text, images, etc., all put together as a printable document where the printed version will look exactly as shown in the screen.

However, the document the user sees is not how the data inside a Word document is stored in memory or disk. As the process of putting everything inside the docx file together is done dynamically every time the text editor is started, if another program wants to access the information without using the text editor, it must either simulate the same steps the text editor uses to put everything together or access the information directly from the files inside the docx file.

Microsoft Office – which includes Word, among other programs – is not open source, so the task of reverse engineering which exact steps it uses to open docx documents in order to show the final result to the user is not an easy one. This is one of the reasons why opening a docx file in LibreOffice (an open-source alternative to Microsoft Office) can sometimes change how the document looks.

As mentioned above, Microsoft shifted to the Open XML standard for all their office products with the Office 2007 release. This standard has had four revisions since its initial release in December 2006, with the latest being released in 2016 and standardized in ISO/IEC 29500

[5].

Docx files follow this standard, and thus all files inside a docx file are either XML files or .rels

(20)

Though following the Open XML standard and recreating the file from its constituent parts is possible, in applications in which only the information inside a docx file is useful, without needing to preserve the exact form and looks for printing, it is much more efficient to simply extract the files inside the docx file and retrieve the information inside.

If any given .docx file is renamed to .zip and then extracted into a folder, the resultant file structure will look very similar to the one shown in Figure 2, though there may be some differences between documents due to embedded images, plots, tables, etc.

As can be seen in Figure 2, all files contained inside the docx file are either XML files or .rels files. Most of them store information about the fonts, style, layout, settings, etc. but almost all the information any user inputs into a Word file is stored inside the document.xml file.

(21)

For the purposes of error checking the errors/defects described in the common errors in TFE reports section, just accessing the information stored in document.xml is enough, so there is no need to parse metadata from other files or combine information from multiple sources. The file document.xml is formatted according to the WordprocessingML standard [6] defined inside the more general Open XML standard. As described in ISO/IEC 29500:2016, a WordprocessingML document is organized around the concept of stories. A story is a region of content in a WordprocessingML document. Some examples of stories could be: main story, header, footer, frame, text box, etc.

Not all stories must be present in a valid WordprocessingML document. The simplest, valid WordprocessingML document only requires a single story — the main document story. In WordprocessingML, the main document story is represented by the main document part. As an example, the main document story of the simplest WordprocessingML document consists of the following XML elements:

• document — The root element for a WordprocessingML main document part, which defines the main document story.

• body — The container for the collection of block-level structures that comprise the main story.

• p — A paragraph. • r — A run.

• t — A range of text.

Both document and body are universal for Open XML documents, and they act as containers to divide the XML file into sections for easier parsing. However, the later three elements (p, r, and t) are more interesting for the intended use case of the error checking tool.

Paragraphs are the most basic unit of block-level content within a WordprocessingML document and as mentioned above, they are stored using the <p> element. A paragraph defines a distinct division of content that begins on a new line. A paragraph can contain three pieces of information: optional paragraph properties, inline content (typically runs), and a set of optional revision IDs used to compare the content of two documents.

Some examples of paragraph properties are alignment, border, hyphenation override, indentation, line spacing, shading and text direction.

(22)

A run, just like a paragraph, can also have its own properties. Some examples of run properties are bold, border, character style, color, font and font size.

Thus, the most common structure in document.xml often is a paragraph (i.e., a <p> element) that contains one or multiple runs, each containing at least one <t> element with the text of the run.

There are many more possible elements inside document.xml, all defined in the standard, but the relevant ones for error checking TFE reports are the header, footer, table, drawing, and textbox elements, in addition to the basic ones described above.

Header and footer elements are similar to document and body in that they work as an internal divider inside document.xml and they usually contain one or more paragraphs inside with the text they display on the page.

The table and textbox elements are self-explanatory, and they also contain paragraphs and/or runs inside with text. Additionally, they also can carry additional elements like embedded images or style properties like border color, border thickness or background color.

Drawing elements are how the WordprocessingML standard defines any kind of image, both those imported and those created by the text editor program. As with the table and textbox elements, they can also have additional properties that define how they look in the final document.

To better visualize how these XML elements are connected to the document the user sees,

Figure 3 shows a page in Microsoft Word with all these elements and how they look from inside the text editor program, while Figure 4 has a snippet of code from the corresponding

document.xml highlighted and color coded to see how an element translates from the editor to

the XML file.

(23)

Figure 3. View of an example document in Microsoft Word.

(24)

6. Program structure

As mentioned in a previous section, most error checking programs use a similar program structure when the specific functions are abstracted away. They usually have a way to extract information from a source, process it so its filtering functions can analyze it and finally they present the filtered information and/or conclusions taken from its analysis to the user.

The program developed as part of this TFG is no different in this regard and it also works along these lines. Figure 5 illustrates a high-level, abstracted view of how the program works.

Examined in more detail, the process for this program goes as follows:

1. The user uploads a Docx file using the graphical web interface, after which he also selects which error checking operations he wishes the program to run on the file. 2. The program runs the data extraction functions for the checks the user has selected. 3. The filtering module functions filter the provided information, execute the checks and

store the results in memory for the output module.

4. The output module parses the results and creates an Error report Docx file which is stored on disk.

5. The user interface notifies the user the program is done and he proceeds to download the Error Report file.

6. The program can now be reset using the user interface to return to step 1.

As can be seen in the flowchart in Figure 5 and the explanation above, the program consists of four modules:

- Data extraction (called docmanip.py in the code). - Filtering (called errorcheck.py in the code). - Output preparation (called output.py in the code). - Web interface + micro server.

(25)

interact with the user through a graphical interface, without the need to use console commands or a script. This not only opens its use to less technically minded users, but also speeds up use even amongst technically proficient users, especially when analyzing more than one document with different checks.

All the programming done as part of this TFG has been done in Python, though some external libraries were used with python wrappers and some JavaScript code is used in the web interface.

6.1. Data extraction

As explained in the internal structure section above, Python programs and scripts in general must access the data inside a Docx document by uncompressing it and parsing the XML file inside.

Docx files follow the Open XML standard and making a comprehensive parser that considers

all edge cases is a non-trivial task that could be the entire subject of a Final Master project (TFM) on its own.

Thankfully, Docx and Microsoft Word are popular enough that existing libraries to interact with

docx documents with python scripts already exist. The most popular and full featured of these

libraries is called python-docx [7] and the latest version released (and the one used in this program) is 0.8.10.

The python-docx library contains functions, classes and methods that automate, simplify and abstract away much of the work needed to interact with docx files using bare python.

As an example, one can initialize a python-docx.Document() object using a Docx file as the input and then call one single function to get a list of all the paragraphs in the main story of the document, with all the XML sub elements inside each <p> element already parsed into python-understandable attributes.

Python-docx also provides functions for writing Docx files, which will be used in the output

module.

However, python-docx has not reached version 1.0 yet and that, in the eyes of its creators, means it is not feature complete. This library has a host of limitations when interacting with a

Docx file, compared to doing so using a dedicated text editor like Microsoft Word, some of

(26)

There are six kinds of elements in a Docx relevant to the program developed in this TFG: - Paragraphs - Tables - Headers - Footers - Images - Text boxes

Python-docx provides support for the first four elements, but extracting information about images and text boxes requires a bespoke solution and new functions.

Thus, in the data extraction module, adapted python-docx functions are run to extract the first four elements from the Docx file provided, while another solution explained below is used to retrieve text boxes and their relationship to images and tables.

As mentioned above, the program needs to retrieve not only the information inside textboxes, but also the relationship of each text box with the closest image or table – since the only textboxes relevant to the errors the program is checking are those acting as captions.

As python-docx does not offer support for textboxes, the solution that has been implemented is to parse the document.xml file using the library lxml [8], which is one of Python default libraries and thus does not require an additional install.

This manual parsing has a few peculiarities and edge cases to consider, which will be explained below, but the general flow of the parsing and retrieving process works like this:

- Unzip the Docx file using the zipfile [9] library (a default library used to open zip files).

- Load the document.xml file in a lxml.etree.XML() object, which parses the file into python-understandable elements and provides functions to search the document. - Use the XPath functions available to the lxml.etree.XML object to search for an

image or table, and then search the preceding and succeeding elements around that image or table to see if they are text boxes.

- If they are, both elements are logged together as image (or table) plus caption, while those images and tables that do not have any text box nearby are assumed to not have captions.

- Then this process is repeated until there are no more images and/or tables in the document.

(27)

nodes in an XML document much like one would navigate a series of folders in a computer. To better illustrate this, an example of how XPath functions are used in this manner could be the python script shown in Figure 6, which is an adapted (for brevity and understandability) version of the one used in the real data extraction module.

The script shown in Figure 6 has many limitations the actual code in the program does not, like the inability to examine the drawing element previous to the picture or only getting the first run in each text box (when there can be more than one run and they must be joined to retrieve the full text inside), but it is close enough to explain the methodology that has been used to extract the information about picture (and table) caption pairs (or lack of) in the document. If the code is examined from the top, the first non-import element seen is the extraction of the

document.xml from the provided Docx file using a custom-made function, which stores the full

XML file in memory. This XML file is then passed to a lxml.etree.XML function for parsing, as explained above.

Before proceeding to the for loop which will iterate over the document.xml file, two extra previous steps must be completed for the script to work. The first involves an element in XML files which has not been discussed in this work yet, the namespace.

(28)

The namespace exists as a way to shorten XML files and make them more understandable, and it is basically a glossary of possible elements in the XML file and their abbreviations. It informs the various functions that interact with XML to, for example, use the long form definition of the w element found in the namespace every time they see w:drawing or w:txbxContent. If the namespace is not defined, every element must explicitly state the long form of each element instead of using the abbreviation.

The second previous step is much simpler: the variable which will hold the picture caption pairs must be initialized before the for loop begins.

The main for loop iterates over every w:drawing element in the XML document, thanks to the use of the XPath functions (in this case tree.findall(path, namespace)). All pictures and textboxes in the WordprocessingML standard must be inside a w:drawing element, so this first filter will make the script much faster compared to iterating over every single element. At the start of every iteration, the picture_found Boolean variable is set to False. It will be set to True when a picture is found.

After that, XPath functions are used again to find all pictures inside the current w:drawing element, and if one is found, it is stored in a new temporary variable and the picture_found Boolean variable is set to True.

If a picture has been found, the next element after the drawing element that contains the picture is retrieved. And a very similar process to the one described above repeats to find and retrieve a textbox if it exists there.

First, a new Boolean variable is set to False to indicate a text box has not been found yet, and then the new element is filtered to find all textboxes inside. If one is found, the text inside is retrieved, joined to the picture element found before as a picture caption pair and then the Boolean variable for the text box search is set to True to end the search.

At the end of the main loop iteration, if a picture has been found, it will be added to the picture caption pairs holding variable, with the captioning text box added if one was found.

This information retrieval method also provides a way for later filtering functions to see if some pictures lack a captioning text box, since those pictures will be the only elements in the

(29)

6.2. Filtering

After the data extraction process, the information retrieved must pass through a filtering process where the errors will be found and possible ways to correct them will be generated. This step is where most of the processing and development time is concentrated in most error checking programs, and the one developed as part of this TFE is no different in this regard. The filtering module of the program, called errorcheck.py in the code, provides functions for finding, highlighting and suggesting a solution to the following common errors (which were explained in detail in a previous section):

- Spelling, grammar and style errors in English. - Missing sections of the document.

- Missing captions on figures and tables. - Missing labels in text box captions.

- Wrong citation order in the Bibliography section.

- Missing and/or disordered references to sources, tables and figures.

Apart from these error checking functions, this module also contains some auxiliary functions that help with different parts of the overall program even if they do not pertain to any error specifically, like a function that parses the document and creates a hierarchical index using paragraph properties to identify headings and normal text, and which is used to separate the results of other functions into a more human-understandable format for the error report.

6.2.1. Spelling, grammar and style errors in English

As explained in detail in a previous section, these kinds of errors are the most common ones, and they are really three different categories of errors that are put under the same heading in this work simply because the tool used to analyze them checks all of them at the same time. Examining any kind of natural language, that is a language used by humans to communicate instead of, for example, a computer language like python, using automated systems is an area of research that is currently very active. This field of research is called natural language processing (NLP), and it has progressed immensely in the last decade thanks, primarily, to the parallel development of artificial intelligence and convolutional neural networks.

(30)

This being the case, some already built NLP solutions were considered and tested for the use in this program. The best performance and results were obtained using free trials of several commercially available solutions like Grammarly or ProWritingAid, but the amount of data bandwidth was limited in the free plans and this, along the design criteria of building a self-contained program that could work without being connected to the internet and the need for the entire tool to be open-source, tipped the balance towards an open-source library.

Among the different NLP open-source libraries available in Python, the closest in features in performance to the commercial solutions is LanguageTool.

This library is a free and open-source grammar checker that was first developed in Python in 2003, but which migrated to Java when more collaborators joined the project. However, a python wrapper library, called language_tool_python [11], exists, which makes its functions available to python programs and scripts without needing to interact directly with the Java library, even if the Java program is running underneath.

Using this open-source library, it is possible to run the entire NLP check in local conditions without any connection to the internet, and the performance is comparable to some commercial products. This NLP processing is the longest step, in terms of processing time, in the entire program, so any amount of efficiency gained here will reduce the critical time of the entire algorithm.

There are five possible sections of the document the user may want to check using the program, which are the following:

- Paragraphs. i.e., the main part of the document. - Text inside tables.

- Text inside text boxes. - Headers.

- Footers.

The text from all of those sources has already been extracted from the Docx file in the previous module, so the functions provided in this module work by using that data along with the LanguageTool NLP functions.

As such, the functions are quite straightforward, and an example could be the one used to check the spelling, grammar and style of the main paragraphs in the text, which is shown in

(31)

The imports and the doc and tool variables are in Figure 7 to provide context, since they are set in the user interface and not in the actual filtering module.

This function uses a docx.Document and a LanguageTool object as inputs and iterates over all the paragraphs in the Docx file, while passing all the text in the paragraphs through the

LanguageTool filter to see if it finds any errors. If an error is found, LanguageTool.check()

returns a list of matches, which are then stored in a dictionary along with the corresponding paragraph index in relation to its position on the document.

The most interesting part in this section of the module are those matches returned by the

LanguageTool check function. Each match is a dictionary which includes not only what error

the function found, but also the position of the error in the text, possible replacement solutions and even a human-readable error message to forward to the user, among other things. As an example, Figure 8 shows a sentence in English with an error and the match object the

LanguageTool check function returns.

(32)

6.2.2. Missing sections in the document

The functions in this section are fairly straightforward. They basically parse the paragraphs in the Docx file (obtained in the data extraction module using the python-docx functions) and look for the text in the headings. If the text corresponding to a particular section heading is not found, it is considered missing.

This, of course, means the program may fail to recognize some sections that are really in the document but that use non-standard naming. It may also signal an error when a given section, like for example, “Environmental Impact”, is missing, but the TFE report author may not have intended to put that section in the document in the first place.

Of course, as this program is intended to be use to highlight errors and be a first filter, it is better to err on the sensitive side while still being aware of the limitations any automated software has, especially one developed by a single person.

The functions in this section look for the following names in the headings of sections: - Index - Glossary - Introduction - Budget - Environmental Impact - Conclusions - Bibliography

(33)

While not all sections in this list are strictly needed in every TFE report, it was decided by the program designer to highlight if any section in the list was missing and pass the decision on whether or not each section is necessary to the document author, just in case.

6.2.3. Missing captions on figures and tables

Due to how the data on figures, tables and text boxes is stored in document.xml, most of the error checking work for this particular error has already been done in the data extraction module when the information was retrieved.

The functions on the data extraction module that retrieve the data on picture caption and table caption pairs store the information as a list of lists, where each subsidiary list is made up of two elements: the picture (or table) index number by order of appearance in the document, and the text of the accompanying textbox, if there is one, or a None element if there is none. Thus, the error checking functions only need to check if the second element is None to conclude if that picture or table has a caption or not.

This error checking process can fail in those cases where more than one picture share one single caption or where the author uses an invisible table to help them structure the document without intending it to be seen as a table.

In document.xml both these edge cases would require special handling, and they have not been implemented due to code complexity.

Multiple pictures sharing a single text box as a caption being detected correctly without introducing unacceptable errors in the detection of normal, one picture with one caption pairs would require the code to use basically all the .rels files in the Docx document in order to reproduce the exact layout the user sees, and in this way evaluate if the pictures are sharing a caption or not. This would have delayed the development beyond the required finishing dates, so this feature was discarded.

The other edge case of using invisible tables suffers from a similar problem, as the invisible border properties are not stored in document.xml but in a separate file and thus, the complexity of operating with multiple file sources is still there.

(34)

6.2.4. Missing labels in textbox captions

This error is also quite straightforward to filter, thanks to the previous work done to get all the captions and the text inside.

The function simply iterates through all the picture and table caption pairs that have a caption and checks if there is a label in the text.

For this specific purpose, the labels defined in the code are shown in Table 1, where the X stands for any number.

Captioned element

Accepted labels

Picture

Fig. X or Figure X

Table

Table X

Though some authors may use different labels for tables and pictures, the error report will show the entire text box if it detects a label error, so the author can evaluate if it’s a real mistake or if the program missed because the label is different but still correct, like using Image X instead of Figure X.

6.2.5. Wrong citation order in the Bibliography section

As explained in the common errors section above, the sources in the Bibliography section can have a different order to the references in the text. Changing either one of these orders would be an acceptable solution, but to make the program more robust and less complicated, the functions implemented just check if the sources in the Bibliography section are ordered correctly according to their numbered prefixes, without checking the rest of the text. The order of the references in text will be examined by the functions in the next part of the filtering module. The functions in this part of the module assume a “Bibliography” section with this exact name exists and contains the sources for all citations in the document, numbered at the start of each source as shown in Figure 9. These functions only check if the internal Bibliography ordering is consistent with itself.

Table 1. Accepted labels for captioned elements.

(35)

6.2.6. Missing and/or disordered references in text to sources, tables and

figures

The last part of the filtering module is dedicated to find all the references to sources, tables and figures in the text, and see if any is missing or if they are in the wrong order.

These errors are intrinsically related to some previous ones the program already checks for, so a great deal of the work is already done.

Once the Bibliography section has been parsed, and its source order is internally consistent, it is considered the correct order against which to check the appearance order of references in the text. This means one function parses the entire text of the document and looks for a particular marker – in this case, the marker decided by this TFG’s supervisor takes the form of “(X)” where X is the number of the source in the Bibliography – in the paragraphs. Then, it checks their order appearance against the order of the sources in the Bibliography and reports any mismatches.

For the references to Tables and Figures, the functions work in a similar fashion. Only Tables and Figures with caption text boxes are considered, and from those, only the text boxes with labels are accepted for this check. This is because a reference to a Table or Figure can only be unambiguous when it references the label, so without labels there is no certainty the reference points to any particular object.

Once again, the document is parsed to get the appearance order of both Tables and Figures and this order is checked against the one obtained by searching the paragraphs in the text for labels. If there is any mismatch, an error is logged.

6.3. Output

The output module is the last one in the core program, which also includes the data extraction and filtration modules, and contains all the functions needed to translate the results from previous steps into a human readable error report.

The program has been designed in such a way to be independent to the provided user interface, and any python script can call the functions in the program and use them without needing an actual human input.

(36)

While the rest of the modules only contain functions, in the Output module almost every feature is tied up in the ErrorReport class. An ErrorReport object can be initialized with the path to the

Docx file the user wants to analyze and all other operations can be done by calling different

class methods on this ErrorReport object, including writing the report to disk as a Docx file. Concentrating all callable features into one object makes it easier to use for both outside scripts and the user interface, since all the internal variables of the program are handled by the

ErrorReport class and the outside script does not need to keep track of the variables.

Multiple ways of showing the output of the error checking analysis to the user were considered in the design stage of the program, including adding the error report to the end of the examined

Docx file, making the report a PDF file to improve compatibility and showing all the errors found

in the user interface. In the end, the implemented solution was to generate a separate Docx file that contains only the error report.

This was considered a superior option because it offered numerous advantages over the other possible options. It did not affect automated indexing like it could have if the report was added to the end of the original TFG report. It also produced a Docx file, and as the program is intended to check errors in Docx files, it can be assumed all users have the capability to open and edit this format. And it produced a self-contained file with all errors inside, something the user interface report alternative did not offer.

An additional benefit is that by saving the report as a Docx file, no additional libraries are needed since python-docx offers functions to make and edit this kind of files.

Apart from internal methods to call the functions of the other modules, the ErrorReport class also has methods to count the total number of errors, write a summary of the error report to put as the first page, and finally one method to translate every finding of the program into something a human can understand.

Additionally, even though the program is only designed with support for the English language, the methods in the Output module have been developed with extendibility in mind, and they could theoretically support any language that LanguageTool also accepts if more development time was invested in these changes.

(37)

6.4. User interface

The program itself is self-contained in the three modules already explained above, since they contain the methods and functions to perform the error checking process from a Docx file and they can be called by any external script, or even from the command line. But, considering the program offers a variety of different errors to check, and some users may want to check some specific errors but not others, it was thought that a graphical user interface could improve the usability of the program.

Python is not the best language to develop native applications, of which an executable (.exe) in Microsoft Windows is an example, and these programs are also limited to specific operating systems, so a decision was made to use a web interface along with a micro server written in Python as a graphical user interface.

Compared to native applications, web interfaces are considerably simpler to develop, and often provide much better results for a fraction of the effort, all while being operating system agnostic (which means the software can run on any operating system).

Python has a variety of popular and well supported libraries to develop servers, including

Django, one of the most popular web frameworks in the world, but most of them are more

suited to larger projects more centered in the web-server aspect. For this reason, the library selected for this work was Flask, a python library dedicated to creating micro servers quickly and stably for small projects like this one.

The user interface module for this project consists of a single flask server that serves some html files through any normal browser and uses the same methods web pages use (GET, POST, forms) to communicate information between the user and the server.

One thing to remark is that even though Flask supports the creation of a production server that can maintain indefinite uptime, its configuration would require more in-depth knowledge about security certificates from the program user than was deemed acceptable, so the intended use, as reflected in the instruction section down below, uses the development server mode provided by Flask.

(38)

This module is the only part of the program which contains non-python code, as the html files displayed to the user required some JavaScript code to work correctly. Nonetheless, as the rest of the program is written in Python (as is the entirety of the core program), this author considers the program as compliant with the design criteria of developing an error checking tool in Python.

Finally, Figure 10 shows the graphical user interface as seen in Google Chrome, though it should look very similar if not identical in any other modern browser that supports JavaScript.

Figure 10 shows the first page of the web interface the user sees after launching the development server according to the program instructions.

The user must select a Docx file to be examined, with the only limitation being it must weigh less than 100 MB in order to limit the processing time to reasonable lengths. Once the file is selected, the user will see a change in the interface, as can be seen in Figure 11 – and then

(39)

After clicking the Submit button, the user will be presented with the check selection page, shown in Figure 12. Any combination is valid, though different options may imply a longer processing time.

Figure 11. File selection page with file selected.

(40)

After selecting the options and clicking on the Start button, the program will start and the page will display some kind of loading sign – the appearance of which depends on the specific browser – on the tab.

When the program finishes the analysis, it writes the error report and automatically forwards the user to the final page of the interface, as shown in Figure 13.

If the user clicks on the Download Report button, the error report will be automatically downloaded to the user’s default download folder – which can be set in the browser configuration menu. If the Reset Program button is clicked, the user will be taken back to the starting page and the data stored in the local server will be deleted.

(41)

7. Analysis of an Error Report

In this section, an example of an Error Report produced by passing a TFG report through the program (with all possible checks) will be analyzed in detail to explain how to read and interpret the program results.

The used TFG report to be checked was provided by my TFG supervisor for testing purposes and it is used here as an archetypical example. Some TFG reports may have different results to the ones shown in this section, especially if they use different citation schemes.

The order of appearance of the results of each check is predetermined and it will always be the same as long as the check is selected and the program finds some error. If any of these conditions is not true for any particular check, the order will still be the same, just without that particular check.

The first page of the error report consists of an error summary, as shown in Figure 14, that explains how the error report is structured, how many errors each check has found and some of the caveats that may limit the usefulness of the provided data.

(42)

If any of the error checks encounters an execution error and the check cannot proceed, the program will log the error, proceed with the other selected checks and notify the user with a warning on this same first page of the error report, as can be seen in Figure 15.

After the summary page, the first errors shown are the missing section errors, as can be seen in Figure 16.

In this particular case, the glossary and budget sections could not be found. The error message already explains that these could be false negatives/alarms, but it could also be that this particular TFG report does not need these sections.

The next section contains the citation errors – both the errors in the Bibliography or text being in the wrong order and references to those sources being not present in the text. The specific error message can be seen in Figure 17.

Figure 16. Missing sections errors.

(43)

In this TFG report, the order of the numbered citations in the Bibliography section is correct, so the citation order error section is not present in the error report. However, the program has not found references in the text for some or all of the sources in the Bibliography so it will include the citation references missing from text section in the error report as seen in Figure 17.

After this, the next sections will involve errors in Figures and Table captions, in this order. First, the error report will show how many images it has found without a caption, as seen in Figure 18. Only a number is given because the internal document names for each picture in Docx document are not human-readable and thus it is very difficult to name a specific picture without a caption in a way that would be unequivocal to a reader. The number is sufficient to give the program user an idea of the magnitude of the error and get them to revisit the document with an eye directed to correcting these errors.

Next, the error report will list the missing labels on captions for figures. That is, those captions it has found that do not include Fig. or Figure in the text box. These kinds of errors, and those that come after these, start to occupy more and more page space in the error report and showing just one or a few errors of each section is enough to understand how the error reports works.

Figure 17. Citation errors.

(44)

An error, along with the heading for this error section and an explanation of what the program is checking, can be seen in Figure 19.

Each error in this section is identified by its subtitle – in this case, Figure caption 4, which denotes this error was found in the fourth figure with a caption in document order.

The last error sub section inside the Figure caption errors section are the missing references to figures. Some examples of which can be seen in Figure 20.

The errors for Table captions appear next in the error reports, but the structure is identical to the figure caption error section explained above so it is not necessary to explain it in detail. The last big section contains the spelling, grammatical and style errors found in the sections of the document selected by the user in the user interface.

First, the errors in the main text of the document will be shown, divided into sub sections using the TFG report own headings. A few of these errors can be seen in Figure 21.

Figure 19. Example of a missing figure label error.

(45)

The rest of the error report is made up of the spelling, grammatical and style errors found inside the tables, text boxes, headers and footers, if those checks were selected. The format will be the same as with the errors in the main text, so no figures or detailed explanations are needed.

(46)

8. Program instructions

In this section, instructions will be provided for both installing and using the program. The installation process only needs to be run once as long as no folders, files or Python itself changes. The program will work for any operating system capable of running Python 3.7 and Java 8 or above, but the instructions provided will only be for the most common operating systems: Microsoft Windows and Linux.

8.1. Installation

The process of installation is very similar in both operating systems, as the main prerequisite to run the program is Python 3.7. Versions above 3.7 should also work, since they try to maintain retro compatibility and neither the program nor its libraries use any functionality likely to be deprecated in the near future.

Some versions of Linux and even Microsoft Windows already come with Python preinstalled, but to be sure, it is best to go to the official Python webpage [12] and follow the steps described there to download and install the desired version for each operating system.

In order to run a local version of LanguageTool, Java 8 or higher is needed and it can be downloaded in the official Java website [13]. Unless there is some warning about retro compatibility, it is always best to install the latest version for security reasons.

Once Python 3.7 and Java 8 (or newer versions) have been installed in the system, you can proceed to install the required dependencies. This can be done manually, but it is highly recommended to use the package and virtual environment tool pipenv to do this automatically, since it will create and maintain a virtual environment for this specific program and possibly prevent future dependency conflicts.

The required version of Python already includes the pip package manager, and thus pipenv can be downloaded and installed easily by entering the following command on a terminal (Linux) or CMD/PowerShell (Windows):

>> pip install pipenv

(47)

The instructions will proceed as if pipenv was installed, but manually installing the required dependencies is possible. They are written in the requirements.txt file in the main folder of the program.

Once pipenv is installed, the last part of the installation is the creation of the virtual environment with all the required dependencies. This is step can be complex and tedious if done manually, but thankfully pipenv completely automates the process.

One needs to navigate to the folder where the pipfile of the program resides and type the following command:

>> pipenv install

This will create a new, isolated virtual environment, where only the base python libraries and the program dependencies will exist. It will also automatically download all dependencies and their sub dependencies.

Once this is done, the installation process is finished.

8.2. Use

While the previous instructions only need to be followed once per computer (in normal circumstances), these instructions must be followed every time anyone wants to use the program.

First, open a terminal or CMD window in the main folder of the program (where the pipfile is). Then, enter the following command:

>> pipenv shell

This will let you execute code in the created virtual environment.

After this, the correct command will depend on the specific operating system.

For Windows (CMD): >> set FLASK_APP=web

For Windows (PowerShell): >> $env:FLASK_APP = "web"

For Linux: >> export FLASK_APP=web

This will set the correct Flask environment variable so the development server the user interface runs on can start.

Finally, run the following command:

(48)

A similar message to the one shown in Figure 22 will appear, with a local IP direction. This will be the local server address where you can access the program.

If this local address is accessed using a browser, like Google Chrome for example, the initial page of the program will show up.

Once the report has been generated and downloaded, click on the Reset Program button to clean the server for the next time it is used.

If you wish to stop using the program, press the key combination shown next to the local address in the terminal or CMD window (CTRL+C in the case of Windows) to shut down the flask server safely. Finally, you can close the browser and the terminal or CMD window.

Development in Python of a tool to check the Final Degree/Master Project report