6. Model Implementation
6.1 Data Preparation Implementation
6.1.1 Content renaming
Content renaming is the first stage of the data preparation pipeline. The goal of content renaming is to identify and rename rare or unseen semantically relevant lexical units32.
These units are typically VCs such as names, titles, locations and product types etc. The rationale behind content renaming stems from the fact that during model implementation, there would be a set of relevant words that would be observed during testing but unseen during model training or vice-versa because they are simply rare. In the sentence “Andre
Pitovsky will be addressing the Shandong community”, the likelihood of observing the
expression ‘Andre Pitovsky’ and ‘Shandong’ in the training or test set is likely to be low. Using a process of smoothing such rare words could be omitted completely from the vocabulary of content words, but since they are important, content renaming is applied by assigning pseudo-words to a class of commonly occurring linguistic units.
31 Domain based document pre-processing of data’ is left out because it is dependent on the
document or domain. For the test cases used in this research the pre-pre-processing in chapter 7 is described.
32 Lexical units refer to grammatical word forms that make up the sentence. These include words
70
Figure 10: Illustration of the Data Preparation Pipeline
The renamed linguistic units include:
• Person Names – In this implementation, GATE’s (General Architecture for Text Engineering33) named entity transducer which is a processing resource for
identifying named entities including person names is used. In figure 10 it forms part of the external processing resources seen in the architecture. In the implementation, GATE is quite good at identifying patterns such as first names e.g. ‘John’, [first name + last name] e.g. ‘John Smith’, [first name + middle name +last name] e.g. ‘John Doe Smith’. However, it sometimes fails to identify other non- descript name patterns such as names preceded by titles e.g. ‘President Thambo
Mbeki’ or even names that are not Germanic, Latin or Hebrew like ‘Thambo Mbeki’.
To address this, a unique processing resource was implemented in GATE to identify such name using regular expressions. All person names identified are mapped to the pseudo-word ‘PERSONNAME’.
In addition to person names, all acronyms/abbreviations are identified and reverted to their abbreviated form. As part of the external processing resource seen in figure 10, an external gazetteer of acronyms (a list of acronyms) is compiled to aid in the identification of popular acronyms and abbreviations. In addition to
71
acronyms, alpha numeric characters which could be product names e.g. ‘Audi A3’ or locations e.g. ‘M40’ are also identified. Seeing as the gazetteer’s coverage cannot always be extensive, regular expressions were developed to augment the process of identifying abbreviations, acronyms and alphanumeric entities34. Abbreviations
and acronyms are mapped to the pseudo-word ‘ABBREVIATIONNAME’ while alphanumeric characters are mapped to the pseudo-word ‘ALPHANUMERICNAME’.
• Dates, Numbers and Currency – Since instances of numbers and currencies are important as quantifiers, it was vital to represent them. Like names, GATE’s processing resource was applied. However, GATE is incapable of identifying date entities like ‘In the 20th century’, and so a regular expression engine is implemented
to identify these. The following pseudo-words are assigned:
o Dates - All date expressions are assigned the pseudo-word ‘FIGUREDATE’. These include expressions such as ‘20th century’, ‘12th of August’, ‘July’, ‘12-11-
2001’. For instance, in sample sentence 1 below, drawn from the research data, ‘2017’ is replaced with the expression ‘FIGUREDATE’ as seen in sample sentence 2.
Sample sentence 1 – “Only the Conservative Party will deliver real change and real choice on Europe, with an in-out referendum by the end of 2017.” Sample sentence 2 – “Only the Conservative Party will deliver real change and real choice on Europe, with an in-out referendum by the end of FIGUREDATE.”
o Currency, Numbers and Percentages – This category includes numbers expressed either as words or Arabic numerals. Since numeric elements can range from −∞ 𝑡𝑜 + ∞, a generic pseudo-word is assigned for all numbers, currencies and percentages that are less than 0 to be ‘FIGURENEGATIVENUMBER’ for numeric expressions, ‘FIGURENEGATIVEPERCENT’ for percentages and ‘FIGURENEGATIVEMONEY’ for currencies. For numeric expressions, greater than 0 a naming convention is used which combines the expression ‘FIGURE’ followed by the number of digits expressed in words and the category ‘PERCENT’, ‘MONEY’ or ‘NUMBER’. Table 5 shows samples of these mappings.
34 Regular expressions - ‘^(? ! [0 − 9] ∗ $)([𝐴 − 𝑍0 − 9](\. )? ){2, }$’ and ‘\𝑏[𝐴 − 𝑍\. ? ]{2, }\𝑏’ - are
72
Table 5: Samples of Mappings from Linguistic Numerals to Pseudo- words
Linguistic Expression Pseudo-word Expression
5000 pounds FigureFiveDigitMoney
10.45% FigureTwoDigitPercent
2 trillion dollars FigureThirteenDigitMoney
75987 FigureFiveDigitNumber
0.45 FigureOneDigitNumber
This sort of numeric or monetary entities were identified generically by GATE’s Semantic Tagger (GST). However, in making the distinctions seen in table 5, a second parser called a post parser is implemented and applied such that for each monetary, date or numeric entity identified by GST, the post parser parses it to see if it contains a decimal, determine digits before the decimal and then assigns a pseudo-word to the entity.
• Locations – Locations like names are proper nouns. The reason behind separating locations from names was to capture the semantic reality that persons are different from locations. The GATE gazetteer was extended with a list of countries, cities and towns around the world. Locations identified in the sentence are mapped to pseudo-word ‘LOCATIONNAME’.
The order of the content renaming pipeline commences with the renaming of persons, abbreviations and acronyms. This is followed by renaming locations and finally Dates, Numbers and Currency. The reason behind this ordering is that, it is possible for a person’s full name or a substring of the name to also be the name of a country. For instance, the last name in the person name ‘Johnny England’ is also the name of a country – ‘England’. Therefore, to avoid assigning the wrong pseudo-word to the entity, it is reasoned that it is more likely for such expressions to be names than countries and so names are converted before locations. In conclusion, the job of this aspect of the pipeline is to map named entities to pseudo-words. The second part of the pipeline seen as ‘Punctuation Elimination’ in figure 10, is discussed in the next section.