• No results found

3.4 Automatic user requests elicitation

3.4.4 Single rule structures and flexibilities

To enable a single rule to work, a training text from NLTK corpus has to be decided for the tokenizer to produce consistent patterns. In the first implementation of this framework, it is set as

train_text = state_union.raw("2005-GWBush.txt")

and it remains the same for both the debugging code and the actual elicitation code that talks to the database.

A typical user request linguistic rule is composed of two parts: chunking (with chinking if necessary) and regular expression. Rule 78 in the example above can be expressed as below for the chunking and chinking grammar.

Code Snippet 13 An example chunking and chinking grammar

chunkGram = r"""Chunk:{<MD><PRP|VB|NN><VBP><.*>*<NN|NNS|NNP|NNPS><.*>*}

}<\.|\?>+{"""

This grammar is looking for a modal verb, followed by either a personal pronoun, or a base form verb, or a noun, then followed by a non-third person singular present verb, then anything for any time of repetitions before and after an any form of noun, and this grammar must end before the end of sentence punctuation, such as period, exclamation mark or question mark.

The reason for <PRP|VB|NN> to represent either a capital “I” or a lowercase “i” in this grammar is that in this implementation, the text of the sentence is converted to lowercase.

The impact of the .lower() is that the capital “I” is parsed as “NN” rather than “PRP”, and sometimes could be “VB”, when it is changed to lowercase “i”.

The period and question mark in the line of chinking grammar in Code Snippet 13 have to be escaped with a backslash. Otherwise, a period means an identifier for any character, except for a new line, whereas a question mark means a modifier matching 0 or 1 repetitions.

89

Only texts that have passed the chunking (with chinking) grammar can be tested against the regular expression further. In this example rule 78, the regular expression is below.

Code Snippet 14 An example regular expression test

regulars = re.findall(r'[c|C]an\s[i|I]\shave\s(.+)', str(string))

In this example, the code is testing against the text for either a “can” or a “Can” followed by a space, then either a lowercase “i” or a capital “I”, a space, a “have”, and a space. If the text passes this test, then the text after that is returned as the resulting user request.

But the resulting user request must not be empty in this example above. “(.+)” means that anything that must appear at least once in this pair of parentheses is the target text.

Rule 78 is a typical rule in the implementation of this component. This implementation defines a function for each rule and returns the user request value. In the function, there are normally two tests, one for chunking (with chinking) grammar, and one for regular expression.

However, rule 77 above needs one more test to ensure that the texts that pass the regular expression test must not begin with anything among “how | nor | why | what | where | what more | what else | how often | not only | no longer”.

This can be achieved by the code in Code Snippet 15.

In this example taken from rule 77, the regular expression defines two pairs of parentheses, which are represented by request[0] and request[1] respectively. Inside the first pair of parentheses, there are definitions for all the terms that must not appear before a user request that matches the chunking grammar and the regular expression in this rule. The modifier of this pair of parentheses is “*”, therefore the terms can appear 0 or any times. The second pair of parentheses is actually the target text for the user request. The logic below the regular expression definition prohibits the return of the target text if any term defined in request[0]

does appear. This is an example of the flexibilities that the current user request elicitation rule structure can fulfil.

90

Each user request linguistic rule is implemented as a function currently. Each function contains a “try” block and an “except” block. It is very important for readers to be aware that the “except” block in each function cannot use “pass” here. The reason for that is there are many rule functions with similar structures in parallel in the code. In Python, “pass” in an

“except” block actually means passing the last variable’s value on to the next available one.

This can create unexpected results if this is not aware of.

Code Snippet 15 An example of flexibility in current user request rule structure

regulars = \

re.findall(r'([h|H|n|w|W][o|h][w|r|y|a|e|t][t|r]?[e]?\s[o|m|e|l][f|o|l|n][t|r|s|l|n][e

|y|g][n|e]?[r]?\s)*[c|C|w|W][a|o][n|u][l]?[d]?\syou\s[g]?[u]?[y]?[s]?[\s]?[p]?[l]?[e]?[a ]?[s]?[e]?[\s]?[a|m|g|i|c|p|l][d|a|i|n|h|r|u|o][d|k|v|p|a|e|t|s|o|n][e|u|n|a|t|k|s]

?[t|g|a|i]?[e|l|d]?[l|e]?[r]?[\s]?[i]?[n]?[t]?[o]?[\s]?\s(.+)', str(string)) for request in regulars:

if request[0] not in ['How often ', 'how often ', 'How ', 'how ', 'nor ', 'why ', 'Why ', 'what ', 'What ', 'where ', 'Where ', 'What more ', 'what more ', 'What else ', 'what else ', 'not only ', 'Not only ', 'no longer ', 'No longer ']:

return request[1]

else:

return "None"

For the full code of all linguistic rules implemented in this submission, readers are referred to the list of code C.8.

For the POS tag list and the convention of chunking (chinking) grammar and regular expressions, readers are referred to Appendix C POS tag list and user request rule convention.

When writing reviews, users sometimes use upper case words in the reviews. Those words can be parsed differently in both POS tags and dependency trees. In order to eliminate such impacts, all sentences are converted into lower cases at the later stage of this implementation, which applies to this component Automatic user requests elicitation and the next component Topic-Opinion extraction. The performance of this framework is therefore improved by

91

eliminating the false negatives that could match a rule. This also improves the clarity in the ontology by reducing the number of different individuals.

The most obvious impact of “.lower()” is that the POS tag of capital “I” changes from “PRP”

to “NN” and sometimes others. There could be other impacts, such as ('Android', 'NNP') becoming ('android', 'JJ').

In the current implementation of linguistic rules, both normal sentences and lower case sentences are accommodated. If the researchers who take this framework in the future decide to remove the “.lower()”, these two sets of linguistic rules in this component and the next one will still work with expected performance. Although this is not recommended because the especially on purpose uppercase words will flee from the rules.