Dictionary development - Data Preparation

Chapter 7 Identifying Crime Scripts Through Computer-aided Content Analysis

7.2 Data Preparation

7.2.1 Dictionary development

To accomplish the tasks of reducing flexibility and ambiguity, a set of dictionaries was prepared and applied to the data. The starting point for building these dictionaries was the use of freely available natural language tools including the R packages Aspell for spell checking, Snowball for stemming and the English stopword list available with the TM Package. The TerMine service, provided by NatCTem, was used to generate a list of the most frequently occurring multi-word phrases in the data (Frantzi et al., 2000). The use of the WordNet lexical database to identify related words was explored but this

145

provided difficult to implement in the R environment (other than searching for one synonym at a time). While all of the tools used were revised and tailored to the data after a series of iterations;

the identification of synonyms and related words was the task for which the most manual effort was required. To ensure anonymity, personal names and proper nouns relating to streets and other local locations were identified and removed.

As already suggested, the ‘off-the-shelf’ tools provided a starting point for data standardisation but using them in their raw form did not resolve all of the issues in the data. For example, the

spellchecker only identified and corrected a small proportion of errors in the MO descriptions.

Spelling errors were identified as a problem in Chapter 6, but the data preparation phase highlighted the true extent of this problem. Not only were there particularly high levels of errors, the incorrect spellings were in many instances far removed from the intended term meaning they were not picked up by spell checkers. Therefore initial spelling errors had to be identified manually and included in the dictionary. Table 7.1 highlights just one example of this issue, showing that there were 36 different spellings of escape in the data, in addition, the alternative words and phrases for escape and their misspellings provided a further 36 terms. These 74 terms were all recoded to the token escape. Similarly, there were 127 variations of offender including spelling variations of offender, suspect, aggressor and assailant (but not including descriptions where terms such as youth or gang were used in place of offender). There were also 48 terms used for victim including IP, injured party, victim, complainant and spelling variations).

146

Table 7-1 Table showing the alternative words and incorrect spellings of ‘escape’ found in the data

Spellings of Escape in the Data Alternative Words for Escape in the Data

andescaped absconded

The TerMine service generated a list of suggested multi-words phrases, each of which would be recoded as one token, but these needed checking by hand and refining by adding and removing terms. Phrases that were added included: under_the_influence; made_good_their_escape;

147

stood_in_her_way and racist_insults. It was particularly important to identify multi-word phrases containing negating words which reverse the meaning of a term or phrase. Phrases where the negative term was directly collocated with the connected phrase were effectively picked up by TerMine, i.e. phrases such as no_injury and without_warning. Other instances were more complex and needed to be identified manually e.g. without_any_warning; not_put_in_any_fear_of_violence;

no_threats_or_weapons (in the latter examples no_threats would be picked up automatically but no_weapons would not). These multi-word phrases were recoded as a token which was treated as one word in which the negative element was clear, some examples of recoding phrases relating to no injury and without provocation are provided in Table 7.2. Negating terms may not always be as obvious as no, didn’t, did not etc. For example in the MO description, below, the inclusion of the words slow and speed completely change the meaning of the word speed:

<<BETWEEN TIMES STATED IP WALKING IN A RESIDENTIAL AREA WHEN A VEHICLE STARTED FOLLOWING HIM AT A SLOW SPEED. THE VEHICLE PULLED OVER AND OFFENDER GOT OUT AND APPROACHED IP AND SAID WHAT HAVE YOU GOT FOR ME"".

HE PUSHED IP TO GROUND BEFORE TAKING HIS WALLET FROM HIS JEANS POCKET AND THEN MADE OFF IN VEHICLE DRIVEN BY OFFENDER 2 IN THE DIRECTION OF XXXX ROAD ROAD CAR>>

Table 7-2 Examples of multi-word phrases (including negating phrases) from the data

Raw Text Changed to Token

No injury no_injury

No visible injuries Not hurt

No visible injury Not injured

Did not suffer any injuries

Without provocation no_provocation

Without any provocation No provocation

Un provoked

As noted in Chapter 4, the meaning of many words can only be understood in relation to their context. The methods used to analyse data in this thesis largely examines words in isolation from their context and can therefore introduce ambiguity. This is a particular problem with homographs, distinct words that share the same spelling but have different meanings ((e.g. saw (verb) and saw (noun)). To reduce the uncertainty of the meaning of terms, potentially ambiguous terms had to be identified and their meaning clarified. Again existing lists of known homographs in the English

148

language can provide a starting point, but much of this work had to be conducted manually by inspecting the database. KWIC lists, also known as a concordance view, aided the process of identifying the meaning of ambiguous words. These lists display every occurrence of a word

surrounded by their context. This facilitates the inspection of the different meanings that words may have within the data. Figure 7-1 below shows a KWIC list for the word ‘escort.’ The list highlights the two different meanings for the similarly spelled words escort and escorted. If stemming had

occurred prior to disambiguation all instances of escorted may have been recoded to escort and then car, a misinterpretation of the original word. Disambiguation can be made more difficult by the use of slang and local variations, for example the names of pubs or local landmarks taken out of contexts could lead to misinterpretation (imagine the impact that many crimes occurring at a Punch and Judy Inn would have on the perceived level of violence in the data if the issues was not recognised and corrected). In the current data, the location code (park, licensed premise, shop etc.) was consistently recorded allowing a quick check for the names of licensed premises and the removal of related proper nouns.

Some advantages in data preparation stemmed from the data’s status as local grammar. This restricted the available meanings of many words that otherwise could have been ambiguous. Thus stalk always referred to the verb, not the noun, while both implement and object always referred to the noun. However, phone could relate to the noun as in mobile phone, but often included the verb as in phoned the police, this term also appeared as phone box. This issues was resolved through the identification of multi-word phrases relating to calling the police or calling 999 and also the

identification of words and phrases relating to phone boxes.

149

Figure 7-1 A screenshot of a KWIC list showing all instances of escort in the data

One issue which arose in the standardisation of the text was the stemming of plural words, when the distinction between singular and plural words was important for the analysis. This issue was relevant when MO descriptions included either the number of offenders or the number of other actors in the situation. This problem is heightened by the fact that the standardisation process removes

punctuation, so that ‘offender’s’ would become ‘offenders’ and then ‘offender’. Further

complications are added by the fact that this punctuation was often missing or misplaced in the raw data. Even if it had been possible to accurately retain the singular or plural in these cases, the number of offenders, or other actors was often indicated by shorthand such as X2. Looking at the examples in Figure 7.1 above once more, we can see details of the number of offenders given in various ways including:

• 3 U K OFFENDERS

• THIRD OFFENDER AN IC3

• OFFENDERS 2 WHITE MALES

• 2 UNKNOWN OFFENDERS

• X1 IC1 MALE AND X1 IC3 MALE

• OFFENDERS 1 AND 2

150

Similar problems emerge when attempting to uncover the number of victims in a description, or the number of people that were in the company of a victim at the time of the offence. It was decided that the data processing would not attempt to ascertain the number of victims / offenders in each description. This represents a limitation within the analysis as co-offenders are a key resource in scripts (the token group later provided significant in the analysis) and other actors have the potential to be crime preventers or crime promoters. This problem could have been addressed with the use of additional variables relating to the numbers offenders and victims and their characteristics, but this data was not part of the original request made to Force A. Providing a way of ascertaining the number of actors in each scene would certainly provide an improvement to any future research using similar techniques.

In document The Utility of Applying Textual Analysis to Descriptions of Offender Modus Operandi for the Prevention of High Volume Crime (Page 145-151)