Chapter 7 Identifying Crime Scripts Through Computer-aided Content Analysis
7.2 Data Preparation
7.2.1 Dictionary development
To accomplish the tasks of reducing flexibility and ambiguity, a set of dictionaries was prepared and applied to the data. The starting point for building these dictionaries was the use of freely available natural language tools including the R packages Aspell for spell checking, Snowball for stemming and the English stopword list available with the TM Package. The TerMine service, provided by NatCTem, was used to generate a list of the most frequently occurring multi-word phrases in the data (Frantzi et al., 2000). The use of the WordNet lexical database to identify related words was explored but this
145
provided difficult to implement in the R environment (other than searching for one synonym at a time). While all of the tools used were revised and tailored to the data after a series of iterations;
the identification of synonyms and related words was the task for which the most manual effort was required. To ensure anonymity, personal names and proper nouns relating to streets and other local locations were identified and removed.
As already suggested, the ‘off-the-shelf’ tools provided a starting point for data standardisation but using them in their raw form did not resolve all of the issues in the data. For example, the
spellchecker only identified and corrected a small proportion of errors in the MO descriptions.
Spelling errors were identified as a problem in Chapter 6, but the data preparation phase highlighted the true extent of this problem. Not only were there particularly high levels of errors, the incorrect spellings were in many instances far removed from the intended term meaning they were not picked up by spell checkers. Therefore initial spelling errors had to be identified manually and included in the dictionary. Table 7.1 highlights just one example of this issue, showing that there were 36 different spellings of escape in the data, in addition, the alternative words and phrases for escape and their misspellings provided a further 36 terms. These 74 terms were all recoded to the token escape. Similarly, there were 127 variations of offender including spelling variations of offender, suspect, aggressor and assailant (but not including descriptions where terms such as youth or gang were used in place of offender). There were also 48 terms used for victim including IP, injured party, victim, complainant and spelling variations).
146
Table 7-1 Table showing the alternative words and incorrect spellings of ‘escape’ found in the data
Spellings of Escape in the Data Alternative Words for Escape in the Data
andescaped absconded
The TerMine service generated a list of suggested multi-words phrases, each of which would be recoded as one token, but these needed checking by hand and refining by adding and removing terms. Phrases that were added included: under_the_influence; made_good_their_escape;
147
stood_in_her_way and racist_insults. It was particularly important to identify multi-word phrases containing negating words which reverse the meaning of a term or phrase. Phrases where the negative term was directly collocated with the connected phrase were effectively picked up by TerMine, i.e. phrases such as no_injury and without_warning. Other instances were more complex and needed to be identified manually e.g. without_any_warning; not_put_in_any_fear_of_violence;
no_threats_or_weapons (in the latter examples no_threats would be picked up automatically but no_weapons would not). These multi-word phrases were recoded as a token which was treated as one word in which the negative element was clear, some examples of recoding phrases relating to no injury and without provocation are provided in Table 7.2. Negating terms may not always be as obvious as no, didn’t, did not etc. For example in the MO description, below, the inclusion of the words slow and speed completely change the meaning of the word speed:
<<BETWEEN TIMES STATED IP WALKING IN A RESIDENTIAL AREA WHEN A VEHICLE STARTED FOLLOWING HIM AT A SLOW SPEED. THE VEHICLE PULLED OVER AND OFFENDER GOT OUT AND APPROACHED IP AND SAID WHAT HAVE YOU GOT FOR ME"".
HE PUSHED IP TO GROUND BEFORE TAKING HIS WALLET FROM HIS JEANS POCKET AND THEN MADE OFF IN VEHICLE DRIVEN BY OFFENDER 2 IN THE DIRECTION OF XXXX ROAD ROAD CAR>>
Table 7-2 Examples of multi-word phrases (including negating phrases) from the data
Raw Text Changed to Token
No injury no_injury
No visible injuries Not hurt
No visible injury Not injured
Did not suffer any injuries
Without provocation no_provocation
Without any provocation No provocation
Un provoked
As noted in Chapter 4, the meaning of many words can only be understood in relation to their context. The methods used to analyse data in this thesis largely examines words in isolation from their context and can therefore introduce ambiguity. This is a particular problem with homographs, distinct words that share the same spelling but have different meanings ((e.g. saw (verb) and saw (noun)). To reduce the uncertainty of the meaning of terms, potentially ambiguous terms had to be identified and their meaning clarified. Again existing lists of known homographs in the English
148
language can provide a starting point, but much of this work had to be conducted manually by inspecting the database. KWIC lists, also known as a concordance view, aided the process of identifying the meaning of ambiguous words. These lists display every occurrence of a word
surrounded by their context. This facilitates the inspection of the different meanings that words may have within the data. Figure 7-1 below shows a KWIC list for the word ‘escort.’ The list highlights the two different meanings for the similarly spelled words escort and escorted. If stemming had
occurred prior to disambiguation all instances of escorted may have been recoded to escort and then car, a misinterpretation of the original word. Disambiguation can be made more difficult by the use of slang and local variations, for example the names of pubs or local landmarks taken out of contexts could lead to misinterpretation (imagine the impact that many crimes occurring at a Punch and Judy Inn would have on the perceived level of violence in the data if the issues was not recognised and corrected). In the current data, the location code (park, licensed premise, shop etc.) was consistently recorded allowing a quick check for the names of licensed premises and the removal of related proper nouns.
Some advantages in data preparation stemmed from the data’s status as local grammar. This restricted the available meanings of many words that otherwise could have been ambiguous. Thus stalk always referred to the verb, not the noun, while both implement and object always referred to the noun. However, phone could relate to the noun as in mobile phone, but often included the verb as in phoned the police, this term also appeared as phone box. This issues was resolved through the identification of multi-word phrases relating to calling the police or calling 999 and also the
identification of words and phrases relating to phone boxes.
149
Figure 7-1 A screenshot of a KWIC list showing all instances of escort in the data
One issue which arose in the standardisation of the text was the stemming of plural words, when the distinction between singular and plural words was important for the analysis. This issue was relevant when MO descriptions included either the number of offenders or the number of other actors in the situation. This problem is heightened by the fact that the standardisation process removes
punctuation, so that ‘offender’s’ would become ‘offenders’ and then ‘offender’. Further
complications are added by the fact that this punctuation was often missing or misplaced in the raw data. Even if it had been possible to accurately retain the singular or plural in these cases, the number of offenders, or other actors was often indicated by shorthand such as X2. Looking at the examples in Figure 7.1 above once more, we can see details of the number of offenders given in various ways including:
• 3 U K OFFENDERS
• THIRD OFFENDER AN IC3
• OFFENDERS 2 WHITE MALES
• 2 UNKNOWN OFFENDERS
• X1 IC1 MALE AND X1 IC3 MALE
• OFFENDERS 1 AND 2
150
Similar problems emerge when attempting to uncover the number of victims in a description, or the number of people that were in the company of a victim at the time of the offence. It was decided that the data processing would not attempt to ascertain the number of victims / offenders in each description. This represents a limitation within the analysis as co-offenders are a key resource in scripts (the token group later provided significant in the analysis) and other actors have the potential to be crime preventers or crime promoters. This problem could have been addressed with the use of additional variables relating to the numbers offenders and victims and their characteristics, but this data was not part of the original request made to Force A. Providing a way of ascertaining the number of actors in each scene would certainly provide an improvement to any future research using similar techniques.