Coding iterations - Data Preparation - Identifying Crime Scripts Through Computer-aided Content

Chapter 7 Identifying Crime Scripts Through Computer-aided Content Analysis

7.2 Data Preparation

7.2.2 Coding iterations

The development of data dictionaries was an iterative process, beginning with the standard tools and then refining the coding frames based on inspection of the data. Initial analysis of the data also began to reveal anomalies that remained in the data. In response to this, the coding frames applied to the data were revised i.e. stoplists and multi-word lists were extended and code lists revised. For example, in an early run of the analysis, the term ‘Belcher’ emerged as a significant term, dominating one of the clusters. This term was not known to me or my supervisors. On further investigation it became clear that a Belcher is a form of gold chain worn as jewellery and the word was recoded accordingly. Other refinements included the revision of the stoplist to add words and phrases that related to administrative processes rather than the commission of crime, examples of these terms include between material times, at offence location, re-classified, xref and cross reference. A further issue was the frequency with which the names of particular police officers and staff members occurred in the data; particularly those individuals involved with auditing the data. These proper nouns were identified and added to the stopword list.

One of the most challenging and time consuming (because it was the least automated) element of data preparation was the extension of tokenisation to identify and group together synonyms, hypernyms and hyponyms and other words and phrases that indicate the same or similar class of object or action.

The identification of words with similar meaning is not an objective process. The inspection of data and coding frames will be guided by the understanding of the analysis task, and potentially by previous experience with the data or the real world problems that the data represents.

151

In this case the analyst (myself) had previous experience in analysing crime data, including some basic experience of inspecting crime MOs and a background knowledge of theory and research evidence relating to theft from the person. It is important to consider the extent to which this prior knowledge may have informed and facilitated the processes of data preparation and analysis, or may have biased them. If the data is prepared with prior knowledge there may be a risk of missing crimes that do not fit expectations. This is avoided to a degree by conducting exploratory analysis to data that has been minimally cleaned, and ensuring that anomalies in the early iterations are not ignored. It is, therefore, important to acknowledge that this process is by no means totally automated and free of subjectivity.

In this research, insights gained from the literature on theft from the person and robbery of personal property were used to consider key elements of offence commission that might be described in the MO and then explore the vocabulary being used to describe those elements. Data preparation was being conducted in parallel with the development of conceptual frameworks, therefore the emerging script frameworks provided a structure for considering elements of the description. This prompted the consideration of different scenes, such as target selection, approach, engagement and transfer, what is known about the accomplishments the offender must complete within each scene and then potential terminology that could be used to describe this. For example, the literature demonstrated that approach is a critical element of robbery of personal property and that one of the different routes for making an approach is to create an element of surprise. This prompted a search of the terms in the data that might describe how surprise might be generated e.g. jumped out, hiding etc. Similar exercises were conducted for other script scenes.

Different research questions require greater or lesser attention to the levels of lexical detail in the data. So, for example, in this analysis the coding frame did not make a distinction between severities of injury. Tokens relating to injury such as injury, head_injury, swelling and bruises were all coded simply as ‘injury’. For the purposes of a different research question, these distinctions may be important and would need to be preserved. However, where this type of detail is important, the ability to accurately code the data to the desired level of detail is dependent on the amount of information that is available in each of the MO descriptions; and many of the descriptions in the dataset simply stated ‘injury’.

In the current analysis distinctions were made between verbal interactions, verbal aggression and verbal abuse. It was anticipated that these distinctions would be helpful in understanding the engagement scene of offence commission. Verbal interactions were characterised by conversations between actors in a script (normally the offender and the victim), where these verbal interactions

152

were elevated to the level of shouting, or use of aggressive profanities this was labelled as verbal aggression and where it was clear that the language was directly abusive it was recoded as verbal abuse. As noted above, it was not always possible to accurately code the nature of interaction because descriptions may only have included the word ‘verbal.’ Consequently, coded data contains imperfections and limitations. If this study was interested in hate crime it would have been

necessary to ensure that, wherever the data permitted, distinctions could be made between different types of abuse such as homophobic, racial, disablist.

The cycle of refinement and re-analysis can be repeated many times, although the marginal improvements to the results become smaller with each iteration. Overall, the coding frames

underwent twelve iterations of refinements. The results of the analysis with the transformed corpus were generally consistent with those of the early iterations. The initial round of transformations is an essential stage of analysis (cluster analysis on completely raw text did not produce any meaningful groupings); however, smaller refinements, while they may have benefits for the presentation and communication of results, did not fundamentally influence the results of the analysis. In other words, there is a point where refinements prevent something strange from occurring on a

presentational wordcloud diagram but the changes do not change the overall interpretation of the cluster. There is, therefore, a careful balance to strike between producing a perfectly clean and unambiguous dataset and the time and effort that is invested to achieve this. However, where time has been invested in the creation of revised dictionaries, these can be used as an improved starting point for future analysis.

Although standardisation means that, for the purpose of analysis, words in the raw text are replaced with a narrower range of tokens, a permanent link is retained between the tokenised data and the raw data. This means that the raw form of the MO descriptions can be recalled at any time in order to assist with the verification and interpretation of the results.

A final stage of data preparation was to remove ‘sparse terms’. This excluded from the analysis any tokens that occurred in the whole dataset with a frequency less than six. This helps to reduce the processing time involved in analysing the data and to simplify output tables and graphs while losing little in terms of additional information.

7.2.3 Summary

This chapter has described the process of data pre-processing and analysis which prepared the data for subsequent analysis. Chapter 6 described characteristics of MO fields that present challenges for their analysis. The above discussion has shown how techniques available within content analysis and

153

natural language processing can help to resolve these issues. However some problems remain, including the identification of the number of actors in a scene.

The dictionary refinements reduced the number of distinct tokens from 22,250 to just 1,920. This striking simplification of the data helps to identify MOs that are similar but have been described using different terminology. The process of data preparation and dictionary development is clearly time consuming which may seem a contradiction to the aims of using methods appropriate to volume crime data. However, effort input into this stage produced the legacy of a set of dictionaries – and a developed procedure – which can provide a starting point and crucial time-saver for any future analysis.

154

In document The Utility of Applying Textual Analysis to Descriptions of Offender Modus Operandi for the Prevention of High Volume Crime (Page 151-155)