Chapter 5 ‘Seek and ye shall find’: formulaic clusters as evidence of authorship
5.3 Results
5.3.2 Attributing a questioned document: two candidate authors
Using the random case selection function in PASW Statistics, two authors were selected for the analysis: Rose and Mark. Of the ten texts produced by these two authors, PASW Statistics was again used to randomly select one text to act as the Questioned Document: the first text produced by Mark.
-110-
Selecting one of the documents as a Questioned Document means that there will be a 5-text to 4-text comparison and although the majority of clusters occur in only three texts, this uneven comparison may skew the results. Whilst the argument can be made that in a forensic investigation it is less likely that exactly the same number of texts will be available for analysis, in an exploratory study such as this, limits must be established where possible. Therefore, the first part of the analysis will proceed with the 5-text to 4-text comparison, before reducing Rose’s texts by one to see how the results are affected by a 4-text to 4-text comparison.
The results of this analysis are presented in Table 5.4. Column 1 shows the formulaic clusters used by Rose. The third column lists all of the formulaic clusters identified in the four ‘Known Documents’ produced by Mark (i.e. those that occurred in at least three texts). The Questioned Document was then searched for each of Rose and Mark’s formulaic clusters and those which were present are shown in the second column. It is important to point out that those items in the second column are only ‘candidate formulaic clusters’, since by definition a formulaic cluster would need to occur in three texts whereas only one Questioned Document is available for analysis. Therefore, this column represents the occurrence of a cluster which has been claimed to be formulaic for another author (either Rose or Mark), and it is predicted that more clusters in the Questioned Document should be shared with its author (Mark) than with the other candidate author (Rose). The fourth column is discussed further below.
-111-
Table 5-4 Formulaic clusters used by Rose, Mark and QD in comparison to all other authors Formulaic clusters used
by Rose
Clusters occurring in QD
Formulaic clusters used by Mark
Total authors using formulaic cluster
A COUPLE OF 11
A LOT OF 8
A WAY I 5
AND I WAS AND I WAS 16
AS I WAS 12
AT THE SAME TIME 5
BUT I KNEW 1
BY THE TIME BY THE TIME 9
I KNEW THAT 10 I REALLY FELT 1 I THINK THE 5 I WAS GLAD 2 I WAS GOING 10 I WAS SO I WAS SO 10 IN A WAY 8 IN A WAY I 4
IN THE END IN THE END 9
IN THE SAME 6 IT WAS A IT WAS A 18 LOOKING FORWARD TO 5 MADE ME FEEL 5 ME AND MY 6 ME IN A 1 THAT I WAS 17
THE SAME TIME 7
THE WHOLE THING 5
WAS GOING TO 9
WENT TO MY 1
WHEN I WAS 18
WHICH I WAS 2
As can be seen from Table 5.4, 24 formulaic clusters were identified in Rose’s texts, whilst only six were identified in Mark’s texts and five formulaic clusters were identified in the Questioned Document. The first thing to notice is that Rose and Mark do not share any of the same formulaic clusters. This adds some weight to the argument that there is inter-author variation in the use of formulaic clusters. Secondly far fewer formulaic clusters were identified for Mark than for Rose. This is partly as a result of research design and is clearly related to the texts analysed (cf. below and Sections 5.3.3—5.3.4 for further testing on different texts). Referring back to Table 5.3, it is evident that nine formulaic clusters were originally identified for Mark, based on five texts. Here, since one of Mark’s texts has been selected as a Questioned Document, only four ‘Known Documents’ were available for analysis, which explains why fewer formulaic clusters were identified than previously.
-112-
Given that only five clusters were identified in the Questioned Document and that four are formulaic for Rose and one is formulaic for Mark, it is unlikely that persuasive evidence can be found for authorship. However, the fact that they are formulaic clusters for an author only means that they are used frequently (at least once in three texts) for that author, not that they are used exclusively by that author. In other words, in line with Solan and Tiersma (2005: 156), the distinctiveness of a feature needs to be assessed in relation to other authors. This is shown in the fourth column in Table 5.4. With the benefit of 18 other authors with whom to compare the texts, it is possible to show how many of the 20 authors also used the identified formulaic clusters in their texts. Note, though, that the occurrence could be as low as once across all five texts produced by an individual author, so the claim is not necessarily that the cluster is also distinctive, or even formulaic, for them; rather, that it is also available in their lexical repertoire. A summary of the salient points is shown in Table 5.5.
Table 5-5 Relative significance of formulaic clusters in comparison to other authors Formulaic cluster Significance
AND I WAS Used by 16 authors
BY THE TIME Used by 9 authors
I WAS SO Used by 10 authors
IN THE END Used by 9 authors
IT WAS A Used by 18 authors
Viewed in this light, it can be seen that whilst Rose shares the majority of the formulaic clusters isolated in the Questioned Document (rather than Mark), they do not seem to offer any discriminatory power since all of the formulaic clusters are used by several other authors—almost 50% in each case with and I was and it was a being used by 80% and 90% of the authors respectively. Therefore, no attribution is possible, and nor is it possible to exclude either author as a potential author of the Questioned Document. It is important to acknowledge though that if an attribution had been based purely on the quantity of ‘matched’ formulaic clusters, the wrong attribution would have been made with Rose looking like the more likely author.
At this stage, it is necessary to consider the fact that five texts produced by Rose have been compared against four texts produced by Mark and that the extra text available for analysis in Rose’s set of Known Documents may well have skewed the results. The point was made above that using fewer texts reduced the quantity of formulaic clusters identified for Mark. Therefore reducing the number of Known Documents written by Rose should also affect the outcome of the qualitative analysis and forms the next part of testing the method. PASW Statistics was instructed to randomly select one text from Rose. Her second text was selected and was removed from the pool of Known
-113-
Documents resulting in four texts by Rose, four by Mark and one Questioned Document. The formulaic cluster analysis based on these texts is presented as Table 5.6.
Table 5-6 Formulaic clusters used by Mark and Rose in comparison to QD (4 Known Documents each)
Formulaic clusters used by Rose
Clusters occurring in QD
Formulaic clusters used by Mark
Total authors using formulaic cluster
A COUPLE OF 11
AND I WAS AND I WAS 16
AT THE SAME TIME 5
BY THE TIME BY THE TIME 9
I REALLY FELT 1
I THINK THE 5
I WAS GLAD 2
I WAS GOING 10
IN THE END IN THE END 9
IN THE SAME 6
LOOKING FORWARD TO 5
ME AND MY 6
THAT I WAS 17
THE SAME TIME 7
THE WHOLE THING 5
WAS GOING TO 9
WENT TO MY 1
WHEN I WAS 18
As predicted, the number of Rose’s formulaic clusters was significantly reduced from 24 to 12 and as a consequence, two of the clusters which occurred in the Questioned Document are discounted. The result is that there are now only two of Rose’s formulaic clusters to place against the one for Mark. This in no way clarifies or otherwise strengthens/weakens the conclusions reached above but simply reduces the data on which conclusions can be based. This reinforces the position of forensic linguists that more data (i.e. more and longer texts) enable stronger conclusions and, more importantly for this method, it appears that data sets should be similar in size to enable more valid comparisons. So far, formulaic clusters which occur in five texts and four texts have been identified and no attribution was possible. It may be the case that formulaic clusters do still hold potential to be diagnostic of authorship, but that a larger set of candidate authors is required to make differences more apparent. The next investigation tests this assertion.
-114-