• No results found

Handwriting individualisation

Chapter 4 Coding scheme development

4.2 The coding scheme

4.4.2 Handwriting individualisation

At the end of Chapter 2, the question was posed as to whether handwriting samples from an individual are more similar to one another than to samples from other writers. This was to be considered in a developmental context and one possibility would be that in younger children subjected to similar teaching regimes pieces of handwriting from different children could be similar one to another whereas in older children this might be less likely due to maturation of an individual style of handwriting. The primary aim of this preliminary study was to assess the extent of within writer similarity in the context of between writer similarities in participants in the three age groups. In order to do this, a method of analysis is required that can find patterns of

cluster analysis. In the cross-sectional study reported in Chapter 5, the analysis of the much larger data set from 144 participants will follow a different approach and will consider trends in feature use and variability and will attempt to find common ground across features by using principal components analysis. The data from the small number of participants in this preliminary study are too limited to follow these approaches in order to attain meaningful results.

Cluster analysis is the method of choice when looking for patterns of association or disassociation in complex data and for that reason is essentially an exploratory process. It clusters together entities that are more similar to each other and dissimilar to those other entities that may form another cluster. Further, using the associated discriminant function analysis, the relative contributions of the scored features to the clustering patterns can be determined. Since one of the tests for the scheme is to show within writer similarity (which can be shown by clustering of pieces from the same writer) and at the same time a degree of within writer difference (but not so different that the writings stop clustering), this method is a good way of assessing the categories in combination.

There is a variety of ways in which clusters can be formed. One general way is to use agglomerative methods which essentially work from the bottom up, looking at the data relating to each individual example and finding which two examples are the most similar, combining those two into a single entity (cluster), and then repeating the process until all of the original individuals have been drawn into clusters.

There are a variety of clustering algorithms that view the data in different ways. Some use the nearest elements and some the furthest elements as the basis for clustering whereas others use the centroid values (determined as the average point in multi-dimensional space) as the basis of clustering. In order to compare the outcomes from a number of clustering algorithms, a measure is needed to compare the outputs. Two measures that could be used are (i) the capacity for the three pieces of handwriting from the same individual to be nearest neighbours in the resulting dendrogram of the cluster pattern (this would address a primary consideration of this preliminary study) and (ii) the extent to which participants from the same age group are adjacent to each other in the dendrograms, in other words the number of clusters formed by each age group (a secondary consideration of this preliminary study).

Given the many options for calculating cluster relationships, five were chosen which reflected a range of approaches to the clustering process. These were (i) between groups linkage, (ii) centroid linkage, (iii) simple linkage, (iv) complete linkage and (v) Ward’s linkage. In each instance, the distances were based on the squared Euclidean distance. The dendrograms for the five methods of clustering are found in Figures 4. a-e. From these dendrograms it can be seen that the numbers of non-adjacent participants and the number of clusters for each age group are as shown in Table 4.3.

Figures 4.1a-4.1e showing dendrograms obtained using different clustering algorithms (4.1a using between groups linkage; 4.1b using centroid linkage; 4.1c using single linkage; 4.1d using complete linkage; 4.1e using Ward’s linkage

Figure 4.1e

Table 4.3 showing participant clustering and age group clustering patterns Cluster algorithm Participant clustering1 Age group clustering2

Between groups 3 5.0 Centroid 6 4.67 Single linkage 5 4.33 Complete linkage 2 3.67 Ward’s 3 4.0 1

measured by number of instances in which a piece of handwriting from a participant is not adjacent to another from the same participant

2

measured by the average number of clusters capturing all fifteen pieces of handwriting in each age group

The first comment to make about the findings in Table 4.3 is that even using a number of different methods for calculating cluster membership, the number of instances in which a piece of handwriting from a participant is not adjacent to one or other of the other two pieces of his or her handwriting is small. Related to this, there is a consistent tendency for the pieces of handwriting from each age group to cluster together. This underlying commonality of the results using different clustering algorithms, suggests that the coding scheme is producing patterns of data that are reasonably robust in that even when examined using different clustering methods, there is still a reasonable similarity of linkage found between the pieces of handwriting from the same individuals and from individuals of similar age.

The second comment is to determine which clustering algorithm is most appropriate to this research. On the face of it, of the five methods tried here,

non-adjacent to other pieces from the same writer and with all fifteen pieces in each age group being, on average, captured in 3.67 clusters. The next best result was obtained using Ward’s method and the worst being the centroid method. However, examination of the dendrograms themselves shows that the Ward’s method produces a clearer pattern of triplication of pieces from the same writer and also the rescaled cluster distances are larger in Ward’s method which is indicative of more profound differences between clusters. Whilst there is no clear cut reason to choose one algorithm over another in this instance, the clustering pattern obtained by using Ward’s method gives cleaner triplicates from the same writer and since this is a primary objective of the coding scheme, Ward’s method was chosen for the remainder of this thesis. Ward’s method calculates the ‘distance’ between the multidimensional points of data using the squared Euclidean distance, evaluates the loss of information that results from each of the clustering steps and checks this loss for each possible clustering, only clustering in those instances which minimise the information loss.

Further, whilst these various different methods have their proponents, Ward’s method is widely regarded as an efficient method and tends to produce clusters of approximately similar size which may be expected to more fairly reflect the spectrum of feature use to be found in a group of individuals (Everitt, 1980).

Cluster analysis also has the potential to illuminate a secondary aim of this analysis, namely, by use of discriminant function analysis, to examine the contribution that each feature makes to the discrimination process (clustering), taking into account the whole data set. It creates a series of

weightings for each feature, indicating its relative contribution to the cluster pattern. These can be used in the second part of this analysis when determining which features to retain and which to discard.