Clustering Concepts from GRiST Mind Maps
6.3 Correspondence Analysis for Researching Text
6.3 Correspondence Analysis for Researching Text
Applying CA to discerning patterns in text is not a novel idea; indeed, it has been used to investigate literary style in various genres of written work. Bodies of text under scrutiny have ranged from the Christian Gospels to 19th Century fiction, in addition to various standard research texts.
The following review starts with an analysis of word usage across various genres of text, by means of CA. Then comes an example of superimposed graphs from sets I and J that aids human interpretation.
Further studies reveal the importance of positive and negative sides of axes, and of the way in which CA graphs are split into quadrants. Then comes a study of trigrams, and further one that used WordNet alongside CA; those latter studies in particular inspire a novel approach to resolving ambiguity.
Analysing Word Usage across Genres of Text
The first example of applying CA to textual analysis offers relatively clear graphs, which serve as good illustrations. The domain of that study was the Lancaster-Oslo/Bergen Corpus (LOB), which comprises four million words from literary works, newspapers, and academic texts; it further covers fifteen categories that range from religion to science fiction.
Results from CA revealed groups of words that corresponded to particular genres: authors were more or less likely to use certain words in any particular genre. In addition, CA identified groups of genres that employed similar words (Nishina, 2007). In terms of the sets I and J that form the basis of any CA, various genres constituted set I, while words of interest comprised set J.
The first step, then, was to count occurrences of relevant words in the LOB. That yielded a matrix of 100 columns by 15 rows, to reflect word frequencies within text categories. For simplicity, categories were denoted as types A - R. Table 6.1 shows the first few columns and rows of that matrix:
etc.
Table 6.1: Detail from a CA matrix of word frequencies by text category (Nishina, 2007)
Table 6.1 shows, for example, 2,018 occurrences of the word ‘said’ in texts from category A, Press Reportage. CA on that matrix produced the graph overleaf in Figure 6.3, which plots word frequencies against genres. Ellipses have been added to highlight particular categories, or groups of categories:
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
Figure 6.3: Plot of text categories based on word frequencies (Nishina, 2007).
Figure 6.3 shows points for categories J and A that appear outside the central cloud. Those categories were Learned and Scientific Writing and Press Reportage, respectively. In addition, a distinct group of points for categories K, L, N and P showed those genres to be quite similar. In fact, categories in that smaller cloud were some form of fiction1. In contrast, genres that congregate around the graph’s origin had less well defined patterns of word-usage. All the same, the point for category R, Humour, is closer to the group of fiction genres than to other points. Humorous writing more resembled fiction than it did, say, academic journals in class J (Nishina, 2007).
1Fiction genres were K: general, L: mystery & detective, N: adventure & western, and P: romance & love story
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
Although CA identifies related genres in set I, words that characterised those groups do not appear in Figure 6.3. That information comes from Figure 6.4, which plots set J:
Figure 6.4: Plot of word frequencies based on text categories (Nishina, 2007).
The top-right quadrant of Figure 6.3 depicts the words ‘different’, ‘form’ , ‘used’ and ‘system’ as slightly removed from the central cloud. In the bottom-right quadrant, the words ‘government’, ‘national’ and
‘year’ comprise a more distinct group. Those groups constitute words that are likely to appear together, though in what genres is not clear. Just as the graph for genres lacked information about specific words, the graph for those words lacks information about genres. Those graphs must be compared in order to find correspondences between words and genres (Nishina, 2007).
Corresponding quadrants of the graphs from Figures 6.3 and 6.4 reveal relationships between words and genres. Take, for example, the top-right quadrant in the graph of genres, which contains an isolated point for category J, Academic Journals. The corresponding quadrant in the graph of words holds the group containing ‘form’ and ‘system’. In a similar way, the bottom-right quadrants reveal a correspon-dence between category H, Government Documents and Industrial Reports, and the words ‘government’,
‘national’ and ‘year’. Comparing graphs indeed showed that particular genres, or groups of genres, were characterised by different vocabularies (Nishina, 2007).
Issues Arising from the Study of Genres by Nishina (2007)
An issue with that study concerns categories A and J, academic texts and press reportage respectively.
Those points appeared near the extremes of separate quadrants in Figure 6.3, leading to the conclusion that categories J and A are totally different. The word ‘totally’ suggests that words from one category
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
of text never appear in the other. It would be better to say that some correspondences between words and genres are stronger than others.
A further issue involves various labels on graphs, which reveal an important aspect of CA that received little attention. Headings and axes from Figures 6.3 and 6.4 bear percentages of any overall variation explained by resulting factors, reflecting their relative importance. The x-axis label, then, shows the first factor from CA to account for 59.04% of the total inertia. Factor two, on the y-axis, explained a further 15.10%. Summing those figures yields the total of 74.14% in the headings. Although both of those factors are important, the first was far more so. In addition, nearly a quarter of the inertia remains unexplained, which might point to further research.
In addition, note that axis labels on the graphs from that study use the word ‘Dimension’. In fact, dimensions reflect the cardinality of set J in any analysis (Benz´ecri, 1992). Put another way, the space that CA creates has as many dimensions as it has matrix columns. The word ‘factor’ would be more accurate, as that is what CA actually projects on graphs (Murtagh et al., 2007). A further misconception arise as criticism of CA for failing to consistently divide genres into specific groups (Nishina, 2007).
Rather than any fault in CA, though, it is the researchers’ input matrix that failed to capture the desired patterns; indeed, Murtagh (n.d.) shows that data encoding is an important part of CA.
A final issue with the study by Nishina (2007) concerns text categories A and J; those points appeared near the extremes of separate quadrants in Figure 6.3, leading to the conclusion that academic texts in J and press reportage in A are totally different from the remaining types of text. The word ‘totally’, though, suggests that words from those categories never appear in other text genres. It would be better, then, to say that certain correspondences between words and genres are stronger than others. Having raised issues with that study of genre, then, attention now turns to approaches that addressed those issues. The first apparent improvement is in visualising CA results; rather than comparing graphs for sets I and J in isolation, superimposed graphs aid interpretation. That is permitted because CA maps both sets into the same space (Murtagh et al., 2008).
Superimposing Graphs from Sets I and J
A good example of superimposing graphs comes from a study of the synoptic gospels: Mark, Matthew, and Luke. CA analysed frequencies of occurrence for types of subordinate clause, such as comparative and conditional clauses, in Greek versions of those gospels. In addition to particular gospels and clauses, a third measure reflected variations in discourse. The narrative form, for example, does not report direct speech, whereas parables contain quotes from participants in a story (Unmans, 1998).
Because CA processes just two sets, discourse could not be addressed directly. Overcoming that constraint
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
meant compressing three measures into a two-way contingency table that CA could use. To that end, gospels were split into sections; each section reflected a specific type of discourse, and contributed a row to the CA matrix. Columns in that matrix stood for types of subordinate clause, while individual cells held the frequency of a particular type of clause in specific source (Unmans, 1998).
Figure 6.5 shows the resulting superimposed graphs, for the first two factors arising from CA. Points for the gospels appear in bold face, and end in a letter that reflects a specific type of discourse. Subordinate clauses are shown in normal type. The exact meaning of those labels is unimportant, here; rather, it is how overlaid graphs help to reveal correspondence between sets. Particular types of clause can be seen to congregate around groups of gospels. Some such clusters have been highlighted in ellipses:
Figure 6.5: Stylistic analysis of the synoptic gospels (Unmans, 1998)
Although little interpretation of Figure 6.5 was offered, it demonstrates the benefit of plotting CA graphs together. In addition to showing clear clusters of gospels and clauses, superimposed charts make obvious any correspondence between the two categories.
Take, for example, clusters that span the upper and lower left-hand quadrants. There, narrative sections from all three synoptic gospels form a well-defined cluster of points LkN, MtN and MkN. Clauses of type Pga, Pc1 and Pc2 cluster together in a similar way, respectively associating clauses having absolute participles with those having various conjunctive participles. Superimposed graphs make obvious the specific correspondence between those sections and clauses.
Although Figure 6.5 gives no percentages of inertia, the first two dimensions actually represented
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
78.1% of the total information content. Further to investigating subordinate clauses, further graphs plotted, for example, correspondence between gospels and certain Greek words. In all cases, attention was paid to the importance of factors arising from CA. Proportions of inertia attributable to the first two factors from those analyses ranged from 45.8% to 99.2%. The latter figure, in particular, showed that higher dimensions of the CA space held most of the essential information (Unmans, 1998).
Another aspect of note from that study of the gospels concerns the distances between points. While quadrants on a graph reflect major differences between sets, distances between points reflect the extent of any resemblance. Mutually close points reflect a high degree of correspondence, whereas points that are widely separated are very dissimilar. That applies to distances between row points and between column points1. Although particular points may not be close in the first two dimensions, they might well be so in higher dimensions. For that reason, unequivocal clustering must sometimes account for additional factors, not just the two most important ones (Unmans, 1998).
1It is important to add that CA allows comparisons both between and within those sets of points.
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
Using CA to Analyse Trigrams
Rather than counting individual words, the CA matrix for the next study comprised frequencies for trigrams; in fact, those trigrams came from essays by students of English as a Foreign Language (EFL), at five levels of education. Trigrams components were assigned POS by the TOSCA tagger; for example the trigram ADJ-N-PREP represented an adjective, followed by a noun, then a preposition (Tono, 1999).
Resulting correspondence between age and patterns of trigram usage, then, appear as Figure 6.6, which in particular shows junior school children to use mainly nouns, while university students accounted for most trigrams involving prepositions. Those two clusters have been circled in blue:
Figure 6.6: Plot of the relationship between trigrams and age (Tono, 1999)
Points for junior school children from Figure 6.6 cluster at the outskirts of diagonally opposed quadrants;
between those extremes, the three remaining age groups were alike in employing largely verb-related trigrams (Tono, 1999).
Introducing WordNet into CA
The last study under review combined WordNet and CA, to investigate any effect on intelligibility of replacing words in sentences. The similarity of replacements to original words was assessed, in part,
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
using WordNet’s measure of semantic distance, which reflected degrees of separation between words from underlying semantic networks. For example, the distance between ‘car’ and ‘gasoline’ was smaller than that between ‘car’ and ‘bicycle’.
Students at various levels of ability in English had to substitute words in sentences. Human annotators judged the suitability of replacements, using categories ranging from ‘clear’ to ‘unintelligible’. Figure 6.7 shows the interaction between proficiency, and the semantic distance that gave clear substitutions:
Figure 6.7: CA graph showing semantic distance of acceptable word substitutions, by age (Izumi et al., 2007)
Figure 6.7 clearly separated students at level 9, in the top-right quadrant, from those having lower levels of proficiency. Very able students tolerated the greatest semantic distances, and appeared near the edge of the graph. Low ability students, on the other hand, showed a corresponding intolerance: substituted words had to be closely related to any original word. Remaining groups understood reasonably distant substitutions, and clustered around the centre of the graph.
In fact, WordNet was just one of several measures of semantic distance that was employed. Separate runs of CA were needed to assess the effect of any specific measure of distance. That was due to the need for simple contingency tables, which was seen as a clear limitation of CA (Izumi et al., 2007).
While demonstrating a notable combination of trigrams, WordNet and CA, that study raises a dif-ficulty with interpreting graphs. Points for ability in English between levels 3 and 8 clustered around the origin, separating them from the lowest and highest levels. The problem arises of deciding just how
6.3. CORRESPONDENCE ANALYSIS FOR RESEARCHING TEXT
many clusters are reflected, and what points belong to each cluster. If asked for just three clusters, CA might show the central points as a single group. Four clusters would most likely split levels 7 and 8 into a separate cluster. Assessing the optimum number of clusters, then, poses a challenge to humans, and more so to machines.
Summary of CA for Analysing Text
Common threads emerged between applications of CA to researching text that will close this section. In general, the distribution across categories of one set were used to explain the spread in a second set, and the same in reverse. Graphs from CA largely plotted one factor on the x-axis against a second factor on the y-axis. The coarsest measure of correspondence was the split between positive and negative sides of a particular axis. The next level of detail arose from interactions between the two axes, giving four quadrants that reflected major sub-categories; further specific correspondences arose on inspecting the distances between points (Unmans, 1998; Nishina, 2007).
With the exception of Nishina (2007), the reviewed studies overlaid graphs to aid comparisons between sets I and J, which occupied comparable CA spaces. Another area of agreement lay in noting any contribution made by specific factors to any overall variation. The two factors normally shown in a CA plot, though, might account for just a proportion of that variation; a certain amount of variation will remain unexplained, should just a few dimensions be considered. That said, any first two factors generally accounted for an acceptably large percentage of overall variation.
Work on analysing trigrams showed the benefit of considering n-gram frequencies rather than counting individual words. Given the importance to this thesis of trigrams that contain prepositions, the study by Tono (1999) was of particular interest. Combined with WordNet, as in work by Izumi et al. (2007), CA offers a powerful tool for discerning patterns of preposition usage in GRiST mind maps. Rather than as an end in itself, though, such patterns will further aid in disambiguating important words from those mind maps, in turn enhancing the emerging information base.
A unanimous aspect of CA in analysing text, though, concerned degrees of human intervention.
Indeed, all of the reviewed studies ran CA manually, with humans inspecting any results. Maximising automation in processing GRiST mind maps, though, requires machines to perform all aspects of CA unaided. While building matrices and running CA are easy enough, reading graphs will be more of a problem, as will be selecting optimum numbers of factors and clusters. Further, machines must collate row and column clusters, and interpret points towards the edges of graphs, whose axes must further be analysed. All of those functions are normally done by eye; what follows, then, is a proposed approach to automating CA for GRiST mind maps.