TABLE 3.1 Genres represented in OHPC
3.3 COMPUTATIONAL ISSUES
Ideally, the FEIs in a corpus would be identified automatically by machine, thus removing human error or partiality from the equation. There is, however, no evidence that this is possible given the current state of the art. It is also
difficult to see exactly how progress can be made. The problems arise because in so many cases FEIs are not predictable, not common, not fixed formally, and not fixed temporally (that is, they are often vogue items like slang). They are dynamic vocabulary items, whereas--at least at present-corpus processing requires givens and stability.
The size of the computer-held corpora used in linguistic research has increased over the last 30 years by orders of magnitude. Leech comments on how 'first- generation' corpora of up to 1 million words (such as the Brown and LOB corpora) have been succeeded by 'second-generation' corpora of 20 million or so words such as OHPC and the Birmingham Collection of English Text, and then by 'third-generation' corpora of hundreds of millions of words
-51-
( 1991: 10). There is too much data for manual analysis, and preprocessing is required in order to make use of the information that such corpora contain. That is, routines are run over the data in order to identify certain features of
the component text. At the most basic level, the text is indexed so that tokens of the same word type may be retrieved as a set. Beyond this, the most
extensively developed and successful of the preprocessing or automatic
routines involve the tagging of words with part-of-speech labels, the parsing of the sequences so tagged, and the identification of recurrent collocations.
Research includes the tagging of the Brown corpus, reported in Greene and Rubin ( 1971) and of the LOB corpus, reported in Garsideet al. ( 1987) and Garside ( 1987); taggers developed at AT&T Bell Labs, reported, for example, in Church ( 1988); taggers and parsers developed in Helsinki, reported, for example, in Voutilainen et al. ( 1992); research into collocations led by Sinclair and reported, for example, in Sinclairet al. ( 1970), Sinclair ( 1991), and
Renouf and Sinclair ( 1991); and research into lexical statistics and the
significances of collocations by Church and colleagues and reported in Church and Hanks ( 1990), Churchet al. ( 1991), and Churchet al. ( 1994). Such work has concentrated on formal aspects of corpora. Research into semantics--for example, the automatic disambiguation of homographic or polysemous
words--has largely proceeded by distinguishing word uses on the basis of
grammar or lexical collocation. However, routines are not yet robust or delicate enough to detect the more subtle polysemies recorded in dictionaries.
Lexicographers using monolingual corpora to analyse lexis in order to write conventional (non-AI) dictionaries still largely rely on data that has been processed only in terms of form, syntactic function, and lexical collocation. FEIs present a particular problem for preprocessing routines. Semantically and often syntactically they function as units rather than as arbitrary sequences, but they need not be contiguous or uninterrupted. They may be syntactically or lexically ill formed, breaking conventional grammatical rules or valency
patterns. Higher-level automatic routines, attempting to establish meaning or topic, may be thrown by the lexico-semantic incongruence of FEIs in their contexts.
One solution adopted is the use of a preprocessing routine to identify and tag FEIs. These routines largely rely on the establishment of a look-up list of FEIs and then pattern matching: the text processed is searched for matches of the predefined listed strings, and occurrences are flagged as units. Preprocessing routines thus
-52-
reflect the tradition of the notional or actual storage of FEIs as a separate part of the lexicon, as well as reflecting work on patterning and chunking in
language in general. A number of routines of different kinds have been developed. The tagging of the LOB corpus was assisted by a program
IDIOMTAG, described and discussed by Blackwell ( 1987) and McEnery ( 1992: 67) whereby holistic units such as at first sight and to and fro were tagged as such. This often involved the use of 'ditto tags', whereby subsequent elements in non-compositional units such as at last and time and again were linked to the first element and given a tag appropriate to the whole unit. Johansson and Hofland describe these units further ( 1989), and list 160 of them. The list of combinations includes foreign phrases such as ultra vires and compounds such as button stitch as well as FEIs, and it is clear that the list was only a tiny
programs are described and discussed by Wilensky and Arens ( 1980), Kunst and Blank ( 1982), Chin ( 1992), Stocket al. ( 1993: 237f.), and Cignoni and Coffey ( 1995). Martin ( 1996: 88 f.) gives an overview of the problems presented in computational processing of FEIs, and recent European work includes the DECIDE project.
Preprocessing routines work best in cases where the FEIs contain unique constituents or unique sequences, as with to and fro or take umbrage, or are completely frozen 'big words'. They work less well where FEIs have
unpredictable transformational behaviour or where they are interrupted by non-canonical words. They work least well when the FEI has variant forms or is exploited. Stocket al. ( 1993) describe their program WEDNESDAY 2 which deals reasonably successfully with variant word order in FEIs as well as some other kinds of variation--although not exploitations. Gross ( 1993) describes a parsing model which improves the handling of variation in FEIS, particularly in the cases of sets such as blow one's top/cork/stack and of FEIs with
interpolated items. Pulman ( 1993) rejects the notions of the look-up list and the canonical form, preferring instead a notion of lexical indexing, where each component is marked with its special features. Breidtet al. ( 1996) describe IDAREX , which involves the use of local grammars in which the
morphological, syntactic, and sequencing properties for items are stated on an individual basis, in addition to some information concerning lexical variation (they do not specify how robust information concerning all this is to be acquired): the information is then accessed during the processing of text in order to identify FEIS.
-53-
A look-up list is based in the first place on secondary sources, albeit modified by an examination of a corpus, and this approach to corpus analysis is
diametrically opposite to the kinds of empirical study being carried out on collocation. Ideally, any base look-up list of FEIs would be generated automatically or empirically, but this is hard to do. The use of secondary
sources in establishing a base look-up list is unsatisfactory and a compromise; however, it allows a reasonably powerful processing tool to be developed and can of course be adjusted as information becomes available. Comparisons may be made with other work in natural-language processing, for example, work carried out on the automatic detection of metaphor and metonymy in text ( Fass 1991; Martin 1990). Such work typically involves some pretraining such as access to a list of valencies of the literal meanings and interpretation of nouns and so on in terms of superordinates or isa-structures. Work by Churchet al. on lexical substitutability ( 1994) can lead to the automatic detection of sets of collocates, for example the range of verbs used with a noun or vice versa, but again it starts from a manually observed 'goodcase' pairing. Taggers themselves start from look-up tables of the word classes of particular words or sets of rules, and probabilities of polyfunctional words. If look-up lists are incorporated into routines, the size of the list becomes important. The list of ditto tags used for the LOB corpus contained relatively few items, all 'big words'. Smith, in discussing the preprocessing of idioms, suggests that there are 4-5,000 common idioms in everyday use ( 1991: 64), although his definition of 'idiom' is fairly loose. In discussions with the Hector
team at DEC/ SRC concerning the value of fully tagging multi-word items, we decided that we should expect to find around 15,000 FEIs and phrasal verbs in a second-generation corpus of current British and American English: routines based on the results of the tagging would need to know at least that number of items in order to be effective. This figure would need to be increased for
substantially bigger corpora such as BOLE, probably to around 25,000, since many more randomly occurring low-frequency items would be observed. A comparatively late development in the course of the Hector research collaboration between Oxford University Press and Digital Equipment
Corporation was the introduction of 'z-tagging'. Tokens of individual types in OHPC were already being tagged in order to link tokens with the relevant sense in an electronic dictionary entry that synthesized the evidence found for the type in OHPC. In particular, each token of each FEI, phrasal verb, and
-54-
compound was given an individual sense tag (for each of its meanings if it was polysemous). With the introduction of z-tagging, the main tag for each
multi-word item was set at one of the elements: the first word in a compound, the verb in a phrasal verb, and a fixed lexical element in an idiom. All
subsidiary items were then given the same tag, suffixed with -z, thus binding all parts of all individual tokens of individual lexical items, regardless of
orthography. Had OHPC been fully sense-tagged, it would have given a complete record of the lexical items in the corpus, as opposed to the
orthographic words. This would have interesting potential. It would be possible to compute the proportion of words in text that form part of complex lexical items, both in the corpus as a whole and in individual texts or genres, thus measuring the density of such items in text. The tagging of subsidiary elements in compounds and phrasal verbs would enable lexicographers to analyse and describe more efficiently the morphology and semantics of such items. One of the aims of hand-tagging OHPC for sense and syntax was to use the tagged corpus as a training corpus, to facilitate the automatic analysis of an untagged corpus by recognition of recurrent patterns associated with individual meanings (see Glassmanet al. 1992; compare Leech 1991: 18 and
passim). The detailed tagging of multi-word items would extend the kind of
automatic analysis possible. Finally, a fully sense-tagged corpus would make it possible to compute the relative proportions of the FEI tokens, including
variations, and their homographic non-idiomatic strings. The resulting data could be used to establish probabilistically the likelihood of further tokens of individual strings in other corpora being idiomatic or literal.
If preprocessing routines are to incorporate a look-up list, then much of their success clearly depends on the robustness and completeness of the list. The better the list, the more likely FEI tokens are to be detected. In particular, information concerning syntax, transformations, variations, and non-canonical insertions would, as Breidtet al. suggest, lead to more sophisticated matching procedures to be developed. However, this information needs to be derived from robust sources such as large corpora, and not from intuition nor from commercial dictionaries. Including data concerning the distribution of FEIs in a corpus would help in the identification of FEIs in other corpora, through
random basis, information to the effect that an FEIis rare has itself some predictive power.
There are several benefits of improved automatic routines for -55-
recognizing FEIs in text. First, it would improve the accuracy of tagging and parsing routines in general. Secondly, it would become possible to investigate more robustly the distribution of FEIs, and of different kinds of FEI, in specific genres, varieties, or idiolects (see Biber and Finegan ( 1991) for a discussion of the methodology of using corpora in large-scale studies of variation, and Crystal ( 1991) on stylistic profiling). This would lead to a better understanding of the lexicon and could be unified with work on the recurrent collocations of text and of specific text-genres. As work developed, information gathered could be fed back into the look-up list in order to modify, augment, and
improve it. Thirdly, there are possible applications in machine translation (see Bar-Hillel ( 1955) and many writers since on the automatic translation of idioms). Finally, Meehanet al. ( 1993) describe other applications in
naturallanguage processing such as a software product, functioning like a spell-checker or grammar-checker, to monitor the use of FEIs in text being composed. It would report aberrant or marked uses, abnormally high (or even low) densities of idioms, Americanisms and Briticisms, and so on. An extension of the FEI list would incorporate stock formulae--Pawley and Syder's
'lexicalized sentence stems' ( 1983), Coulmas's 'routine formulae' ( 1979b), Nattinger and DeCarrico's 'lexical phrases' ( 1992)--in order to exploit the recurrence of such formulae in speech recognition work. For example, many telephone calls involve the eliciting and giving of information and operate round simple exchange structures, realized by prefabricated routines and
semi-institutionalized strings, or programmable data. The starting-point for all such work must be a detailed corpus-based, text-based description of FEIs, of the kind begun in my own study, and incorporating distributional, formal, semantic, and discoursal information.
-56-
4
Frequencies and FEIs
In setting out what I learned about the distribution of FEIs from investigating a corpus, I want to emphasize again the limitations of the study in order to
contextualize my findings. OHPC was an idiosyncratic corpus, but useful as a means to an end. My findings were intended to be benchmarking statistics; to provide some framework which could be used in further studies--for example, of different corpora, whether matched typologically or constructed from
different kinds of text. Some cross-corpus comparisons are given in the last sections of this chapter. It is my contention that the general tenor of the
distributions I observed are borne out by other corpora, and anomalies can be explained.