Chapter 3. Methodology
3.2 Corpus Linguistics
3.2.1 Corpus-based methods to approach knowledge construction
Some other scholars apply corpus-based methods to investigate knowledge through
examinations of lexis use. The approach owes a great deal to the advent and development of computer technology in the field of Applied Linguistics. Compared with manual compiling, computers have made it easier to store and retrieve large collections of language data in electronic form and to analyse them using “increasingly sophisticated, versatile and user-friendly software tools” (Altenberg and Granger 2002: 1). Using a corpus has advantages in lexis study since 1) “lexis lends itself perfectly to the form-based research (e.g. letters, lemmas, word spaces, punctuations, etc.) at which computers excel” (Altenberg and Granger 2002: 1); and 2) the distribution of lexis can be easily computed and observed from frequency counts of words (Hunston 2002), which forms the basis for more complex and sophisticated computation of linguistic relations. These two factors make the task of examining lexis use straightforward (Moon 2010) and the examination inductive rather than intuitive.
74
Although scholars using corpora to approach lexis do not usually address knowledge
construction in an explicit way, their work has serious implications for identifying knowledge structures through patterned lexis use in discourse which might be overlooked without the assistance of computer techniques. Patterns emerging at the lexico-grammatical level (see Halliday 1966; Sinclair 1991; Biber et al. 1998) have particular relevance to the
restrictiveness of linguistic choices for meaning-making, the process of which would reveal the scope and preference of knowledge construction through natural language use.
Knowledge structures representing schematized human experience are related to formulaic language use constituting patterned word associations. The linguistic research on word associations has been focusing on the existence of collocations, the combinational restrictions reflecting “the habitual or customary places” of words in company (Firth 1957: 12). Earlier work had paid attention to the computation and attribution of collocational patterns based on marked word senses and uses (e.g. Biber 1993; Smadja 1993; Stubbs 1995; Williams 1998).
Numerous empirical studies followed to broaden the investigatory scope by explicating the implications of words’ collocational behaviours in generic discourses (e.g. Xiao and McEnery 2006, language learning; McEnery 2006a, 2006b, bad language and moral panic; Baker et al.
2008, media discourse; Siyanova and Schmitt 2008, second language production and processing). More recent research, interestingly, has seen a trend of re-considering and improving the existing methodological procedures to address the multi-dimensionality of collocational phenomena (e.g. Evert 2010; Pecina 2010; Gries 2013; Brezina et al. 2015;
Baker 2016).
Closely related to meaning representation through collocations is the research on semantic prosody. Arising from Corpus Linguistics, the term semantic prosody was attributed to
75
Sinclair (1991) and first introduced to the public by Louw (1993). In his search for extended units of meaning using concordance lines, Sinclair (1996) found that many words occur frequently in recurring sequences which reveal textual patterns of meaning-making. Stubbs (2002) referred such observable semantic relations between a given word and its typical collocates to semantic prosody (p. 225), “a form of meaning which is established through the proximity of a consistent series of collocates” (Louw 2000: 57). Semantic prosody has been viewed to express speaker/writer attitude or evaluation (Louw 2000; Xiao and McEnery 2006;
Bednarek 2008), which contributes to revealing how a collocational structure is to be
interpreted functionally (Sinclair 1996). Such investigation may lead the researcher “close to the boundary of the lexical item” (Sinclair 1996: 34) to identify the basic units of meaning.
The attitudinal functions of collocations, however, are far from explicit and categorical. While Sinclair (2004) saw semantic prosody as an obligatory property, Partington (2004) regarded it as gradable by drawing a binary distinction between positive and negative attitudinal
meanings. This concerns how knowledge represented by collocations is to be constructed across contexts (Whitsitt 2005); namely, to what extent the evaluative knowledge of a frequent collocation found in one context “carries over” to another (Hunston 2002: 141).
Another group of scholars tended to approach the relationship between lexis use and knowledge construction by conducting focused semantic analyses of language data. Many researchers have been working on knowledge-based Word Sense Disambiguation (WSD) using a broad range of corpus approaches (Mihalcea 2006). For example, the eXtended WordNet (Mihalcea and Moldovan 2001), large collections of semantic preferences retrieved from SemCor (Agirre and Martinez 2001) and BNC (McCarthy 2001), large scale topic signatures acquired from BNC (Cuadros et al. 2005). Specifically, Cuadros and Rigau (2007) evaluated the relative quality of available knowledge resources on a WSD task to build a large
76
and rich knowledge base for broad-coverage semantic processing. Hassan et al. (2007) introduced a system to identify lexical substitutions (McCarthy 2002) for words in a given context by combining knowledge sources. Other researchers evaluated the accuracy of knowledge-based approaches to semantic tagging of corpus data (e.g. Andreevskaia and Bergler 2007) and to semantic relations between lexis of a certain part-of-speech in English (e.g. Tribble and Fahlman 2007; Beamer et al. 2007).
Specifically, the notion of frame as a knowledge structure (Fillmore 1982a, 1982b) has been introduced and applied to corpus-based lexis processing. The contribution of FrameNet data to practical lexicography and natural language processing (NLP) has been extensively
discussed (see Atkins et al. 2003; Fillmore et al. 2003; Petruck et al. 2004). Building upon the seminal work, Ruppenhofer et al. (2006) provided a comprehensive introduction to the
FrameNet Project concerning how texts in a corpus can be reasonably grouped at the semantic level by identifying frames and frame-to-frame relations based on systematic annotation.
Baker et al. (2007) designed a task to recognize words and phrases that evoke semantic frames defined in the FrameNet Project and to explore the semantic dependency between them. Litkowski (2007) integrated and exploited FrameNet data focusing on text processing in a knowledge management system to explore the feasibility of a dictionary-based approach to extraction of frames from a corpus.
As shown above, the researchers using corpus-based methods have overwhelmingly focused on the methodological procedures to identify valid units of meaning at the lexico-grammatical level and to explain language phenomena related to representation and construction of
knowledge. In spite of the enhanced scope and reliability of Corpus Linguistics analysis (see Biber et al. 1998), the mechanism and process of knowledge construction through natural
77
language use have not been sufficiently addressed in the field concerning:
How corpus technical procedures (e.g. annotation/tagging, frequency, keywords, collocations) can benefit knowledge perspectives to approach natural language use rather than vice versa;
How words and word clusters identified as linguistic forms encode different lexical concepts to achieve textual coherence in expanded discourse rather than within limited window spans;
How frames evoked by lexical concepts contribute to characterizing collaborative spoken discourse7 where knowledge is constructed through interaction.
Section 3.3 and Section 3.4 examined the two approaches involved in the methodological synergy for data analysis: Corpus Linguistics and Interactional Linguistics. An overview of each approach was provided in each section, followed by a detailed discussion on the principles and techniques relevant to the research focus.