The CG-based solution - The linguistic analysis underlying domain-specific assessment

6.3 A general architecture for the analysis of learner language

6.3.1 The linguistic analysis underlying domain-specific assessment

6.3.2.2 The CG-based solution

The CG-based solution is a piece of software that consists of several modules implemented using general programming languages and the formalism known as the Constraint Grammar – Karlsson (1990); Karlsson et al. (1995). The CG-based solution is used to analyse and spell check Catalan unrestricted text (Badia et al., 2001; Alsina et al., 2002; Badia et al., 2004).

This solution includes all the modules shown in Figure 8.5. The Tokeniser, the Morphological Analyser and the Spell Checker are implemented in Perl and C++ – including an algorithm based on minimum edit distance measures for the generation of correction proposals (Badia et al., 2004). The dictionary look-up process uses a word-form list that has more than one million entries and was generated with a two-way morphological processing module (Badia et al., 1997). CG-based grammars are used in the Morphological Disambiguator, the Context-Sensitive Spell Checker, and the Information Extraction module.

In contrast to the MPRO-KURD solution the CG-based one does not use explicit attribute-value pairs. It shows only the values – attributes are implicit –, but as we said in footnote 4 (p. 94) data representation and data structure are not necessarily related. Figure 6.6 shows the results of Tokeniser and Morphological Analyser for the sentence in (2). (2) La The casa house ´es is verda. green.

Linguistic information is added in any preferred systematic order, except for the fact that lemmata have to be at the beginning of each of the readings of the word, as shown in Figure 6.6 in the indented lines. For instance the word La has two readings: a determiner reading and a pronoun reading, both feminine singular. In

the CG terminology a word with its associated readings is called cohort. For instance, casa and its three readings form a cohort.

<p id=“1”> <s id=“1”> “<La>”

“el” Det fem sg

“lo” Pron person febl acus 3pers fem sg “<casa>”

“casa” Nom com fem sg N5-FS “casar” Verb MInd Pres 3pers sg “casar” Verb MImp Pres 2pers sg “<´es>”

“ser” Verb MInd Pres 3pers sg “<verda>”

“verd” Adj qual fem sg “ <$.>00

< /p> < /s>

Figure 6.6: Results of the tokenisation and morphological analysis process for the Catalan sentence La casa ´es verda.

At this point of the processing, the modules implemented in Constraint Grammar are used. Karlsson et al. (1995: p. 1) define Constraint Grammar as “a language- independent formalism for surface-oriented, morphology-based parsing of unrestricted text. [...] All relevant structure is assigned via [...] simple mappings from morphology to syntax. The constraints discard as many alternative readings as possible [...] with the proviso that no genuine ambiguities should be obliterated”.

As with KURD, what is crucial in this definition is that CG relies initially on morphological information to perform increasingly complex levels of automated analysis. As shown in the most recent versions of some of the products offered by the company that distributes a commercial licence of CG, Connexor Oy7_{, CG-based}

grammars can be used for tasks as complex as functional dependency parsing, or semantic role labelling. With such techniques Connexor Oy can provide solutions for the identification of opinions (in several types of texts), detection of fraud, or extraction of specific knowledge from large collections of biomedical articles.

6.3.2.2.1 The CG rule formalism

Technically, the CG formalism is implemented as a set of finite-state cascades that are sequentially applied. The grammar writer does not decide the order in which the rules are applied. However, the grammar writer can decide to group rules into blocks so that they apply in a given order. The CG interpreter builds up a cascade

of finite-state automata that is actually responsible for controlling the accepted or active paths – sequences of states given and input. The system applies a particular grammar on a text as far as there are words whose information is modified. After two continuous iterations with no modifications the algorithm stops the process.

The basic structure of CG rules is reflected in (3). The Target characterises the specific linguistic features that have to be met by the linguistic object on which the action of the rule will be applied. The Operator indicates which is the action to be performed on the Target in case the context matches. Possible actions are Remove, Select – for disambiguation –, Add, Map or Replace – information mapping. The Context defines the linguistic properties of the words surrounding the Target that need to be matched for the rule to apply.

(3) _{Operator (Target) IF Context;}

Context positions are indicated with positive (right of target) or negative (left of target) integers. The CG formalism provides the grammar writer with other functionalities, such as the possibility to work with relative or absolute positions, or to create contexts in which one or more of the conditions of application can be defined within a range of positions. There is a functionality called “careful mode” that allows grammar writers to restrict application conditions, so that rules only apply if the condition matches unambiguously.

The rules that would be needed to disambiguate the words La casa in our sample sentence (2) are reflected in Figures 6.7 and 6.8. In Figure 6.7 we have a rule that removes the Pron(oun) reading of any cohort whose context complies with the following conditions: it has a feminine singular determiner reading, it has a sentence start one position to the left (-1), it has a cohort with a feminine singular noun reading one position to the right, and a non-ambiguous finite verb (the C in 2C stands for careful mode, see above).

REMOVE (Pron) IF (-1 SentenceStart) (0 DET + FS) (1 NOM + FS) (2C VFIN);

Figure 6.7: Disambiguation rule that applies to the word La to remove the pronoun reading in the analysis of the sentence La casa ´es verda.

Figure 6.8 is the rule that applies to the word casa so that its noun reading is selected, which has the consequence that its two verb readings are discarded. In this rule the description context uses also the careful mode (2C) and looks at positions at the right and the left-hand sides of the target word.

SELECT (Nom) IF (-2 SentenceStart) (-1C DET + FS) (0 NOM + FS) (2C VFIN);

Figure 6.8: Disambiguation rule applying to the word casa to select the noun reading in the analysis of the sentence La casa ´es verda.

Further details on the CG-based solution used in ALLES can be found in Badia et al. (2001), Alsina et al. (2002), and Badia et al. (2004).

In document Language Learning Tasks and Automatic Analysis of Learner Language: Connecting FLTL and NLP design of ICALL materials supporting use in real-life instruction (Page 139-142)