From Psycholinguistic Modelling of Interlanguage in Second Language Acquisition to a Computational Model

(1)

From Psycholinguistic Modelling of Interlanguage in Second

Language Acquisition to a Computational Model

Montse Maritxalar

A r a n t z a D i a z de

Ilarraza

Maite Oronoz

Dept. of Computer

Languages and Systems.

Languages

and Systems.

Languages

and Systems.

Univ. of the Basque Country Univ. of the Basque Country Univ. of the Basque Country

649 postakutxa 20080

Donostia. Spain

j ipmaanm~si, ehu. es j ipdisaa@si, ehu. es j iboranm©si, ehu. es

A b s t r a c t

The present article demostrates the implementation of a psycholinguistic model of second language learners' interlanguage in an Intelligent Computer Assisted Lan- guage Learning (ICALL) system for studying second language acquisition. We have focused our work on the common interlanguage structures of students at the same language level. The Interlanguage Level Model (ILM) is made up of these structures.

In this paper we explain the conceptual model of ILMs, we present the experimental method followed for collecting written materiM, we describe the output of the modelling tool, and finally some conclusions and future work are outlined.

1 I n t r o d u c t i o n

The main goal of this article is to show the implementation of a psycholinguistic model of second language learners' interlanguage in an Intelligent Com- puter Assisted Language Learning (ICALL) system for studying second language acquisition. The ICALL system, and the computational model for representing interlanguage, have been designed after previous work on psycholinguistics, artificial intelligence, and computational linguistics. This article will be focused on the computational model of interlanguage. Description of the ICALL system can be found in (Maritxalar et al., 1994).

The concepts transitional dialects (Corder, 1971) and approximate systems (Nemser, 1971) are pre- cursors to interlanguage (Selinker, 1972) (Selinker, 1992). Their aim was to define communicative and grammatical competence in second languages. All of them have these common features: a) a student's

discourse is i n d e p e n d e n t f r o m t h e n a t i v e l a n - g u a g e (L1) a n d t h e t a r g e t l a n g u a g e (L2) and it is the product of a s t r u c t u r e d l i n g u i s t i c s y s - t e m ; b) the linguistic system is v a r i a b l e during the learning process and it is v e r y s i m i l a r in s t u d e n t s o f t h e s a m e l a n g u a g e level, with the exception of some differences, results of a person's learning expe- rience. In our case, the above mentioned characteristics are the basis for modelling the interlanguage. For example, as the linguistic system is very simi- lar in students of the same level, we can infer that we will have an interlanguage model for each level. Therefore, in our ICALL system we find a module where the different Interlanguage Level Models (ILMs) of each language level are represented. Rep- resentation of the ILMs will be presented in this article.

Our work is based on corpus analysis. At this moment we have focused on the implementation of the morphological and morphosyntactic competence at word level. Work at word level is important in our case because Basque is an agglutinative language with rich morphosyntactic information within words. We have studied texts written by Spanish students of Basque. We are aware of the limits of the modelling only taking into account written material. Spoken material could also be collected for a better modelling. However, as the computational tools we have for studying Basque are only for writing studies, we have discarded the treatment of the spoken output of the learners.

In the next section we will explain the conceptual model of ILMs by means of the KADS Domain Description Language (DDL) proposed in Schreiber (Schreiber et al., 1993). Next, we will give an idea of the experimental method used in order to collect written material and the top-down methodology used in order to model interlanguage in support of the selected corpora. After that, we will describe some of the implemented tools for modelling ILMs.

Martxalar, Diaz de Ilarraza ~ Maite Oronoz 50 Computational Model of Interlanguage

Montse Maritxalar, Arantza Dfaz de llarraza and Make Oronoz (1997) From Psycholingulstic Modelling of I n t e r l a n g u a g e in Second Language Acquisition to a Computational Model. In T.M. Ellison (ed.) CoNLL97:

Computational Natural Language Learning, ACL pp 50-59.

(2)

Then, we will describe the o u t p u t of the modelling tool, in order to compare the information given by the computational tool and the conceptual model of ILMs proposed before. Finally, some conclusions and future work are presented.

2 A C o n c e p t u a l M o d e l for

Interlanguage Level Models

In this section we show the psycholinguistic model for Interlanguage Level Models (ILMs). The ILMs characterise different grammars of the interlanguage the ideal learner can have for the second language at each language level. It must be noted that, although some linguistic structures are particular to each student, others are common to all students at the same level. These common structures are those which will be represented in the ILMs.

We represent ILMs by means of two different types of sequences of models: consecutive sequences and

embodied sequences. For each level, two different kinds of knowledge are represented: the variable knowledge, the knowledge the student is supposed to be learning; and, the fixed knowledge, the knowledge the learner has already acquired. T h e knowledge the language learners have acquired at a concrete level includes the knowledge acquired at previous levels. However, the knowledge they are learning at each level follows a different structure: each model from each level is independent from other levels, although intersections between knowledge at different levels t a n occur at times. T h e set of different variable language structures at each language level consists of consecutive sequences of models, while the set of fixed ones consists of embodied sequences (see Fig. 1). When we say fixed knowledge or variable knowledge we refer to the language structures the learners have in their interlanguage, without making any dis- tinction between the representation of correct and "incorrect" structures. When we specify interlanguage all the language structures, right and "deviant", are represented in the same way. It must be noted t h a t while a language structure used by high level learners can be considered deviant, the same structure in the case of beginners could be seen as correct at their level. For example a deviation at 10th level like avoiding the word ote 'could' (i.e. nor da? 'who is?' instead of nor ore da? 'who could be?' ) is not considered a deviation in lower levels of the language.

Language structures of the interlanguage are context-dependent (Selinker, 1992). Language learners create discourse domains as contexts for interlanguage development.

c

L

lg

• e •

Fig 1. ~ t w ~ c t i o a d ~ l l.n te fla ngmlge ~ d d ef X from X * L

Such domains are constructed in connection with life experiences t h a t have importance for the learner, containing prototypical interlanguage forms associated with the content area by the learner. Interlan- guage m a y be developing in one domain while at the same time it m a y be stabilised, or possibly fossilised, in another (Selinker, 1992). So, language structures of the interlanguage include information about context where they appear. Some structures appear in the interlanguage in specific contexts, however, others appear in any context. T h a t is why we can see, in the representation below t h a t structures can be context-dependent or context-independent. A first approximation of the representation (using DDL) for the ILMs is as follows:

s t r u c t u r e Interlanguage_Level_Models;

p a r t s :

models:

s e t ( i n s t ance(Interlanguage._Level_Model));

s t r u c t u r e Interlanguage_LeveLModel;

p a r t s :

fixed_knowledge:

s e t ( i n s t a n c e ( i n t e r l a n g u a g e _ x n o d e l ) ) ; c a r d i n a l i t y :

r a i n 0 m a x HIGHEST_LEVEL;

variable_knowledge:

s e t ( i n s t a n c e ( i n t e r l a n g u a g e _ m o d e l ) ) ;

c a r d i n a l i t y :

r a i n 0 m a x HIGHEST._LEVEL;

[image:2.596.294.520.121.360.2]

(3)

s t r u c t u r e

interlanguage_model;

p a r t s :

structure_context_indep :

s e t ( i n s t a n c e

(interlanguage_structure_context_indep));

strueture_context_dep :

s e t ( i n s t a n c e

(interlanguage_structure_context_dep)); p r o p e r t i e s :

language_level:

i n t e g e r - r a n g e ( I ,

HIGHEST.LEVEL);

s t r u c t u r e

interlanguage_structure_context_indep ;

s u b t y p e - o f : interlanguage_structure,

s t r u c t u r e

interlanguage_structure_context_dep;

s u b t y p e - o f : interlanguage_structure,

s t r u c t u r e

interlanguage_structure;

s u b t y p e s :

interlanguage_strueture_context _indep, interlanguage_structure_context _dep; p a r t s :

phenomena:

set ( i n s t a n c e (linguistic_phenomenon));

conditions:

set ( i n s t a n c e (interlanguage_condition)); p r o p e r t i e s :

description:

s t r i n g ;

eontextualization:

b o o l e a n ;

deviation:

b o o l e a n ;

stabilization:

function(stabilization of the rules associated to each phenomenon); a x i o m s :

If contextualization= False then conditions=<>.

The same interlanguage structure can be deviation=True at the X language level and deviation=False at the Y language level.

The value of

stabilization

is {rarely, sometimes, usually, always}

The interlanguage structures we propose are composed of

linguistic phenomena

which occur under some

conditions.

The properties of the interlanguage structures are:

description

of the structure;

contex-

tualization,

that is, context dependent or context independent;

deviation,

which marks if the structure must be considered deviant at the related language level and, finally,

stabilization.

The stabilization property is a qualitative value

VARIABLE

K N O W L E D G E

/ \ Stabilization

\

FIXED

[image:3.596.316.548.103.284.2]

K N O W L E D G E

Fig 2. Interlanguage.

which represents the acquisition level of the structure for a particular language level. It is by means of this property that a language structure inside the student's interlanguage is considered fixed knowledge or variable knowledge. When the value of the stabilization property is

always

it means that the language structure has been assimilated by the learner and in the future it would be quite difficult making some changes. However, when the value is

rarely,

there is a high variability of the language structure, and, therefore, a good teacher (could be an ICALL system) should be able to help the learner to take away those structures in case they would not direct the student towards a target language.

In the first approximation of the representation for the ILMs we have represented interlanguage structures as a set of linguistic phenomena depending on interlanguage conditions. Now, we will explain those conditions, and, we will also describe the definition of linguistic rules and replacements which define the linguistic phenomena.

s t r u c t u r e

linguistic_phenomenon;

p a r t s :

rules:

s e t ( i n s t a n c e (linguistic_rule)) I set (tuple(replacement-rule); p r o p e r t i e s :

type:

f u n c t i o n ( t y p e of the rules or replacements associated to each phenomenon)

description:

s t r i n g ;

global:

b o o l e a n ;

lexical_entry:

set(instance(morpheme)); axioms:

If global=False then lexical_entry = < >

(4)

s t r u c t u r e interlanguage_condition;

p a r t s :

linguistic_context:

s e t (instance(linguistic_condition));

non_linguistic_context:

s e t (instance(non.linguistic_condition));

Which kind of conditions must be considered in the interlanguage? We said above that, inside the interlanguage, language structures are context- dependent or context-independent (Selinker, 1992). Referring to context-dependent structures it is very i m p o r t a n t to differentiate between two types of contexts: non-linguistic and linguistic.

Non-linguistic conditions are related to a discourse domain. Interlanguage varies depending on the domain. T h e linguistic structures that we activate when writing a story are different to those we activate when we write scientific-technical texts (the- matic_conditions). Activity types, such as fielding questions, writing a letter, translating a sentence, conversing in a group etc. are also non-linguistic conditions.

Interlanguage forms can vary from one activity type to another, even though the discourse domain is the same. The activity types can be related to the structure of the whole text (text_conditions); in addition there are also some non-linguistic conditions (e.g. length of the sentence) related to a particular language structure in the text (struc- ture_conditions). For example, sometimes students mark agreement between verb and complements in small simple sentences, and, at the same time they forget it in long sentences. In our case, structure conditions are studied by means of the corpus; text conditions, however, are detected by means of interviews with the teachers and learners.

c o n c e p t condition;

s u b t y p e s :

linguistic_condition, non_linguistic_condition; p r o p e r t i e s :

description: s t r i n g ;

c o n c e p t linguistic_condition;

s u b t y p e - o f : condition; p r o p e r t i e s :

word_level:

llst (part_of_speech, declension_case ... )

sentence_level: list (use_of_subordinates . . . )

c o n c e p t non_linguistic_condition;

s u b t y p e - o f : condition; s u b t y p e s :

textual_condition, thematic_condition;

c o n c e p t textual_condition

s u b t y p e - o f : non_linguistic_condition; p r o p e r t i e s :

text_conditions:

list (summary, formalJetter, translation . . . );

structure_conditions:

list(long_sentence_based, long_word_based, apl_place_lem, apl_place_mor . . . );

c o n c e p t thematic_condition

s u b t y p e - o f : non_linguistic_condition; p r o p e r t i e s :

type: (general, technical);

Linguistic conditions are studied by means of a corpus. In the corpus composed of texts written by high level and upper intermediate language students (350 texts collected between 1991 and 1995) we found linguistic influence in some language structures of the interlanguage. For instance, the presence of plurality of some components in the sentence can cause verbs to agree with such components, whether or not these must be in agreement with the verb (the phenomenon of plurality has also been observed in second language learners of French (Lessard et al., 1994)). Finally, we would like to claim that this way of putting linguistic structures in context could be applied similarly when modelling first languages.

After defining which kind of conditions must be taken into account when linguistiq phenomena are identified, we will see which type of linguistic rules can be found inside the phenomena.

It is usual t h a t language learners know only some linguistic rules corresponding to a particular linguistic phenomenon, and not all of them. In the case of language natives, however, they know, in most cases, all linguistic rules. T h a t is the reason for representing explicitly in the student's interlanguage the set of linguistic rules related to the linguistic phenomena of the language structure. In other cases, as in natives, it should not be necessary to make their linguistic rules explicit as the linguistic phenomena define, implicitly, the set of linguistic rules.

In the same way, we could say t h a t linguistic rules identify the corresponding linguistic phenomenon,

(5)

however, it is necessary to make it explicit, as a linguistic phenomenon could be detected in a student's interlanguage, but the linguistic rules would not be identified until after some interactions with the student. The student model is dynamic, so, the language structures are also dynamic. Therefore, in the first interactions with the student a linguistic phenomenon can be detected inside the student's interlanguage before eliciting the corresponding linguistic rules. Next we shall define the structure of the linguistic rules.

s t r u c t u r e linguistic_rule;

s u b t y p e s : morphological_rule, syntactic_rule; p a r t s :

implemented_by: set (rule. identifier);

cardinality:

r a i n 1;

conditions: s e t ( i n s t a n c e (interlanguage_condition)); p r o p e r t i e s :

type: {morph, syn};

description: s t r i n g ;

example: s t r i n g ;

stabilization:

{rarely, sometimes, usually, always};

In the experiments the teachers have identified three types of language phenomena in the students' interlanguage: first, simple interlanguage structures composed of a set of linguistic phenomena and linguistic rules; second, avoiding interlanguage structures defined as linguistic rules t h a t the student usually avoids; and, third, replacement language structures where the learner uses a structure too often (or rarely) instead of using other structures (e.g. when a person uses the conjunction and all time and rarely uses structures such as however, nevertheless, thus ... ). Consequently, we can say t h a t there are relationships between linguistic rules of the interlanguage. T h a t is why structures which represent linguistic phenomena also have a set of replacement tuples, which represent the replacement relations between linguistic rules.

3 Using Corpus Linguistics in order

to Model

Interlanguage

In this section we will explain first the experimental method used in order to collect written material, and second the top-down methodology used in order to model interlanguage in support of the selected corpora. We model Interlanguage Level Models by means of automatic tools which use the collected material as input. It must be noted t h a t some in-

formation of the interlanguage models, for example text conditions (see section 2), are detected semiautomatically with the help of the teachers and learners.

Before explaining modelling based on corpus analysis, we would like to make some comments about the criteria for defining the corpus: we collected written material from different language schools (IRALE 1, ILAZKI) and grouped this material depending on some features of the texts such as, 1) the kind of exercise proposed by the teacher (e.g. abstract, article about a subject, letter . . . ) and 2) the student who wrote the text. Those are students with a regular attendance in classes and with different characteristics and motivations for learning Basque (e.g. different learning rates, different knowledge about other languages, mother tongue . .. ).

We codified the texts of the corpora following a prefixed notation (e.g. ill0as) showing the language school (e.g. il, ILAZKI), the language level, the learner's code, and the type of exercise proposed (e.g. s, summary). The last feature is what we have called text condition in section 2. At the same time, a database for gathering the relevant information about the students' learning process was developed. We retrieved such information from interviews with the students and the teachers (Andueza et al., 1996). The corpus collected from 1990 to 1995 is made up of 350 texts. This corpus has been di- vided in subsets depending on the language level. At the moment we have defined three language levels of study t h a t we call low, intermediate, and high levels. Before designing and implementing the automatic tools, three different studies of corpora during 90/91 (i.e. 50 texts semiautomatically analysed), 93/94 (i.e. 20 texts semiautomatically analysed), and 94/95 (i.e. 100 texts semiautomatically analysed) were carried out. These studies were done, in the first case, by teachers who didn't know the students, and in the other two cases by teachers who knew the students. In the first two cases the work lasted two months. In the third case, however, texts were collected every week from September until June, and two teachers worked five hours per week on studying the corpora during the 94/95 academic year. The language learners had five hours of language classes per week, and they wrote one composition every week or every fortnight.

For modelling interlanguage at different language levels we use a top-down methodology, t h a t is, we start from the modelling of high levels and con- tinue to lower ones (see Fig 1). The reasons for

11RALE and ILAZKI: schools specialised in the teaching of Basque

(6)

a top-down methodology are t h a t most computational tools for Basque we have (lemmatiser, spelling checker-corrector, morphological d i s a m b i g u a t o r . . . ) can be easily adapted for high language levels; be- sides, usually computational tools for analysing written texts of high language levels are more robust than those of low levels and, finally, there usually is more written material at high levels than at low ones.

We have a u t o m a t i c a l l y analysed subsets of corpora in intermediate and high levels. Choosing a text as a unit of study, groups of sixteen texts have been deeply and automatically studied.

T h e steps we followed using the tools we have adapted in order to build the interlanguage model for each N language level were:

i. Design of the lexical database for the Nth language level.

2. Selection of the corpus (CORPUS-N) and subsets of CORPUS-N to be used in the next steps. This selection is based on the criteria, explained before, for collecting material.

3. Definition of the morphology and morphosyntax based on a subset of CORPUS-N.

4. Identification of the fixed knowledge and the variable knowledge, considering the contexts defined in section 2.

(a) Evaluation of the reliability of the model using other subsets of CORPUS-N. (b) Evaluation of the results by a language

teacher of N level.

For example, in studies of high language modelling, a teacher evaluated the results at word level, t h a t is, the type of rules detected and the contexts where they were applied. T h e evaluation was suc- cessful, even though in some cases the perception of the teacher was not the same as the results inferred from the automatic study of the corpora (e.g. in the opinion of the teacher the students are used to delet- ing the h letter more usually than adding it. This phenomenon has not been detected in the results of the corpus).

4

Implementation

We have adapted tools previously developed in our computational linguistic group during the last ten years. These tools are: the Lexical Database for Basque (EDBL) (Agirre et al., 1994), the morphological analyser based on the Two-Level Morphology (Agirre et al., 1992), the lemmatiser (Aldezabal et

al., 1994) and some parts of the Constraint G r a m - mar for Basque (Alegria et al., 1996) (Karlsson et hi., 1995).

We have two main reasons for adapting these tools:

. Some of the deviant linguistic structures used by second language learners are different to those native Basque speakers use. T h e context of application, i.e. structure conditions at word level of some rules, are not the same in both cases. Moreover, we need to add some new rules to have one rule for each linguistic structure, and we also need some of these rules in order to detect deviant linguistic phenomena, e.g. loan words from Spanish (see section 4.1).

. In the original tools, the context of application of the rules remained ambiguous. As we explained in section 2, the context of application is i m p o r t a n t to us for modelling the g r a m m a t - ical competence of the students, so we disambiguate such contexts by means of our adapted tools (see section 4.2). In the figure below we can see a scheme of the way in which we have used these adapted tools (Diaz et al., 1997):

A N A L Y S I S A N D

C O N T E X T D E T E C T I O N Learner saalyselt + P , ~ o c e s s

D I S A M B I G U A T I O N

La~ttet lemrmliser ÷

Language Lavel Based Disambiguato~ ÷

Context Based Disambiguator INTERLANGUAGE

[image:6.596.302.518.426.692.2]

1HODI~S

Fig 3. Modelling Process.

L E A R N E R ANALYSER = T h e adapted morphological analyser (detection of structural contexts).

P O S T P R O C E S S = Context detection tool (detection of structural contexts and linguistic contexts).

(7)

L E A R N E R L E M M A T I S E R = The adapted lemmatiser.

L A N G U A G E L E V E L BASED DISAMBIGUA- T O R = Disambiguator for each language level (Based on the number of rules in each interpretation).

C O N T E X T BASED D I S A M B I G U A T O R = Dis- ambiguator for each language level.

(Based on subsets of the Constraint G r a m m a r for Basque + disambiguation rules based on the context of application).

I N T E R L A N G U A G E M O D E L S = I n t e r l a n - g u a g e L e v e l M o d e l s ( I L M s )

4.1 R e d e s i g n i n g t h e a u t o m a t a

As we said above, the original morphological analyser was based on the Two-Level Morphology. The implementation of the morphophonological rules were made by the automata. In the original analyser we had 30 automata: 11 of them were used for analysing standard words, but their activation was never detected; the other 19 which represented a deviation were identified but the type of deviation remained unknown. Moreover, the context of application remained ambiguous.

In the adapted analyser we find 59 automata, which represent 59 different types of phenomena (these are codified as we can see in the table below). We have modified the a u t o m a t a of the morphological analyser in order to detect which rules have been applied and the contexts (structure conditions) where they have been activated. We have also made some changes in the module of the analyser which recog- nised the a u t o m a t a . The number of a u t o m a t a has increased due to the addition of new rules for de- tecting new deviations of language learners and to the division of some original a u t o m a t a in others that detect, in a more specific way, some morphologicM phenomena which are very interesting for the study of second language acquisition. An example will il- lustrate this fact:

Rules in the original analyser

h:0 = > R : = ( + : = ) + _

(Rule for standard phenomenon)

h : 0 = > I V : V - - ( 0 : * ) : ] _ V : V (Rule for deviant phenomenon)

T h e application of the rule for standard phenomenon is not detected, however, the competence error represented in the rule for deviant phenomenon is identified even though the context of application remains ambiguous]

Rules in the adapted analyser

h:0 = > R:= ( + : = ) + _

(LEDEH: Delete the H LEtter at the End of the root)

h:0 = > V:V _ V:V

i(LEDAH: Delete the H L E t t e r Anywhere in the word)

h:0 = > ( 0 : * ) : _ V:V

(LEDBH: Delete the H L E t t e r at the Beginning of the root)

All three rules of the interlanguage and the context of application are detected when they have been activated. We repeat the same a u t o m a t a three times and mark as negative in each a u t o m a t o n the states which correspond to the activation of the rule.]

4.2 R u l e s i n t h e i r Linguistic C o n t e x t

Experiments and interviews with experts lead us to see the need of identifying the linguistic context where a morphological rule (for s t a n d a r d or deviant phenomenon) is applied. We identify this context by adding to the adapted morphological analyser

(Learner_Analyser) some characteristics such as the place ( l e m m a / m o r p h e m e ) where the rule is applied, the length of the word and the type of the last letter (vowel/consonant) of the root (Postprocess).

We have two main aims in mind:

1. Disambiguate unlikely interpretations of a word (Context..Based_Disambiguator).

There are two ways to do this:

* Discarding interpretations in which a morphological rule has been applied in a part of the word ( l e m m a / m o r p h e m e ) where it never appear in real life examples. For example, the deviant word *analisis has two interpretations

(analisi/analisiz): the rule to add an s at the end of the l e m m a / the replacement of z by s in the morpheme. T h e second interpretation is not possible for high language level students, so we discard it at such level.

* Discarding interpretations in which a morphological rule is applied within an unusual part of speech. T h e rule t h a t detects the replacement of t by d is a good example of this. T h e rule is never used in verbs starting with d. After discarding all interpretations of the words, where the replacement rule has been applied and the part of speech is a verb, the n u m b e r of interpretations in the analysis of the word is reduced to a half.

(8)

2. Refine the model of the student's Interlanguage (Language_Level_Based_Disambiguator).

A word changes into another one quite different as a result of the application of an excessive number of rules who represent deviant phenomena. From the study of the corpora, we can de- termine the exact number of possible deviation rules for an interpretation that makes sense at each language level. At the moment, we have determined it for some levels (i.e. the highest) and we are working on the others.

5 The Output of the Modelling Tool

In this section we will show an example of an interlanguage structure in order to see the relationship between the output of the modelling tool and the description of interlanguage structures explained in the conceptual model.

In this example we will see a detected linguistic phenomenon in the corpus of high level learners. The description of the phenomenon is: "when learners want to construct relative clauses and the last letter

of the verb is t, for example dut (auxiliary verb for

composed verbs), when adding the suffix -.._n for con- structing relative clauses, the t is replaced by d and

the a letter is added. That is, dut + -n = dudan".

e.g. Ikusi dudan haurra atsegina da. 'The boy

who I have seen is nice.'

Ikusi duda_n haurra atsegina da.

I have seen who boy the nice is.

This example shows that Basque syntactic information is found inside the word. That is why in the modelling of the LC_I linguistic condition (see the example) the REL feature (relative clause) is at word level, and not at sentence level.

An example of the output of the modelling tool:

(interlanguage_structure IE_i (phenomena (LF_i)) (conditions (LC_i TC_3)) (description ?)

(contextualization True) (deviation False) (stabilization usually))

(linguistic_phenomenon LF_I (rules (LEAIA_i LERATDI_i)) (type (morph))

(description ?) (global False) (lexical_entry <>))

(linguistic_rule LEAIA_i (implemented_by (9))

(conditions (LC_I TC_I TC_3)) (type morph)

(description

"Add the A LEtter Inside the word" (example " dudan ")

(stabilization usually))

(linguistic_rule LERATDi_I (implemented.by (8))

(conditions ( LC_I TC_2 TC_3)) (type morph)

(description

"Replace T by D Anywhere in the word") (example "dudan")

(stabilization usually))

(textual_condition TC_i (text_conditions < > )

(structure_conditions (apl_place_mor)) (description

"rule applied in the morpheme"))

(textual_condition TC_3 (text_conditions < > )

(structure_conditions (end_word_t))

(description "the last letter of the word is t"))

(linguistic_condition LC_i (wordAevel (V REL)) (sentence_level < > )

(description "verb for relative clause"))

(textual_condition TC_2 (text_conditions <:>)

(structure_conditions (apl_placeAem)) (description "rule applied in the lemma"))

(9)

In the process of modelling, first, we identify the linguistic rules, second, we detect groups of linguistic rules which occur in the same context and define the linguistic phenomenon, and last, the interlanguage structure is identified.

If we compare this output with the conceptual model given in section 2, we can see how the information needed is reached automatically, except the description of the linguistic phenomenon and the interlanguage structure (see question marks in the example). Such information will be completed by the psycholinguist who will use the ICALL system. The information given by the psycholinguist will be reused in future modellings.

6 C o n c l u s i o n s a n d F u t u r e W o r k

The presented work proves that the implementation of psycholinguistic models of interlanguage is viable from the adaptation of the linguistic-computational tools we have for the automatic study of Basque. Modelling at word level is important in our case because Basque is an agglutinative language with rich morphosyntactic information within words.

The results obtained using the developed tools for language learning studies provide us statistical information about phenomena that teachers and psycholinguistics knew by intuition. Therefore, we can say that corpus analysis is a good technique for modelling ILMs.

In the near future we will develop new tools in order to model each student's knowledge. The detection of contexts will be improved in order to identify contexts related to the characteristics of the particular learners. Moreover we will work on the detection of interlanguage structures at sentence level . Field studies have been carried out in this sense (Mar- itxalar et al., 1993) (Andueza et al., 1996). At the moment, we are studying some aspects of the morphosyntax and syntax for L2, taking as a basis the results obtained in Andueza (Andueza et al., 1996), where in a final test the hypothesis obtained in the study carried out in the 94/95 academic year were contrasted with the students. In this work, we also analysed the reasons for using some language structures in addition to the detection of their context. We plan in the future to add to the system such knowledge about the diagnosis of structures.

Finally, we would like to remark that together with the experiments explained in the article, two environments are being prepared in the ICALL system: the Knowledge Acquisition Subsystem and the Learning Process Subsystem. The Knowledge Ac- quisition Subsystem helps language teachers to make hypotheses about, among others, the reasons stu-

dents have for using deviant language structures. The Learning Process Subsystem guides users in their learning process giving hints according to their language level.

7 A c k n o w l e d g e m e n t s

This work is being developed in the framework of a research project sponsored by the University of the Basque Country and the Government of the Province of Gipuzkoa. The authors would like to thank both Institutions.

R e f e r e n c e s

Eneko Agirre, Ifiaki Alegria, Xabier Arregi, Xa- bier Artola, Arantza Diaz de Ilarraza, Montse Maritxalar, Kepa Sarasola and Miriam Urkia. 1992. XUXEN: A spelling checker/corrector for Basque based on Two-Level Morphol- ogy. In

Proceedings off the Third Conference

ANLP(ACL),

pages 119-125, Trento, Italy.

Eneko Agirre, Xabier Arregi, Jose Mari Arriola, Xa- bier Artola and Jon Mikel Insausti. 1994. Eu- skararen Datu-Base Lexikala (EDBL)

Internal

Report. UPV/EHU/LSI/TR 8-94,

Computer Science Faculty: University of the Basque Country.

Izaskun Aldezabal, Ifiaki Alegria, Xabier Artola, Arantza Diaz de Ilarraza, Nerea Ezeiza, Koldo Gojenola, Itziar Aduriz and Miriam Urkia. 1994. EUSLEM: Un lematizador/etiquetador de textos en euskara. In

Proceedings of the X.

Conference SEPLN,

C6rdoba, Spain.

Ifiaki Alegria, Jose Mari Arriola, Xabier Artola, Arantza Diaz de Ilarraza, Koldo Gojenola, Montse Maritxalar and Itziar Aduriz. 1996. A Corpus-Based Morphological Disambiguation Tool for Basque. In

Proceedings of the XII.

Conference SEPLN,

Sevilla, Spain.

A. Andueza, Arantza Diaz de Ilarraza, Montse Mar- itxalar, Josune Martiarena and Ifiaki Pikabea. 1996. Hizkuntza baten Ikaskuntza Prozesuari buruzko landa lana. Sistema Informatiko adi- mendun baten oinarria. In

Internal Report.

UPV/EHU/LSI/TR 8-96,

S. Bull. 1994. Student Modelling for Second Lan- guage Acquisition. In

Computers 8J Education

23,

pages 13-20.

(10)

!

S. Corder. 1971. Idiosyncratic dialects and error analysis. In

International Review of Applied

Linguistics 9,

pages 147-159.

Arantza Diaz de Ilarraza, Montse Maritxalar and Maite Oronoz. 1997. Reusability of Language Technology in support of Corpus Studies in an CALL Environment. In

Language Teaching

and Language Technology Conference,

Gronin- gen, The Netherlands.

F. Karlsson, A. Voutilainen, J. Heikkil/i and A. Anttila. 1 9 9 5 . Constraint Grammar: a language-independent system for parsing un- restricted text. Mouton de Gruyter.

G. Lessard, D. Maher and I. Tomek. 1994. Mod- elling Second Language Learner Creativity. I n

Journal of Artificial Intelligence in Education

5(4).,

pages 455-480.

Montse Maritxalar and Arantza Dfaz de Ilarraza. 1993. Integration of Natural Language Tech- niques in the ICALL Systems Field: The treatment of incorrect knowledge. In

Internal Re-

port. UP V/EHU/LSI/TR 9-93,

Computer Sci- ence Faculty: University of the Basque Coun- try.

Montse Maritxalar and Arantza Dfaz de Ilarraza. 1994. An ICALL System for Studying the Learning Process. In

Computers in Applied

Linguistics Conference.

Iowa State University.

Montse Maritxalar and Arantza Dfaz de Ilar- raza. 1996a. Hizkuntza baten Ikaskuntza- Prozesuan zeharreko Tartehizkuntz Osaketa: Sistema Informatiko baten Diseinurako Azter- keta Psikolinguistikoa. In

Internal Report.

UPV/EHU/LSI/TR 7-96,

Montse Maritxalar and Arantza Dfaz de Ilarraza. 1996b. Modelizaci6n de la Competencia Gra- matical en la Interlengua basada en el An£1isis de Corpus. In

Proceedings of the XII. Confer-

ence SEPLN,

Sevilla, Spain.

W. Nemser. 1971. Approximate Systems of Foreign Language Learners. In

International Review of

Applied Linguistics 9,

pages 115-123.

G. Schreiber, B. Wielinga and J. Breuker. 1993. KADS: A Principled Approach to Knowledge- Based System Development. Academic Press.

L. Selinker. 1972. Interlanguage In

International

Review of Applied Linguistics 10.,

pages 209- 231.

L. Selinker. 1992. Rediscovering interlanguage. Lon- don:Longman.