Prospects for Computer Assisted Dialect Adaptation

(1)

Prospects for Computer-Assisted Dialect Adaptation

David J. W e b e r

I n s t i t u t o L i n g u i s t i c o d e V e r a n o C a s i l l a 2 4 9 2

L i m a , 1 0 0 P E R U

William C. M a n n

I n f o r m a t i o n S c i e n c e s I n s t i t u t e U n i v e r s i t y o f S o u t h e r n C a l i f o r n i a

M a r i n a del R e y , C a l i f o r n i a 9 0 2 9 1

This paper describes a project which has explored the feasibility of using a computer to perform a significant portion of the changes required to adapt text from one dialect to several others. This ongoing experiment has examined adaptation between various dialects of Quechua, finding that a computer program may be an important tool for adaptation. An experimental computer program was written and applied to text, and its output was field tested in five target dialects. Preliminary results indicate that preprocessing text with a computer may I) enable informants who are not bi-dialectical (in the source and target dialects) to produce adequate adaptations without much coaching from the linguist/translator; 2) improve the quality of the resulting text; and 3) reduce time and e f f o r t m b o t h in adaptation and in manuscript preparation.

1, Introduction

This paper describes experiments in c o m p u t e r - assisted adaptation of text from one dialect to several others. The principal purposes of the initial explora- tions reported here were:

1. To discover whether a c o m p u t e r program could convert text in a dialect unintelligible to an i n f o r m a n t into easily read (possibly errorful) text in the i n f o r m a n t ' s own dialect.

2. To discover what kinds of dialect difference i n f o r m a t i o n are needed to support an effective dialect-adapting c o m p u t e r program.

3. To discover classes of dialect changes not a c c o u n t e d for in a particular first-draft c o m p u t e r program, t h e r e b y to provide data for a detailed examination of whether each class of changes is suitable for performance by a c o m p u t e r program.

In pursuit of these goals, an experimental c o m p u t e r program was written and applied to text, and its output was field tested.

In this paper the nature of the language situation is discussed first, followed by a description of the computer program. Then, p r o c e d u r e s for c h e c k i n g the c o m p u t e r - a d a p t e d text are described, followed by a discussion of the results of this checking. Finally, conclusions are stated.

2. The Nature of the Language Situation

The practical difficulty of dialect adaptation is pri- marily determined by the language situation.

2.1 The General Nature of the Language(s)

This experiment was carried out in the subgroup of Q u e c h u a called "central" Q u e c h u a by P. L a n d e r m a n [1]. These languages/dialects have the following characteristics:

1. More of the structure of the language is in the m o r p h o l o g y than in the syntax.

2. Much of the discourse structure involves the manipulation of the so-called "topic" marker and the "evidential" suffixes (the reportative, the assertative . . . . ).

Copyright 1981 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the Journal reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee and/or specific permission.

(2)

David J. Weber and William C. Mann Prospects for Computer-Assisted Dialect Adaptation

.

4.

T h e s e of the

The c a t e g o r y of each morphological unit strongly constrains what suffix m a y i m m e - diately follow that unit. F o r example, the intransitive v e r b r o o t aywa- m a y n o t be followed by the case m a r k e r -man, which only follows nouns.

A significant a m o u n t of m o r p h o p h o n e m i c s is involved in getting f r o m sequences of m o r p h e m e s to the c o r r e c t s e q u e n c e of p h o n e m e s . T h e two such processes which are m o s t widely applied are first, morpho- phonemic lowering, w h e r e b y a high vowel

(i.e., e i t h e r / i / or / u / ) of certain suffixes b e c o m e s / a / w h e n one of a certain small class of suffixes follows a n d s e c o n d , foreshortening, w h e r e b y c e r t a i n suffixes, the final vowel of which is otherwise long, a p p e a r with short, final vowels w h e n followed directly b y certain other suffixes.

characteristics are m e n t i o n e d b e c a u s e the design p r o g r a m r e s p o n d s to them.

2.2 T h e N a t u r e of t h e D i a l e c t D i f f e r e n c e s

Six dialects were involved in this experiment. T h e source dialect ( a b b r e v i a t e d SD), i.e., the dialect f r o m which text was a d a p t e d , was H u a l l a g a (Hu~muco) Q u e c h u a ( a b b r e v i a t e d H g Q ) .

T h e target dialects ( a b b r e v i a t e d T D ) are the following:

P a n a o ( H u ~ n u c o ) Q u e c h u a ( P a Q )

D o s - d e - M a y o ( H u ~ n u c o ) Q u e c h u a ( D o Q ) L l a t a ( H u ~ n u c o ) Q u e c h u a (LIQ)

Y a n a h u a n c a (Pasco) Q u e c h u a ( Y a Q ) Junin (Junin) Q u e c h u a ( J u Q )

T h e differences b e t w e e n these dialects are not trivial, as should b e c o m e clear in the discussion below. E a c h of these would require s e p a r a t e reading m a t e r i - als, but it is e x p e c t e d that the material of one dialect of H u h n u c o could be a d a p t e d to the other dialects of that d e p a r t m e n t .

To what extent these dialects are mutually intelligible (pair-wise) is an o p e n question. To our knowledge, no f o r m a l tests have b e e n made. T h e first author has asked several H g Q s p e a k e r s if they can u n d e r s t a n d the P a Q speakers. A n s w e r s indicate that c o m m u n i c a t i o n is difficult. T h a t H g Q s p e a k e r s and J u Q s p e a k e r s c a n n o t c o m m u n i c a t e e f f e c t i v e l y has b e e n o b s e r v e d firsthand. A rough metric of difference is this: a d a p - tation b e t w e e n these dialects has required roughly 800 changes per 1000 words. W h a t e v e r the case, little or no significance is a t t a c h e d to w h e t h e r the dialects involved are mutually intelligible or not; it is s u f f i c i e n t ' t h a t s e p a r a t e m a t e r i a l s n e e d to be p r e p a r e d f o r the dialects, as established b y an overall c o n s i d e r a t i o n of the language.

T h e dialect d i f f e r e n c e s will n o w be discussed in g r e a t e r detail u n d e r three headings: the phonological differences, the lexical differences, and the g r a m m a t i - cal differences.

2.3 P h o n o l o g i c a l D i f f e r e n c e s

T a b l e 1 indicates v a r i o u s regular sound changes ( R S C ' s ) which h a v e a f f e c t e d these dialects.

T h e r e are also s o m e sound c h a n g e s which are not so .regular. F o r e x a m p l e , the c h a n g e b y which a V o w e l - S e m i v o w e l - V o w e l s e q u e n c e b e c o m e s a long v o w e l (e.g. c o m p a r e H g Q chaya- with D o Q cha:- ' a r r i v e ' and H g Q tiya- with D o Q ta:-) is not entirely regular. S o m e c h a n g e s a f f e c t v e r y f e w w o r d s ( f o r example, the change b y which / s / is lost intervocali- cally, e.g., H g Q wasi ' h o u s e ' but D o Q wayi, H g Q usa ' f l e e ' b u t D o Q uwa), a n d t h e s e are h a n d l e d in the p r o g r a m as t h o u g h t h e y w e r e simply lexical d i f f e r - ences.

2.4 L e x i c a l D i f f e r e n c e s

T h e SD and T D m a y use v e r y d i f f e r e n t r o o t s to express the s a m e concept. F o r e x a m p l e , to express ' t o r e c o v e r ( f r o m an illness), to get well', in H g Q o n e uses allchaka:-, in L1Q o n e uses aliya:-, a n d in J u Q o n e uses kachaka:-; ' t o g a t h e r ' is e x p r e s s e d b y shunta- in H g Q and b y qori- in LIQ; the p r e - a d j e c t i v e m e a n i n g ' v e r y ' (used as in ' v e r y big') is e x p r e s s e d b y sumaq in H g Q and b y sellama in L1Q; ' t o lie (tell a f a l s e h o o d ) ' is e x p r e s s e d b y llulla- in H g Q and b y kaski- in Y a Q and J u Q ... et cetera. S o m e t i m e s the SD and T D pos- sess the s a m e root, b u t its m e a n i n g s differ. Aru- m e a n s ' t o w o r k ' in H g Q but ' t o c o o k ' in dialects to the west ( H u a r a s , A n c a s h ) ; yacha- m e a n s ' t o k n o w (how t o ) ' in H g Q , w h e r e a s the c o g n a t e in J u Q (yatra-) m e a n s ' t o reside at'.

2.5 G r a m m a t i c a l D i f f e r e n c e s

T h e g r a m m a t i c a l d i f f e r e n c e s b e t w e e n dialects can be divided into two general types, the m o r p h o l o g i c a l d i f f e r e n c e s (i.e. those which o c c u r within a single word) and the syntactic differences (i.e. those which involve m o r e t h a n one word). O n l y the m o r p h o l o g i c a l differences will be discussed in this section. T h e y are of several sorts:

First, in the central Q u e c h u a dialects, one of the m o s t drastic m o r p h o l o g i c a l d i f f e r e n c e s involves h o w the plurality of the s u b j e c t or o b j e c t of a clause is m a r k e d in the v e r b of that clause. In L1Q there is only o n e verbal plural m a r k e r , -ya:. In H g Q there are five suffixes which indicate plurality within the verb. T h e r e are conditions on the o c c u r r e n c e of these, e.g., -rka o c c u r s only p r e c e d i n g the c o n t i n u a t i v e suffix -yka:. L1Q has only -ya:, H g Q has all except -ya:, and the o t h e r dialects h a v e only -:ri, -pa:ku, and -rka.

(3)

*~>¢

*ff>~

(ch >ts) (tr>ch)

P a Q - -

H g Q - +

D o Q + +

L1Q + +

YaQ + -

J u Q - -

*'~>s * h > l *fi>n * q > h (ts>s) (//>1) ( n n > n ) (q>h)

- -t- -t- -

+ + + -- -- -t- -I- -- -- -t- -I- --

P h o n e m i c Change O r t h o g r a p h i c F o r m

Table 1. Regular sound changes.

Second, there are differences in the properties of m o r p h e m e s . F o r example, in H g Q the suffix -:ri u n d e r g o e s m o r p h o p h o n e m i c lowering, but in J u Q it does not. In some dialects the durative suffix -ra: foreshortens while in others it does not.

Third, a suffix in one dialect m a y simply be absent in another. F o r example, LIQ has a verbal suffix -ski, a suffix which H g Q not only lacks, but for which there does not seem to be any corresponding suffix. H g Q has a suffix -paq which is used with future verbs, while YaQ has no such suffix.

F o u r t h , a single suffix in one dialect m a y corre- spond to two different suffixes in another dialect. F o r example, the relativizer -sha of H g Q c o r r e s p o n d s to both -sha and -nqa in JuQ, which has a temporal con- trast in the formation of relative clauses which is not present in HgQ. A similar case arises in the tense systems of these two dialects: JuQ has a distinction between a recent and a remote past tense which H g Q lacks.

Fifth, a suffix in one dialect m a y be the collapse of two suffixes in another. F o r example, in some dialects, the suffix -mi followed by the postposition ari has formed a single suffix -mari.

Sixth, suffixes m a y o c c u r in different orders in different dialects. F o r example, in H g Q -rqu precedes the object marker -ma: (e.g. maqa-rqu-ma:-nki ' y o u h i t

me (recent p a s t ) ' ; in H u a r a s ( A n c a s h ) Q u e c h u a , it follows the object marker: maqa-ma:-rqu-nki.

Seventh, the number of allomorphs of a suffix m a y differ. F o r example, in H g Q the second person pos- sessive suffix has allomorphs -yki, -ki, and -niki. JuQ has all of these forms as well as -y.

Finally, a suffix m a y have different forms (spellings) in the different dialects. F o r example, the suffix which expresses similarity is -naw in HgQ, -noq in LIQ and D o Q , and -nut in JuQ; the continuative suffix is -yka: in H g Q and -ya: in JuQ; the first person plural conditional form is -shwan in H g Q and -chwan in JuQ.

N o t all of the m o r p h o l o g i c a l differences b e t w e e n dialects have been m e n t i o n e d in this section. The

c o m p u t e r p r o g r a m handles all of the kinds mentioned, as well as others.

3. T h e N a t u r e of t h e C o m p u t e r P r o g r a m

A n early pencil-and-paper experiment in adapting from H g Q to the Q u e c h u a of Jacas Grande showed that, of a set of changes suggested by certain Jacas G r a n d e speakers, approximately 95 percent were phonological, lexical or morphological, and a significant p r o p o r t i o n of the residue were idiosyncratic rather than systematic. This suggested that making the program responsive to features b e y o n d word boundaries would not make the task of final c o r r e c t i o n of an a d a p t e d text significantly easier. Thus the decision was made to have the program treat one word at a time.

The program is designed for Quechua, but not for any particular dialects. In all of the experiments to date, the source dialect has been HgQ.t

The program has three distinct phases of operation: initialization, derivation of target-dialect root spellings, and text-processing, each to be discussed in the following sections.

3.1 I n i t i a l i z a t i o n

In initialization, linguistic i n f o r m a t i o n a b o u t the dialects involved is made available to the c o m p u t e r program. We will discuss the information provided for a source dialect, and then what is provided for a target dialect.

[image:3.612.112.504.74.169.2]

(4)

David J. W e b e r and W i l l i a m C. Mann Prospects for Computer-Assisted Dialect Adaptation

3.1.1 S o u r c e D i a l e c t Initialization

F o r the SD, two m a j o r lists and several small lists are p r o v i d e d to the p r o g r a m . T h e largest list is a dictionary of roots.

F o r each root, the dictionary entry ( D E ) contains the following information:

1. The spelling of the root, i.e. the string of c h a r a c t e r s b y which the r o o t is r e c o g - nized.

2. T h e name of the root:

a. If the r o o t is native to Q u e c h u a , the n a m e is the p r o t o - Q u e c h u a f o r m of that root.

b. If the r o o t is a Spanish loan, the n a m e is the Spanish ( p h o n e m i c ) f o r m of that root.

3. An indication of w h e t h e r the r o o t is the result of R S C ' s applied to the p r o t o - Q u e c h u a f o r m ( n e g a t i v e f o r Spanish loans, names, i d e o p h o n e s , etc.).

4. M o r p h o l o g i c a l c a t e g o r y (e.g. noun, intransitive verb, transitive verb, etc.).

5. An indication of w h e t h e r its final vowel is short or long.

6. An indication of w h e t h e r its final vowel has u n d e r g o n e the m o r p h o p h o n e m i c p r o c - ess of lowering ( f r o m / i / o r / u / to / a / ) .

N o t e that the dictionary will contain multiple D E s for the various allomorphs of the same r o o t (i.e., for the various f o r m s that the root m a y take). F o r e x a m - ple, a D E c o r r e s p o n d i n g to one of the allomorphs of y a y k u - ' t o e n t e r ' has (1) y a y k u as the spelling, (2) *yayku as the p r o t o - Q u e c h u a form; (3) it u n d e r g o e s the R S C ' s ; (4) it is an intransitive verb; (5) its final vowel is short; (6) it has not u n d e r g o n e m o r p h o p h o - nemic lowering. All of this i n f o r m a t i o n is exploited b y the p r o g r a m as described below.

T h e s e c o n d m a j o r list is the list of SD suffixes. E a c h suffix D E contains the following information:

1. T h e spelling of the suffix.

2. An a r b i t r a r y n a m e f o r the suffix, this n a m e to be c o m m o n to all the D E ' s c o r r e s p o n d i n g to various a l l o m o r p h s of t h a t suffix a n d used u n i f o r m l y f o r b o t h SD and T D suffixes.

3. T h e m o r p h o l o g i c a l c a t e g o r y of the unit (root followed by zero or m o r e suffixes) to which this suffix can be applied.

4. The morphological c a t e g o r y which results f r o m c o n c a t e n a t i n g this suffix to an a p p r o p r i a t e unit.

5. An indication of w h e t h e r its final vowel is short or long.

6. An indication of w h e t h e r it disallows an i m m e d i a t e l y preceding long vowel.

7. An indication of w h e t h e r its final vowel has u n d e r g o n e the m o r p h o p h o n e m i c process of lowering.

8. An indication of w h e t h e r it causes m o r - p h o p h o n e m i c lowering.

9. A n u m b e r which gives the " o r d e r class" of the suffix, used to constrain sequences of suffixes to occur in a m o n o t o n i c a l l y n o n - decreasing o r d e r of their order class n u m - bers.

F o r e x a m p l e , the suffix -ka: (1) with a r b i t r a r y n a m e / p a s s i v e / (2) applies to transitive verbs (3) with the result that the c o m b i n a t i o n is an intransitive v e r b (4); the final vowel is long (5) and the suffix does not disallow a preceding long vowel (6); the final v o w e l has not u n d e r g o n e m o r p h o p h o n e m i c lowering (7) and it does not cause such lowering (8); the order class is 600, placing it in the large class of " d e r i v a t i o n a l " suffixes (not o r d e r e d with r e s p e c t to o t h e r m e m b e r s of the class, b u t o r d e r e d with r e s p e c t to o t h e r " o r d e r classes" of suffix).

In addition to these t w o m a j o r lists, there are several small lists:

1. T h e allomorphs which c a n only occur in w o r d - f i n a l position following a short v o w - el; o n e such D E has as its spelling m and a r b i t r a r y n a m e / M I / ; there is a n o t h e r D E in the suffix dictionary which has mi as its spelling a n d the s a m e a r b i t r a r y n a m e . T h e s e t w o D E ' s c o r r e s p o n d to the t w o a l l o m o r p h s of the suffix -mi ' a s s e r t i v e ' .

2. T h e c a t e g o r i e s of m o r p h o l o g i c a l units which are allowable in w o r d - f i n a l position, i.e. the c a t e g o r i e s of c o m p l e t e words.

3. T h e table of o r t h o g r a p h i c changes, used to c o n v e r t the SD expression of vowel length to a single form, and to r e m o v e SD o r t h o - graphic r e f l e c t i o n s of low level p h o n e t i c processes.

Source dialect initialization is i n d e p e n d e n t of t a r g e t dialect i n f o r m a t i o n , and so is invariant with change of target dialect.

3.1.2 Target Dialect Initialization

F o r a given T D , a dictionary of roots, a dictionary of suffixes, and several o t h e r small lists, m u s t be p r o - vided the p r o g r a m .

T h e small lists which m u s t be m a d e available are:

1. T h e non-cognate roots list: c o r r e s p o n d i n g SD and T D roots w h e r e the SD r o o t and

(5)

2.

.

the T D root are not cognate. F o r e x a m - ple, the H g Q r o o t lloqshi- ' t o l e a v e ' corresponds to the Y a Q root yarqu-.

T h e plural suffix list: all verbal suffixes in the SD which indicate t h a t the v e r b so m a r k e d has a plural subject or object.

T h e drop list: suffixes to be deleted in adapting f r o m the SD to the TD. F o r example, H g Q -paq is a b s e n t in Y a Q a n d must be d r o p p e d in adapting f r o m H g Q to Y a Q , since there is n o t h i n g in Y a Q to which -paq corresponds. All of the SD pluralizers must be on this list.

3.2" T a r g e t D i a l e c t R o o t D e r i v a t i o n

T D roots are derived f r o m SD roots so that it is not n e c e s s a r y to enter a dictionary for the target dialect. This eliminates the arduous task of collecting, organiz- ing, and entering such a dictionary, and second allows o n e to begin a d a p t a t i o n on the a s s u m p t i o n t h a t the target dialect roots are related in some systematic w a y to the source dialect roots.

T D roots are derived f r o m the SD roots as follows: T h e D E of each SD root includes the p r o t o - Q u e c h u a f o r m of that root; the R S C ' s are applied to this form. F o r example, s u p p o s e that the SD r o o t is chaki ' d r y ' . In the D E for t h a t r o o t the p r o t o - Q u e c h u a f o r m is given as *~aki. N o w if the T D is D o Q , the change ~>~ f r o m the table of R S C ' s for that dialect is applied to the p r o t o - Q u e c h u a f o r m ; the result ( a f t e r o r t h o - graphic a d j u s t m e n t ) is tsaki. If, h o w e v e r , the T D were L1Q, the changes ~ > ~ followed by ~ > s would be applied to the r o o t with the resulting L1Q w o r d saki. (These changes could be m a d e as one change, ~ > s , but it is no less c o n v e n i e n t to handle it as two steps, which is clearly h o w the c h a n g e c a m e a b o u t historically). The process uses a simple substring substitution algorithm.

A f t e r the R S C ' s have applied, a T D D E is m a d e for that root b y substituting the derived spelling for the SD spelling.

T h r e e types of roots must not u n d e r g o the R S C ' s : (1) roots for which the T D root is simply not c o g n a t e with the SD root, (2) native Q u e c h u a words which do n o t u n d e r g o the sound changes, such as n a m e s (personal, place . . . . ) and i d e o p h o n e s , and (3) words b o r r o w e d f r o m Spanish.

3.3 T e x t P r o c e s s i n g

A f t e r initialization, a T D t e x t m a y be p r o d u c e d f r o m a SD text. T h e text is t r e a t e d w o r d by word. E a c h word passes t h r o u g h 1) an analysis process, and 2) a synthesis process, which are independent.

3.3.1 W o r d A n a l y s i s

W o r d Analysis ( W A ) t r a n s f o r m s a w o r d to a set of distinct readings, e a c h reading consisting of a sequence of m o r p h e m e n a m e s a n d an i n d i c a t o r m a r k i n g the p r e s e n c e or a b s e n c e of pluralization.

T h e first step in W A is rewriting v o w e l length (which in the practical o r t h o g r a p h y is w r i t t e n with double vowels, e.g. " a a " ) , as a vowel followed b y a colon. This is n e c e s s a r y b e c a u s e the vowel and its length m a y not be parts of the same m o r p h e m e . F o r example, aywaa 'I go' is c o m p o s e d of the m o r p h e m e s aywa- ' t o go' and -: 'first p e r s o n ' . T h e change m a k e s it possible to r e c o g n i z e all m o r p h e m e s a c c o r d i n g to their spellings.

W A seeks to d e c o m p o s e a w o r d into a sequence of m o r p h e m e s . It works f r o m left to right, first seeking a r o o t w h o s e spelling m a t c h e s the b e g i n n i n g of the word, and if successful, t h e n seeking a suffix whose spelling m a t c h e s s o m e i m m e d i a t e l y following sequence of characters, repeating this process until all the characters of the w o r d h a v e b e e n m a t c h e d b y the spellings of D E ' s . W A is organized in such a w a y that it c o n - siders all possible sequences of D E ' s (root followed b y a n y n u m b e r of suffixes) w h o s e s p e l l i n g s - - t a k e n t o - gether in s e q u e n c e - - m a t c h the characters of the word. T h e a l g o r i t h m is a c o m p u t a t i o n a l l y - s t r a i g h t f o r w a r d recursive exhaustive search.

H o w e v e r , this c h a r a c t e r - m a t c h i n g process is not a sufficient basis for arriving at c o r r e c t analyses; m a n y s e q u e n c e s of m o r p h e m e s w h o s e spellings ( t a k e n t o - gether) m a t c h the c h a r a c t e r s of the w o r d will be u n a c - ceptable, b e c a u s e the language i m p o s e s constraints on the allowable decompositions. To eliminate spurious analyses, the p r o g r a m ' s o n g o i n g analysis is c o r r e - s p o n d i n g l y c o n s t r a i n e d . F o r e x a m p l e , if c h a r a c t e r m a t c h i n g alone is used to a n a l y z e aywaykan, there are 56 d e c o m p o s i t i o n s into m o r p h e m e s . If, h o w e v e r , the c o n s t r a i n t s are i m p o s e d , t h e n only o n e of these de- c o m p o s i t i o n s is passed as legitimate.

(6)

David J. W e b e r and W i l l i a m C. M a n n Prospects for C o m p u t e r - A s s i s t e d Dialect A d a p t a t i o n

analysis. Such tests help reject b a d d e c o m p o s i t i o n s b e f o r e a great deal of c o m p u t a t i o n is spent on them.

T h e second t y p e of constraint is i n v o k e d w h e n the word is c o m p l e t e l y d e c o m p o s e d into m o r p h e m e s . F o r example, the word aywaykaran 'he was going' can be d e c o m p o s e d as aywa-yka-ra-n where -yka is t a k e n to be one of the allomorphs of the suffix / - Y K U / . This d e c o m p o s i t i o n passes all of the constraints which apply to a d j a c e n t m o r p h e m e s , but not the overall constraint that certain allomorphs (-yka as an a l l o m o r p h of / - Y K U / a m o n g t h e m ) are a p p r o p r i a t e only if o n e of a certain small class of suffixes (-chi, -mu, etc). follows s o m e w h e r e in that word. T h e correct analysis involves the same m o r p h e m e b o u n d a r i e s but with -yka as an a l l o m o r p h of the suffix / - Y K A : / ' c o n t i n u a t i v e ' : aywa-yka-ra-n ( g o - c o n t i n u a t i v e - p a s t - t h i r d : p e r s o n ) .

All m o r p h e m e d e c o m p o s i t i o n s of a w o r d which pass the tests are collected. A n y m o r p h e m e sequence containing a plural m o r p h e m e is m a r k e d as plural, and the plural m o r p h e m e is deleted. T h e resulting set of m a r k e d , d e p l u r a l i z e d readings is the final result of WA.

W h e n W A does not yield any reading, the SD w o r d passes u n c h a n g e d t h r o u g h the r e m a i n d e r of the p r o - gram and a p p e a r s in b r a c k e t s in the final text. If the w o r d fails to be analysed b e c a u s e its initial c h a r a c t e r s do not m a t c h a n y root in the dictionary, that w o r d is e n t e r e d on a list which is periodically reviewed to u p - grade the dictionary of roots.

3.3.2 W o r d Synthesis

W o r d Synthesis (WS) derives a T D w o r d for e a c h reading p r o d u c e d b y WA. W h a t passes f r o m W A to WS for each reading is a sequence of m o r p h e m e n a m e s and the plurality tag of e a c h sequence. F r o m this sequence are d r o p p e d any m o r p h e m e n a m e s which are listed in the d r o p list, a n d substitutions of n o n - c o g n a t e T D roots are made. F o r the remaining m o r p h e m e s , each n a m e is replaced by its entire T D D E (so that its properties are characteristic of the T D r a t h e r than the SD). The next step is repluralization.

Repluralization inserts a plural m o r p h e m e into the m o r p h e m e s e q u e n c e a c c o r d i n g to the pluralization scheme of the TD. R e p l u r a l i z a t i o n is dialect specific, and requires different m e t h o d s for different dialects. T h e m e t h o d appeals to the order of various classes of suffixes to find the a p p r o p r i a t e niche for the pluralizer.

The next step is to select for each m o r p h e m e the a l l o m o r p h which is c o m p a t i b l e with the o t h e r m o r - p h e m e s in the word. T h e allomorphs of a m o r p h e m e are r e p r e s e n t e d by D E ' s which share the n a m e of the m o r p h e m e . This set of D E ' s is r e d u c e d by successive application of constraints until a single D E remains. W h e n the choices h a v e b e e n m a d e f o r all the m o r -

p h e m e s of the sequence, the spellings of the chosen D E ' s are c o n c a t e n a t e d to f o r m the T D word.

T h e n all of the s e p a r a t e l y d e v e l o p e d renderings of a w o r d are collected, and a n y duplicates are eliminat- ed. Ordinarily, this results in a single word. If there are multiple r e n d e r i n g s , t h e y are s e p a r a t e d b y s l a s h e s ( / / ) . F o r e x a m p l e , H g Q kasha has t w o possible analyses: 1) the r o o t kasha ( ' t h o r n ' ) and 2) the r o o t ka ( ' b e ' ) followed b y the suffix -sha, (past p a r - ticiple). In going to LIQ, the first of these analyses will yield kasha; the s e c o n d will yield kashqa. Thus the resulting L1Q text will h a v e kasha//kashqa c o r r e - sponding to H g Q kasha.

T h e final step is to c o n v e r t the w o r d ( s ) to the T D o r t h o g r a p h y , r e p r e s e n t i n g length b y d o u b l e d v o w e l s and m a k i n g a n y o t h e r n e c e s s a r y changes.

3.4 T h e P r o g r a m ' s E f f e c t s o n S i n g l e W o r d s

This section roughly c h a r a c t e r i z e s h o w the p r o g r a m t r e a t s c e r t a i n actual words, illustrating h o w c e r t a i n dialect d i f f e r e n c e s are handled. T h e f o r m s given b e - low are only suggestive of h o w the p r o g r a m o p e r a t e s ; t h e y are not the actual f o r m of the w o r d in the c o m - puter, and possible alternative readings are not shown. ( C a p i t a l letters are u s e d f o r the a r b i t r a r y n a m e s of suffixes; the asterisk p r e c e d i n g roots indicates that the f o r m is a p r o t o - Q u e c h u a form). T h e pair of e x a m p l e s in T a b l e 2 illustrates the r a t h e r d r a m a t i c d i f f e r e n c e which arises in words containing pluralizers in a d a p t - ing f r o m H g Q to LIQ.

In (1), -rka is a pluralizer, w h e r e a s in ( 1 ' ) it is the a l l o m o r p h of - R K U which has u n d e r g o n e m o r p h o p h o - nemic lowering, in this case b e c a u s e it is followed b y -:RI. In the first e x a m p l e -rka is eventually d r o p p e d b e c a u s e it is a pluralizer; in the second it eventually b e c o m e s -rku b e c a u s e o n c e - : R I is d r o p p e d there is no longer any r e a s o n for the vowel to be lowered. N o t e t h a t in the s e c o n d e x a m p l e , the pluralizer which is inserted ( - Y A : ) occupies the s a m e position as the SD pluralizer ( - : R I ) , but in the first e x a m p l e the inserted pluralizer o c c u p i e s a d i f f e r e n t p o s i t i o n t h a n the SD pluralizer (in this case, -rka). F u r t h e r n o t e that the a l l o m o r p h selected f o r the pluralizer -YA: has length in the second e x a m p l e but not in the first; this is b e - cause in the s e c o n d e x a m p l e the following suffix ( - N A ) does not f o r e s h o r t e n , w h e r e a s in the first e x a m - ple the following suffix (-3) does. A similar case involves - Y K A : in the first example: in the SD the correct a l l o m o r p h is -yka b e c a u s e it i m m e d i a t e l y p r e c e d e s -n, which f o r e s h o r t e n s , w h e r e a s in the T D the c o r r e c t a l l o m o r p h is -yka: b e c a u s e it i m m e d i a t e l y p r e c e d e s - Y A : , which does not f o r e s h o r t e n .

F o u r tables of e x a m p l e s will n o w be given (with f e w e r stages) which illustrate cases in which a T D r o o t r e p l a c e s an SD r o o t which has d i f f e r e n t p r o p e r t i e s

(7)

David J. W e b e r and William C. Mann Prospects for Computer-Assisted Dialect Adaptation

Given:

SD orthographic form:

WA develops, in succession:

length converted: segmentation:

morphophonemic form: plurality handled:

WS develops, in succession:

re-pluralization: allomorph selection: TD orthographic form:

(1) aywarkaykan (1') aywarkaarinanpaq

(2) aywarkaykan (3) aywa-rka-yka-n (4) *aywa-RKA-YKA:-3 (5) (*aywa-YKA:-3) + P L

(2') aywarka:rinanpaq (3') aywa-rka-:ri-na-n-paq

(4') * aywa-RKU-:RI-NA-3 P-PAQ (5') (*aywa-RKU-NA-3P-PAQ) + P L

(6) *aywa-YKA:-YA:-3 (7) aywa-yka:-ya-n (8) aywaykaayan 'they are going'

(6') *aywa-RKU-YA:-NA-3P-PAQ (7') aywa-rku-ya:-na-n-paq

(8') aywarkuyaananpaq 'in order that they go'

Table 2. The program's effects on two plural words.

SD orthographic form: segmentation:

WA output: re-pluralization: allomorph selection: TD orthographic form:

warantin wara-ntin

(*wara-NTIN)-notPL *wara-NTIN

waray-nintin waraynintin

'day after tomorrow' (< wara(y) 'tomorrow')

Table 3. Consequences of root change.

morphophonemic form: re-pluralization: allomorph selection: TD orthographic form:

chayaykun chaya-yku-n

(*~aya-YKU-3)-notPL *~aya-YKU-3

~a-yku-n traykun

'he arrives (directly)'

chayamun chaya-mu-n

(*~aya-MU-3)-notPL *~aya-MU-3

~a:-mu-n traamun

'he arrives (here)'

Table 4. Selection of root allomorphs for length.

than the TD root. In the example of Table 3 the SD root ends with a vowel, whereas the TD root ends in a consonant. The suffix -ntin has an allomorph -nintin which occurs following consonants and long vowels, so this is the appropriate allomorph for the TD.

In the examples of Table 4 the SD root (chaya-) has but one allomorph, but in the TD (YaQ) the corresponding root has both long and short allomorphs. In the first example of this table the long form does not

appear because the root immediately precedes -yka, which foreshortens. In the second example the length occurs because -mu does not foreshorten.

(8)

David J. Weber and W i l l i a m C. Mann Prospects for Computer-Assisted Dialect Adaptation

Table 5.

lloqshimun lloqshi-mu-n

(*lloqshi-MU-3)-notPL *yarqu-MU-3

yarqa-mu-n yarqamun 'he leaves

(said from within)'

lloqshikun lloqshi-ku-n

(*lloqshi-KU-3)-notPL *yarqu-KU-3

yarqu-ku-n yarqukun 'he leaves

(said from without)'

Selection of root aUomorphs for lowering.

suwamannaw suwa-man-naw

(*suwa-MAN-NAW)-notPL *suwa-MAN-NAW

suwa-man-noq

suwamannoq

suwa-ma-n-naw

(*suwa-MA:-3-NAW)-notPL *suwa-MA:-3-NAW

suwa-ma-n-noq

Table 6. A single word form resulting from an ambiguous analysis.

suffix follows in the word which causes morphophonemic lowering.

In the example of Table 6 the SD word is ambiguous (between 'as though to a thief' and 'as though to steal me'). It is seen that this ambiguity produces only one TD word, a word which is ambiguous in the TD in the same way as it is ambiguous in the SD.

3.5 Program Design Problems: Ambiguity and the Control of Complexity

The problem of ambiguity strongly influenced the design and the allocation of development effort. We now feel that ambiguity control is one of the major considerations in designing an adaptor. The work of coping with ambiguity and complexity is presented below in a series of retrospective (non-chronological) design decisions.

Given our previous decision that the program should work on a word-by-word basis, one could (in principle only) design an adaptor which would work perfectly, without any internal linguistic knowledge except a table of word correspondences. The program would, for each given word, look up the adapted word. The difficulty with this for Quechua is the practical impossibility of providing the word table. Morphologi- cal productivity would yield a table of millions of words, simple in structure but of unmanageable size. Also, the process of building the word table would continue indefinitely, and much of the relevant, systematic knowledge could not be used at all.

This perception leads to the initial design decision:

Decision 1: The basic linguistic unit which the program manipulates should be the morpheme rather than the word.

Since we would like to be able to adapt from many SD's to many TD's with minimal effort of change, this leads to another decision on the particular morphemes to be used:

Decision 2: The program should analyze

text into proto-Quechua morphemes where appropriate, and should synthesize text from proto-Quechua morphemes by recapitulating actual historical processes.

Discovering words in text is trivial, but discovering distinct morphemes is not. The decision to use morphemes as the basic unit makes it necessary for the program to contain an analyzer to reduce individual words into sequences of morphemes. This is the major part of the Word Analyzer above.

One might hope that distinct morphemes would have distinct spellings, and that they could be discov- ered effectively by seeking concatenations of morpheme spellings which would yield the word. Unfortu- nately, as we have seen, spelling is a hopelessly inade- quate basis for morphological analysis. Use of spelling alone leads to truly amazing numbers of analyses, as was demonstrated above with aywaykan. This is so frequent that, for efficiency, the program must never even produce the set of spelling-based readings inter-

(9)

nally. Rather, the analysis must be c o n s t r a i n e d by knowledge other than spelling. This fact leads to another design decision:

Decision 3: The program must incorpo-

rate a strongly constraining m o r p h o l o g y in its analysis.

The m o r p h o l o g y used [2] makes i t possible to ex- ploit the characteristic of Q u e c h u a ( m e n t i o n e d in 2.1) that the c a t e g o r y of a morphological unit strongly constrains what suffix may follow the unit. On the basis of the m o r p h e m e categories alone, the n u m b e r of distinct analyses of a y w a y k a n drops from 56 to 2. The program applies these morphological constraints as the analysis of a word proceeds, so that an invalid analysis

can be rejected as soon as it is proposed if it contains an

invalid pair of adjacent morphemes, rather than delay- ing rejection until the word analysis is complete.

The program also uses several o t h e r kinds of cri- teria for rejecting analyses, including a set of pro- grammed tests which apply to whole words, used as soon as an analysis is complete. T h e y eliminate analyses containing defects such as i n c o r r e c t lowering, i m p r o p e r ordering, and the i m p r o p e r use of certain special m o r p h e m e s such as -paq 'future'.

These methods eliminate nearly all of the potential ambiguity (well over 99 percent of it), but still leave a significant residue. In the experiment, about 25 percent of the words were still ambiguous under all of these constraints applied jointly, and this ambiguity contributed about 1 excess reading for e v e r y 2 words. If this amount of ambiguity were to appear in the final text, it would be a significant impediment. F o r t u n a t e - ly, as we will see, most of it goes away.

In synthesizing words in the target dialect from their m o r p h e m e definitions, the many allomorphs and associated conditions must be treated properly. A T D m o r p h e m e may have multiple allomorphs in surface forms of the TD, only one of which is allowed by conditions in the remainder of the word. T h e r e is a great diversity of conditions. Composing TD words without r e f e r e n c e to these conditions would yield i n c o m p r e - hensible text. This leads to the final design decision:

Decision 4: The program must contain

linguistically sound methods for selecting correct allomorphs in context.

A b o u t one third of the words in our experimental text had possible allomorph variants when adapted to JuQ. Virtually all of this variation is subject to condi- tioning, which can be r e p r e s e n t e d computationally by systematic environmental filtering rules. Application of this elaborate collection of environmental filtering

rules causes essentially all allomorph variation to be resolved during synthesis.

The final ambiguity-resolution " m e t h o d , " the one which resolves a b o u t 93 p e r c e n t of the residue, is not really a m e t h o d at all. It turns out that when the various readings of an ambiguous SD word are synthesized in the TD, they seldom differ. The p r e d o m i n a n t case is that only one spelling (rendering) arises f r o m an ambiguous word, so that the final text has only a single word form even though the analysis and synthesis of that word were ambiguous. On a 5 7 6 - w o r d sample of text a d a p t e d to JuQ, t h e r e were 151 ambiguous words. F o r these, W A produced 850 readings (1.47 readings per word). But the "collapsing e f f e c t " re- duced the n u m b e r of readings to 587 (1.02 readings per word), so that only 11 words were indicated as ambiguous in the final text.

In checking and c o r r e c t i n g these texts, native speakers n e v e r had any difficulty in quickly selecting one of the alternative words as being correct, and agreed on all selections. The n u m b e r of alternatives in the resulting TD text did not present difficulties for the c h e c k e r / e d i t o r s . The final level of ambiguity pres- e n t e d no experimental difficulties of any kind.

Given the necessary complexity and diversity of the symbol processing required by the design decisions above, the programming language and related software support must provide:

1. Flexible facilities for manipulating charac- ter strings,

2. Easy representation of processes with elaborate functional decomposition,

3. Recursive processes,

4. Strong tools for defining, testing, revising and combining processes.

T h e s e r e q u i r e m e n t s ( a m o n g others) led to our choosing I N T E R L I S P (a dialect of LISP) as the programming language. LISP encourages functional decomposition of one's program; this has b e e n vital to keeping the whole enterprise comprehensible and con- trollable. This paper's description of the program does not reflect the strong f u n c t i o n a l d e c o m p o s i t i o n em- ployed. T h e r e are actually 111 functions in the cur- rent version of the program, and a similar n u m b e r of other (tool) functions were defined for managing dic- tionaries, printing statistics and o t h e r activities.

4. P r o c e d u r e s for C h e c k i n g t h e C o m p u t e r A d a p t e d T e x t

All during the m o n t h s - l o n g d e v e l o p m e n t of the program there was an understanding that its effective-

(10)

ness would have to be e x a m i n e d in s o m e sort of field test. Subsequent d e v e l o p m e n t and application would d e p e n d on the o u t c o m e s of the testing, i.e. on h o w r e a d a b l e and c o r r e c t a b l e the c o m p u t e r - p r o d u c e d text really is. In particular, we were p r e p a r e d to e m b e d the p r e s e n t p r o g r a m in a m o r e c o m p l e x p r o g r a m whose scope was sentences or larger units if that had p r o v e d necessary.

To test the p r o g r a m ' s a d e q u a c y , samples of various types of text were selected for trial a d a p t a t i o n , and all samples were a d a p t e d into five target dialects. In the s u m m e r of 1978 these texts were t a k e n b y the first author to the areas where the target dialects are spo- ken. T h e y were p r e s e n t e d to native speakers of the various dialects, and their r e a c t i o n s a n d suggestions for revision were noted. This section presents the testing, and Section 5 describes the results.

4.1 T h e N a t u r e of t h e T e x t s U s e d in t h e E x p e r i m e n t

T h e source texts used for this e x p e r i m e n t consisted in the following: two folktales " H o m b r e y O s o " ( a p p r o x i m a t e l y 150 w o r d s ) and " M o c o y Mishi" ( a p p r o x i m a t e l y 250 w o r d s ) , t w o p a s s a g e s f r o m the translation of the

Gospel of Mark

2 : 1 - 1 2 ( T h e H e a l i n g of the Paralytic) and 14:12-15:41 (The L a s t Supper t h r o u g h T h e D e a t h of J e s u s ) , a n d a short p e r s o n a l narrative a b o u t an accident.

4.2 P r o c e d u r e U s e d to C h e c k t h e T e x t s

T h e p r o c e d u r e s used to c h e c k these texts varied f r o m helper to helper as a function of his ability to read and to grasp what was e x p e c t e d of him b y w a y of editing or correcting. T h e m o s t k n o w l e d g e a b l e helper t o o k p e n in h a n d and p r o c e e d e d to c o r r e c t and edit with i n t e r v e n t i o n (by the first a u t h o r ) only on certain m a t t e r s of spelling (where the official o r t h o g r a p h y - - which was not k n o w n to him - - was in conflict with his intuitions b a s e d on the writing of Spanish). This helper fully g r a s p e d the task of editing: he would f r e q u e n t l y b a c k up (as he p r o c e e d e d t h r o u g h the texts) to read several sentences b e f o r e the one he was focusing on to see that the passage r e a d smoothly. H e m a d e a fair n u m b e r of c h a n g e s in w o r d o r d e r and s o m e that were clearly intended to bring the discourse structure of the text m o r e into line with his c o n c e p t i o n of w h a t it should be. (This raised an interesting p r o b - lem, one which has no o b v i o u s solution, n a m e l y h o w can one distinguish b e t w e e n those changes which are essential to the intelligibility and g o o d style of the T D text f r o m t h o s e which m a k e the text c o n f o r m to a p e r s o n a l style? A n d h o w would o n e train a c h e c k e r / e d i t o r to distinguish b e t w e e n these?).

T h e m o s t c o m m o n situation was t h a t the h e l p e r was illiterate (or t h a t he h a d r a t h e r minimal r e a d i n g skills in Spanish). In this case it was n e c e s s a r y to r e a d the text to him s e n t e n c e b y s e n t e n c e , p a u s i n g a n d soliciting c o r r e c t i o n s , suggestions, a n d i m p r e s s i o n s if these were not f o r t h c o m i n g , and m o n i t o r i n g his u n d e r - standing of the text. T h e m a i n d r a w b a c k of this m e - t h o d is t h a t it was difficult (usually nearly impossible) to get g o o d reactions as to the discourse f e a t u r e s of the text. ( A n d on a second reading, the helpers were unwilling to criticize the text which they had just cor- rected).

S o m e text was c h e c k e d m e r e l y b y h a n d i n g it to a p e r s o n (who in e a c h case was a t e a c h e r at the s e c o n d - ary or university level) with the instruction to c o r r e c t i t - - t o m a k e it g o o d text f o r his dialect. In these cases v e r y little was d o n e ; f a r f e w e r c h a n g e s w e r e m a d e t h a n in a n y case w h e r e a linguist w o r k e d with the c o r r e c t o r / e d i t o r .

D e s p i t e these and o t h e r d r a w b a c k s , it was felt that, b y and large, the checking and correcting did bring the T D texts e x t r e m e l y close to w h a t would be a d e q u a t e e v e n for Scripture translation. ( O f course, this is a subjective j u d g m e n t ) .

5. R e s u l t s of C h e c k i n g t h e C o m p u t e r A d a p t e d T e x t

C h e c k i n g the c o m p u t e r - a d a p t e d t e x t with native s p e a k e r s of these dialects has r e v e a l e d b o t h the general strength of the p r o g r a m an an interesting residue of disparities b e t w e e n c o m p u t e r - p r o d u c e d t e x t a n d text after s u b s e q u e n t corrective editing.

5.1 G e n e r a l R e s u l t s

T h e results of testing the c o m p u t e r - a d a p t e d texts in the areas where the various target dialects are s p o k e n has b e e n v e r y encouraging.

First, t h o u s a n d s of changes were c o r r e c t l y m a d e b y the p r o g r a m . Samples of the text a d a p t e d to L1Q were c o u n t e d to estimate the p r o g r a m ' s q u a n t i t a t i v e effects. It c h a n g e d a b o u t 760 m o r p h e m e s per 1000 words of text. T h e native helper suggested a b o u t 190 additional c h a n g e s p e r 1000 w o r d s of text, and these c h a n g e s include some which are editorial r a t h e r t h a n dialect- related. (This was an e x t r e m e ; all helpers for o t h e r dialects suggested far f e w e r c h a n g e s of the same m a - terial). So the p r o g r a m seems c a p a b l e of doing b e - t w e e n 80 p e r c e n t and 95 p e r c e n t of the n e c e s s a r y or advisable dialect-related changes. T h e s e figures are of less practical i m p o r t a n c e t h a n the qualitative fact that the p r o g r a m m a d e these texts fully c o m p r e h e n s i b l e to the native helpers.

(11)

O f the c o r r e c t i o n s which field testing b r o u g h t to light, a rough estimate suggests that well o v e r 90 percent could also have b e e n m a d e (and eventually will be m a d e ) b y the p r o g r a m b y simple modifications of the data lists m a d e available to it. (The bulk of these changes involve the spelling of Spanish loans, which in the e x p e r i m e n t a l tests were written as in Spanish but which will eventually be written according to Q u e c h u a o r t h o g r a p h y ) . This m e a n s that only a very small resi- due o f changes will not be made by the program, these changes being left for the c h e c k e r / e d i t o r .

T h e c o m p u t e r - a d a p t e d text was - - with very f e w e x c e p t i o n s - - c o m p l e t e l y intelligible to the c h e c k e r / e d i t o r s , and this would not h a v e b e e n the case for material which had not b e e n p r e - a d a p t e d .

Thus it seems safe to say that one c o n s e q u e n c e of c o m p u t e r a d a p t a t i o n is that the resultant text could be c h e c k e d by a p e r s o n w h o k n e w only the target dialect, his task being to m a k e the text sound m o r e natural in his dialect. O n the same basis, c o m p u t e r a d a p t a t i o n can m a k e an existing translation in a related language or dialect accessible to a native translator.

It has b e e n our feeling t h a t p r e - a d a p t i n g with a c o m p u t e r would enable the c h e c k e r / e d i t o r to consider m o r e significant changes b e c a u s e it relieved him of the ( o v e r w h e l m i n g n u m b e r of) rather superficial and systematic changes. This e x p e r i m e n t has not verified this (nor c o n t r a d i c t e d it) b e c a u s e those w h o helped were not trained in the task of c h e c k i n g / e d i t i n g , w h i c h - - i t is a s s u m e d - - w i l l be the case in the eventual applica- tions. (Training c h e c k e r / e d i t o r s , w h o will w o r k f r o m p r e - a d a p t e d text, seems a r e a s o n a b l e task. By con- trast, to train p e r s o n s to m a k e all the changes, working directly f r o m the SD text, would be a f o r m i d a b l e if not impossible task. But by first p r e a d a p t i n g the text, that which must be done manually by the c h e c k e r / e d i t o r , and c o n s e q u e n t l y w h a t he must be trained to do, is r e d u c e d to well within the feasible).

O n e implication of the results of this e x p e r i m e n t is that the range of dialects for which there should be a single p r i m a r y t r a n s l a t i o n has b e e n c o n s i d e r a b l y en- larged. Mutual intelligibility is no longer an a d e q u a t e basis on which to m a k e decisions a b o u t the allocation of linguistic or translation personnel. Mutual intelligibility m a y be very low for s o m e e x t r e m e l y trivial, superficial reason, and e v e n if it is low b e c a u s e of a great many, seemingly c o m p l e x factors, so long as t h e y are systematic, c o m p u t e r a d a p t a t i o n m a y m a k e otherwise unintelligible text quite intelligible to a s p e a k e r of the target dialect, intelligible enough that he can edit it without some other m e a n s of discovering its meaning.

5.2 D i s p a r i t i e s B e t w e e n C o m p u t e r - A d a p t e d a n d Final T D T e x t s

This section lists kinds of text changes which were d i s c o v e r e d to be n e c e s s a r y to m a k e the c o m p u t e r - a d a p t e d texts b e t t e r c o n f o r m to the T D ' s . O n e point s e e m s inevitable: there will always be a non-trivial residue o f changes which cannot be done by the computer. Thus, it will always be n e c e s s a r y that the T D texts be manually c o r r e c t e d or edited. So r a t h e r t h a n strive to m a k e the p r o g r a m do all the changing possible, it is wise to be c o n t e n t with a p r o g r a m which does a significant a m o u n t of the c h a n g e s - - t h o s e " l o w - level" changes which are s y s t e m a t i c - - a n d not unduly c o m p l i c a t e the p r o g r a m b y a t t e m p t i n g to m a k e it do an unrealistic amount. T h e p r o p o r t i o n of all of the desired changes which is accomplished b y the p r e s e n t p r o g r a m seems entirely satisfactory; no f u r t h e r devel- o p m e n t is needed to change the b a l a n c e b e t w e e n m a n - ual and p r o g r a m m e d changes.

It was gratifying to find that, a l t h o u g h s o m e of these residual kinds of changes (described below) were difficult or impossible to p r o g r a m , they caused little difficulty for the checkers in the field test. T h e raw c o m p u t e r - a d a p t e d text was as good as a c o n v e n t i o n a l r o u g h draft, a n d could h a v e b e e n used in the s a m e way.

1. C h a n g e s due to cultural differences

T h e Q u e c h u a culture is not c o m p l e t e l y h o m o g e n e - ous t h r o u g h o u t the A n d e s ; there are cultural d i f f e r - ences which give rise to certain p r o b l e m s in a d a p t i n g f r o m one dialect to another.

In H g Q , the e x p r e s s i o n s f o r the time of d a y are tied to the practice of chewing coca, e.g., chaqcha inti ( c o c a : c h e w i n g sun) m i d - m o r n i n g coca b r e a k (roughly at 10:30 A M ) and mallway inti or mallway oora, which refers to the time of the m i d - a f t e r n o o n c o c a break. In o t h e r a r e a s m J u n i n , for e x a m p l e - - w h e r e the chewing of c o c a plays a lesser part in the culture, time is not indicated b y r e f e r e n c e to c o c a chewing, p r e s u m a b l y b e c a u s e there are no " i n s t i t u t i o n a l i z e d " times f o r chewing. In these areas time is told either b y e x p r e s - sions b o r r o w e d f r o m Spanish, e.g., las tres '3 o ' c l o c k ' or b y r e f e r e n c e to the position of the sun, e.g., inti yarpunan oora (sun falling time) ' 5 : 0 0 - 6 : 0 0 P M ' .

2. D i f f e r e n t c o m p l e m e n t s

(12)

David J. W e b e r and W i l l i a m C. M a n n Prospects for Computer-Assisted Dialect Adaptation

3. Case

In H g Q forgiveness is expressed by the verb perduna- ( b o r r o w e d from the Spanish perdonar) and that of which one is forgiven is expressed by an accusative noun phrase; e.g., noqa qam-ta huchayki-ta perdunaami (I y o u - A C C y o u r : s i n - A C C I:forgive:you) 'I forgive you your sin.' In DoQ, it is expressed by an

ablative n o u n phrase: noqa qam-ta huchayki-pita perdunaami (I you y o u r : s i n - A B L A T I V E I:forgive:you)

'I forgive you from your sin. '3

4. Spanish loans

The process of a d a p t a t i o n must deal with differences in what elements of Spanish have b e e n bor- rowed and how they have been assimilated. So, adapting from HgQ, we have:

1. A Spanish loan changed to a native Que- chua word, e.g., H g Q alberti- becoming LIQ yaasi- 'to advise'.

2. A Q u e c h u a word becoming a Spanish loan, e.g., H g Q pachaku- b e c o m i n g L1Q posadaku- 'to take lodging'.

3. A Spanish loan becoming a different Span- ish loan, e.g., H g Q iwal b e c o m i n g P a Q pareehu ' t o g e t h e r (with)', H g Q sabli ( f r o m Spanish sable 'saber') b e c o m i n g YaQ macheti (from Spanish machete).

4. Spanish loans in different degrees of as- similation, e.g., HgQ ligi- (highly assimilated) becoming LIQ liyi- 'to read' from Spanish leer ; H g Q rigi- (highly assimilated) becoming D o Q kriyi- 'to believe' from Spanish creer ; H g Q iskirbi- b e c o m i n g D o Q iskribi- 'to r e a d ' f r o m Spanish escribir ; H g Q pakillaa b e c o m i n g D o Q Dyosolpaki 'thank you' from Spanish Dios se 1o pague 'May G o d r e p a y you'.

adequate merely to say fiyupa pununaptin 'because they were really sleepy'.

2. In HgQ, 'to greet' is expressed by adyusta qo-, literally, 'to give an adios'. In o t h e r dialects this is e x p r e s s e d simply b y the Spanish loan saluda- 'to greet'.

3. In H g Q , to say that s o m e o n e b e g a n to

sob, one says waqayman churakaran

(to:cry he:was:placed). In D o Q waqay- man chayarqan ( t o : c r y he:arrived).

4. T h e H g Q expression, tapriypa tupaypa, which means roughly 'end for end, head over heels' is u n k n o w n in e.g., JuQ.

5. In HgQ, roosters are said to sing (kanta-), whereas in YaQ, to cry (waqa-).

It was encouraging to note a case in which an idiomatic usage is preserved even though the verb root is changed. The case is H g Q wasi-ta hata-ra-chi- (house- A C C stand-RI-cause-) 'to raise a house' (literally, 'to cause a house to stand'). In the dialects to the east, the same idiom is used but with the local r e f l e x of *shaya- 'to stand'; thus in DoQ, 'to raise a house' is expressed wayita sharkatsi-.

6. Auxiliary constructions

H g Q has a construction which indicates that some action or e v e n t is imminent. Some dialects lack the imminent construction. F o r such dialects it was necessary to change this to the periphrastic future.

7. Suffixes present in one dialect but absent in a n o t h e r

In LIQ there is a fairly c o m m o n verbal suffix -ski which H g Q does not have. In adapting from H g Q to LIQ it was left to the c h e c k e r / e d i t o r to insert -ski where appropriate.

5. Idioms

Idiomatic expressions vary from dialect to dialect (although very little). F o r example,

1. H g Q nawinkuna fiyupa pununaptin

( t h e i r : e y e s really b e c a u s e : t h e y : d e s i r e : sleep) 'because their eyes really wanted to sleep' is an idiomatic way to say that they were extremely sleepy. This idiom is not used in LIQ, D o Q , and YaQ, where it is

2 A phasal verb is one which refers to the starting, stopping, continuing, et cetera of the action e x p r e s s e d by its c o m p l e m e n t , e.g., to start to eat, to stop talking, etc.

3 This difference may be due to H g Q having assimilated the loan more highly than DoQ; the ablative would be used in Spanish: Yo te perdono de tus pecados 'I forgive you f r o m your sins.'

8. Pluralization

The pluralization algorithm works perfectly, in that w h e n e v e r there is a pluralizer in the SD, the verb is a p p r o p r i a t e l y pluralized in the TD. H o w e v e r , this t r e a t m e n t is deficient for the following reason: in H g Q (the SD for the present e x p e r i m e n t ) only a b o u t 30 percent (a rough guess) of those verbs which have a plural subject or object are actually m a r k e d as being plural in the verb. In o t h e r dialects a much greater p e r c e n t a g e of the verbs which have plural subjects or objects are m a r k e d plural. F o r LIQ, this might be as high as 90 percent. In adapting from H g Q to LIQ, far too few pluralizers were inserted because there were too few pluralizers in the H g Q verbs to trigger the process which inserts them into the T D verbs.