• No results found

Two Simple Prediction Algorithms to Facilitate Text Production

N/A
N/A
Protected

Academic year: 2020

Share "Two Simple Prediction Algorithms to Facilitate Text Production"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

T W O S I M P L E P R E D I C T I O N A L G O R I T H M S T O F A C I L I T A T E T E X T P R O D U C T I O N

L o i s Boggess P.O. D r a w e r CS M i s s i s s i p p i S t a t e U n i v e r s i t y M i s s i s s i p p i State, MS 39762

A B S T R A C T

S e v e r a l s i m p l e p r e d i c t i o n s c h e m e s a r e p r e s e n t e d f o r s y s t e m s i n t e n d e d to f a c i l i - t a t e t e x t p r o d u c t i o n f o r h a n d i c a p p e d i n d i v i d u a l s . T h e s c h e m e s are b a s e d o n s i n g l e - s u b j e c t l a n g u a g e m o d e l s , w h e r e the s y s t e m is s e l f - a d a p t i n g to the p a s t

l a n g u a g e use of the s u b j e c t . S e n t e n c e p o s i t i o n , the i m m e d i a t e l y p r e c e d i n g o n e o r t w o w o r d s , a n d i n i t i a l l e t t e r s of the d e s i r e d w o r d a r e cues w h i c h m a y be used by the systems.

I N T R O D U C T I O N

F o r s o m e y e a r s we h a v e b e e n i n v e s t i - g a t i n g the use of a s i z e a b l e s a m p l e of a p a r t i c u l a r i n d i v i d u a l ' s l a n g u a g e h a b i t s in p r e d i c t i n g f u t u r e l a n g u a g e use f o r t h a t i n d i v i d u a l . T h e r e s e a r c h has t a k e n t w o d i r e c t i o n s .

O n e of these, the H W Y E ( H e a r W h a t Y o u E x p e c t ) s y s t e m , b u i l d s a l a r g e l a n - g u a g e m o d e l of the p a s t l a n g u a g e h i s t o r y of the i n d i v i d u a l , w i t h s p e c i a l e m p h a s i s o n the m o s t f r e q u e n t w o r d s of t h a t p e r s o n , a n d the r e s u l t is used in s p e e c h r e c o g n i t i o n . In s t u d y i n g the l a n g u a g e m o d e l d e v e l o p e d b y the H W Y E s y s t e m , s e v e r a l s i m p l e p r e d i c t i v e s c h e m e s w e r e n o t e d w h i c h a r e c a p a b l e of a n t i c i p a t i n g , d u r i n g the g e n e r a t i o n of a sentence, a s m a l l set of w o r d s f r o m w h i c h the n e x t d e s i r e d w o r d c a n be selected. T h e t w o s c h e m e s d e s c r i b e d h e r e are used f o r t e x t g e n e r a t i o n ( n o t s p e e c h r e c o g n i t i o n ) in a f o r m a t t h a t c o u l d be of use to a p h y s i - c a l l y h a n d i c a p p e d p e r s o n ; hence the

s c h e m e s h a v e n o r i g h t c o n t e x t a v a i l a b l e . O n e of the s c h e m e s d o e s use left c o n t e x t , a n d the o t h e r uses o n l y s e n t e n c e p o s i t i o n as " c o n t e x t ' . Both a r e i m p l e m e n t e d o n I B M - P C s y s t e m s w i t h m i n i m a l m e m o r y r e q u i r e m e n t s .

M O T I V A T I O N

O n e h u n d r e d E n g l i s h w o r d s a c c o u n t f o r 47 p e r cent of the B r o w n c o r p u s ( a b o u t o n e m i l l i o n w o r d s of A m e r i c a n E n g l i s h t e x t t a k e n f r o m a w i d e r a n g e of sources). It seems r e a s o n a b l e to s u p p o s e t h a t a s i n g l e i n d i v i d u a l m i g h t in fact r e q u i r e f e w e r w o r d s to a c c o u n t f o r a l a r g e p r o p o r t i o n of g e n e r a t e d text. F r o m o u r w o r k o n the H W Y E s y s t e m it was k n o w n t h a t 75 w o r d s a c c o u n t e d f o r h a l f of a l l the t e x t of V a n i t y F a i r , a 300,000 w o r d V i c t o r i a n E n g l i s h n o v e l by T h a c k e r a y ( w h i c h i n c o r p o r a t e d a f a i r l y i n v o l v e d s y n t a x , m u c h e m b e d d e d q u o t a t i o n , a n d p a s s a g e s in d i a l e c t a n d in F r e n c h )

[ E n g l i s h a n d Boggess, 1986]. We f u r t h e r f o u n d t h a t 50 w o r d s a c c o u n t e d f o r h a l f of a l l the v e r b i a g e in a 20,000 w o r d set of s e n t e n c e s p r o v i d e d by an i n d i v i d u a l w h o c o l l a b o r a t e d w i t h us. T h i s l a t t e r

c o r p u s , c a l l e d the S h e r r i d a t a , is a set of t e x t s p r o v i d e d by a s p e e c h - h a n d i c a p p e d i n d i v i d u a l w h o uses a t y p e w r i t e r to

(2)

Y o u said s o m e t h i n g a b o u t a m a g a z i n e t h a t < n a m e l > h a d a b o u t c o m p u t e r s t h a t I m i g h t l i k e t o b o r r o w .

I w o u l d s o m e time.

I t h i n k we h a v e t o p i c k u p the c h i l d r e n w h i l e < n a m e 2 > is in the h o s p i t a l .

I w a n t t o visit her in the h o s p i t a l .

But y o u h a v e t o lift me u p t o the w i n d o w f o r me t o see the b a b y .

W e l l , it's M a y first n o w . H e l p !

I t h o u g h t it w o u l d n o t be so b u s y b u t it l o o k s l i k e it m i g h t be n o w .

F i g u r e 1. S a m p l e set of c o n t i g u o u s s e n t e n c e s in S h e r r i d a t a

It seems r e a s o n a b l e to s u p p o s e t h a t f o r c o n v e r s a t i o n a l E n g l i s h , a p p r o x i m a t e l y 50 w o r d s m a y a c c o u n t f o r h a l f of the v e r b i a g e of m o s t E n g l i s h users. F r o m the s t a n d p o i n t of h u m a n f a c t o r s , an a r g u m e n t c o u l d be m a d e t h a t o n e s h o u l d s i m p l y p u t the 50 w o r d s u p o n the s c r e e n w i t h the a l p h a b e t a n d t h u s be a s s u r e d t h a t h a l f of a l l the w o r d s d e s i r e d b y the user w e r e i n s t a n t l y a v a i l a b l e , in k n o w n l o c a t i o n s t h a t the user w o u l d q u i c k l y b e c o m e a c c u s t o m e d to. C o n s t a n t l y c h a n g i n g m e n u s i n t r o d u c e an e l e m e n t of user f a t i g u e [ G i b l e r a n d C h i l d r e s s ,

1982]. T h a t a r g u m e n t m a y e s p e c i a l l y m a k e sense as l a r g e r s c r e e n s w i t h m o r e lines p e r s c r e e n a n d m o r e c h a r a c t e r s p e r line b e c o m e m o r e c o m m o n .

If we l i m i t o u r s e l v e s to the t o p 20 m o s t f r e q u e n t w o r d s as a c o n s t a n t m e n u , o n l y a b o u t 30 p e r cent of the u s e r ' s v e r b i a g e is a c c o u n t e d for. H o w e v e r , it w a s o b s e r v e d , w h i l e w o r k i n g w i t h the H W Y E s y s t e m , t h a t if o n e l o o k e d at the t o p 20 w o r d s f o r a n y g i v e n s e n t e n c e p o s i t i o n , o n e d i d n o t see the s a m e set of w o r d s o c c u r r i n g . C l e a r l y the h i g h f r e q u e n c y w o r d s (the set t h a t c o m p r i s e h a l f of w o r d use) are m i l d l y s e n s i t i v e to " c o n t e x t " even w h e n " c o n t e x t " is so b r o a d l y d e f i n e d as s e n t e n c e p o s i t i o n . D i f f e r e n t s u b s e t s of the 50 m e m b e r set of h i g h f r e q u e n c y w o r d s a p p e a r in the set of 20 m o s t f r e q u e n t w o r d s f o r a g i v e n s e n t e n c e p o s i t i o n . M o r e o v e r , a f t e r p r o c e s s i n g a p p r o x i m a t e l y 2000 s e n t e n c e s f r o m the user, it w a s s t i l l the case t h a t

s o m e o f the t o p 20 w o r d s f o r a g i v e n p o s i t i o n w e r e n o t m e m b e r s of the h i g h f r e q u e n c y set at all. F o r e x a m p l e , the w o r d " t h e y ' , a m e m b e r of the m e n u f o r the f i r s t s e n t e n c e p o s i t i o n Isee F i g u r e 2) a n d hence o n e of the 20 m o s t f r e q u e n t w o r d s to s t a r t a s e n t e n c e , is n o t a m e m b e r of the g l o b a l h i g h f r e q u e n c y set.

A p r e l i m i n a r y a n a l y s i s b y E n g l i s h s u g g e s t e d t h a t , w h e r e a s a c o n s t a n t " p r e d i c t i o n " of the t o p 20 m o s t f r e q u e n t w o r d s w o u l d y i e l d a success r a t e of 30 p e r cent, p r e d i c t i n g the t o p 20 m o s t f r e q u e n t w o r d s p e r p o s i t i o n in s e n t e n c e w o u l d y i e l d a success r a t e of 40 p e r cent.

~ C O N T E X 7 " AS S E N T E N C E P O S I T I O N T h e s i m p l e s t scheme, w h i c h h a s b e e n b u i l t as a p r o t o t y p e o n an IBM P C w i t h t w o f l o p p y d i s k d r i v e s , p r e s e n t s the user w i t h the t o p 20 m o s t f r e q u e n t w o r d s t h a t the user has e m p l o y e d at w h a t e v e r

p o s i t i o n in a s e n t e n c e is c u r r e n t . F o r e x a m p l e , F i g u r e 2 s h o w s the s c r e e n p r e s e n t e d t o the user at the b e g i n n i n g of p r o d u c t i o n of a sentence. O n the left is a list of the 20 w o r d s w h i c h t h a t

(3)

1 b u t 2 o a n 3 o o u l d 4 d o S h e 6 h o t * ? I

8 I ° N 9 i f 1 0 i t I I i t ) s 1 2 L o i s

1 3 s h e 1 4 t h a t 1 5 t h e 1 6 t h e y

1 7 w e

1 8 w h a t 1 9 w h e n 2 e u o u

S P E L L

C A P I T A L

b o

P U N C T U A T I O N

H E L P - N E N U S h I

E N D I N G

m n o

N U N D E R

S P E C I A L • t u

R E U l E N

U z

H A R D - C O P Y

S A U E - S E N T

j k 1

p q r

v w x

H E N E R A S E q U I T

H E N S E N T E N C E :

F i g u r e 2. I n i t i a l S c r e e n

and f u n c t i o n s is m a d e by m o u s e , t h o u g h the a c t u a l s e l e c t i o n m e c h a n i s m is

s e p a r a t e d f r o m the b u l k o f the c o d e s o that r e p l a c e m e n t w i t h a n o t h e r s e l e c t i o n m e c h a n i s m s h o u l d be r e l a t i v e l y e a s y t o

i m p l e m e n t . ) T h e s e n t e n c e is b u i l t at the

b o t t o m o f the screen. If the user

s e l e c t s a w o r d f r o m the m e n u at the left, it is p l a c e d in first p o s i t i o n in the s e n t e n c e , and a s e c o n d m e n u , c o n s i s t i n g o f the 20 m o s t f r e q u e n t w o r d s that the user has used in s e c o n d p l a c e in a s e n t e n c e , a p p e a r s in the left p o r t i o n o f

the screen. After a s e c o n d w o r d has been

p r o d u c e d and a d d e d t o the s e n t e n c e , a third m e n u , c o n s i s t i n g o f the 20 m o s t f r e q u e n t w o r d s f o r that user in third p l a c e in a s e n t e n c e , is o f f e r e d , and s o o n .

At a n y time the user m a y reject the l e f t h a n d m e n u by s e l e c t i n g a letter o f the

a l p h a b e t . F i g u r e 3 s h o w s the s c r e e n after

the user has p r o d u c e d t w o w o r d s o f a s e n t e n c e and has b e g u n t o s p e l l a third w o r d by s e l e c t i n g the letter "a +. At this p o i n t , the t o p 20 m o s t f r e q u e n t l y used w o r d s b e g i n n i n g w i t h +a" h a v e b e e n o f f e r e d

at the left. If the d e s i r e d w o r d is n o t

in the list, the user c o n t i n u e s by s e l e c t - i n g the s e c o n d letter o f the d e s i r e d w o r d

(in this case, "n'). T h e l e f t - h a n d m e n u

b e c o m e s the 20 m o s t f r e q u e n t l y used w o r d s b e g i n n i n g w i t h the pair o f letters g i v e n

s o far. As is s h o w n in F i g u r e 4, there

are times w h e n f e w e r than 20 w o r d s o f a g i v e n t w o - l e t t e r s t a r t i n g c o m b i n a t i o n h a v e

b e e n e n c o u n t e r e d f r o m the user's past h i s t o r y , in w h i c h case t h i s a l g o r i t h m o f f e r s a s h o r t e n e d list.

In the case i l l u s t r a t e d , the d e s i r e d

w o r d w a s o n the list. If it w e r e n o t , the

user w o u l d h a v e had t o s p e l l o u t the e n - tire w o r d , and it w o u l d h a v e b e e n e n t e r e d

i n t o the s e n t e n c e . In e i t h e r case, the

s y s t e m s u b s e q u e n t l y r e t u r n s t o o f f e r i n g

the m e n u o f m o s t - f r e q u e n t l y - u s e d w o r d s f o r

the f o u r t h p o s i t i o n , and c o n t i n u e s in s i m i l a r f a s h i o n t o the end o f the s e n t e n c e .

L • 2 a b l e 3 • b o u t 4 • £ t e r 5 a f t e r n o o n 6 a p a l n 7 a l l 8 a m 9 a n

l e a n d

1 2 a p p l e 1 3 A p r i l

1 4 a r e 1 5 a ~ o u n d 1 6 a s 1 7 a s k 1 8 a s k e d

1 9 a t 2 e • u n t

b o d F

g h i J k 1

t u v w )<

NEW S E N T E N C E :

I h a v e

F i g u r e 3:

1 a n i m a l 2 a n i m a l s 3 A n i t a 4 a n n i v e r s a r y S A n n M 6 a n o t h e r ? & n s u e r 8 a n s w e r •

9 a n y

I O • n v o n e

I I • n p t h | n •

User has s e l e c t e d "a"

b o d £

k I J k I

t u v ~ x

N E N S E N T E N C E :

I h a v e

[image:3.612.100.518.77.730.2] [image:3.612.94.323.87.273.2] [image:3.612.339.566.260.707.2]
(4)

T h e s y s t e m k e e p s u p w i t h h o w o f t e n a w o r d has b e e n used a n d w i t h h o w m a n y times it h a s o c c u r r e d in e a c h p o s i t i o n in a s e n t e n c e , so t h a t f r o m t i m e t o t i m e a w o r d is p r o m o t e d t o o n e o f the t o p 20 a l p h a b e t i c o r t o p 20 p o s i t i o n - r e l a t e d sets of w o r d s . F o r d e t a i l s o n the file o r g a n i - z a t i o n s c h e m e t h a t a l l o w s this t o be d o n e in r e a l time, see W e i [1987]. D e t a i l s o n the m o u s e - b a s e d i m p l e m e n t a t i o n f o r IBM P C ' s a r e a v a i l a b l e in C h o w [1986].

A S E C O N D A L G O R I T H M

An a l t e r n a t i v e p r e d i c t i v e a l g o r i t h m h a s b e e n i m p l e m e n t e d w h i c h r e p l a c e s the s e n t e n c e - p o s i t i o n - b a s e d f i r s t menu. It p a y s s p e c i a l a t t e n t i o n to the 50 m o s t f r e q u e n t l y used w o r d s in the i n d i v i d u a l ' s v o c a b u l a r y (the h i g h - f r e q u e n c y w o r d s ) a n d to the w o r d s m o s t l i k e l y t o f o l l o w them. By v i r t u e of t h e i r f r e q u e n c y , these a r e p r e c i s e l y the w o r d s a b o u t w h i c h the m o s t is k n o w n , w i t h the g r e a t e s t c o n f i d e n c e , a f t e r a r e l a t i v e l y s m a l l b o d y of i n p u t s u c h as a f e w t h o u s a n d sentences.

F o r e a c h of the 50 h i g h - f r e q u e n c y w o r d s , a list is k e p t o f the t o p 20 m o s t f r e q u e n t w o r d s t o f o l l o w t h a t w o r d . L e t us c a l l these t h e f i r s t o r d e r f o l l o w e r s . F o r e a c h of the f i r s t o r d e r f o l l o w e r s , t h e r e is a list of s e c o n d - o r d e r f o l l o w e r s : w o r d s k n o w n t o h a v e f o l l o w e d the t w o w o r d s e q u e n c e c o n s i s t i n g of the h i g h - f r e q u e n c y w o r d a n d its f i r s t o r d e r f o l l o w e r .

F o r e x a m p l e , the w o r d "I" is a h i g h - f r e q u e n c y w o r d . T h e f i r s t o r d e r f o l l o w e r s f o r "I" i n c l u d e the w o r d " w o l ) l d ' . T h e s e c o n d - o r d e r f o l l o w e r s f o r "I w o u l d " i n c l u d e the w o r d " l i k e ' . (See F i g u r e 5.) T h e s e c o n d - o r d e r f o l l o w e r s f o r "I w o u l d " a l s o i n c l u d e m a n y o n e - t i m e - o n l y f o l l o w e r s , as w e l l , so the s y s t e m m a i n t a i n s a

t h r e s h o l d f o r the n u m b e r of o c e u r r a n c e s b e l o w w h i c h a w o r d is n o t i n c l u d e d in the list of s e c o n d - o r d e r f o l l o w e r s . T h e

r e a s o n i n g is t h a t a w o r d ' s h a v i n g o c c u r r e d o n l y o n c e in an e n v i r o n m e n t t h a t b y d e f i n i t i o n o c c u r s f r e q u e n t l y m a y be t a k e n as c o u n t e r - e v i d e n c e t h a t the w o r d s h o u l d be p r e d i c t e d .

R a t h e r t h a n p r e d i c t a w o r d w i t h l o w r e l i a b i l i t y , o n e of t w o a l t e r n a t i v e s are t a k e n . If the f i r s t - o r d e r f o l l o w e r is i t s e l f a h i g h - f r e q u e n c y w o r d , t h e n l o w - r e l i a b i l i t y s e c o n d - o r d e r f o l l o w e r s m a y be r e p l a c e d w i t h the f i r s t - o r d e r f o l l o w e r ' s o w n f o l l o w e r s . ( ' W o u l d " is a f i r s t - o r d e r

I o

F i g u r e 5.

..~-! thi,k ,-~--'"

d o n ' t * , - ~ , 1

h o p e ~. ! '

i w a s

w i s h

l i k e

w i l l

h a v e

w a n t

w o n d e r

g o t

r - ,

~ z

I ' l l

t h e

w e

i t

I t ' S o F V o u

r e a l l y w a n t

h a v e

,,. . . . Q

F i r s t - a n d s e c o n d - f o l l o w e r s f o r "I"

f o l l o w e r of "I" a n d is itself a h i g h - f r e q u e n c y w o r d . T h e r e a r e r e l a t i v e l y few r e l i a b l e s e c o n d - o r d e r f o l l o w e r s t o " w o u l d " in the left c o n t e x t of "I', so the list is a u g m e n t e d w i t h f i r s t - o r d e r f o l l o w e r s of " w o u l d " t o r o u n d o u t a list of 20 w o r d s . ) T h e o t h e r a l t e r n a t i v e , t a k e n w h e n the f i r s t - o r d e r f o l l o w e r is n o t a h i g h - f r e q u e n c y w o r d , is to fill o u t a n y s h o r t list of s e c o n d - o r d e r w o r d s w i t h the h i g h - f r e q u e n c y w o r d s t h e m s e l v e s .

T h i s a l g o r i t h m is r e l a t e d to, b u t t a k e s less m e m o r y a n d is less p o w e r f u l t h a n a f u l l - b l o w n s e c o n d o r d e r M a r k o v m o d e l . E a c h s t a t e in a s e c o n d - o r d e r ( t r i g r a m ) M a r k e r m o d e l is u n i q u e l y d e t e r m i n e d b y the p r e v i o u s t w o i n p u t s . F o r an i n p u t v o c a b u l a r y of 2000 w o r d s , the n u m b e r of m a t h e m a t i c a l l y p o s s i b l e s t a t e s in a t r i g r a m M a r k e r m o d e l is 4,000,000, w i t h m o r e t h a n 8 b i l l i o n a r c s i n t e r c o n - n e c t i n g the states. F o r t u n a t e l y , in the r e a l w o r l d m o s t of these m a t h e m a t i c a l l y p o s s i b l e s t a t e s a n d a r c s d o n o t a c t u a l l y o c c u r , b u t a t r i g r a m m o d e l f o r the r e a l w o r l d p o s s i b i l i t i e s is s t i l l q u i t e large.

We e x p e r i m e n t e d w i t h a b s t r a c t i n g the i n p u t v o c a b u l a r y b y r e s t r i c t i n g it to the 50 h i g h e s t - f r e q u e n c y w o r d s p l u s the

[image:4.612.343.543.76.285.2]
(5)

S h e r r i d a t a T h a c k e r a y d a t a w o r d s n e w s t a t e s n e w a r c s n e w s t a t e s n e w a r c s

1000 527 677 6 3 9 8 3 0

2 0 0 0 4 6 9 6 2 0 6 2 4 818

3 0 0 0 471 636 476 705

4 0 0 0 399 562 467 716

5 0 0 0 397 566 463 714

6 0 0 0 391 5 7 9 437 668

7 0 0 0 337 507 389 642

8 0 0 0 311 4 7 6 370 628

9 0 0 0 323 500 361 612

1 0 0 0 0 285 486 384 6 2 9

11000 329 518 348 601

12000 278 448 331 588

13000 276 445 310 543

1 4 0 0 0 240 408 291 530

1 5 0 0 0 248 425 287 529

16000 244 4 2 0 290 533

1 7 0 0 0 243 4 1 4 269 497

18000 259 446 234 468

F i g u r e 6. G r o w t h of a b s t r a c t e d f o u r t h - o r d e r M a r k e r m o d e l s

new w o r d s of t e x t , a f t e r 17000 w o r d s of i n p u t . T h i s w a s t r u e f o r b o t h the S h e r r i d a t a ( c o n v e r s a t i o n a l E n g l i s h ) a n d the m o r e f o r m a l T h a c k e r a y data. M o r e o v e r , the f o u r t h - o r d e r M a r k e r m o d e l f o r the a b s t r a c t e d T h a c k e r a y d a t a c o n t i n u e d to g r o w . After 100,000 w o r d s of i n p u t , w i t h a m o d e l of a p p r o x i m a t e l y 22,000 s t a t e s a n d a p p r o x i m a t e l y 45,000 arcs, the r a t e of g r o w t h w a s s t i l l m o r e t h a n 1,000 s t a t e s a n d 3,000 ares p e r 10,000 w o r d s of i n p u t .

F o r this p a r t i c u l a r i m p l e m e n t a t i o n , h o w e v e r , n e i t h e r r. f u l l - b l o w n M a r k o v m o d e l u s i n g t o t a l v o c a b u l a r y n o r an a b s t r a c t m o d e l u s i n g the 5 0 - w o r d v o c a b u - l a r y seemed a p p r o p r i a t e . O n the o n e h a n d , m o d e l s of the e n t i r e v o c a b u l a r y c o n f i r m e d t h a t m a n y m u l t i p l e w o r d s e q u e n c e s d i d o c c u r r e g u l a r l y . N e v e r t h e l e s s , f o r a n y b u t the s i m p l e s t o r d e r M a r k e r m o d e l s ( o r d e r s z e r o a n d one), the vast b u l k of the n e t w o r k s w e r e t a k e n b y w o r d c o m b i n a - t i o n s t h a t o c c u r r e d o n l y once. On the o t h e r h a n d , r e s t r i c t i n g the p r e d i c t i v e

m e c h a n i s m to o n l y the h i g h - f r e q u e n c y w o r d s o b v i o u s l y left o u t s o m e of the r e g u l a r l y o c c u r r i n g w o r d c o m b i n a t i o n s . O u r f i r s t - a n d s e c o n d - f o l l o w e r a l g o r i t h m d e s c r i b e d on the p r e v i o u s p a g e s a l l o w s l o w e r f r e q u e n c y w o r d s to be p r e d i c t e d w h e n t h e y o c c u r r e g u l a r l y in c o m b i n a t i o n w i t h h i g h - f r e q u e n c y w o r d s .

P R E D I C T I V E C A P A B I L I T I E S

T h e data used to test the predictive capabilities of the system were type- scripts provided by the user, w h o was utilizing a manual typewriter; it follows that the results were not biased by the user's favoring sentence patterns that the system itself provided. T h e system had bccn given 1750 prior scntcnces produced by the user and the data collected were for the performance of the system over the next 97 sentences. T h e 1750 sentences were 14,669 w o r d s in length with a vocabu- lary of 1512 words. Twelve sentences of the 1750 were a single w o r d in length {e.g. "yeah", "no" and "gesundheit") and

51 w e r e of l e n g t h 20 o r g r e a t e r . A v e r a g e l e n g t h of s e n t e n c e f o r the i n i t i a l b o d y w a s 8.4 w o r d s p e r sentence. T h e first 200 s e n t e n c e s i n c l u d e d t r a n s c r i p t i o n s of o r a l sentences, w h i c h w e r e m u c h s h o r t e r o n a v e r a g e , since the user is s p e e c h h a n d i - c a p p e d . If the first 200 s e n t e n c e s a r e o m i t t e d , the a v e r a g e s e n t e n c e l e n g t h is 8.6 f o r the f o l l o w i n g 1550 sentences.

[image:5.612.176.502.82.293.2]
(6)

Of the 884 w o r d s , 350 w e r e p r e s e n t e d o n the f i r s t m e n u , 373 w e r e p r e s e n t e d o n the s e c o n d m e n u (after o n e l e t t e r h a d b e e n s p e l l e d ) , 109 w e r e p r e s e n t e d o n the t h i r d m e n u (after t w o l e t t e r s h a d b e e n s p e l l e d ) , . 2 w e r e p r e s e n t e d o n the f o u r t h m e n u (after t h r e e l e t t e r s h a d b e e n s p e l l e d , 43 w e r e s p e l l e d o u t in t h e i r e n t i r e t y , a n d 7 w e r e n u m b e r s in d i g i t a l f o r m , p r o d u c e d u s i n g the n u m b e r s c r e e n of the system.

F r o m the a b o v e , it is o b v i o u s t h a t the device o f p r e d i c t i n g the 20 m o s t f r e q u e n t w o r d s b y s e n t e n c e p o s i t i o n is s u c c e s s f u l 39.6 p e r cent of the time; 42.2 p e r cent of the time, the d e s i r e d w o r d is a m o n g the 20 m o s t f r e q u e n t w o r d s of a g i v e n i n i t i a l l e t t e r b u t n o t in the 20 m o s t f r e q u e n t w o r d s b y p o s i t i o n ; c o m b i n i n g these t w o facts, we see t h a t 81.8 p e r cent of the time, this s i m p l e p r e d i c t i o n s c h e m e p r e s e n t s the d e s i r e d w o r d o n a f i r s t o r s e c o n d s e l e c t i o n . T h e d e s i r e d w o r d is o f f e r e d in the first, s e c o n d , o r t h i r d m e n u 94.1 p e r cent of the time, a n d m o s t o f the rest of the t i m e (5.7 p e r cent of t o t a l ) , the d e s i r e d w o r d is u n k n o w n to the s y s t e m a n d is " s p e l l e d o u t ' , w h e r e " s p e l l i n g " i n c l u d e s p r o d u c i n g n u m b e r s .

A l t h o u g h the f o u r t h m e n u , c o n s i s t i n g of w o r d s w i t h a t h r e e - l e t t e r i n i t i a l s e q u e n c e , p r e s e n t l y h a s a l o w success rate, it is p r e c i s e l y this c a t e g o r y t h a t we e x p e c t t o see i m p r o v e as m o r e of the u s e r ' s w o r d s b e c o m e k n o w n t o the s y s t e m t h r o u g h s p e l l i n g . T h a t is, as t i m e

p a s s e s , we e x p e c t the user to h a v e to r e s o r t to c o m p l e t e s p e l l i n g less a n d less b e c a u s e the k n o w n v o c a b u l a r y w i l l i n c l u d e m o r e a n d m o r e of the a c t u a l v o c a b u l a r y of the user. M a n y of the n e w w o r d s w i l l be l o w f r e q u e n c y w o r d s t h a t we w o u l d e x p e c t t o find o n the m e n u f o r t h r e e - l e t t e r c o m - b i n a t i o n s a f t e r t h e y a r e k n o w n .

T h e s e c o n d a l g o r i t h m , u s i n g f i r s t - a n d s e c o n d - f o l l o w e r s of the h i g h - f r e q u e n c y w o r d s , w a s r u n o n i00 s e n t e n c e s , the s h o r t e s t of w h i c h w a s "Help!" (94 of the 97 test s e n t e n c e s f o r the f i r s t a l g o r i t h m w e r e r e p r e s e n t e d in the test set f o r the second.) T h e r e w e r e 895 w o r d s in the s a m p l e , of w h i c h 448 w e r e p r e s e n t e d o n the first m e n u , 280 w e r e p r e s e n t e d o n the s e c o n d ( a f t e r o n e l e t t e r h a d b e e n s p e l l e d o u t , 83 o n the t h i r d (after t w o l e t t e r s w e r e s p e l l e d ) , 1 o n the f o u r t h , a n d 83 w e r e s p e l l e d o u t in t h e i r e n t i r e t y (this c a t e g o r y i n c l u d e d n u m b e r s ) .

R u n n i n g the s e c o n d test g a v e us a v e r y q u i c k a p p r e c i a t i o n f o r the v a l u e of a d d i n g n e w w o r d s to the s y s t e m as t h e y a r e e n c o u n t e r e d , since this i m p l e m e n t a t i o n of the s e c o n d a l g o r i t h m d i d not. O n e e s p e c i a l l y s t r i k i n g e x a m p l e w a s a w o r d b e g i n n i n g w i t h " w - o " w h i c h h a d n e v e r b e e n used b e f o r e , b u t w h i c h o c c u r r e d five t i m e s in the 100 test s e n t e n c e s a n d h a d t o be s p e l l e d o u t each time. T h i s w a s e s p e c i a l - ly i r r i t a t i n g since the " w - o " m e n u ( t h i r d menu) h a d f e w e r t h a n 20 e n t r i e s a n d w o u l d h a v e a c c o m m o d a t e d the n e w w o r d . A c o m - p a r i s o n of the t w o c o l u m n s of F i g u r e 7 s u g g e s t s t h a t f o r the t e x t h e l d in c o m m o n b y the t w o tests, a p p r o x i m a t e l y 30 w o r d s h a d t o be s p e l l e d o u t b y the s e c o n d a l g o - r i t h m , w h i c h w e r e s e l e c t e d b y m e n u in the f i r s t a l g o r i t h m b e c a u s e it a d d e d n e w w o r d s to its d a t a sets as t h e y w e r e e n c o u n t e r e d .

P R O P O S E D E X T E N S I O N S

We h a v e s e v e r a l p l a n s f o r the f u t u r e , m o s t of t h e m i n v o l v i n g the s e c o n d a l g o - r i t h m . O u r first t a s k is to i n c r e a s e the n u m b e r of s e n t e n c e s in the S h e r r i d a t a to 3000 a n d d e t e r m i n e h o w m u c h (if at all) an e n l a r g e d b a s e of e x p e r i e n c e i m p r o v e s the a b i l i t y of the a l g o r i t h m to p r e d i c t

S e n t e n c e p o s i t i o n a l g o r i t h m number s e n t e n c e s : 97 number o f w o r d s : 884

f r e q u e n t w o r d / l e f t c o n t e x t a l g o r i t h m number s e n t e n c e s : 100 number o f w o r d s : 895 w o r d s % t o t a l

f i r s t menu: 350 39.6% 39.6%

s e c o n d menu: 373 4 2 . 2 % 8 1 . 8 % t h i r d menu: 109 12.3% 9 4 . 1 % f o u r t h menu: 2 0 . 2 % 9 4 . 3 %

s p e l l e d : 43 4.8% 99.2%

n u m b e r s : 7 0 . 8 % 100%

w o r d s % t o t a l

f i r s t menu: 448 50% 50%

s e c o n d menu: 280 31.3% 8 1 . 3 % t h i r d menu: 83 9 . 3 % 9 0 . 6 % f o u r t h menu: 1 0 . 1 % 9 0 . 7 % " s p e l l e d ' : 83 9.3% 100%

(7)

the d e s i r e d w o r d o n the first try. In its p r e s e n t f o r m , the s y s t e m is r e l i a b l e in its p r e d i c t i o n s a f t e r s e v e r a l h u n d r e d s e n t e n c e s b y the user h a v e been p r o c e s s e d . We i n t e n d to t a k e s o m e t h i n g l i k e the B r o w n c o r p u s f o r A m e r i c a n E n g l i s h a n d f r o m it c r e a t e a v a n i l l a - f l a v o r e d p r e d i c t o r as a s t a r t - u p v e r s i o n f o r a n e w user, w i t h f a c i l i t i e s b u i l t in to h a v e the u s e r ' s o w n l a n g u a g e p a t t e r n s g r a d u a l l y o u t w e i g h the B r o w n c o r p u s i n i t i a l i z a t i o n as t h e y a r e i n p u t .

E v e n t u a l l y the B r o w n c o r p u s w o u l d h a v e e s s e n t i a l l y n o effect, o r at least n o effect o v e r r i d i n g the u s e r ' s i n d i v i d u a l use of l a n g u a g e (it m i g h t serve as a basic d i c t i o n a r y f o r t e x t v o c a b u l a r y n o t yet seen f r o m the user).

We i n t e n d to i n v e s t i g a t e w h a t effect g e n e r a t i n g s e n t e n c e s w h i l e u s i n g the s y s t e m has o n o u r c o l l a b o r a t o r . T o date, she has o b l i g i n g l y been w i l l i n g t o

c o n t i n u e to use a t y p e w r i t e r to g e n e r a t e t e x t , b u t she d o e s o w n a p e r s o n a l c o m p u t e r a n d is a b l e to use a m o u s e . O u r o w n e x p e r i e n c e in e n t e r i n g her s e n t e n c e s o n the s y s t e m has m a d e it c l e a r t h a t in m a n y i n s t a n c e s she w o u l d h a v e e x p r e s s e d the same ideas m o r e r a p i d l y o n the s y s t e m w i t h a s l i g h t c h a n g e in w o r d i n g . Since the p r e f e r r e d w o r d s a n d p a t t e r n s a r e d e r i v e d b y the s y s t e m f r o m her o w n l a n g u a g e h i s t o r y , t h e y s h o u l d feel n o r m a l a n d n a t u r a l to her a n d c o u l d i n f l u e n c e her to m o d i f y her i n t e n t i o n s in g e n e r a t i n g a sentence. On the o t h e r h a n d , a d i f f e r e n t h a n d i c a p p e d i n d i v i d u a l (a q u a d r i p l e g i c ) has i n f o r m e d us t h a t ease of m e c h a n i c a l p r o d u c t i o n of a s e n t e n c e has l i t t l e o r n o effect o n his c h o i c e of w o r d s , a n d t h a t w o u l d a p p e a r to be the case f o r o u r c o l l a b o r a t o r w h i l e she uses the t y p e w r i t e r .

F i n a l l y , we w i s h to m a k e use of the m u c h l a r g e r a m o u n t s of m e m o r y a v a i l a b l e o n p e r s o n a l c o m p u t e r s b y t a k i n g a c c o u n t of the f o l l o w e r s f o r m a n y of the m o d e r a t e - f r e q u e n c y w o r d s . F o r e x a m p l e , in the s e n t e n c e " w o u l d y o u be able..." the w o r d "able" is n o t h i g h f r e q u e n c y . N e v e r t h e - less, the s y s t e m c o u l d e a s i l y d e d u c e w h a t f o l l o w i n g w o r d to e x p e c t , since e v e r y k n o w n o c c u r r e n c e of "able" is f o l l o w e d b y " t o ' . As it h a p p e n s , "to" is o n e of the t o p 20 m o s t f r e q u e n t w o r d s a n d hence f o r t u i t o u s l y is o n the d e f a u l t m e n u a f t e r the n o n - h i g h - f r e q u e n c y w o r d " a b l e ' , b u t t h e r e are m a n y o t h e r e x a m p l e s w h e r e the

s y s t e m is n o t so l u c k y . F o r i n s t a n c e , "pick" is u s u a l l y f o l l o w e d b y "up" in the S h e r r i d a t a , b u t "pick" is l o w f r e q u e n c y a n d "up" is n o t o n the d e f a u l t first menu. S i m i l a r l y , " t h i n k " is a h i g h - f r e q u e n c y w o r d a n d has a w e l l d e v e l o p e d set of f o l l o w e r s . " T h i n k s " a n d " t h o u g h t " a r e n o t h i g h - f r e q u e n c y a n d hence a r e f o l l o w e d b y the d e f a u l t first menu. Yet v i r t u a l l y e v e r y f o l l o w e r f o r " t h i n k s " a n d " t h o u g h t " in the S h e r r i d a t a h a p p e n s t o b e l o n g to the set o f f o l l o w e r s f o r " t h i n k ' . We b e l i e v e t h a t b y s t o r i n g i n f o r m a t i o n o n m o d e r a t e f r e q u e n c y w o r d s w i t h s t r o n g l y a s s o c i a t e d f o l l o w e r s a n d o n c l u s t e r s of v e r b f o r m s we m a y s i g n i f i c a n t l y i m p r o v e the success of the first menu.

R E L A T E D W O R K

T h a t a s m a l l n u m b e r of w o r d s a c c o u n t f o r a l a r g e p r o p o r t i o n of the t o t a l v e r - biage in c o n v e r s a t i o n has been k n o w n f o r s o m e time [ K u c e r a a n d F r a n c i s , 1967].

T h e idea of u s i n g the first s e v e r a l l e t t e r s t y p e d b y a h a n d i c a p p e d i n d i v i d u a l to a n t i c i p a t e the n e x t d e s i r e d w o r d has been used in n u m e r o u s s y s t e m s (e.g., [ G i b l e t a n d C h i l d r e s s , 1982], [ P i c k e t i n g et al., 1984]). T h e G i b l e r a n d C h i l d r e s s s y s t e m is t y p i c a l in t h a t it uses a f e w - t h o u s a n d - w o r d v o c a b u l a r y d r a w n f r o m the g e n e r a l p u b l i c , p l u s a few h u n d r e d w o r d s s p e c i f i c to the user of the system. T h e user m u s t t y p e the first t w o l e t t e r s b e f o r e the s y s t e m p r o v i d e s a m e n u of w o r d s b e g i n n i n g w i t h the l e t t e r p a i r . If the d e s i r e d w o r d w a s n o t o n the menu, the user had to s p e l l the w o r d out. It w a s felt t h a t o n e l e t t e r w a s n o t i n f o r m a t i v e e n o u g h to w a r r a n t a menu. F u r t h e r m o r e , G i l b l e r and C h i l d r e s s s h o w e d t h a t i n c r e a s - ing the s y s t e m v o c a b u l a r y d e g r a d e d the p e r f o r m a n c e of t h e i r s y s t e m a n d t h e y r e c o m m e n d e d l i m i t a t i o n of the v o c a b u l a r y f o r h u m a n f a c t o r s r e a s o n s .

By c o n t r a s t , o u r s y s t e m c o s t s the user n o m o r e e f f o r t in t e r m s of s e l e c t i n g the first t w o l e t t e r s - if i n d e e d t h e y h a v e n e e d e d to go t h a t far; 80 p e r cent of the time, t h e y h a v e n ' t n e e d e d t o p r o - vide t w o letters. F u r t h e r , t h e r e is n o q u e s t i o n t h a t f o r o u r s y s t e m , a l l o w i n g the v o c a b u l a r y to g r o w is of b e n e f i t b o t h t o s y s t e m p e r f o r m a n c e a n d to user s a t i s - f a c t i o n .

(8)

p e r s o n s c o n v e r s a n t in the Bliss c o m m u n i - c a t i o n s system. C o m m u n i c a t i o n w i t h Bliss i n v o l v e s a high d e g r e e of i n t e r p r e t a t i o n by the " l i s t e n e r ' , a n d G a l l i e r s r e p o r t s an i m p r e s s i v e 75 per cent success rate in a u t o m a t i n g such i n t e r p r e t a t i o n . T h e G a l l i e r s system is s i n g l e - s u b j e c t , as o u r s is, and it d o e s use past h i s t o r y t o f a c i l i t a t e i n t e r p r e t a t i o n . It was, h o w - ever, l i m i t e d t o a v e r y s m a l l d o m a i n f o r the e x p e r i m e n t d e s c r i b e d .

One s t a t i s t i c c i t e d by this last p a p e r was t h a t the same t e x t p r o d u c e d f r o m the Bliss c o m m u n i c a t i o n , had it been p r o d u c e d by t y p i n g i n t o a w o r d p r o c e s s i n g system, w o u l d have r e q u i r e d t h r e e times as m a n y k e y - p r e s s o p e r a t i o n s . O u r o w n r a t i o of k e y - p r e s s o p e r a t i o n s to c h a r a c t e r s

p r o d u c e d was 45 per cent f o r the s e n t e n c e p o s i t i o n a l g o r i t h m . T h a t is, o n a v e r a g e it t o o k 45 presses of a m o u s e b u t t o n to p r o d u c e 100 c h a r a c t e r s . P a r t of the r e a s o n f o r such a high r a t i o has to d o w i t h p u n c t u a t i o n , c a p i t a l i z a t i o n , and s p e c i a l screens such as the n u m b e r screen, w h i c h r e q u i r e s n o t o n l y the same n u m b e r of presses of the b u t t o n as t h e r e are digits, f o r e x a m p l e , b u t a d d i t i o n a l presses of the b u t t o n to s u m m o n the screen and q u i t the menu. But p r i m a r i l y the r a t i o seems t o d e r i v e f r o m the fact t h a t m a n y of the w o r d s in a n y t e x t are s h o r t - "a', " t o ' , "the', " o f ' , "in', and "on" b e i n g e x a m p l e s f r o m this v e r y p a r a g r a p h . If the first menu d o e s n o t c o n t a i n a d e s i r e d t w o - l e t t e r w o r d , o n e has to s p e l l the first l e t t e r and t h e n m a k e a s e l e c t i o n f r o m the s e c o n d m e n u - r e q u i r i n g t w o presses of a b u t t o n . By c o n t r a s t , Bliss users c o m m o n l y use a t e l e g r a p h i c s t y l e of c o m m u n i c a t i o n and o m i t f u n c t i o n w o r d s a l t o g e t h e r .

C O N C L U S I O N

In s u m m a r y , e v i d e n c e e x i s t s t h a t f o r a system b u i l t a r o u n d a single user's l a n g u a g e , a p r e d i c t i o n scheme that s i m p l y a n t i c i p a t e d f i f t y o r so w o r d s w o u l d o n a v e r a g e be c o r r e c t a b o u t half the time. L i m i t i n g such a system to o n l y the t o p 20 most f r e q u e n t w o r d s w o u l d give a success rate of a b o u t 30 per cent. H o w e v e r , n o t all of the high f r e q u e n c y w o r d s are d i s - t r i b u t e d e v e n l y by s e n t e n c e p o s i t i o n . A system t h a t o f f e r s the t o p 20 most f r e - q u e n t l y o c c u r r i n g w o r d s f o r each p o s i t i o n of a s e n t e n c e was successful a b o u t 40 per cent of the time o n the n e x t 97 sentences. A l l o w i n g a user t o r e j e c t the first set of w o r d s by g i v i n g the first l e t t e r of the d e s i r e d w o r d and o f f e r i n g the 20 most

f r e q u e n t w o r d s b e g i n n i n g w i t h that l e t t e r r e s u l t e d in success f o r the c o m b i n e d first and s e c o n d menus 82 per cent of the time.

After a t r a i n i n g b o d y of 1750 s e n - tences (14,669 w o r d s ) , w i t h a v o c a b u l a r y of 1512 w o r d s , it was still the case t h a t a b o u t six per cent of the d e s i r e d w o r d s w e r e u n k n o w n t o the system.

An a l t e r n a t i v e a l g o r i t h m f o r the f i r s t o f f e r i n g of 20 w o r d s , based p r i m a r i l y o n the r i g h t h a n d c o n t e x t s of the h i g h f r e - q u e n c y w o r d s , is s u c c e s s f u l o n the f i r s t guess 50 per cent of the time.

R E F E R E N C E S

Boggess, L o i s and T h o m a s M. E n g l i s h , T h e H W Y E s p e e c h r e c o g n i t i o n system: a u s e r - s p e c i f i c m o d e l f o r e x p e c t a t i o n - b a s e d r e c o g n i t i o n , in P r o c e e d i n g s of the 25th S o u t h e a s t R e g i o n a l C o n f e r e n c e of the ACM, B i r m i n g h a m , 1987.

C h o w , C. L. A m o u s e - d r i v e n m e n u - b a s e d t e x t p r o s t h e s i s f o r the s p e e c h

h a n d i c a p p e d , M.C.S. p r o j e c t r e p o r t , M i s s i s s i p p i S t a t e U n i v e r s i t y , 1986. E n g l i s h , T. M. and L o i s Boggess, A g r a m -

m a t i c a l a p p r o a c h to r e d u c i n g the s t a t i s - t i c a l s p a r s i t y of l a n g u a g e m o d e l s in n a t - u r a l d o m a i n s , P r o c e e d i n g s of the I n t e r - n a t i o n a l C o n f e r e n c e o n A c o u s t i c s , S p e e c h , and S i g n a l P r o c e s s i n g , T o k y o , 1986. G a l l i e r s , J u l i a , AI f o r s p e c i a l needs -

an " i n t e l l i g e n t " c o m m u n i c a t i o n aid f o r Bliss users, A p p l i e d A r t i f i c i a l

I n t e l l i g e n c e , 1(1):77-86, 1987.

G i b l e r , D. C. and D. S. C h i l d r e s s , L a n - guage a n t i c i p a t i o n w i t h a c o m p u t e r based s c a n n i n g aid, P r o c e e d i n g s of the I E E E C o m p u t e r W o r k s h o p o n C o m p u t e r s t o Aid the H a n d i c a v o e d , 1982.

Kucera, H. and W. N. F r a n c i s , C o m p u t a - t i o n a l a n a l y s i s of p r e s e n t - d a y A m e r i c a n English. B r o w n U n i v e r s i t y Press, 1967. P i c k e r i n g , J., J. L. A r n o t t , J. G. W o l f f ,

and A. L. S w i f f i n , P r e d i c t i o n and a d a p - t a t i o n in a c o m m u n i c a t i o n aid f o r the d i s a b l e d , P r o c e e d i n g s of the I F I P C o n f e r e n c e o n H u m a n - C o m p u t e r I n t e r a c t i o n , L o n d o n , 1984.

Figure

Figure 3: User has selected "a"
Figure 5. First- and second- followers for "I"
Figure 6. Growth of abstracted fourth-order Marker models

References

Related documents

The cattle feed plant with installed capacity of 300 MT production daily faces large problem in effective utilization of the available capacity resulting deficit

The overall goal of the EAC Customs Union (EAC CU) is to deepen the integration process through liberalization and promotion of intra-regional trade. Other

In this paper, new plans of pre-encoded multipliers are investigated by disconnected encoding the standard coefficients and putting away them in framework memory .We

lower activation threshold in 5-Aza-treated mice (A) Effector T cells isolated from naive B6 mice were cocultured with Treg cells isolated from 5-Aza-treated (5-Aza Treg) or

The upstream tree (i.e., the tree used in the forwarding of data packets from the vehicle to the Internet) is built and updated when each node learns about its parent upon

the presence, duration, and impact of CAP symptoms. The key findings demonstrate that CAP patients experi- ence a wide variety of systemic and respiratory symptoms that also impact

Activated carbon does not definite arrangement and because it methyl orange structure is small and has less space to prevent and methyl orange functional groups are linked to

after the new Code of Criminal Procedure went into effect in 1947. 42 In that case, known as the Teigin case, police originally arrested the suspect on