T W O S I M P L E P R E D I C T I O N A L G O R I T H M S T O F A C I L I T A T E T E X T P R O D U C T I O N
L o i s Boggess P.O. D r a w e r CS M i s s i s s i p p i S t a t e U n i v e r s i t y M i s s i s s i p p i State, MS 39762
A B S T R A C T
S e v e r a l s i m p l e p r e d i c t i o n s c h e m e s a r e p r e s e n t e d f o r s y s t e m s i n t e n d e d to f a c i l i - t a t e t e x t p r o d u c t i o n f o r h a n d i c a p p e d i n d i v i d u a l s . T h e s c h e m e s are b a s e d o n s i n g l e - s u b j e c t l a n g u a g e m o d e l s , w h e r e the s y s t e m is s e l f - a d a p t i n g to the p a s t
l a n g u a g e use of the s u b j e c t . S e n t e n c e p o s i t i o n , the i m m e d i a t e l y p r e c e d i n g o n e o r t w o w o r d s , a n d i n i t i a l l e t t e r s of the d e s i r e d w o r d a r e cues w h i c h m a y be used by the systems.
I N T R O D U C T I O N
F o r s o m e y e a r s we h a v e b e e n i n v e s t i - g a t i n g the use of a s i z e a b l e s a m p l e of a p a r t i c u l a r i n d i v i d u a l ' s l a n g u a g e h a b i t s in p r e d i c t i n g f u t u r e l a n g u a g e use f o r t h a t i n d i v i d u a l . T h e r e s e a r c h has t a k e n t w o d i r e c t i o n s .
O n e of these, the H W Y E ( H e a r W h a t Y o u E x p e c t ) s y s t e m , b u i l d s a l a r g e l a n - g u a g e m o d e l of the p a s t l a n g u a g e h i s t o r y of the i n d i v i d u a l , w i t h s p e c i a l e m p h a s i s o n the m o s t f r e q u e n t w o r d s of t h a t p e r s o n , a n d the r e s u l t is used in s p e e c h r e c o g n i t i o n . In s t u d y i n g the l a n g u a g e m o d e l d e v e l o p e d b y the H W Y E s y s t e m , s e v e r a l s i m p l e p r e d i c t i v e s c h e m e s w e r e n o t e d w h i c h a r e c a p a b l e of a n t i c i p a t i n g , d u r i n g the g e n e r a t i o n of a sentence, a s m a l l set of w o r d s f r o m w h i c h the n e x t d e s i r e d w o r d c a n be selected. T h e t w o s c h e m e s d e s c r i b e d h e r e are used f o r t e x t g e n e r a t i o n ( n o t s p e e c h r e c o g n i t i o n ) in a f o r m a t t h a t c o u l d be of use to a p h y s i - c a l l y h a n d i c a p p e d p e r s o n ; hence the
s c h e m e s h a v e n o r i g h t c o n t e x t a v a i l a b l e . O n e of the s c h e m e s d o e s use left c o n t e x t , a n d the o t h e r uses o n l y s e n t e n c e p o s i t i o n as " c o n t e x t ' . Both a r e i m p l e m e n t e d o n I B M - P C s y s t e m s w i t h m i n i m a l m e m o r y r e q u i r e m e n t s .
M O T I V A T I O N
O n e h u n d r e d E n g l i s h w o r d s a c c o u n t f o r 47 p e r cent of the B r o w n c o r p u s ( a b o u t o n e m i l l i o n w o r d s of A m e r i c a n E n g l i s h t e x t t a k e n f r o m a w i d e r a n g e of sources). It seems r e a s o n a b l e to s u p p o s e t h a t a s i n g l e i n d i v i d u a l m i g h t in fact r e q u i r e f e w e r w o r d s to a c c o u n t f o r a l a r g e p r o p o r t i o n of g e n e r a t e d text. F r o m o u r w o r k o n the H W Y E s y s t e m it was k n o w n t h a t 75 w o r d s a c c o u n t e d f o r h a l f of a l l the t e x t of V a n i t y F a i r , a 300,000 w o r d V i c t o r i a n E n g l i s h n o v e l by T h a c k e r a y ( w h i c h i n c o r p o r a t e d a f a i r l y i n v o l v e d s y n t a x , m u c h e m b e d d e d q u o t a t i o n , a n d p a s s a g e s in d i a l e c t a n d in F r e n c h )
[ E n g l i s h a n d Boggess, 1986]. We f u r t h e r f o u n d t h a t 50 w o r d s a c c o u n t e d f o r h a l f of a l l the v e r b i a g e in a 20,000 w o r d set of s e n t e n c e s p r o v i d e d by an i n d i v i d u a l w h o c o l l a b o r a t e d w i t h us. T h i s l a t t e r
c o r p u s , c a l l e d the S h e r r i d a t a , is a set of t e x t s p r o v i d e d by a s p e e c h - h a n d i c a p p e d i n d i v i d u a l w h o uses a t y p e w r i t e r to
Y o u said s o m e t h i n g a b o u t a m a g a z i n e t h a t < n a m e l > h a d a b o u t c o m p u t e r s t h a t I m i g h t l i k e t o b o r r o w .
I w o u l d s o m e time.
I t h i n k we h a v e t o p i c k u p the c h i l d r e n w h i l e < n a m e 2 > is in the h o s p i t a l .
I w a n t t o visit her in the h o s p i t a l .
But y o u h a v e t o lift me u p t o the w i n d o w f o r me t o see the b a b y .
W e l l , it's M a y first n o w . H e l p !
I t h o u g h t it w o u l d n o t be so b u s y b u t it l o o k s l i k e it m i g h t be n o w .
F i g u r e 1. S a m p l e set of c o n t i g u o u s s e n t e n c e s in S h e r r i d a t a
It seems r e a s o n a b l e to s u p p o s e t h a t f o r c o n v e r s a t i o n a l E n g l i s h , a p p r o x i m a t e l y 50 w o r d s m a y a c c o u n t f o r h a l f of the v e r b i a g e of m o s t E n g l i s h users. F r o m the s t a n d p o i n t of h u m a n f a c t o r s , an a r g u m e n t c o u l d be m a d e t h a t o n e s h o u l d s i m p l y p u t the 50 w o r d s u p o n the s c r e e n w i t h the a l p h a b e t a n d t h u s be a s s u r e d t h a t h a l f of a l l the w o r d s d e s i r e d b y the user w e r e i n s t a n t l y a v a i l a b l e , in k n o w n l o c a t i o n s t h a t the user w o u l d q u i c k l y b e c o m e a c c u s t o m e d to. C o n s t a n t l y c h a n g i n g m e n u s i n t r o d u c e an e l e m e n t of user f a t i g u e [ G i b l e r a n d C h i l d r e s s ,
1982]. T h a t a r g u m e n t m a y e s p e c i a l l y m a k e sense as l a r g e r s c r e e n s w i t h m o r e lines p e r s c r e e n a n d m o r e c h a r a c t e r s p e r line b e c o m e m o r e c o m m o n .
If we l i m i t o u r s e l v e s to the t o p 20 m o s t f r e q u e n t w o r d s as a c o n s t a n t m e n u , o n l y a b o u t 30 p e r cent of the u s e r ' s v e r b i a g e is a c c o u n t e d for. H o w e v e r , it w a s o b s e r v e d , w h i l e w o r k i n g w i t h the H W Y E s y s t e m , t h a t if o n e l o o k e d at the t o p 20 w o r d s f o r a n y g i v e n s e n t e n c e p o s i t i o n , o n e d i d n o t see the s a m e set of w o r d s o c c u r r i n g . C l e a r l y the h i g h f r e q u e n c y w o r d s (the set t h a t c o m p r i s e h a l f of w o r d use) are m i l d l y s e n s i t i v e to " c o n t e x t " even w h e n " c o n t e x t " is so b r o a d l y d e f i n e d as s e n t e n c e p o s i t i o n . D i f f e r e n t s u b s e t s of the 50 m e m b e r set of h i g h f r e q u e n c y w o r d s a p p e a r in the set of 20 m o s t f r e q u e n t w o r d s f o r a g i v e n s e n t e n c e p o s i t i o n . M o r e o v e r , a f t e r p r o c e s s i n g a p p r o x i m a t e l y 2000 s e n t e n c e s f r o m the user, it w a s s t i l l the case t h a t
s o m e o f the t o p 20 w o r d s f o r a g i v e n p o s i t i o n w e r e n o t m e m b e r s of the h i g h f r e q u e n c y set at all. F o r e x a m p l e , the w o r d " t h e y ' , a m e m b e r of the m e n u f o r the f i r s t s e n t e n c e p o s i t i o n Isee F i g u r e 2) a n d hence o n e of the 20 m o s t f r e q u e n t w o r d s to s t a r t a s e n t e n c e , is n o t a m e m b e r of the g l o b a l h i g h f r e q u e n c y set.
A p r e l i m i n a r y a n a l y s i s b y E n g l i s h s u g g e s t e d t h a t , w h e r e a s a c o n s t a n t " p r e d i c t i o n " of the t o p 20 m o s t f r e q u e n t w o r d s w o u l d y i e l d a success r a t e of 30 p e r cent, p r e d i c t i n g the t o p 20 m o s t f r e q u e n t w o r d s p e r p o s i t i o n in s e n t e n c e w o u l d y i e l d a success r a t e of 40 p e r cent.
~ C O N T E X 7 " AS S E N T E N C E P O S I T I O N T h e s i m p l e s t scheme, w h i c h h a s b e e n b u i l t as a p r o t o t y p e o n an IBM P C w i t h t w o f l o p p y d i s k d r i v e s , p r e s e n t s the user w i t h the t o p 20 m o s t f r e q u e n t w o r d s t h a t the user has e m p l o y e d at w h a t e v e r
p o s i t i o n in a s e n t e n c e is c u r r e n t . F o r e x a m p l e , F i g u r e 2 s h o w s the s c r e e n p r e s e n t e d t o the user at the b e g i n n i n g of p r o d u c t i o n of a sentence. O n the left is a list of the 20 w o r d s w h i c h t h a t
1 b u t 2 o a n 3 o o u l d 4 d o S h e 6 h o t * ? I
8 I ° N 9 i f 1 0 i t I I i t ) s 1 2 L o i s
1 3 s h e 1 4 t h a t 1 5 t h e 1 6 t h e y
1 7 w e
1 8 w h a t 1 9 w h e n 2 e u o u
S P E L L
C A P I T A L
• b o
P U N C T U A T I O N
H E L P - N E N U S h I
E N D I N G
m n o
N U N D E R
S P E C I A L • t u
R E U l E N
U z
H A R D - C O P Y
S A U E - S E N T
j k 1
p q r
v w x
H E N E R A S E q U I T
H E N S E N T E N C E :
F i g u r e 2. I n i t i a l S c r e e n
and f u n c t i o n s is m a d e by m o u s e , t h o u g h the a c t u a l s e l e c t i o n m e c h a n i s m is
s e p a r a t e d f r o m the b u l k o f the c o d e s o that r e p l a c e m e n t w i t h a n o t h e r s e l e c t i o n m e c h a n i s m s h o u l d be r e l a t i v e l y e a s y t o
i m p l e m e n t . ) T h e s e n t e n c e is b u i l t at the
b o t t o m o f the screen. If the user
s e l e c t s a w o r d f r o m the m e n u at the left, it is p l a c e d in first p o s i t i o n in the s e n t e n c e , and a s e c o n d m e n u , c o n s i s t i n g o f the 20 m o s t f r e q u e n t w o r d s that the user has used in s e c o n d p l a c e in a s e n t e n c e , a p p e a r s in the left p o r t i o n o f
the screen. After a s e c o n d w o r d has been
p r o d u c e d and a d d e d t o the s e n t e n c e , a third m e n u , c o n s i s t i n g o f the 20 m o s t f r e q u e n t w o r d s f o r that user in third p l a c e in a s e n t e n c e , is o f f e r e d , and s o o n .
At a n y time the user m a y reject the l e f t h a n d m e n u by s e l e c t i n g a letter o f the
a l p h a b e t . F i g u r e 3 s h o w s the s c r e e n after
the user has p r o d u c e d t w o w o r d s o f a s e n t e n c e and has b e g u n t o s p e l l a third w o r d by s e l e c t i n g the letter "a +. At this p o i n t , the t o p 20 m o s t f r e q u e n t l y used w o r d s b e g i n n i n g w i t h +a" h a v e b e e n o f f e r e d
at the left. If the d e s i r e d w o r d is n o t
in the list, the user c o n t i n u e s by s e l e c t - i n g the s e c o n d letter o f the d e s i r e d w o r d
(in this case, "n'). T h e l e f t - h a n d m e n u
b e c o m e s the 20 m o s t f r e q u e n t l y used w o r d s b e g i n n i n g w i t h the pair o f letters g i v e n
s o far. As is s h o w n in F i g u r e 4, there
are times w h e n f e w e r than 20 w o r d s o f a g i v e n t w o - l e t t e r s t a r t i n g c o m b i n a t i o n h a v e
b e e n e n c o u n t e r e d f r o m the user's past h i s t o r y , in w h i c h case t h i s a l g o r i t h m o f f e r s a s h o r t e n e d list.
In the case i l l u s t r a t e d , the d e s i r e d
w o r d w a s o n the list. If it w e r e n o t , the
user w o u l d h a v e had t o s p e l l o u t the e n - tire w o r d , and it w o u l d h a v e b e e n e n t e r e d
i n t o the s e n t e n c e . In e i t h e r case, the
s y s t e m s u b s e q u e n t l y r e t u r n s t o o f f e r i n g
the m e n u o f m o s t - f r e q u e n t l y - u s e d w o r d s f o r
the f o u r t h p o s i t i o n , and c o n t i n u e s in s i m i l a r f a s h i o n t o the end o f the s e n t e n c e .
L • 2 a b l e 3 • b o u t 4 • £ t e r 5 a f t e r n o o n 6 a p a l n 7 a l l 8 a m 9 a n
l e a n d
1 2 a p p l e 1 3 A p r i l
1 4 a r e 1 5 a ~ o u n d 1 6 a s 1 7 a s k 1 8 a s k e d
1 9 a t 2 e • u n t
• b o d • F
g h i J k 1
• t u v w )<
NEW S E N T E N C E :
I h a v e
F i g u r e 3:
1 a n i m a l 2 a n i m a l s 3 A n i t a 4 a n n i v e r s a r y S A n n M 6 a n o t h e r ? & n s u e r 8 a n s w e r •
9 a n y
I O • n v o n e
I I • n p t h | n •
User has s e l e c t e d "a"
• b o d • £
• k I J k I
• t u v ~ x
N E N S E N T E N C E :
I h a v e
[image:3.612.100.518.77.730.2] [image:3.612.94.323.87.273.2] [image:3.612.339.566.260.707.2]T h e s y s t e m k e e p s u p w i t h h o w o f t e n a w o r d has b e e n used a n d w i t h h o w m a n y times it h a s o c c u r r e d in e a c h p o s i t i o n in a s e n t e n c e , so t h a t f r o m t i m e t o t i m e a w o r d is p r o m o t e d t o o n e o f the t o p 20 a l p h a b e t i c o r t o p 20 p o s i t i o n - r e l a t e d sets of w o r d s . F o r d e t a i l s o n the file o r g a n i - z a t i o n s c h e m e t h a t a l l o w s this t o be d o n e in r e a l time, see W e i [1987]. D e t a i l s o n the m o u s e - b a s e d i m p l e m e n t a t i o n f o r IBM P C ' s a r e a v a i l a b l e in C h o w [1986].
A S E C O N D A L G O R I T H M
An a l t e r n a t i v e p r e d i c t i v e a l g o r i t h m h a s b e e n i m p l e m e n t e d w h i c h r e p l a c e s the s e n t e n c e - p o s i t i o n - b a s e d f i r s t menu. It p a y s s p e c i a l a t t e n t i o n to the 50 m o s t f r e q u e n t l y used w o r d s in the i n d i v i d u a l ' s v o c a b u l a r y (the h i g h - f r e q u e n c y w o r d s ) a n d to the w o r d s m o s t l i k e l y t o f o l l o w them. By v i r t u e of t h e i r f r e q u e n c y , these a r e p r e c i s e l y the w o r d s a b o u t w h i c h the m o s t is k n o w n , w i t h the g r e a t e s t c o n f i d e n c e , a f t e r a r e l a t i v e l y s m a l l b o d y of i n p u t s u c h as a f e w t h o u s a n d sentences.
F o r e a c h of the 50 h i g h - f r e q u e n c y w o r d s , a list is k e p t o f the t o p 20 m o s t f r e q u e n t w o r d s t o f o l l o w t h a t w o r d . L e t us c a l l these t h e f i r s t o r d e r f o l l o w e r s . F o r e a c h of the f i r s t o r d e r f o l l o w e r s , t h e r e is a list of s e c o n d - o r d e r f o l l o w e r s : w o r d s k n o w n t o h a v e f o l l o w e d the t w o w o r d s e q u e n c e c o n s i s t i n g of the h i g h - f r e q u e n c y w o r d a n d its f i r s t o r d e r f o l l o w e r .
F o r e x a m p l e , the w o r d "I" is a h i g h - f r e q u e n c y w o r d . T h e f i r s t o r d e r f o l l o w e r s f o r "I" i n c l u d e the w o r d " w o l ) l d ' . T h e s e c o n d - o r d e r f o l l o w e r s f o r "I w o u l d " i n c l u d e the w o r d " l i k e ' . (See F i g u r e 5.) T h e s e c o n d - o r d e r f o l l o w e r s f o r "I w o u l d " a l s o i n c l u d e m a n y o n e - t i m e - o n l y f o l l o w e r s , as w e l l , so the s y s t e m m a i n t a i n s a
t h r e s h o l d f o r the n u m b e r of o c e u r r a n c e s b e l o w w h i c h a w o r d is n o t i n c l u d e d in the list of s e c o n d - o r d e r f o l l o w e r s . T h e
r e a s o n i n g is t h a t a w o r d ' s h a v i n g o c c u r r e d o n l y o n c e in an e n v i r o n m e n t t h a t b y d e f i n i t i o n o c c u r s f r e q u e n t l y m a y be t a k e n as c o u n t e r - e v i d e n c e t h a t the w o r d s h o u l d be p r e d i c t e d .
R a t h e r t h a n p r e d i c t a w o r d w i t h l o w r e l i a b i l i t y , o n e of t w o a l t e r n a t i v e s are t a k e n . If the f i r s t - o r d e r f o l l o w e r is i t s e l f a h i g h - f r e q u e n c y w o r d , t h e n l o w - r e l i a b i l i t y s e c o n d - o r d e r f o l l o w e r s m a y be r e p l a c e d w i t h the f i r s t - o r d e r f o l l o w e r ' s o w n f o l l o w e r s . ( ' W o u l d " is a f i r s t - o r d e r
I o
F i g u r e 5.
..~-! thi,k ,-~--'"
d o n ' t * , - ~ , 1
h o p e ~. ! '
i w a s
w i s h
l i k e
w i l l
h a v e
w a n t
w o n d e r
g o t
r - ,
~ z
I ' l l
t h e
w e
i t
I t ' S o F V o u
r e a l l y w a n t
h a v e
,,. . . . Q
F i r s t - a n d s e c o n d - f o l l o w e r s f o r "I"
f o l l o w e r of "I" a n d is itself a h i g h - f r e q u e n c y w o r d . T h e r e a r e r e l a t i v e l y few r e l i a b l e s e c o n d - o r d e r f o l l o w e r s t o " w o u l d " in the left c o n t e x t of "I', so the list is a u g m e n t e d w i t h f i r s t - o r d e r f o l l o w e r s of " w o u l d " t o r o u n d o u t a list of 20 w o r d s . ) T h e o t h e r a l t e r n a t i v e , t a k e n w h e n the f i r s t - o r d e r f o l l o w e r is n o t a h i g h - f r e q u e n c y w o r d , is to fill o u t a n y s h o r t list of s e c o n d - o r d e r w o r d s w i t h the h i g h - f r e q u e n c y w o r d s t h e m s e l v e s .
T h i s a l g o r i t h m is r e l a t e d to, b u t t a k e s less m e m o r y a n d is less p o w e r f u l t h a n a f u l l - b l o w n s e c o n d o r d e r M a r k o v m o d e l . E a c h s t a t e in a s e c o n d - o r d e r ( t r i g r a m ) M a r k e r m o d e l is u n i q u e l y d e t e r m i n e d b y the p r e v i o u s t w o i n p u t s . F o r an i n p u t v o c a b u l a r y of 2000 w o r d s , the n u m b e r of m a t h e m a t i c a l l y p o s s i b l e s t a t e s in a t r i g r a m M a r k e r m o d e l is 4,000,000, w i t h m o r e t h a n 8 b i l l i o n a r c s i n t e r c o n - n e c t i n g the states. F o r t u n a t e l y , in the r e a l w o r l d m o s t of these m a t h e m a t i c a l l y p o s s i b l e s t a t e s a n d a r c s d o n o t a c t u a l l y o c c u r , b u t a t r i g r a m m o d e l f o r the r e a l w o r l d p o s s i b i l i t i e s is s t i l l q u i t e large.
We e x p e r i m e n t e d w i t h a b s t r a c t i n g the i n p u t v o c a b u l a r y b y r e s t r i c t i n g it to the 50 h i g h e s t - f r e q u e n c y w o r d s p l u s the
[image:4.612.343.543.76.285.2]S h e r r i d a t a T h a c k e r a y d a t a w o r d s n e w s t a t e s n e w a r c s n e w s t a t e s n e w a r c s
1000 527 677 6 3 9 8 3 0
2 0 0 0 4 6 9 6 2 0 6 2 4 818
3 0 0 0 471 636 476 705
4 0 0 0 399 562 467 716
5 0 0 0 397 566 463 714
6 0 0 0 391 5 7 9 437 668
7 0 0 0 337 507 389 642
8 0 0 0 311 4 7 6 370 628
9 0 0 0 323 500 361 612
1 0 0 0 0 285 486 384 6 2 9
11000 329 518 348 601
12000 278 448 331 588
13000 276 445 310 543
1 4 0 0 0 240 408 291 530
1 5 0 0 0 248 425 287 529
16000 244 4 2 0 290 533
1 7 0 0 0 243 4 1 4 269 497
18000 259 446 234 468
F i g u r e 6. G r o w t h of a b s t r a c t e d f o u r t h - o r d e r M a r k e r m o d e l s
new w o r d s of t e x t , a f t e r 17000 w o r d s of i n p u t . T h i s w a s t r u e f o r b o t h the S h e r r i d a t a ( c o n v e r s a t i o n a l E n g l i s h ) a n d the m o r e f o r m a l T h a c k e r a y data. M o r e o v e r , the f o u r t h - o r d e r M a r k e r m o d e l f o r the a b s t r a c t e d T h a c k e r a y d a t a c o n t i n u e d to g r o w . After 100,000 w o r d s of i n p u t , w i t h a m o d e l of a p p r o x i m a t e l y 22,000 s t a t e s a n d a p p r o x i m a t e l y 45,000 arcs, the r a t e of g r o w t h w a s s t i l l m o r e t h a n 1,000 s t a t e s a n d 3,000 ares p e r 10,000 w o r d s of i n p u t .
F o r this p a r t i c u l a r i m p l e m e n t a t i o n , h o w e v e r , n e i t h e r r. f u l l - b l o w n M a r k o v m o d e l u s i n g t o t a l v o c a b u l a r y n o r an a b s t r a c t m o d e l u s i n g the 5 0 - w o r d v o c a b u - l a r y seemed a p p r o p r i a t e . O n the o n e h a n d , m o d e l s of the e n t i r e v o c a b u l a r y c o n f i r m e d t h a t m a n y m u l t i p l e w o r d s e q u e n c e s d i d o c c u r r e g u l a r l y . N e v e r t h e l e s s , f o r a n y b u t the s i m p l e s t o r d e r M a r k e r m o d e l s ( o r d e r s z e r o a n d one), the vast b u l k of the n e t w o r k s w e r e t a k e n b y w o r d c o m b i n a - t i o n s t h a t o c c u r r e d o n l y once. On the o t h e r h a n d , r e s t r i c t i n g the p r e d i c t i v e
m e c h a n i s m to o n l y the h i g h - f r e q u e n c y w o r d s o b v i o u s l y left o u t s o m e of the r e g u l a r l y o c c u r r i n g w o r d c o m b i n a t i o n s . O u r f i r s t - a n d s e c o n d - f o l l o w e r a l g o r i t h m d e s c r i b e d on the p r e v i o u s p a g e s a l l o w s l o w e r f r e q u e n c y w o r d s to be p r e d i c t e d w h e n t h e y o c c u r r e g u l a r l y in c o m b i n a t i o n w i t h h i g h - f r e q u e n c y w o r d s .
P R E D I C T I V E C A P A B I L I T I E S
T h e data used to test the predictive capabilities of the system were type- scripts provided by the user, w h o was utilizing a manual typewriter; it follows that the results were not biased by the user's favoring sentence patterns that the system itself provided. T h e system had bccn given 1750 prior scntcnces produced by the user and the data collected were for the performance of the system over the next 97 sentences. T h e 1750 sentences were 14,669 w o r d s in length with a vocabu- lary of 1512 words. Twelve sentences of the 1750 were a single w o r d in length {e.g. "yeah", "no" and "gesundheit") and
51 w e r e of l e n g t h 20 o r g r e a t e r . A v e r a g e l e n g t h of s e n t e n c e f o r the i n i t i a l b o d y w a s 8.4 w o r d s p e r sentence. T h e first 200 s e n t e n c e s i n c l u d e d t r a n s c r i p t i o n s of o r a l sentences, w h i c h w e r e m u c h s h o r t e r o n a v e r a g e , since the user is s p e e c h h a n d i - c a p p e d . If the first 200 s e n t e n c e s a r e o m i t t e d , the a v e r a g e s e n t e n c e l e n g t h is 8.6 f o r the f o l l o w i n g 1550 sentences.
[image:5.612.176.502.82.293.2]Of the 884 w o r d s , 350 w e r e p r e s e n t e d o n the f i r s t m e n u , 373 w e r e p r e s e n t e d o n the s e c o n d m e n u (after o n e l e t t e r h a d b e e n s p e l l e d ) , 109 w e r e p r e s e n t e d o n the t h i r d m e n u (after t w o l e t t e r s h a d b e e n s p e l l e d ) , . 2 w e r e p r e s e n t e d o n the f o u r t h m e n u (after t h r e e l e t t e r s h a d b e e n s p e l l e d , 43 w e r e s p e l l e d o u t in t h e i r e n t i r e t y , a n d 7 w e r e n u m b e r s in d i g i t a l f o r m , p r o d u c e d u s i n g the n u m b e r s c r e e n of the system.
F r o m the a b o v e , it is o b v i o u s t h a t the device o f p r e d i c t i n g the 20 m o s t f r e q u e n t w o r d s b y s e n t e n c e p o s i t i o n is s u c c e s s f u l 39.6 p e r cent of the time; 42.2 p e r cent of the time, the d e s i r e d w o r d is a m o n g the 20 m o s t f r e q u e n t w o r d s of a g i v e n i n i t i a l l e t t e r b u t n o t in the 20 m o s t f r e q u e n t w o r d s b y p o s i t i o n ; c o m b i n i n g these t w o facts, we see t h a t 81.8 p e r cent of the time, this s i m p l e p r e d i c t i o n s c h e m e p r e s e n t s the d e s i r e d w o r d o n a f i r s t o r s e c o n d s e l e c t i o n . T h e d e s i r e d w o r d is o f f e r e d in the first, s e c o n d , o r t h i r d m e n u 94.1 p e r cent of the time, a n d m o s t o f the rest of the t i m e (5.7 p e r cent of t o t a l ) , the d e s i r e d w o r d is u n k n o w n to the s y s t e m a n d is " s p e l l e d o u t ' , w h e r e " s p e l l i n g " i n c l u d e s p r o d u c i n g n u m b e r s .
A l t h o u g h the f o u r t h m e n u , c o n s i s t i n g of w o r d s w i t h a t h r e e - l e t t e r i n i t i a l s e q u e n c e , p r e s e n t l y h a s a l o w success rate, it is p r e c i s e l y this c a t e g o r y t h a t we e x p e c t t o see i m p r o v e as m o r e of the u s e r ' s w o r d s b e c o m e k n o w n t o the s y s t e m t h r o u g h s p e l l i n g . T h a t is, as t i m e
p a s s e s , we e x p e c t the user to h a v e to r e s o r t to c o m p l e t e s p e l l i n g less a n d less b e c a u s e the k n o w n v o c a b u l a r y w i l l i n c l u d e m o r e a n d m o r e of the a c t u a l v o c a b u l a r y of the user. M a n y of the n e w w o r d s w i l l be l o w f r e q u e n c y w o r d s t h a t we w o u l d e x p e c t t o find o n the m e n u f o r t h r e e - l e t t e r c o m - b i n a t i o n s a f t e r t h e y a r e k n o w n .
T h e s e c o n d a l g o r i t h m , u s i n g f i r s t - a n d s e c o n d - f o l l o w e r s of the h i g h - f r e q u e n c y w o r d s , w a s r u n o n i00 s e n t e n c e s , the s h o r t e s t of w h i c h w a s "Help!" (94 of the 97 test s e n t e n c e s f o r the f i r s t a l g o r i t h m w e r e r e p r e s e n t e d in the test set f o r the second.) T h e r e w e r e 895 w o r d s in the s a m p l e , of w h i c h 448 w e r e p r e s e n t e d o n the first m e n u , 280 w e r e p r e s e n t e d o n the s e c o n d ( a f t e r o n e l e t t e r h a d b e e n s p e l l e d o u t , 83 o n the t h i r d (after t w o l e t t e r s w e r e s p e l l e d ) , 1 o n the f o u r t h , a n d 83 w e r e s p e l l e d o u t in t h e i r e n t i r e t y (this c a t e g o r y i n c l u d e d n u m b e r s ) .
R u n n i n g the s e c o n d test g a v e us a v e r y q u i c k a p p r e c i a t i o n f o r the v a l u e of a d d i n g n e w w o r d s to the s y s t e m as t h e y a r e e n c o u n t e r e d , since this i m p l e m e n t a t i o n of the s e c o n d a l g o r i t h m d i d not. O n e e s p e c i a l l y s t r i k i n g e x a m p l e w a s a w o r d b e g i n n i n g w i t h " w - o " w h i c h h a d n e v e r b e e n used b e f o r e , b u t w h i c h o c c u r r e d five t i m e s in the 100 test s e n t e n c e s a n d h a d t o be s p e l l e d o u t each time. T h i s w a s e s p e c i a l - ly i r r i t a t i n g since the " w - o " m e n u ( t h i r d menu) h a d f e w e r t h a n 20 e n t r i e s a n d w o u l d h a v e a c c o m m o d a t e d the n e w w o r d . A c o m - p a r i s o n of the t w o c o l u m n s of F i g u r e 7 s u g g e s t s t h a t f o r the t e x t h e l d in c o m m o n b y the t w o tests, a p p r o x i m a t e l y 30 w o r d s h a d t o be s p e l l e d o u t b y the s e c o n d a l g o - r i t h m , w h i c h w e r e s e l e c t e d b y m e n u in the f i r s t a l g o r i t h m b e c a u s e it a d d e d n e w w o r d s to its d a t a sets as t h e y w e r e e n c o u n t e r e d .
P R O P O S E D E X T E N S I O N S
We h a v e s e v e r a l p l a n s f o r the f u t u r e , m o s t of t h e m i n v o l v i n g the s e c o n d a l g o - r i t h m . O u r first t a s k is to i n c r e a s e the n u m b e r of s e n t e n c e s in the S h e r r i d a t a to 3000 a n d d e t e r m i n e h o w m u c h (if at all) an e n l a r g e d b a s e of e x p e r i e n c e i m p r o v e s the a b i l i t y of the a l g o r i t h m to p r e d i c t
S e n t e n c e p o s i t i o n a l g o r i t h m number s e n t e n c e s : 97 number o f w o r d s : 884
f r e q u e n t w o r d / l e f t c o n t e x t a l g o r i t h m number s e n t e n c e s : 100 number o f w o r d s : 895 w o r d s % t o t a l
f i r s t menu: 350 39.6% 39.6%
s e c o n d menu: 373 4 2 . 2 % 8 1 . 8 % t h i r d menu: 109 12.3% 9 4 . 1 % f o u r t h menu: 2 0 . 2 % 9 4 . 3 %
s p e l l e d : 43 4.8% 99.2%
n u m b e r s : 7 0 . 8 % 100%
w o r d s % t o t a l
f i r s t menu: 448 50% 50%
s e c o n d menu: 280 31.3% 8 1 . 3 % t h i r d menu: 83 9 . 3 % 9 0 . 6 % f o u r t h menu: 1 0 . 1 % 9 0 . 7 % " s p e l l e d ' : 83 9.3% 100%
the d e s i r e d w o r d o n the first try. In its p r e s e n t f o r m , the s y s t e m is r e l i a b l e in its p r e d i c t i o n s a f t e r s e v e r a l h u n d r e d s e n t e n c e s b y the user h a v e been p r o c e s s e d . We i n t e n d to t a k e s o m e t h i n g l i k e the B r o w n c o r p u s f o r A m e r i c a n E n g l i s h a n d f r o m it c r e a t e a v a n i l l a - f l a v o r e d p r e d i c t o r as a s t a r t - u p v e r s i o n f o r a n e w user, w i t h f a c i l i t i e s b u i l t in to h a v e the u s e r ' s o w n l a n g u a g e p a t t e r n s g r a d u a l l y o u t w e i g h the B r o w n c o r p u s i n i t i a l i z a t i o n as t h e y a r e i n p u t .
E v e n t u a l l y the B r o w n c o r p u s w o u l d h a v e e s s e n t i a l l y n o effect, o r at least n o effect o v e r r i d i n g the u s e r ' s i n d i v i d u a l use of l a n g u a g e (it m i g h t serve as a basic d i c t i o n a r y f o r t e x t v o c a b u l a r y n o t yet seen f r o m the user).
We i n t e n d to i n v e s t i g a t e w h a t effect g e n e r a t i n g s e n t e n c e s w h i l e u s i n g the s y s t e m has o n o u r c o l l a b o r a t o r . T o date, she has o b l i g i n g l y been w i l l i n g t o
c o n t i n u e to use a t y p e w r i t e r to g e n e r a t e t e x t , b u t she d o e s o w n a p e r s o n a l c o m p u t e r a n d is a b l e to use a m o u s e . O u r o w n e x p e r i e n c e in e n t e r i n g her s e n t e n c e s o n the s y s t e m has m a d e it c l e a r t h a t in m a n y i n s t a n c e s she w o u l d h a v e e x p r e s s e d the same ideas m o r e r a p i d l y o n the s y s t e m w i t h a s l i g h t c h a n g e in w o r d i n g . Since the p r e f e r r e d w o r d s a n d p a t t e r n s a r e d e r i v e d b y the s y s t e m f r o m her o w n l a n g u a g e h i s t o r y , t h e y s h o u l d feel n o r m a l a n d n a t u r a l to her a n d c o u l d i n f l u e n c e her to m o d i f y her i n t e n t i o n s in g e n e r a t i n g a sentence. On the o t h e r h a n d , a d i f f e r e n t h a n d i c a p p e d i n d i v i d u a l (a q u a d r i p l e g i c ) has i n f o r m e d us t h a t ease of m e c h a n i c a l p r o d u c t i o n of a s e n t e n c e has l i t t l e o r n o effect o n his c h o i c e of w o r d s , a n d t h a t w o u l d a p p e a r to be the case f o r o u r c o l l a b o r a t o r w h i l e she uses the t y p e w r i t e r .
F i n a l l y , we w i s h to m a k e use of the m u c h l a r g e r a m o u n t s of m e m o r y a v a i l a b l e o n p e r s o n a l c o m p u t e r s b y t a k i n g a c c o u n t of the f o l l o w e r s f o r m a n y of the m o d e r a t e - f r e q u e n c y w o r d s . F o r e x a m p l e , in the s e n t e n c e " w o u l d y o u be able..." the w o r d "able" is n o t h i g h f r e q u e n c y . N e v e r t h e - less, the s y s t e m c o u l d e a s i l y d e d u c e w h a t f o l l o w i n g w o r d to e x p e c t , since e v e r y k n o w n o c c u r r e n c e of "able" is f o l l o w e d b y " t o ' . As it h a p p e n s , "to" is o n e of the t o p 20 m o s t f r e q u e n t w o r d s a n d hence f o r t u i t o u s l y is o n the d e f a u l t m e n u a f t e r the n o n - h i g h - f r e q u e n c y w o r d " a b l e ' , b u t t h e r e are m a n y o t h e r e x a m p l e s w h e r e the
s y s t e m is n o t so l u c k y . F o r i n s t a n c e , "pick" is u s u a l l y f o l l o w e d b y "up" in the S h e r r i d a t a , b u t "pick" is l o w f r e q u e n c y a n d "up" is n o t o n the d e f a u l t first menu. S i m i l a r l y , " t h i n k " is a h i g h - f r e q u e n c y w o r d a n d has a w e l l d e v e l o p e d set of f o l l o w e r s . " T h i n k s " a n d " t h o u g h t " a r e n o t h i g h - f r e q u e n c y a n d hence a r e f o l l o w e d b y the d e f a u l t first menu. Yet v i r t u a l l y e v e r y f o l l o w e r f o r " t h i n k s " a n d " t h o u g h t " in the S h e r r i d a t a h a p p e n s t o b e l o n g to the set o f f o l l o w e r s f o r " t h i n k ' . We b e l i e v e t h a t b y s t o r i n g i n f o r m a t i o n o n m o d e r a t e f r e q u e n c y w o r d s w i t h s t r o n g l y a s s o c i a t e d f o l l o w e r s a n d o n c l u s t e r s of v e r b f o r m s we m a y s i g n i f i c a n t l y i m p r o v e the success of the first menu.
R E L A T E D W O R K
T h a t a s m a l l n u m b e r of w o r d s a c c o u n t f o r a l a r g e p r o p o r t i o n of the t o t a l v e r - biage in c o n v e r s a t i o n has been k n o w n f o r s o m e time [ K u c e r a a n d F r a n c i s , 1967].
T h e idea of u s i n g the first s e v e r a l l e t t e r s t y p e d b y a h a n d i c a p p e d i n d i v i d u a l to a n t i c i p a t e the n e x t d e s i r e d w o r d has been used in n u m e r o u s s y s t e m s (e.g., [ G i b l e t a n d C h i l d r e s s , 1982], [ P i c k e t i n g et al., 1984]). T h e G i b l e r a n d C h i l d r e s s s y s t e m is t y p i c a l in t h a t it uses a f e w - t h o u s a n d - w o r d v o c a b u l a r y d r a w n f r o m the g e n e r a l p u b l i c , p l u s a few h u n d r e d w o r d s s p e c i f i c to the user of the system. T h e user m u s t t y p e the first t w o l e t t e r s b e f o r e the s y s t e m p r o v i d e s a m e n u of w o r d s b e g i n n i n g w i t h the l e t t e r p a i r . If the d e s i r e d w o r d w a s n o t o n the menu, the user had to s p e l l the w o r d out. It w a s felt t h a t o n e l e t t e r w a s n o t i n f o r m a t i v e e n o u g h to w a r r a n t a menu. F u r t h e r m o r e , G i l b l e r and C h i l d r e s s s h o w e d t h a t i n c r e a s - ing the s y s t e m v o c a b u l a r y d e g r a d e d the p e r f o r m a n c e of t h e i r s y s t e m a n d t h e y r e c o m m e n d e d l i m i t a t i o n of the v o c a b u l a r y f o r h u m a n f a c t o r s r e a s o n s .
By c o n t r a s t , o u r s y s t e m c o s t s the user n o m o r e e f f o r t in t e r m s of s e l e c t i n g the first t w o l e t t e r s - if i n d e e d t h e y h a v e n e e d e d to go t h a t far; 80 p e r cent of the time, t h e y h a v e n ' t n e e d e d t o p r o - vide t w o letters. F u r t h e r , t h e r e is n o q u e s t i o n t h a t f o r o u r s y s t e m , a l l o w i n g the v o c a b u l a r y to g r o w is of b e n e f i t b o t h t o s y s t e m p e r f o r m a n c e a n d to user s a t i s - f a c t i o n .
p e r s o n s c o n v e r s a n t in the Bliss c o m m u n i - c a t i o n s system. C o m m u n i c a t i o n w i t h Bliss i n v o l v e s a high d e g r e e of i n t e r p r e t a t i o n by the " l i s t e n e r ' , a n d G a l l i e r s r e p o r t s an i m p r e s s i v e 75 per cent success rate in a u t o m a t i n g such i n t e r p r e t a t i o n . T h e G a l l i e r s system is s i n g l e - s u b j e c t , as o u r s is, and it d o e s use past h i s t o r y t o f a c i l i t a t e i n t e r p r e t a t i o n . It was, h o w - ever, l i m i t e d t o a v e r y s m a l l d o m a i n f o r the e x p e r i m e n t d e s c r i b e d .
One s t a t i s t i c c i t e d by this last p a p e r was t h a t the same t e x t p r o d u c e d f r o m the Bliss c o m m u n i c a t i o n , had it been p r o d u c e d by t y p i n g i n t o a w o r d p r o c e s s i n g system, w o u l d have r e q u i r e d t h r e e times as m a n y k e y - p r e s s o p e r a t i o n s . O u r o w n r a t i o of k e y - p r e s s o p e r a t i o n s to c h a r a c t e r s
p r o d u c e d was 45 per cent f o r the s e n t e n c e p o s i t i o n a l g o r i t h m . T h a t is, o n a v e r a g e it t o o k 45 presses of a m o u s e b u t t o n to p r o d u c e 100 c h a r a c t e r s . P a r t of the r e a s o n f o r such a high r a t i o has to d o w i t h p u n c t u a t i o n , c a p i t a l i z a t i o n , and s p e c i a l screens such as the n u m b e r screen, w h i c h r e q u i r e s n o t o n l y the same n u m b e r of presses of the b u t t o n as t h e r e are digits, f o r e x a m p l e , b u t a d d i t i o n a l presses of the b u t t o n to s u m m o n the screen and q u i t the menu. But p r i m a r i l y the r a t i o seems t o d e r i v e f r o m the fact t h a t m a n y of the w o r d s in a n y t e x t are s h o r t - "a', " t o ' , "the', " o f ' , "in', and "on" b e i n g e x a m p l e s f r o m this v e r y p a r a g r a p h . If the first menu d o e s n o t c o n t a i n a d e s i r e d t w o - l e t t e r w o r d , o n e has to s p e l l the first l e t t e r and t h e n m a k e a s e l e c t i o n f r o m the s e c o n d m e n u - r e q u i r i n g t w o presses of a b u t t o n . By c o n t r a s t , Bliss users c o m m o n l y use a t e l e g r a p h i c s t y l e of c o m m u n i c a t i o n and o m i t f u n c t i o n w o r d s a l t o g e t h e r .
C O N C L U S I O N
In s u m m a r y , e v i d e n c e e x i s t s t h a t f o r a system b u i l t a r o u n d a single user's l a n g u a g e , a p r e d i c t i o n scheme that s i m p l y a n t i c i p a t e d f i f t y o r so w o r d s w o u l d o n a v e r a g e be c o r r e c t a b o u t half the time. L i m i t i n g such a system to o n l y the t o p 20 most f r e q u e n t w o r d s w o u l d give a success rate of a b o u t 30 per cent. H o w e v e r , n o t all of the high f r e q u e n c y w o r d s are d i s - t r i b u t e d e v e n l y by s e n t e n c e p o s i t i o n . A system t h a t o f f e r s the t o p 20 most f r e - q u e n t l y o c c u r r i n g w o r d s f o r each p o s i t i o n of a s e n t e n c e was successful a b o u t 40 per cent of the time o n the n e x t 97 sentences. A l l o w i n g a user t o r e j e c t the first set of w o r d s by g i v i n g the first l e t t e r of the d e s i r e d w o r d and o f f e r i n g the 20 most
f r e q u e n t w o r d s b e g i n n i n g w i t h that l e t t e r r e s u l t e d in success f o r the c o m b i n e d first and s e c o n d menus 82 per cent of the time.
After a t r a i n i n g b o d y of 1750 s e n - tences (14,669 w o r d s ) , w i t h a v o c a b u l a r y of 1512 w o r d s , it was still the case t h a t a b o u t six per cent of the d e s i r e d w o r d s w e r e u n k n o w n t o the system.
An a l t e r n a t i v e a l g o r i t h m f o r the f i r s t o f f e r i n g of 20 w o r d s , based p r i m a r i l y o n the r i g h t h a n d c o n t e x t s of the h i g h f r e - q u e n c y w o r d s , is s u c c e s s f u l o n the f i r s t guess 50 per cent of the time.
R E F E R E N C E S
Boggess, L o i s and T h o m a s M. E n g l i s h , T h e H W Y E s p e e c h r e c o g n i t i o n system: a u s e r - s p e c i f i c m o d e l f o r e x p e c t a t i o n - b a s e d r e c o g n i t i o n , in P r o c e e d i n g s of the 25th S o u t h e a s t R e g i o n a l C o n f e r e n c e of the ACM, B i r m i n g h a m , 1987.
C h o w , C. L. A m o u s e - d r i v e n m e n u - b a s e d t e x t p r o s t h e s i s f o r the s p e e c h
h a n d i c a p p e d , M.C.S. p r o j e c t r e p o r t , M i s s i s s i p p i S t a t e U n i v e r s i t y , 1986. E n g l i s h , T. M. and L o i s Boggess, A g r a m -
m a t i c a l a p p r o a c h to r e d u c i n g the s t a t i s - t i c a l s p a r s i t y of l a n g u a g e m o d e l s in n a t - u r a l d o m a i n s , P r o c e e d i n g s of the I n t e r - n a t i o n a l C o n f e r e n c e o n A c o u s t i c s , S p e e c h , and S i g n a l P r o c e s s i n g , T o k y o , 1986. G a l l i e r s , J u l i a , AI f o r s p e c i a l needs -
an " i n t e l l i g e n t " c o m m u n i c a t i o n aid f o r Bliss users, A p p l i e d A r t i f i c i a l
I n t e l l i g e n c e , 1(1):77-86, 1987.
G i b l e r , D. C. and D. S. C h i l d r e s s , L a n - guage a n t i c i p a t i o n w i t h a c o m p u t e r based s c a n n i n g aid, P r o c e e d i n g s of the I E E E C o m p u t e r W o r k s h o p o n C o m p u t e r s t o Aid the H a n d i c a v o e d , 1982.
Kucera, H. and W. N. F r a n c i s , C o m p u t a - t i o n a l a n a l y s i s of p r e s e n t - d a y A m e r i c a n English. B r o w n U n i v e r s i t y Press, 1967. P i c k e r i n g , J., J. L. A r n o t t , J. G. W o l f f ,
and A. L. S w i f f i n , P r e d i c t i o n and a d a p - t a t i o n in a c o m m u n i c a t i o n aid f o r the d i s a b l e d , P r o c e e d i n g s of the I F I P C o n f e r e n c e o n H u m a n - C o m p u t e r I n t e r a c t i o n , L o n d o n , 1984.