P R O B L E M S IN N A T U R A L - L A N G U A G E I N T E R F A C E TO DSMS W I T H E X A M P L E S FROM EUFID
M a r j o r i e T e m p l e t o n John Burger
S y s t e m D e v e l o p m e n t C o r p o r a t i o n Santa Mortice, C a l i f o r n i a
A B S T R A C T
For five y e a r s t h e End-User F r i e n d l y I n t e r f a c e to Data m a n a g e m e n t (EUFID) pro- ject team at System D e v e l o p m e n t C o r p o r a - tion w o r k e d on the d e s i g n and i m p l e m e n t a - tion of a N a t u r a l - L a n g u a g e I n t e r f a c e (NLI) s y s t e m that was to be i n d e p e n d e n t of both the a p p l i c a t i o n and the d a t a b a s e m a n a g e m e n t system. In this paper we d e s c r i b e a p p l i c a t i o n , n a t u r a l - l a n g u a g e and d a t a b a s e m a n a g e m e n t p r o b l e m s involved in NLI d e v e l o p m e n t , w i t h s p e c i f i c r e f e r - ence to the EUFID s y s t e m as an example.
I I N T R O D U C T I O N
From 1976 t o 1 9 8 1 SDC was i n v o l v e d i n t h e d e v e l o p m e n t o f t h e E n d - U s e r F r i e n d l y I n t e r f a c e to Data m a n a g e m e n t
(EUFID) system, a n a t u r a l - l a n g u a g e inter- face (NLI) that is d e s i g n e d to be i n d e p e n d e n t of both the a p p l i c a t i o n and the u n d e r l y i n g d a t a b a s e m a n a g e m e n t s y s t e m
(DBMS). [TEMP79, TEMP80, BURG80,
BURG82]. The EUFID system p e r m i t s u s e r s to c o m m u n i c a t e with d a t a b a s e m a n a g e m e n t s y s t e m s in natural English rather than formal q u e r y l a n g u a g e s . It is a s s u m e d that the a p p l i c a t i o n d o m a i n is well d e f i n e d and bounded, that u s e r s share a c o m m o n language to a d d r e s s the a p p l i c a -
tion, and that users may have little
e x p e r i e n c e with c o m p u t e r s or DBMSs but are c o m p e t e n t in the a p p l i c a t i o n area.
At least three broad c a t e g o r i e s of issues had to be a d d r e s s e d during EUFID d e v e l o p m e n t , and it is a p p a r e n t that they are common to any g e n e r a l n a t u r a l - language interface to d a t a b a s e m a n a g e m e n t systems.
The first c a t e g o r y involves the a p p l i c a t i o n : how to c h a r a c t e r i z e the r e q u i r e m e n t s of the h u m a n - m a c h i n e d i a l o - gue and interaction, c a p t u r e that infor- mation e f f i c i e n t l y , f o r m a l i z e the infor- m a t i o n and i n c o r p o r a t e that k n o w l e d g e into a framework that can be used by the system. The major problems in this area are k n o w l e d g e a c q u i s i t i o n and r e p r e s e n t a - tion. For m a n y NLI systems, bringing up a new a p p l i c a t i o n requires e x t e n s i v e effort by system d e s i g n e r s with c o o p e r a - tion from a r e p r e s e n t a t i v e set of end-
users. Tools that could a s s i s t in
a u t o m a t i n g this p r o c e s s are b a d l y needed. The second set of issues i n v o l v e s l a n g u a g e p r o c e s s i n g t e c h n i q u e s : h o w to a s s i g n c o n s t i t u e n t s t r u c t u r e and i n t e r p r e t a t i o n to q u e r i e s using robust and g e n e r a l m e t h o d s that a l l o w e x t e n s i o n to a d d i t i o n a l lexical items, s e n t e n c e types and s e m a n t i c r e l a t i o n s h i p s . Some NLI s y s t e m s d i s t i n g u i s h the a s s i g n m e n t of s y n t a c t i c s t r u c t u r e , o r p a r s i n g , from the i n t e r p r e t a t i o n . Other systems, i n c l u d i n g EUFID, c o m b i n e i n f o r m a t i o n about c o n s t i - tuent and s e m a n t i c s t r u c t u r e into an i n t e g r a t e d s e m a n t i c g r a m m a r .
The third c l a s s involves d a t a b a s e issues: h o w to a c t u a l l y p e r f o r m the intent of the n a t u r a l - l a n g u a g e q u e s t i o n by f o r m u l a t i n g the c o r r e c t s t r u c t u r e d q u e r y a n d e f f i c i e n t l y n a v i g a t i n g t h r o u g h the d a t a b a s e to r e t r i e v e the right answer. This involves a t h o r o u g h u n d e r - s t a n d i n g of the DBMS s t r u c t u r e u n d e r l y i n g the a p p l i c a t i o n , the o p e r a t i o n s and func- tions the q u e r y l a n g u a g e supports, and the n a t u r e and v o l a t i l i t y of the d a t a - base.
O b v i o u s l y issues in these three a r e a s are related, and the k n o w l e d g e needed to deal with them may be d i s t r i - buted t h r o u g h o u t a n a t u r a l - l a n g u a g e i n t e r f a c e system. The purpose of this paper is to show how such issues m i g h t be a d d r e s s e d in NLI d e v e l o p m e n t , with illus- t r a t i o n s from EUFID.
II B A C K G R O U N D
Over the past two d e c a d e s a c o n s i d - e r a b l e a m o u n t of w o r k has g o n e into the d e v e l o p m e n t of natural-language s y s t e m s . E a r l y d e v e l o p m e n t s were in the areas of text p r o c e s s i n g , s y n t a c t i c parsing t e c h - niques, m a c h i n e t r a n s l a t i o n , and e a r l y a t t e m p t s at E n g l i s h - l a n g u a g e q u e s t i o n a n s w e r i n g s y s t e m s . Several e a r l y q u e s t i o n - a n s w e r i n g e x p e r i m e n t s are r e v i e w e d by R. F. S i m m o n s in [SIMM65]. W a l t z has e d i t e d a c o l l e c t i o n of s h o r t p a p e r s on t o p i c s r e l a t e d to n a t u r a l - l a n g u a g e and artificial i n t e l l i g e n c e in a s u r v e y of NLI r e s e a r c h [WALT77]. A s u r - vey of NLIs and e v a l u a t i o n of s e v e r a l s y s t e m s with r e s p e c t t o their applicabil- ity to c o m m a n d and c o n t r o l e n v i r o n m e n t s can be found in [OS179].
A . R E L A T E D WORK
W h i l e few NLIs h a v e r e a c h e d the c o m - m e r c i a l m a r k e t p l a c e , m a n y s y s t e m s h a v e c o n t r i b u t e d to a d v a n c i n g the s t a t e of the art. S e v e r a l r e p r e s e n t a t i v e s y s t e m s and the p r o b l e m s they a d d r e s s e d are d e s c r i b e d in this s e c t i o n .
i. C O N V E R S E [KELLT1] used formal syn- tactic a n a l y s i s to g e n e r a t e s u r f a c e - and d e e p - s t r u c t u r e p a r s i n g s t o g e t h e r with formal s e m a n t i c t r a n s f o r m a t i o n rules t o p r o d u c e q u e r i e s for a b u i l t - i n r e l a t i o n a l DBMS. It was w r i t t e n in SDC LISP and ran on IBM 37@ c o m p u t e r s . S t a r t e d in 1968, it
was one o f the first n a t u r a l -
l a n g u a g e p r o c e s s o r s to be b u i l t for the p u r p o s e of q u e r y i n g a s e p a r a t e d a t a m a n a g e m e n t system.
2. L A D D E R [HEND77] was d e s i g n e d to a c c e s s large d i s t r i b u t e d d a t a b a s e s .
it is i m p l e m e n t e d in INTERLISP, runs on a PDP-I@, and can i n t e r f a c e to d i f f e r e n t DBMSs with p r o p e r c o n f i - g u r a t i o n . It uses a s e m a n t i c g r a m - mar and, like E U F I D and m o s t NLIs, a d i f f e r e n t g r a m m a r m u s t be d e f i n e d
for each a p p l i c a t i o n .
3. The Lunar Rocks s y s t e m LSNLIS [WOOD72] was the first to use the A u g m e n t e d T r a n s i t i o n N e t w o r k (ATN) g r a m m a r . W r l ~ t e n in LISP, it t r a n s f o r m e d f o r m a l l y parsed q u e s - tions into r e p r e s e n t a t i o n s of the f i r s t - o r d e r p r e d i c a t e calculus for d e d u c t i v e p r o c e s s i n g a g a i n s t a b u i l t - i n DBMS.
4. P H L I Q A I [SCHA77] uses a s y n t a c t i c parser w h i c h runs as a s e p a r a t e pass from the s e m a n t i c u n d e r s t a n d i n g passes. This s y s t e m is m a i n l y
i n v o l v e d w i t h p r o b l e m s o f s e m a n t i c s a n d h a s t h r e e s e p a r a t e l a y e r s o f s e m a n t i c u n d e r s t a n d i n g . The l a y e r s are c a l l e d " E n g l i s h Formal Language", " W o r l d Model L a n g u a g e " , and "Data Base L a n g u a g e " and a p p e a r to c o r r e s p o n d r o u g h l y to the " e x t e r - nal", " c o n c e p t u a l " , and " i n t e r n a l " v i e w s of d a t a as d e s c r i b e d by C. J. Date [DATE77]. P H L I Q A I can i n t e r - face to a v a r i e t y of d a t a b a s e s t r u c - tures and DBMSs.
5. The P r o g r a m m e d L A N g u a g e - b a s e d E n q u i r y S y s t e m (PLANES) [WALT78] uses an ATN b a s e d parser and a s e m a n t i c c a s e frame a n a l y s i s to u n d e r s t a n d q u e s t i o n s . C a s e frames are used to h a n d l e p r o n o m i n a l and e l l i p t i c a l r e f e r e n c e and to g e n e r a t e r e s p o n s e s t o c l a r i f y p a r t i a l l y i n t e r p r e t e d q u e s t i o n s .
6 . R E L [THOM69], i n i t i a l l y w r i t t e n e n t i r e l y in a s s e m b l e r c o d e for an IBM36@, h a s b e e n in c o n t i n u o u s d e v e l o p m e n t s i n c e 1967. REL a l l o w s a user to m a k e i n t e r a c t i v e e x t e n - sions to the g r a m m a r and s e m a n t i c s
of the system. It uses a formal
g r a m m a r e x p r e s s e d as a set of g e n - eral r e - w r i t e rules w i t h s e m a n t i c t r a n s f o r m a t i o n s a t t a c h e d to each rule. A n s w e r s are o b t a i n e d from a b u i l t - i n database.
7. R E N D E Z V O U S [CODD74] a d d r e s s e s the p r o b l e m of c e r t a i n t y r e g a r d i n g the m a c h i n e ' s u n d e r s t a n d i n g of the u s e r ' s q u e s t i o n . It e n g a g e s the user in d i a l o g u e to s p e c i f y and d i s a m b i g u a t e the q u e s t i o n and will not route the formal q u e r y to the r e l a t i o n a l DBMS until the user is s a t i s f i e d with the m a c h i n e ' s
i n t e r p r e t a t i o n .
8. R O B O T [HARR78] is one of the few NLI s y s t e m s c u r r e n t l y a v a i l a b l e on the c o m m e r c i a l m a r k e t . It is the b a s i s for C u l l i n a n e ' s O n L i n e E n g l i s h [CULL80] and A r t i f i c i a l I n t e l l i g e n c e C o r p o r a t i o n ' s I n t e l l e c t [EDP82]. It uses an e x t r a c t e d v e r s i o n of the
database for l e x i c a l data to a s s i s t
the ATN parser.
B. O V E R V I E W OF E U F I D
E U F I D is a g e n e r a l p u r p o s e n a t u r a l - l a n g u a g e f r o n t - e n d for d a t a b a s e m a n a g e - ment. The o r i g i n a l d e s i g n g o a l s for EUFID were:
- to b e a p p l i c a t i o n i n d e p e n d e n t . This m e a n s that the p r o g r a m m u s t be table d r i v e n . The t a b l e s c o n t a i n the d i c - t i o n a r y and s e m a n t i c i n f o r m a t i o n and are loaded with a p p l i c a t i o n - s p e c i f l c
data.
It was d e s i r e d that thetables could be c o n s t r u c t e d by s o m e - one o t h e r than the E U F I D staff, so
t h a t
users could build new a p p l i c a - tions on their o w n .- to be d a t a b a s e i n d e p e n d e n t . This m e a n s that the o r g a n i z a t i o n of the data in
t h e
d a t a b a s e m u s t be r e p r e s e n t a b l e in t a b l e s that d r i v e the q u e r y g e n e r a t o r . ~ Adatabase
r e o r g a n i z a t i o n that d o e s n o t c h a n g e the s e m a n t i c s of the a p p l i c a t i o n s h o u l d b e t r a n s p a r e n ~ t o t h e u s e r . - to be DBMS i n d e p e n d e n t . This m e a n sthat it must be
able
to g e n e r a t e requests to d i f f e r e n t DBMSs in the DBMS's q u e r y l a n g u a g e and that the i n t e r f a c e of EUF~D to a d i f f e r e n t DBMS should not r e q u i r e c h a n g e s to t h e NLI m o d u l e s . T r a n s f e r r i n g the samedatabase
with the same s e m a n t i c c o n t e n t to a n o t h e r DBMS should be t r a n s p a r e n t to thenatural-language
users.- to run on a m i n i - c o m p u t e r that m i g h t p o s s i b l y be d i f f e r e n t from the c o m - puter with the DBMS.
- to have a fast r e s p o n s e time, even when the q u e s t i o n c a n n o t be inter- preted. This m e a n s it must be able q u i c k l y to r e c o g n i z e
unanalyzable
c o n s t r u c t s .- to h a n d l e n o n s t a n d a r d or p o o r l y - formed (but, n e v e r t h e l e s s , m e a n i n g - ful) q u e s t i o n s .
- to be p o r t a b l e to v a r i o u s m a c h i n e s . This m e a n s that the system had to be * We make a technical d i s t i n c t i o n between the words "question" and "query". A q u e s t i o n is any string
entered by the user to the EUFID
analyzer, r e g a r d l e s s of the t e r m i n a t i n g p u n c t u a t i o n . This is c o n s i s t e n t with the d e s i g n since EUFID treats all input as a request for information. A q u e r y is a formal r e p r e s e n t a t i o n of a q u e s t i o n in either the
EUFID
i n t e r m e d i a t e language IL, or in the formal q u e r y language of a DBMS.w r i t t e n in a h i g h level l a n g u a g e ; i n i t i a l l y a c u s t o m e r r e q u i r e d c o d e to be w r i t t e n in FORTRAN, later we were able to use the "C" p r o g r a m m i n g language.
- to s u p p o r t d i f f e r e n t v i e w s of the data for s e c u r i t y p u r p o s e s .
The d e s i g n w h i c h met these r e q u i r e m e n t s is a m o d u l a r system w h i c h uses an I n t e r - m e d i a t e L a n g u a g e (IL) as the o u t p u t of
t h e
n a t u r a l - l a n g u a g e a n a l y s i s s y s t e m [BURG82]. This l a n g u a g e r e p r e s e n t s , in m a n y ways, the union of the c a p a b i l i t i e s of m a n y "target" DBMS q u e r y l a n g u a g e s .The EUFID s y s t e m c o n s i s t s of three m a j o r m o d u l e s , not c o u n t i n g the DBM3 (see Figure I). The a n a l y z e r (parser) m o d u l e is t a b l e d r i v e n . It is n e c e s s a r y o n l y to p r o p e r l y b u i l d and load the t a b l e s to i n t e r f a c e E U F I D to a n e w a p p l i c a t i o n . M a p p i n g a q u e s t i o n from its d i c t i o n a r y (user) r e p r e s e n t a t i o n to DBMS r e p r e s e n t a - tion is h a n d l e d by m a p p i n g f u n c t i o n s c o n - tained in a table and a p p l i e d b y a
s e p a r a t e
m o d u l e ,t h e
" m a p p e r " . Eachc o n -
tent ( a p p l i c a t i o n d e p e n d e n t ) word in the d i c t i o n a r y has one or m o r e m a p p i n g func- tions d e f i n e d for it. A final stage of the m a p p e r is a q u e r y - l a n g u a g e g e n e r a t o r c o n t a i n i n g the syntax of IL. This s t a g e w r i t e s a q u e r y in IL using thegroup/field
names found by the m a p p e rt o
r e p r e s e n t the u s e r ' s c o n c e p t s and the s t r u c t u r a l r e l a t i o n s h i p s b e t w e e n them. This d e s i g n s a t i s f i e s
t h e
r e q u i r e m e n t of a p p l i c a t i o n i n d e p e n d e n c e .ENGLISH
QUESTION
P, ESPO~SE
t
t
Figure i: EUFID Block D i a g r a m
For each d i f f e r e n t DBMS used by a EUFID a p p l i c a t i o n , a " t r a n s l a t o r " m o d u l e needs to be w r i t t e n to c o n v e r t a q u e r y in IL to the e q u i v a l e n t in the DBMS q u e r y language. This d e s i g n s a t i s f i e s the r e q u i r e m e n t of DBMS i n d e p e n d e n c e .
[image:3.612.330.549.244.526.2]The f o l l o w i n g s u b s e c t i o n s d e s c r l o e each of the m o d u l e s of the E U F I D s y s t e m , and g i v e our m o t i v a t i o n for d e s i g n .
i. A ~ p l i c a t i o n D e f i n i t i o n s
B r i n g i n g up a new a p p l i c a t i o n is a long and c o m p l e x p r o c e s s . The d a t a b a s e d e f i n i t i o n m u s t be t r a n s m i t t e d to EUFID. A l a r g e c o r p u s of " t y p i c a l " user q u e s - tions m u s t be c o l l e c t e d from a r e p r e s e n - t a t i v e set of u s e r s and from these the d i c t i o n a r y and m a p p i n g t a b l e s a r e d e s i g n e d . A " s e m a n t i c g r a p h " is d e f i n e d for the a p p l i c a t i o n . This g r a p h is i m p l i c i t l y r e a l i z e d in t h e d i c t i o n a r y w h e r e the n o d e s of the g r a p h are the d e f i n i t i o n s of E n g l i s h c o n t e n t w o r d s and the c o n n e c t i v i t y of the g r a p h is i m p l i e d by the c a s e - s t r u c t u r e r e l a t i o n s h i p s d e f i n e d for the nodes.
All d i c t i o n a r y and m a p p i n g - f u n c t i o n d a t a a r e then e n t e r e d into c o m p u t e r files w h i c h are p r o c e s s e d by the A p p l i c a t i o n
D e f i n i t i o n M o d u l e (ADM) to p r o d u c e t h e r u n - t i m e tables. T h e s e final t a b l e s are c o m p l e x s t r u c t u r e s of p o i n t e r s , c h a r a c t e r strings, and index tables, d e s i g n e d to d e c r e a s e a c c e s s time to the i n f o r m a t i o n r e q u i r e d by the a n a l y z e r and m a p p e r m o d u l e s .
The ADM, t y p i c a l l y , n e e d s to be run s e v e r a l times to "debug" the tables. E U F I D i n t e r f a c e s t o three a p p l i c a t i o n s c u r r e n t l y exist, and b u i l d i n g t a b l e s for e a c h n e w a p p l i c a t i o n took less time than the p r e v i o u s one, b u t it still r e q u i r e s s e v e r a l s t a f f - m o n t h s to bring up a new a p p l i c a t i o n .
a. U s e r - V i e w R e p r e s e n t a t i o n
All i n f o r m a t i o n on the u s e r ' s v i e w of the d a t a b a s e is kept in the d i c t i o n - ary. The d i c t i o n a r y c o n s i s t s of two kinds of w o r d s and d e f i n i t i o n s . F u n c t i o n words, such as p r e p o s i t i o n s and C o n j u n c -
tions, are p r e - s t o r e d in each
a p p l i c a t i o n ' s d i c t i o n a r y and are used by the a n a l y z e r for d i r e c t i o n on h o w to c o n - nect the s e m a n t i c - g r a p h n o d e s d u r i n g a n a l y s i s . C o n t e n t w o r d s are a p p l i c a t i o n d e p e n d e n t . The d - c r O o n s of c o n t e n t w o r d s are s e m a n t i c - g r a p h nodes. The c o n - n e c t i v i t y o£ the g r a p h is i n d i c a t e d by s e m a n t i c case slots and p o i n t e r s c o n -
tained in the nodes. A form of
s e m a n t i c - c a s e is used t o indicate the a t t r i b u t e s of an e n t i t y (e.g., a d j e c - tives, p r e p o s i t i o n a l phrases, and other m o d i f i e r s of a noun).
b. M a p p i n g F u n c t i o n s
The list of m a p p i n g f u n c t i o n s is d e r i v e d from the d i c t i o n a r y . E v e r y pos- sible c o n n e c t i o n of e v e r y n o d e has to be
c o n s i d e r e d . F r e q u e n t l y , desig~ :o,~- s i d e r a t i o n s in the m a p p i n g - f u n c t i o n list n e c e s s i t a t e g o i n g b a c k and m o d i f y i n g the c o n t e n t of the d i c t i o n a r y . T h i s is an e x a m p l e of the o v e r l a p of the l i n g u i s t i c and d a t a b a s e issues in a s s i g n i n g an i n t e r p r e t a t i o n to a q u e s t i o n .
c. D a t a b a s e R e p r e s e n t a t i o n
The s t r u c t u r e o f the d a t a in the u s e r ' s d a t a b a s e is r e p r e s e n t e d in two t a b l e s , c a l l e d the C A N (for c a n o n i c a l ) and REL (for r e l a t i o n s h i p s ) t a b l e s . T a k - ing a d v a n t a g e of the fact that a n y d a t a - base can b e r e p r e s e n t e d in r e l a t i o n a l form, E U F I D lists e a c h d a t a b a s e g r o u p as if it w e r e a r e l a t i o n . G r o u p - t o - g r o u p l i n k a g e ( r e p r e s e n t e d in the R E L table) is d e a l t w i t h as if a join* w e r e n e c e s s a r y to i m p l e m e n t the link. For h i e r a r c h i c a l and n e t w o r k DBMSs the join w i l l not be n e e d e d : the link is " w i r e d in" to the d a t a b a s e s t r u c t u r e . E U F I D n e v e r t h e l e s s a s s u m e s a join m a i n l y in o r d e r to f a c i l i - tate the w r i t i n g of g r o u p - t o - g r o u p l i n k s in IL, w h i c h is a r e l a t i o n a l l a n g u a g e . The CAN t a b l e i n c l u d e s d a t a b a s e - s p e c i f i c i n f o r m a t i o n for e a c h field (attribute) of e a c h g r o u p ( r e l a t i o n ) , s u c h as field name, c o n t a i n i n g g r o u p , n a m e of d o m a i n from w h i c h a t t r i b u t e d g e t s its v a l u e s , and a p o i n t e r to a set of c o n v e r s i o n f u n c t i o n s for n u m e r i c v a l u e s w h i c h can be be used to c o n v e r t from one unit of m e a s - ure to a n o t h e r (e.g., feet to m e t e r s ) .
T h e s e d a t a are used by the r u n - t i m e m o d u l e s w h i c h m a p and t r a n s l a t e the t r e e - s t r u c t u r e d o u t p u t of the a n a l y z e r to IL on the a c t u a l g r o u p / f i e l d names of the d a t a b a s e , and then co the l a n g u a g e of the DBMS. These m o d u l e s are d i s c u s s e d in the n e x t s e c t i o n s .
2. The E U F I D A n a l y z e r
The c u r r e n t v e r s i o n of the E U F I D a n a l y z e r e m p l o y s a v a r i a n t of the C o c k e - K a s a m i - Y o u n g e r a l g o r i t h m for parsing its input. This c l a s s i c a l n o n p r e d i c t i v e b o t t o m - u p a l g o r i t h m has been used in a f a m i l y of " c h a r t p a r s e r s " d e v e l o p e d by Kay, Earley, and o t h e r s [AHO72]. The m a i n f e a t u r e s of these p a r s e r s are: (i) T h e y use a r b i t r a r y c o n t e x t - f r e e g r a m m a r s . T h e r e are no r e s t r i c t i o n s on rules w h i c h h a v e l e f t - r e c u r s i o n or other c h a r a c t e r i s - tics w h i c h s o m e t i m e s c a u s e d i f f i c u l t y . (2) T h e y p r o d u c e all p o s s i b l e p a r s e s of a g i v e n input s t r i n g . The g r a m m a r s they use m a y be a m b i g u o u s at e i t h e r the n o n t e r m i n a l - or t e r m i n a l - s y m b o l levels. In n a t u r a l - l a n g u a g e p r o c e s s i n g , this a l l o w s for a p r e c i s e r e p r e s e n t a t i o n of * The t e r m "join" r e f e r s to a c o m p o s i t e
both the s y n t a c t i c and lexical a m b i g u i - ties w h i c h m a y be p r e s e n t in an input s e n t e n c e . (3) T h e y p r o v i d e partial p a r s e s of the input. Each n o n - t e r m i n a l symbol d e r i v e s some input s u b s t r i n g . Even if no such s u b s t r i n g spans the e n t i r e s e n t e n c e , i.e., no c o m p l e t e parse is a c h i e v e d , a n a l y s e s of v a r i o u s r e g i o n s o f t h e s e n t e n c e a r e a v a i l a b l e . (4) T h e y are c o n c e p t u a l l y s t r a i g h t f o r w a r d and e a s y t o implement. The speed and s t o r a g e c o n - s i d e r a t i o n s w h i c h h a v e kept s u c h p a r s e r s from being w i d e l y used in c o m p i l e r s are less r e l e v a n t in the a n a l y s i s o f short s t r i n g s such as q u e r i e s to a DBMS.
The g r a m m a r used b y the E U F I D p a r s e r is e s s e n t i a l l y s e m a n t i c . The s y m b o l s of the g r a m m a r r e p r e s e n t t h e c o n c e p t s u n d e r -
lying lexical items, and t h e rules
s p e c i f y the ways in w h i c h these c o n c e p t s c a n be c o m b i n e d . More s p e c i f i c a l l y , t h e c o n c e p t s are o r g a n i z e d into a case sys- tem. Each rule states that a g i v e n pair of c o n s t i t u e n t s can be linked if the c o n - c e p t u a l head of o n e c o n s t i t u e n t fills a case on the c o n c e p t u a l head of t h e other. A d e g r e e of c o n t e x t s e n s i t i v i t y is a c h i e v e d b y a t t a c h i n g p r e d i c a t e s to the rules. These p r e d i c a t e s b l o c k a p p l i c a - tion of t h e rules u n l e s s c e r t a i n (usually syntactic) c o n d i t i o n s hold true. The parser uses s y n t a c t i c i n f o r m a t i o n o n l y " o n d e m a n d " , that is, o n l y when such i n f o r m a t i o n is n e c e s s a r y to r e s o l v e s e m a n t i c a m b i g u i t i e s . This a d d s to its c o v e r a g e and r o b u s t n e s s , and m a k e s it r e l a t i v e l y i n s e n s i t i v e to the phrasing v a r i a t i o n s w h i c h m u s t be e x p l i c i t l y a c c o u n t e d for in m a n y other systems.
3. Mapping
The mapper m o d u l e c o n v e r t s the o u t - put of the a n a l y z e r to input for the t r a n s l a t o r module. A n a l y z e r o u t p u t is a tree s t r u c t u r e where the nodes are s e m a n t i c - g r a p h nodes c o r r e s p o n d i n g to the c o n t e n t words in the user's q u e s t i o n and o b t a i n e d from the d i c t i o n a r y .
Input to the t r a n s l a t o r m o d u l e is a string in the syntax of IL w h i c h c o n t a i n s the names of actual g r o u p s and fields in the d a t a b a s e . The mapping a l g o r i t h m ,
thus, has to make several levels of
c o n v e r s i o n s i m u l t a n e o u s l y :
- it must c o n v e r t a tree s t r u c t u r e into a linear string of tokens, - it must c o n v e r t s e m a n t i c - g r a p h nodes
into d a t a b a s e g r o u p - and field- names, and
- it must c o n v e r t the c o n n e c t i v i t y of the tree (representing c o n c e p t - t o - concept linkage in English) into the (frequently very different) g r o u p -
t o - f i e l d and g r o u p - t o - g r o u p c o n n e c - tions of t h e d a t a b a s e .
The m a p p e r m a k e s use of a table of m a p p i n g f u n c t i o n s . The table c o n t a i n s at least o n e m a p p i n g f u n c t i o n for e v e r y c o n - tent word in the d i c t i o n a r y . The a n a l y z e r ' s tree is t r a v e r s e d b o t t o m up, a p p l y i n g m a p p i n g f u n c t i o n s to each node on t h e way. M a p p i n g f u n c t i o n s are c o n - t e x t s e n s i t i v e with r e s p e c t to those n o d e s b e l o w it in the tree: n o d e s that h a v e a l r e a d y b e e n m a p p e d . A n e w tree is g r a d u a l l y formed and c o n n e c t e d this way. M a p p i n g f u n c t i o n s m a y i n d i c a t e that the m a p of a s e m a n t i c - g r a p h n o d e is a d a t a - base n o d e (that is, a g r o u p or field name), o r a p r e - c o n n e c t e d s u b - t r e e of d a t a b a s e nodes. The m a p p i n g f u n c t i o n m a y also i n d i c a t e removal of a d a t a b a s e node or m o d i f i c a t i o n to the e x i s t i n g s t r u c t u r e of the tree being c o n s t r u c t e d .
The new t r e e i s c r e a t e d i n t e r m s o f t h e d a t a b a s e g r o u p s and f i e l d s and i t s s t r u c t u r e r e f l e c t s t h e c o n n e c t i v i t y o f the d a t a b a s e . A final stage of the m a p p e r t r a v e r s e s this n e w tree and g e n - e r a t e s the EL s t a t e m e n t of the q u e r y using a table of the syntax and k e y w o r d s
of EL and the d a t a b a s e names from the
tree.
An a l t e r n a t i v e m e t h o d of m a p p i n g that i s n o w being i n v e s t i g a t e d i n v o l v e s b r e a k i n g the p r o c e s s into two b a s i c parts. The first s t e p would be to map the tree o u t p u t o f the a n a l y z e r t o an I L q u e r y on w h a t C. J. Date c a l l s the "con- c e p t u a l schema" of the d a t a b a s e [DATE77]. A second s t e p would take this IL input and r e - a r r a n g e the schema c o n n e c t i v i t y (and names of g r o u p s and fields) from that of the c o n c e p t u a l schema to that of the actual target d a t a b a s e , g e n e r a t i n g a n o t h e r IL q u e r y as input to the c u r r e n t t r a n s l a t o r s .
4. T r a n s l a t i n ~
The o u t p u t of a t r a n s l a t o r is sent to the a p p r o p r i a t e DBMS. In the E U F I D s y s t e m running at SDC, a Q U E L q u e r y is s u b m i t t e d d i r e c t l y to INGRES running on the same P D P - I I / 7 0 as EUFID. For testing p u r p o s e s , q u e r i e s g e n e r a t e d by the W W D M S t r a n s l a t o r were t r a n s m i t t e d from a PDP- 11/70 to a H o n e y w e l l H6000 w i t h a W W D M S d a t a b a s e .
5. A p p l i c a t i o n D e s c r i p t i o n
E U F I D runs o n three d i f f e r e n t a p p l i - c a t i o n d a t a b a s e s . The M E T R O a p p l i c a t i o n
i n v o l v e s m o n i t o r i n g of s h i p p i n g t r a n s a c - tions b e t w e e n c o m p a n i e s in a c i t y c a l l e d " M e t r o p o l i s " . T h e r e are ten c o m p a n i e s l o c a t e d in a n y one of three n e i g h b o r - hoods. Each c o m p a n y rents w a r e h o u s e s p a c e for s h i p p i n g / r e c e l v i n g t r a n s a c - tions, a n d has local o f f i c e s w h i c h r e c e i v e g o o d s . The d a t a is o r g a n i z e d t e l a t i o n a l l y using the INGRES d a t a b a s e m a n a g e m e n t s y s t e m . That m e a n s that there are no n a v i g a t i o n a l links s t o r e d in the r e c o r d s (called " r e l a t i o n s " ) and there is n o p r e d e f i n e d "root" to the d a t a b a s e s t r u c t u r e . A c c e s s m a y be m a d e from any r e l a t i o n to a n y o t h e r r e l a t i o n as long as there is a field in each of the two rela- tions w h i c h has the same "domain" (set of v a l u e s ) .
A I R E P (ADP I n c i d e n t REPorting) is a n e t w o r k d a t a b a s e , i m p l e m e n t e d in WWDMS. It c o n t a i n s r e p o r t s about h a r d w a r e and s o f t w a r e f a i l u r e s and r e s o l u t i o n of the p r o b l e m s in a large c o m p u t e r system. A c t i v e p r o b l e m s are m a i n t a i n e d in an a c t i v e file and old, s o l v e d p r o b l e m s are m o v e d to an h i s t o r i c a l file. If a p r o b - lem [s r e p o r t e d m o r e than once, an a b b r e - v i a t e d record is m a d e for the a d d i t i o n a l report, c a l l e d the " d u p l i c a t e incident" record. This m e a n s that there are four basic type of report: a c t i v e i n c i d e n t s , d u p l i c a t e i n c i d e n t s , h i s t o r i c a l i n c i d e n t s , and h i s t o r i c a l d u p l i c a t e i n c i d e n t s . In a d d i t i o n , there are r e c o r d s a b o u t sites, p r o b l e m s , and s o l u - tions.
The A P P L I C A N T d a t a b a s e is a rela- tional d a t a b a s e i m p l e m e n t e d in INGRES that c o n t a i n s i n f o r m a t i o n a b o u t job a p p l i c a n t s and their b a c k g r o u n d s . The c e n t r a l e n t i t y is the " a p p l i c a n t " , w h i l e other r e l a t i o n s d e s c r i b e the a p p l i c a n t ' s s p e c i a l t i e s , e d u c a t i o n , p r e v i o u s e m p l o y - ment, c o m p u t e r e x p e r i e n c e , and inter- views.
Each d a t a b a s e has d i f f e r e n t f e a t u r e s chat m a y p r e s e n t p r o b l e m s for a n a t u r a l - l a n g u a g e i n t e r f a c e but w h i c h are typical of 'real-world' a p p l i c a t i o n s . M E T R O has r e l a t i v e l y few e n t i t i e s but has c o m p l e x r e l a t i o n s h i p s among them. A P P L I C A N T has m a n y u p d a t e s and m a n y d i f f e r e n t v a l u e s ,
some c o m i n g from o p e n - e n d e d d o m a i n s . A I R E P has a n e t w o r k d a t a b a s e s t r u c t u r e and c o n t a i n s the same d a t a s t r u c t u r e in four d i f f e r e n t files.
III L E V E L OF S U C C E S S
M o s t of the E U F I D d e s i g n g o a l s w e r e a c t u a l l y met. E U F I D runs on a m i n i - c o m p u t e r , a DEC PDP 11/70. It is a p p l i - c a t i o n , d a t a b a s e , and DBMS i n d e p e n d e n t . A t y p i c a l q u e s t i o n is a n a l y z e d , m a p p e d and t r a n s l a t e d in five to f i f t e e n s e c o n d s e v e n w i t h g r a m m a t i c a l l y i n c o r r e c t input.
The a n a l y z e r c o n t a i n s a g o o d s p e l - ling c o r r e c t o r and a g o o d m o r p h o l o g y a l g o r i t h m that s t r i p s i n f l e c t i o n a l e n d - ings so that all i n f l e c t e d forms of w o r d s need not be s t o r e d e x p l i c i t l y . A " s y n o n y m e d i t o r " p e r m i t s the user t o r e p l a c e a n y w o r d or s t r i n g of w o r d s in the d i c i o n a r y w i t h a n o t h e r word or s t r i n g , to a c c o m m o d a t e p e r s o n a l j a r g o n and e x p r e s s a b i l i t y . A " C o n c e p t G r a p h Editor s a l l o w s a d a t a b a s e a d m i n i s t r a t o r to m o d i f y t a b l e s and d e f i n e user p r o f i l e s so that d i f f e r e n t u s e r s m a y h a v e l i m i t e d v i e w s of the d a t a for s e c u r i t y p u r p o s e s .
The a n a l y s i s s t r a t e g y , b a s e d on a s e m a n t i c g r a m m a r , p e r m i t s e a s y and n a t u r a l p a r a p h r a s e r e c o g n i t i o n , a l t h o u g h t h e r e are l i n g u i s t i c c o n s t r u c t s it c a n n o t h a n d l e . T h e s e are d i s c u s s e d b e l o w .
An E n g l i s h w o r d m a y h a v e m o r e t h a n one d e f i n i t i o n w i t h o u t c o m p l i c a t i n g the a n a l y s i s s t r a t e g y . For e x a m p l e , "ship" as a v e s s e l and as a v e r b m e a n i n g "to send" can be d e f i n e d in the same d i c t i o n - ary. W o r d s used as d a t a b a s e v a l u e s , such as names, m a y also h a v e m u l t i p l e d e f i n i - tions, e.g., " N e w York" used as the n a m e of b o t h a c i t y and a state.
The m a p p e r , d e s p i t e its m a n y l i m i t a - tions, can c o r r e c t l y m a p a l m o s t all trees o u t p u t by the a n a l y z e r . It is able to h a n d l e E n g l i s h c o n j u n c t i o n s , m a p p i n g them a p p r o p r i a t e l y to l o g i c a l ANDs or ORs, and u n d e r s t a n d i n g that s o m e "ands" m a y need to be i n t e r p r e t e d as OR and v i c e - v e r s a under c e r t a i n c i r c u m s t a n c e s . It is able to g e n e r a t e c a l l s on DBMS c a l c u l a t i o n s (e.g., average) and u s e r - d e f i n e d func- tions (e.g., m a r i n e g r e a t - c i r c l e d i s - tance) if the u s e r - f u n c t i o n e x i s t s and is s u p p o r t e d by the DBMS.
The m a p p e r can t r a n s l a t e "user values" (e.g., "Russian") to database v a l u e s (e.g., "USSR"), and c o n v e r t o n e unit of m e a s u r e (e.g., feet) to a n o t h e r
(e.g., m e t e r s ) .
E U F I D c a n i n t e r f a c e t o v e r y c o m p l e x r e l a t i o n a l and C O D A S Y L - t y p e d a t a b a s e s h a v i n g d i f f i c u l t n a v i g a t i o n a n d p a r a l l e l s t r u c t u r e s . In t h e A I R E P a p p l i c a t i o n a c o n s i s t e n t WWDMS n a v i g a t i o n a l m e t h o d o l o g y i s u s e d t o a c c e s s n o n - k e y r e c o r d s . The s y s t e m c a n also m a p to t h e p a r a l l e l , but n o t i d e n t i c a l , s t r u c t u r e s for d u p l i c a t e and h i s t o r i c a l incidents.
I n the INGRES a p p l i c a t i o n s , E U F I D is a b l e to use and c o r r e c t l y map to = r e l a - t i o n s h i p r e l a t i o n s " w h i c h r e l a t e t w o o r m o r e o t h e r r e l a t i o n s . F o r e x a m p l e , t h e M E T R O r e l a t i o n =cw" c o n t a i n s a c o m p a n y name, a w a r e h o u s e name, a n d a date. This r e p r e s e n t s t h e i n i t i a l b u s i n e s s c o n t a c t . A u s e r m i g h t ask, =When d i d C o l o n i a l s t a r t t o d o b u s i n e s s w i t h S u p e r i o r ? = o r • When d i d b u s i n e s s b e g i n b e t w e e n C o l o n i a l a n d S u p e r i o r ? = , e i t h e r o f w h i c h m u s t ~ o i n b o t h t h e c o m p a n y ( " c =) a n d t h e w a r e h o u s e
('w') r e l a t i o n s t o the =cw" r e l a t i o n . The s y s t e m c o n t r o l m o d u l e keeps a journal of all u s e r - s y s t e m i n t e r a c t i o n together with internal m o d u l e - t o - m o d u l e data such as the IL for the u s e r ' s q u e s - tion and the g e n e r a t e d DBMS query. The s y s t e m also e m p l o y s a v e r y e f f e c t i v e HELP m o d u l e which, under c e r t a i n c i r - c u m s t a n c e s , is c o n t e x t s e n s i t i v e t o the p r o b l e m a f f e c t i n g the user.
IV P R O B L E M S
This s e c t i o n d e s c r i b e s p r o b l e m s a s s o c i a t e d with EUFID d e v e l o p m e n t that appear to be c o m m o n to n a t u r a l - l a n g u a g e i n t e r f a c e s to database m a n a g e m e n t sys- tems. They are l o o s e l y c l a s s i f i e d into the major areas Of a p p l i c a t i o n , l a n g u a g e and d a t a b a s e m a n a g e m e n t issues, a l t h o u g h there m a y be overlap. C r i t e r i a for e v a l u a t i n g w h e t h e r a n a p p l i c a t i o n is a p p r o p r i a t e for a n a t u r a l - l a n g u a g e front-end are also d e s c r i b e d .
A. A P P L I C A T I O N D E F I N I T I O N P R O B L E M S The p r i m a r y issue in this area is c o n c e r n e d with p r o b l e m s of d e f i n i n g ,
creating, and bringing up the n e c e s s a r y
data for a new a p p l i c a t i o n . The d i s c u s - sion points out the d i f f i c u l t i e s a s s o c i - ated with s y s t e m a t i c k n o w l e d g e a c q u i s i - tion.
I. User Model
A single d a t a b a s e may be used by d i f f e r e n t g r o u p s of users for d i f f e r e n t purposes. For example, some users of the
A P P L I C A N T d a t a b a s e m a y w i s h to fill a s p e c i f i c Job o p e n i n g w h i l e o t h e r s m a y c o l l e c t s t a t i s t i c s on types of a p p l i ~ cants. The language used for these two f u n c t i o n s can be q u i t e d i f f e r e n t , and it is n e c e s s a r y to h a v e e x t e n s i v e i n t e r a c - tion w i t h c o o p e r a t i v e users in o r d e r to c h a r a c t e r i z e the kinds of d i a l o g u e s t h e y will h a v e w i t h the system.
Not o n l y m u s t r e p r e s e n t a t i v e l a n g u a g e p r o t o c o l s b e c o l l e c t e d , b u t d e s i r e d r e s p o n s e s m u s t b e u n d e r s t o o d . For example, to a n s w e r a q u e s t i o n such as =What is t h e s t a t u s of our forces in Europe = , the s y s t e m m u s t k n o w w h e t h e r 'our' refers to U.S. or N A T O or some o t h e r unit.
The i m p o r t a n c e of this i n t e r a c t i o n b e t w e e n p o t e n t i a l users and s y s t e m d e v e l o p e r s s h o u l d n o t b e u n d e r e s t i m a t e d , a s i t i s t h e b a s i s f o r d e f i n i n g m u c h o f the k n o w l e d g e base needed by the system, and m a y also b e t h e b a s i s for e v e n t u a l user acceptance o r r e j e c t i o n of the NLI s y s t e m .
2 . V a l u e R e c o g n i t i o n
A "value = is a s p e c i f i c d a t u m stored in the d a t a b a s e , and is the s m a l l e s t piece of d a t a o b t a i n a b l e as the result o f a d a t a b a s e query. For example, i n r e s p o n s e t o t h e q u e s t i o n " W h a t c o m p a n i e s in North Hills s h i p p e d light freight to S u p e r i o r ? = the M E T R O D B M S r e t u r n s two values: " C o l o n i a l " and " S u p r e m e ' . Values can also be used in a q u e r y to q u a l i f y or s e l e c t c e r t a i n r e c o r d s for o u t p u t , e.g., in t h e a b o v e q u e s t i o n "North Hills" and " S u p e r i o r " are v a l u e s that must be r e p r e s e n t e d in the q u e r y to the DBMS. As long as the a l p h a n u m e r i c v a l u e s used in a p a r t i c u l a r d a t a b a s e field are the same as w o r d s in t h e English q u e s t i o n s , there are no d i f f i c u l t p r o b l e m s involved in recog- nizing v a l u e s as s e l e c t o r s in a query.
There are three basic ways to r e c o g - nize these value w o r d s in a q u e s t i o n . They can be e x p l i c i t l y listed in the d i c - tionary, r e c o g n i z e d by a pattern or con- text, or found in the d a t a b a s e itself.
If the value words are stored in the d i c t i o n a r y , they can be subject to spel- ling c o r r e c t i o n b e c a u s e the spelling c o r r e c t o r uses the d i c t i o n a r y to locate w o r d s w h i c h are a close m a t c h to u n r e c o g - nized words in a q u e s t i o n . This means, though, that all p o s s i b l e v a l u e s and v a r i a n t l e g i t i m a t e s p e l l i n g s of v a l u e s for a c o n c e p t must be put either into the d i c t i o n a r y or into the s y n o n y m list. This is r e a s o n a b l e for c o n c e p t s w h i c h have a small and c o n t r o l l e d set of
_values* such as the names of the
c o m p a n i e s in METRO, but m a y b e c o m e u n w i e l d y for l a r g e sets of v a l u e s .
If a v a l u e can be r e c o g n i z e d by a p a t t e r n , it is not n e c e s s a r y to i t e m i z e all i n s t a n c e s in the d i c t i o n a r y . For e x a m p l e , a d a t e m a y be e n t e r e d as " y y / m m / d d " so that any input m a t c h i n g the p a t t e r n " n n / n n / n n " is r e c o g n i z e d as a date. This is the a p p r o a c h used for d a t e s and for n a m e s of a p p l i c a n t s in the A P P L I C A N T database, w h e r e n a m e s of p e o p l e m a t c h the p a t t e r n " I . I . L a s t n a m e " .
In a n o t h e r a p p r o a c h , O n L i n e E n g l i s h [CULL80] and I n t e l l e c t [HARR78, EDP82] (two v a r i a t i o n s of ROBOT) used the d a t a - base to r e c o g n i z e v a l u e s . This is a s a t i s f a c t o r y s o l u t i o n if the database is small or if the small n u m b e r of d i f f e r e n t v a l u e s is s t o r e d in an index a c c e s s i b l e to the NLI, and if the v a l u e s in the
database are s u i t a b l e for use in E n g l i s h
q u e s t i o n s .
Each of these s o l u t i o n s has d i s a d - v a n t a g e s . If v a l u e s are s t o r e d in the d i c t i o n a r y there m a y be m a n y d i f f e r e n t ways to spell each p a r t i c u l a r v a l u e . For e x a m p l e , the c o m p a n y n a m e for " S y s t e m D e v e l o p m e n t C o r p o r a t i o n " m a y also be g i v e n as " S . D . C . " , "S D C", or " S y s t e m D e v e l o p m e n t Cotp". W h i l e e a c h d i f f e r e n t s p e l l i n g c o u l d be e n t e r e d as a s y n o n y m for the " c o r r e c t " s p e l l i n g in the data- base, this w o u l d result in an e n o r m o u s p r o l i f e r a t i o n of the d i c t i o n a r y e n t r i e s and p r o b l e m s w i t h c o n c u r r e n c y c o n t r o l b e t w e e n the u p d a t e s d i r e c t e d to the data m a n a g e m e n t s y s t e m and the u p d a t e s to the d i c t i o n a r y . A c r e a t i v e s o l u t i o n m i g h t he to d e f i n e rules for s y n o n y m g e n e r a t i o n and a p p l y them to d a t a b a s e u p d a t e s .
A s o m e w h a t d i f f e r e n t e x a m p l e is from the A P P L I C A N T a p p l i c a t i o n w h i c h has m a n y o p e n ended d o m a i n s , such as n a m e s of a p p l i c a n t s and p r e v i o u s e m p l o y e r s . In this case, the a p p l i c a t i o n d e s i g n e r m a y h a v e to treat c e r t a i n fields as " r e t r i e v e - o n l y " , m e a n i n g that the d a t a can be asked ~or but not used as a s e l e c - tion c r i t e r i o n . A d a t a b a s e w i t h a l a r g e n u m b e r of r e t r i e v e - o n l y fields m a y be a poor c a n d i d a t e for an NLI.
P a t t e r n s can be used o n l y if they can be e n f o r c e d , and p r o b a b l y few v a l u e s r e a l l y fit the p a t t e r n s nicely. Proper names ate a poor c h o i c e for p a t t e r n s b e c a u s e of v a r i a t i o n s such as m i d d l e ini- tial or t i t l e such as "Dr." or "Jr.". Also, s p e l l i n g c o r r e c t i o n c a n n o t be per- formed u n l e s s the value is stored in the d i c t i o n a r y .
F i n a l l y , the s o l u t i o n of u s i n g the d a t a b a s e i t s e l f to r e c o g n i z e v a l u e s is u n s a t i s f a c t o r y to a g e n e r a l NLI for a n y - thing o t h e r than t r i v i a l d a t a b a s e s , u n l e s s an i n v e r t e d index of v a l u e s is e a s i l y a c c e s s i b l e . T h e r e are the p r o b - lems of s p e l l i n g c o r r e c t i o n and s y n o n y m s for d a t a b a s e v a l u e s , the i n e f f i c i e n c y i n v o l v e d in a c c e s s i n g the D B M S for e v e r y u n r e c o g n i z e d word, and the d i f f i - c u l t y of k n o w i n g w h i c h f i e l d s in the d a t a b a s e to s e a r c h .
3. S e m a n t i c V a r i a t i o n By V a l u e D a t a b a s e s are g e n e r a l l y d e s i g n e d w i t h a m i n i m u m n u m b e r of d i f f e r e n t r e c o r d types. W h e n t h e r e are e n t i t i e s w h i c h are s i m i l a r , but p o s s i b l y h a v e a small n u m b e r of a t t r i b u t e s w h i c h are not s h a r e d , the e n t i t i e s will be s t o r e d in the s a m e record type w i t h null v a l u e s for the a t t r i b u t e s that do n o t apply. The user, in his q u e s t i o n s , m a y v i e w t h e s e s i m i l a r e n t i t i e s as v e r y d i f f e r e n t e n t i t i e s and talk a b o u t them d i f f e r e n t l y .
We did not e n c o u n t e r the p r o b l e m with M E T R O or AIREP. For e x a m p l e , in
METRO, the user asks the same type of
q u e s t i o n s a b o u t the c o m p a n y n a m e d " C o l o - nial" as a b o u t the c o m p a n y n a m e d " S u p r e m e " . In A P P L I C A N T , h o w e v e r , e a c h a p p l i c a n t h a s a set of " s p e c i a l t i e s " such as " c o m p u t e r p r o g r a m m e r " , " a c c o u n t i n g c l e r k " , or " g a r d e n e r " . T h e s e are all s t o r e d as v a l u e s of the s p e c i a l t y field in the d a t a b a s e . Unfortunately, in this c a s e d i f f e r e n t s p e c i a l t i e s e v o k e c o m - p l e t e l y d i f f e r e n t c o n c e p t s to the end user. The user m a y ask q u e s t i o n s such as, " W h a t p r o g r a m m e r s k n o w C O B O L ? " , "Who can p r o g r a m in C O B O L ? " , and " H o w m a n y a p p l i c a n t s with a s p e c i a l t y in c o m p u t e r p r o g r a m m i n g a p p l i e d in 1982?". N o t i c e the new n o u n s and v e r b s that are i n t r o - d u c e d by this s p e c i a l t y name.
A v a l u e d o m a i n such as s p e c i a l t i e s s h o u l d be h a n d l e d w i t h an ISA h i e r a r c h y . Each d i f f e r e n t type of s p e c i a l t y s u c h as g a r d e n e r or p r o g r a m m e r c o u l d h a v e a d i f - f e r e n t c o n c e p t that is a s u b s e t of the c o n c e p t " s p e c i a l t y " . Some q u e s t i o n s c o u l d be asked a b o u t all s p e c i a l t i e s and o t h e r s c o u l d be d i r e c t e d o n l y to c e r t a i n s u b c o n c e p t s . However, there is no [SA h i e r a r c h y in EUFID, and it w o u l d h a v e been i n e f f i c i e n t to treat each s p e c i a l t y and s u b s p e c i a l t y as a s e p a r a t e c o n c e p t s i n c e there are 30 s p e c i a l t i e s and 196 s u b s p e c i a l t i e s . T h e r e f o r e , we r e q u i r e d the users to know the e x a c t v a l u e s , to k n o w w h i c h v a l u e s are for s p e c i a l t i e s and w h i c h are for s u b s p e c i a l t i e s , and to ask q u e s t i o n s using the v a l u e s o n l y as n o u n s . This is not "user f r i e n d l y " .
Even if it were p o s s i b l e to b u i l d a d i f f e r e n t c o n c e p t for each d i f f e r e n t skill, t h e r e is an u p d a t e problem. W h e n a n e w v a l u e is a d d e d to a v a l u e d o m a i n w h e r e there ace u n i f o r m s e m a n t i c s (as in adding a new c o m p a n y n a m e in METRO), the n e w v a l u e is s i m p l y a t t a c h e d to the e x i s t i n g concept, w h e n the n e w v a l u e has d i f f e r e n t s e m a n t i c s , t h e n e w l y a s s o c i a t e d c o n c e p t s , nouns, and v e r b s c a n n o t b e added a u t o m a t i c a l l y . If t h e NLI s u p p o r t s an ISA h i e r a r c h y , s o m e o n e w i l l need t o c a t e g o r i z e t h e n e w v a l u e and a d d a n e w n o d e t o t h e h i e r a r c h y o r s p e c i f y a p o s i - t i o n i n t h e h i e r a r c h y .
4. Automation of D e f i n i t i o n
A n a t u r a l - l a n g u a g e i n t e r f a c e s y s t e m w i l l n o t b e p r a c t i c a l u n t i l a n e w a p p l i - c a t i o n can b e i n s t a l l e d easily. "Easily" m e a n s that the e n d - u s e r o r g a n i z a t i o n m u s t be able to c r e a t e and m o d i f y the d r i v i n g tables for the a p p l i c a t i o n r e l a t i v e l y q u i c k l y w i t h o u t the h e l p of the NLI d e v e l o p e r , a n d m u s t b e able to use the
NLI w i t h o u t r e s t r u c t u r i n g the d a t a b a s e . Each EUFID a p p l i c a t i o n r e q u i r e d " h a n d c r a f t e d " tables that were b u i l t by the d e v e l o p m e n t staff. Each new a p p l i c a - tion was d o n e in less time than the pre-
vious one, but still required several
s t a f f - m o n t h s to bring up. Clearly, the goal of f a c i l i t a t i n g the b u i l d i n g of the
t a b l e s b y e n d users was n o t m e t .
C o m p u t e r - a s s i s t e d t o o l s f o r d e f i n i n g new a p p l i c a t i o n s are a p r e r e q u i s i t e for p r a c - tical NLIs.
B. L A N G U A G E P R O B L E M S
The basic a p p r o a c h to l a n g u a g e a n a l y s i s in EUFID involves a b o t t o m up parser using a s e m a n t i c g r a m m a r . The symbols of the g r a m m a r are c o n c e p t s u n d e r l y i n g lexical items, and the rules of the g r a m m a r ace based o n a case frame- work. E s s e n t i a l l y s y n t a c t i c i n f o r m a t i o n is used o n l y when needed to resolve a m b i - guity. The l a n g u a g e features that this t e c h n i q u e has t o h a n d l e are common to any NLI, and some of the p r o b l e m areas are d e s c r i b e d in the following sections.
I . A n a p h o r a and E l l i p s i s
To support natural i n t e r a c t i o n it is d e s i r a b l e to allow the use of a n a p h o r i c r e f e r e n c e and e l l i p t i c a l c o n s t r u c t i o n s across s e n t e n c e s e q u e n c e s , such as "What a p p l i c a n t s know Fortran and C?", "Which of them live in C a l i f o r n i a ? " , "In Nevada?", "How m a n y know Pascal?'. One of the b i g g e s t p r o b l e m s is to d e f i n e the scope of the r e f e r e n c e in such cases. In the example, it is not clear w h e t h e r the user wishes to retrieve the s e t of all a p p l i c a n t s who know Pascal or o n l y the
s u b s e t who l i v e i n N e v a d a .
One s o l u t i o n i s t o p r o v i d e c o m m a n d s that a l l o w u s e r s to d e f i n e s u b s e t s of the d a t a b a s e to w h i c h to a d d r e s s q u e s t i o n s . This r e m o v e s the a m b i g u i t y and s p e e d s up r e t r i e v a l time on a large d a t a b a s e . H o w - ever, it m o v e s the NLI i n t e r a c t i o n t o w a r d that of a s t r u c t u r e d q u e r y l a n g u a g e , and forces the user to be a w a r e of the level of s u b s e t b e i n g a c c e s s e d . It is also d i f f i c u l t to i m p l e m e n t b e c a u s e a s u b s e t m a y i n v o l v e p r o j e c t i o n s a n d j o i n s t o b u i l d a n e w r e l a t i o n c o n t a i n i n g the sub- set. The NLI m u s t be a b l e d y n a m i c a l l y and t e m p o r a r i l y to c h a n g e the m a p p i n g t a b l e s t o m a p t o this n e w r e l a t i o n .
2. I n t e l l l ~ e n t I n t e r a c t i o n
One of the E U F I D d e s i g n g o a l s was to r e s p o n d p r o m p t l y e i t h e r with an a n s w e r or w i t h a m e s s a g e that the q u e s t i o n c o u l d not be i n t e r p r e t e d . The s y s t e m h a n d l e s s p e l l i n g or t y p o g r a p h i c a l e r r o r s b y i n t e r a c t i n g with the user t o s e l e c t the c o r r e c t word. However, w h e n all of the w o r d s are r e c o g n i z e d but do n o t c o n n e c t s e m a n t i c a l l y , It is d i f f i c u l t to i d e n t i f y a s i n g l e p o i n t in a n a l y s i s w h i c h c a u s e d the failure.
It is i n this a r e a t h a t the a b s e n c e of a s y n t a c t i c m e c h a n i s m for d e t e r m i n i n g w e l l - f o r m e d n e s s was m o s t n o t i c e a b l e . There are times when a q u e s t i o n has a proper s y n t a c t i c s t r u c t u r e , but c o n t a i n s s e m a n t i c r e l a t i o n s h i p s u n r e c o g n i z a b l e to the a p p l i c a t i o n as in "What is the l o c a - tlon of North Hills?". A r e s p o n s e of " L o c a t i o n is not d e f i n e d f o r North Hills in this a p p l l c a c i o n " should be d e r i v a b l e from the r e c o g n i z a b l e s e m a n t i c failure. S i m i l a r l y , it would be useful to have a f r a m e w o r k for i n t e r p r e t i n g partial trees, as in the q u e s t i o n "What c o m p a n i e s d o e s Mohawk ship to?" w h e r e Mohawk is not a r e c o g n i z e d word within the a p p l i c a t i o n . An a p p r o p r i a t e response m i g h t be "Com- p a n i e s ship to r e c e i v i n g o f f i c e s and c o m - panies; M o h a w k is n e i t h e r a r e c e i v i n g o f f i c e nor a company. The names of o f f i c e s and c o m p a n i e s are . . . " . I n t e r p r e t a t i o n of partial a n a l y s e s is not p o s s i b l e w i t h i n the EUFID system; it either s u c c e e d s or fails c o m p l e t e l y .
3. Yes/No Q u e s t i o n s
In normal NLI i n t e r a c t i o n users m a y wish to ask "yes/no" q u e s t i o n s , yet no DBMS has the a b i l i t y to answer "yes" or "no" e x p l i c i t l y . The EUFID m a p p e r maps a yes/no q u e s t i o n into a q u e r y w h i c h will r e t r i e v e some data, such as an " o u t p u t identifier" or d e f a u l t name for a con- cept, if the answer is "yes" and no data if the answer if "no". However, the answer m a y be "no" for several reasons.
For e x a m p l e , a "no" r e s p o n s e to the q u e s - tion "Has John Smith been i n t e r v i e w e d ? " m a y m e a n that the d a t a b a s e has k n o w l e d g e a b o u t John Smith and about i n t e r v i e w s and Smith is not l i s t e d as h a v i n g had an i n t e r v i e w * , or the d a t a b a s e knows a b o u t John Smith and no d a t a a b o u t i n t e r v i e w s
is a v a i l a b l e . A third p o s s i b i l i t y c o u l d be that the d a t a b a s e has i n f o r m a t i o n
a b o u t John Smith and his e m p l o y m e n t
s i t u a t i o n ( a l r e a d y h i r e d ) , and the r e s p o n s e m i g h t i n c l u d e that i n f o r m a t i o n , as in "No, but he has a l r e a d y b e e n h i r e d ' .
4. C o n j u n c t i o n s
T h e s c o p e of c o n j u n c t i o n s is a d i f - ficult p r o b l e m for any p a r s i n g or a n a l y z - ing a l g o r i t h m . The n a t u r a l - l a n g u a g e use of "and" and "or" does not n e c e s s a r i l y c o r r e s p o n d to the l o g i c a l m e a n i n g , as in the q u e s t i o n " L i s t the a p p l i c a n t s who live in C a l i f o r n i a a n d A r i z o n a . " . M u l t i - ple c o n j u n c t i o n s in a s i n g l e q u e s t i o n can be a m b i g u o u s as in " w h i c h m i n o r i t y and female a p p l i c a n t s k n o w F o r t r a n a n d C o b o l ? ' . This c o u l d be i n t e r p r e t e d with l o g i c a l "and" or w i t h l o g i c a l "or" as in " W h i c h a p p l i c a n t s who are m i n o r i t y or female k n o w e i t h e r F o r t r a n or C o b o l ? " .
The E U F I D m a p p e r will c h a n g e E n g l i s h "and" to l o g i c a l "or" w h e n the two p h r a s e s w i t h i n the scope of the c o n j u n c - tion are v a l u e s for the same field. In the e x a m p l e above, an a p p l i c a n t has o n l y one s t a t e of r e s i d e n c e .
u n c e r t a i n w h e t h e r t h e y s h o u l d be r e t u r n e d in the a n s w e r . It is also d i f f i c u l t to take a c o m p l e m e n t of a set of d a t a using the m a n y d a t a m a n a g e m e n t s y s t e m s that do not s u p p o r t set o p e r a t o r s b e t w e e n r e l a - tions.
Q u e s t i o n s w h i c h r e q u i r e a "yes" or "no" r e s p o n s e are d i f f i c u l t to a n s w e r b e c a u s e o f t e n the "no" is due to a p r e s u p p o s i t i o n w h i c h is i n v a l i d . This is e s p e c i a l l y true w i t h n e g a t i o n . For e x a m - ple, if the user asks, " D o e s e v e r y c o m - p a n y in North H i l l s e x c e p t S u p r e m e use NH2?", the a n s w e r m a y be "no" b e c a u s e S u p r e m e is not in North Hills.
The c u r r e n t i m p l e m e n t a t i o n of E U F I D d o e s not a l l o w e x p l i c i t n e g a t i o n , a l t h o u g h s o m e n e g a t i v e c o n c e p t s are h a n - dled s u c h as " W h a t c o m p a n i e s s h i p to c o m - p a n i e s o t h e r than C o l o n i a l ? " . " O t h e r than" is i n t e r p r e t e d as the "!-" o p e r a t o r in e x a c t l y the same w a y that " g r e a t e r than" is i n t e r p r e t e d as ">".
C. I N T E R P R E T A T I O N A N D D A T A B A S E I S S U E S M a n y q u e s t i o n s m a k e p e r f e c t s e n s e s e m a n t i c a l l y but are d i f f i c u l t to m a p into DBMS q u e r i e s b e c a u s e of the d a t a b a s e s t r u c t u r e . The p r o b l e m s b e c o m e w o r s e w h e n a c c e s s is t h r o u g h an NLI b e c a u s e of i n c r e a s e d e x p e c t a t i o n s on the part of the user and b e c a u s e it m a y be d i f f i c u l t for a h e l p s y s t e m a d e q u a t e l y to d e s c r i b e the p r o b l e m to the user who is u n a w a r e of the d a t a b a s e s t r u c t u r e .
5.
Nepption
N e g a t i v e r e q u e s t s m a y c o n t a i n e x p l i - cit n e g a t i v e w o r d s such as "not" and "never" or m a y c o n t a i n i m p l i c i t n e g a t i v e s such as "only", "except" and "other than" [OLNE78]. The i n t e r p r e t a t i o n of n e g a - tives can be v e r y d i f f i c u l t . For e x a m - ple, " W h i c h c o m p a n i e s did not s h i p any p e r i s h a b l e f r e i g h t in 1976" could m e a n e i t h e r " W h i c h (of all t h e c o m p a n i e s ) s h i p p e d no p e r i s h a b l e freight in 1976?" or " W h i c h (of the c o m p a n i e s that s h i p p e r i s h a b l e freight) s h i p p e d none in 1976?'. M o r e o v e r , if some c o m p a n i e s were o n l y r e c e i v e r s and never s h i p p e r s it is " - " ~ e is the i m p o r t a n t d i s t i n c t i o n
b e t w e e n a " c l o s e d world" d a t a b a s e in w h i c h the a s s u m p t i o n is that the d a t a b a s e c o v e r s the w h o l e w o r l d (of the a p p l i c a t i o n ) and an "open world" d a t a b a s e in w h i c h it is u n d e r s t o o d that the d a t a b a s e d o e s not r e p r e s e n t all there is to the real w o r l d of the a p p l i c a t i o n . In the o p e n w o r l d d a t a b a s e , w h i c h we e n c o u n t e r m o s t of the time, a r e s p o n s e of "not that this d a t a b a s e knows of" m i g h t be m o r e a p p r o p r i a t e .
I. IL L i m i t a t i o n s
The d e s i g n of the IL is c r i t i c a l . It m u s t be rich e n o u g h to s u p p o r t r e t r i e v a l from all the u n d e r l y i n g DBMSs. However, if it c o n t a i n s c a p a b i l i t i e s that do not e x i s t in a s p e c i f i c DBMS, it is d i f f i c u l t to d e s c r i b e this d e f i c i e n c y to the user.
In A P P L I C A N T , the user c a n n o t g e t b o t h the m a j o r and m i n o r fields of s t u d y by asking "List a p p l i c a n t s and field of study", b e c a u s e a l i m i t a t i o n in the E U F I D IL p r e v e n t s m a k i n g two j o i n s b e t w e e n e d u - c a t i o n and s u b j e c t r e c o r d s . This p r o b l e m was c o r r e c t e d in a s u b s e q u e n t v e r s i o n of IL w i t h the a d d i t i o n of a "range" s t a t e - ment s i m i l a r t o that used by Q U E L
[STON76].
The c u r r e n t IL d o e s not c o n t a i n an " E X I S T S " or "FAILS" o p e r a t o r w h i c h can test for the e x i s t e n c e of a record. Such an o p e r a t o r is f r e q u e n t l y used to test an i n t e r r e c o r d link in a n e t w o r k or
h i e r a r c h i c a l DBMS. It is n e e d e d to