Abstract :
TEXT ANALYSIS LEARAING STrATEGiES Pierre PLANTE
Universit@ du Qu@bec ~ Montr@al
APLEC (l'APrenti-LECteur - or Lear- ning Reader) is an extension of the D@- redec software, a programming system de- voted to the content analysis and lin- guistic treatment of texts.
APLEC will associate automatically to any text descriptive grammar (TDG - Grammaire descriptive de texte) a ques- tion/answer module where the questions are asked in a given natural language by the user, and where the a~wers are tracked down in the textual corpus un- dergoing examination.
The user's TDGs can inscribe them- selves at various levels of analysis (morphological, syntactical, semantical, logical, ...), and it is a singular characteristic of the D@redec that it allows for a polyvalent treatment with- out interfaces, this being done with only one retentive structure for the information, and only one algorithmic structure for the description and for the retrieval.
Once the user's TDG is applied on the question (it was at first applied on the whole of the text), some explo- ration models (provided by the user) activate the comparison of the question to the text in order to initiate the tracking down of the answer.
If, due to weaknesses of the gram- mar, APLEC cannot bring to an end this work, it will track down relevant ele- ments of the problem lifted up, commu- nicate with the user, submit to him the results of its analysis, and wait for him to propose a solution to the pro- blem. If there is such a solution, it will be further on 'learned' by APLEC, that is it will be generalized on the whole text in such a way as APLEC will no more intervene if similar problems arise again.
Hence, the intelligence of the ana- lysis grows gradually as the work of the interactions between the user and his text, via the automaton, goes on; and it shall construct itself by osmosis to the particular semantics of the text undergoing questioning.
We do not pretend to supply an achieved theory of automatic learning, but rather try to supply different
formal strategies functioning whatever the content of the D@redec grammars applied to a given text.
I will first describe the general charac- teristics of the D@redec programming sys- tem, and afterwards will pass on to a few examples of utilization of the question module named APLEC.
The D@redec offers to its users two large classes of functions: (i) functions said to be text descriptive, and (ii) text ex~lorative ~unctions.
The first functions have the form of finished state automata. They are machines that scan the text sequence after sequence (e.g. sentences), and that build tree structures onto the ele- ments of these sequences (e.g. words). The knots of the trees are labelled with descriptive categories, and are also, eventually, linked together by oriented relations. The leaves of the trees (the words of the sequences) can see themselves adjoined with complex seman- tic networks.
Example :
C A T * ~
SUBJECT* / OBJEC?
G
SG
The user will program his D@redec automata in such a way as they can la- bel descriptive structures to every se- quences of his text.
The D@redec remains indifferent to
the nature of these structures which hence can inscribe themselves at syntac- ticalp morphological ~ 'semantical t or lo- gical level~.
The D@redec automata scan the se- quences in both directions. They are non deterministic, but all degrees of determinism are programmable. These au- tomata are preferentially ascending ra- ther than descending. It is to be noted also that their sensitivity to context can run over the frame of every analysed sequence and spread out on the whole of a corpus. Actually, any decision taken in a D@redec automata (may it be of ca- tegorization, composition of a phrase, the labelling of a relation between two knots, the construction or developing of a semantic network...) can be tribut- ary to the result of an investigation carried out in that part of the corpus preceding or following the pointed se- quence, or even in any other corpus. This type of enlarged contextual inves- tigation enables a D@redec definition of properly textual recursive grammars in addition to the definition of sentential recursive grammars.
The tree structures (named EXFAD - Expressions de forme admissible, or Ex- pressions of admissible form) will be associated to the corpus sequences in an evolved interactive mode. Here the D@re- dec software can give assistance to its users in m a n y ways: first, it will auto- matically track down and give him a diag- nosis of programming errors; it also al- low him to trace the behavior of an auto- maton; it further allows interruptions in the automata's work in order to ques- tion the state of the description, to modify the grammar, etc.
The text descriptions (TD - Descrip- tion de textes) produced by automata will be analyzed further on by explora- tive functions whose arguments (called exploration models), given by the user, are pattern-matching structures having a simple writing syntax and a high dis- cernment power. It is by way of the ex- plorative functions that the text des- criptive grammars (TDG, i.e. the automa- ta series) will be associated to content analysis objectives.
Thus the programming sessions with D@redec will usually have the aspect of an enchainment whose links are:
an u omat on uct on
the obtaining of TDs by the applica- tion of these automata to the corpus; (3) the elaboration of exploration models; (&) the application of explorative func-
tions on the TDs;
(5) the analysis of the results event- ually followed by re-explorations or by the reconstruction of new descriptions.
Programming with D@redec essentially signifies to program the production of TDs, then to program their exploration, to analyze the results, and to start o- v e r at one or the other stage until the obtaining of satisfying results. The whole of the process is highly facilita- ted by the fact that the admissible ex- pressions, at the input like those at the output - may they be for descriptive functions or f o r explorative functions - have exactly the same writing syntax from a computational point of view.
One could think that such a type of experimentation will be the lot of most of the software users. But the highly interested user will surely want to enjoy the D~redec's ACSP procedures, i.e. it's automatic context sensitive programming procedures (proc@dures de programmation automatique sensibles au contexte: PASC).
These fuctions try in various ways to simulate the behavior of the program- m e r in the control box of the following diagrams:
i | ' n e s c r i p t i v e l
r e s u l t s
n
.1
P r o g r a m m i n g of:
~ ~ E x p l o r a t i v e
•
functionControl
box T
Text i~
What is here aimed at is to take out of the p r o g r a m m e r ' s h a n d s as m a n y as possi- ble of the real o p e r a t i o n s ' t o be made, and to leave him to take only c e r t a i n high level d e c i s i o n s as p e r g e n e r a l pla- n i f i c a t i o n of the experiments.
As for example, some of the A C S P s deal w i t h the chained r e a p p l i c a t i o n of e x p l o r a t i v e f u n c t i o n s on a TD; the re- sults that are o b t a i n e d at every explo- ration will serve to m o d i f y the m o d e l (or m o d e l s ) of e x p l o r a t i o n that has (or have) been used, m o d e l (or m o d e l s ) w h i c h is (or are) g i v e n ~ n a p r i m a r y v e r s i o n by the u s e r as a starting point. The u s e r will control the ACSP by giving c e r t a i n keys, certain p a r a m e t e r s that will guide the w h o l e of the operations.
O t h e r ACSP a l l o w for the p r o g r e s s i - ve e n r i c h m e n t of a TD by c o n s t a n t l y mo- d i f y i n g a d e s c r i p t i v e a p p a r a t u s w h o s e g e n e r a l structure is g i v e n at the start, here a g a i n a c c o m p a n i e d by g e n e r a l para- m e t e r s d e a l i n g w i t h the i t e r a t i v e process.
It should be noted that the a u t o m a - tic p r o g r a m m i n g p r o c e d u r e s have, from a c o m p u t a t i o n a l point of view, the same form or the same a d m i s s i b i l i t y as the o t h e r D @ r e d e c f u n c t i o n s or operations; it f o l l o w s that the f o r m e r are thus com-
p o u n d a b l e w i t h the latter. This last
c h a r a c t e r i s t i c is the one that a c c o u n t s for the 'context sensitive' q u a l i f i c a t i o n of the ACSP: the a u t o m a t i c p r o g r a m m i n g p r o c e d u r e s are settled in e n v i r o n m e n t s that are s u s c e p t i b l e to supply the para- m e t e r s r e l e v a n t to t h e i r execution.
The A C S P m a c h i n e w h i c h is by far the m o s t complex and, in a c e r t a i n way, the most c o m p l e t e is named APLEC.
APLEC ( A P r e n t i - L E C t e u r - the Lear- ning Reader) will a s s o c i a t e a u t o m a t i c a l - ly a q u e s t i o n / a n s w e r m o d u l e for all TDG s u b m i t t e d by the user. T h i s Q/A m o d u l e w h e r e the q u e s t i o n s are f o r m u l a t e d in the n a t u r a l l a n g u a g e of the text, and w h e r e the a n s w e r s are t r a c k e d down seg- m e n t s of the latter.
W h e n APLEC cannot track down an answer, it will:
(i) give a d i a g n o s i s of the difficulty;
(2) c o m m u n i c a t e w i t h the user;
(3) ask for a solution;
(4) and - if a s o l u t i o n is p r o p o s e d by the u s e r - try to g e n e r a l i z e the s o l u t i o n onto the o v e r a l l corpus. All this is done in o r d e r to i n c r e a s e it's f o r t h c o m i n g retrieval power, and to keep i n t e r r u p t i o n s to a minimum.
In a d d i t i o n to the g r a m m a r and to
a few e x p l o r a t i o n m o d e l s that he will c o n s i d e r as being r e p r e s e n t a t i v e of the s t r u c t u r e s l a b e l l e d by it, the u s e r will supply APLEC with only very g e n e r a l p a - r a m e t e r s as p e r the r e t r i e v a l c o n d i t i o n s and the l e a r n i n g conditions.
Here is a first example of a dialo- gue with APLEC:
USER:
APLEC:
USER:
APLEC:
USER:
APLEC :
USER:
APLEC :
USER:
APLEC :
USER:
APLEC :
USER:
APLEC:
USER:
APLEC:
USER:
APLEC:
Hello~
Hello~ W o u l d you like to give
an o p i n i o n on the h u n t i n g of ba-
by seals? I will look t h r o u g h
m y data for an o p p o s e d opinion. B r i g i t t e Bardot i n t e r e s t s are less than honourable~
It seems to m e that we should p r a i s e this d e f e n d e r of nature r a t h e r than to run him down by l e n d i n g to his a c t i o n s p u r e l y selfish motives.
N e v e r t h e l e s s Bardot is a very decent lady~
You are c o n t r a d i c t i n g yourself; you were for it and now you are a g a i n s t it. Are you in f a v o r of this hunting, yes or no?
No:
B r i g i t t e Bardot has just m a d e a fool of h e r s e l f by her l o o n y de- c l a r a t i o n s on the h u n t i n g of ba- by seals.
Give a n o t h e r opinion; I will n o w look for an a g r e e i n g argument. W e b e r acts to p r o t e c t the cute little seals~
Can you supply an equivalent ex- p r e s s i o n for the e x p r e s s i o n "seals" among the expressions: "wonder", "earth", " h o u s e " ? "Wonder"~
Can you c o n s i d e r the e x p r e s s i o n "Weber" as an e q u i v a l e n t f o r " c e l e b r i t y " ?
Yes~
O t h e r rich c e l e b r i t i e s do n o t h i n g for the p r o t e c t i o n of w o n d e r s of this earth w h i c h is the house of all.
A n y o t h e r opinion?
B r i g i t t e Bardot comes to defend the nice hunters~
I think I d i s c e r n a c o n t r a d i c t i o n in the e v a l u a t i v e p r o c e s s of F o u r
opinion. I cannot give an an-
swer to it.
A n y o t h e r opinion? Thank you~
Bye, byes
In that example, the a n a l y z e d text is m a d e up of some t w e n t y l e t t e r s sent to n e w s p a p e r s w h i l e w a s t a k i n g place the d i s p u t e s u r r o u n d i n g the h u n t i n g of baby
seals in the S t - L a u r e n t estuary. APLEC
was then fed upon a F r e n c h surface gram-
m a r (PLANTE 1980), and a so-called dis- c o u r s i v e e v a l u a t i v e g r a m m a r ( g r a m m a ~ @ v a l u a t i v e du discours) (PANACCIO, 1979). The first g r a m m a r t r a c k s down f o r all f r e n c h s e n t e n c e s the topic and comment as well as v a r i o u s t y p e s of c o m p l e m e n t s and d e t e r m i n a t i v e s . It also p r a c t i c e s a s e g m e n t a t i o n of the sentence in its
s e n t e n t i a l components. The second gram-
m a r a l l o w s us to k n o w if an agent or a t y p i c a l act of a d i s c u r s i v e f o r m a t i o n
(e.g. here, the h u n t i n g of baby seals) is p r a i s e w o r t h y or condemnable, this b e i n g done a f t e r a study of the evalua- tive t r a n s f e r s o c c u r r i n g on the m a r k e r s " p r a i s e w o r t h y " or " c o n d e m n a b l e " in the
sentences. These m a r k e r s are at first
l a b e l l e d m a n u a l l y on c e r t a i n w o r d s of the corpus (e.g. " r i d i c u l e ": " c o n d e m n a - ble"). The second g r a m m a r j u x t a p o s e s
i t s e l f on the first. (Both g r a m m a r s are
m o r e t h o r o u g h l y e x p o u n d e d in PLANTE,
198o.)
W h e n APLEC cannot track down an an- swer d i r e c t l y in the text (either becau- se of a w e a k n e s s of the g r a m m a r or be- cause of some t e x t u a l i n s u f f i c i e n c y on the question), APLEC will intervene, i.e. get in t o u c h w i t h the user, e x p l a i n the difficulty, and wait for a solution. If a s o l u t i o n is proposed, it will pro- b a b l y f a c i l i t a t e the d i s c o v e r y of the p r e c i s e a n s w e r to the question; moreover, it will be, if the u s e r wishes, g e n e r a l i - zed to the w h o l e of the text u n d e r g o i n g exploration, thus it will help with the future t r a c k i n g down of a n s w e r s to o t h e r questions.
Hence A P L E C ' s performance, that is its c a p a c i t y to supply a d e q u a t e a n s w e r s to g i v e n questions, i m p r o v e s itself gra- d u a l l y with the c o n v e r s a t i o n s allowed. L i t t l e by little, c o n c e p t u a l n e t w o r k s are constructed, a l l o w i n g f o r a m o r e and m o r e r e f i n e d and acute a n a l y s i s b e c a u s e
of its b e i n g m o r e and m o r e r e l e v a n t to
the e x p l o r e d text. APLEC is in a way
sold blank to its u s e r s who little by little will t r a n s f o r m it into a m o r e p e r s o n a l robot w h i c h gets b e t t e r w h i l e a d a p t i n g to p a r t i c u l a r t e x t u a l data.
The s o l u t i o n s that are p r o p o s e d du- ring the c o n v e r s a t i o n s and that are la- ter on g e n e r a l i z e d can take d i f f e r e n t
forms. In our example, the s o l u t i o n
form lies in the supplying of " w o r d - t o - word" e q u i v a l e n c e r e l a t i o n s ("Weber":: " c e l e b r i t y " ) . Yet APLEC can accept and g e n e r a l i z e d solution forms w h i c h are a
lot m o r e complex. Thus, for example,
the short f o l l o w i n g text:
John and M a r y are at the bookshop. M a r y w o u l d like to o f f e r a book to
John.
But she lacks the money.
M a r y d e c i d e s to slip the book un- d e r h e r coat.
The b o o k s h o p o w n e r will lose a book but John will be happy.
USER: APLEC: USER:
APLEC: USER: APLEC: USER:
APLEC:
Did M a r y steal?
M a k e "offer' m o r e explicit. H a b i t u a l l y T w e o f f e r an object of values
M a k e "slip" m o r e explicit. It is an action~
M a k e "steal" m o r e explicit. It is an a c t i o n o n an object of value.
M a r y d e c i d e s to slip the book u n d e r h e r coat.
Any o t h e r question?
Thus APLEC l e a r n s that "book" is an object of v a l u e (since it is "offered" in the text), for it l e a r n e d that to "slip" and "steal" can n o w be t a k e n one for the other, since that to "slip" has b e c o m e an "action" on an object of "va- lue"...
In this second l e a r n i n g strategy, the " w o r d - t o - w o r d " e q u i v a l e n c e s are re- p l a c e d by m o r e or less c o m p l e t e p a t t e r n - m a t c h i n g semantic n e t w o r k s d e f i n e d in
the terms of the e x p e r i m e n t e d g r a m m a r and l a b e l l e d to the w o r d s of the q u e s t i o n and
to t h o s e of the text. These networks,
p r o p o s e d only during interruptions, first permit a r e f i n e m e n t to the p a t t e r n - m a t - ching p r o c e d u r e s to f a c i l i t a t e the track- ing down of the good answer, and second-
ly permit the a u g m e n t a t i o n or the con- s t r u c t i o n by t h e m s e l v e s of n e w networks. In the last example, a f t e r the first ex- plicit reformulation, all the 'offered objects' in the text r e c e i v e d the cate- gory " v a l u e " . . . W e will note that APLEC a u t h o r i z e s the c o n t e x t u a l i s a t i o n of the l e a r n i n g procedures, i.e. the circum- stantial g e n e r a l i z a t i o n of the s o l u t i o n s p r o p o s e d by the user, this b e i n g done w i t h the help of d i f f e r e n t e x e c u t i o n pa- r a m e t e r s m a n i p u l a t e d by the latter.
We b e l i e v e that APLEC f o r m a l l y al- lows the d i s t i n c t i o n of what is r e l e v a n t i~ a semiotic e n t e r p r i s e to the s e m a n t i c s of a g r o u p of texts, to the s e m a n t i c s of a p a r t i c u l a r text, or r a t h e r to the se- m a n t i c s of a p a r t i c u l a r use of a g i v e n text. It also a l l o w s the d e l i m i t a t i o n of the semantic b o u n d a r i e s of a given TDG, and to t h u s m a k e easier the a m e l i o - r a t i o n of the latter.
T h e s e last c o n s i d e r a t i o n s lead us to reflect on the p r o b l e m s that are re- lated to the project of the e d i f i c a t i o n of a t h e o r y of n a t u r a l l a n g u a g e d e s c r i p -
tion, ands in particular, on the problem of the insertion of semantics in that the- ory.
Natural languages are objects that, in an evident manner, strongly resist all known formalization techniques. We have to get used to the idea of an object who- se rules can change according to the por- tion of it we are observing, and accor- ding to the use we are making of it. We can always think of particular seman- tics, functional for a given world; many robots have shown that it is possible to simulate the functioning of a natural language for a limited semantic world. These experiments are interesting in an illustrative way, but they m i s s the es- sential of what characterizes the normal behavior of a speaker: his capacity to pass from one semantics to another while adapting, or even while transforming when needed the set of rules already constructed. What we need is a software that will constantly facilitate the construction of new sets of primitives, of new axioms, and of new rules of infe- rence, these elements being considered as variables rather than constants.
We are still a long way from perfec- ting a system that would simulate such learning procedures, still far away from a system where the semantics would be in a fairly good part balanced on the side of the use and not on the side of the internal construction principles. Mean- while, local experiments of description will be valid only if they are accompa- gnied by their empirical adequation con- ditions. From such a point of view, the D@redec wants itself to be a formal fra- mework for the comparison of different functionality indices and of empirical adequation indices of the descriptive rules. But moreover, for a given set of rules, it (and here I am thinking more particularly of APLEC) automatically will track down the sequences or the lexical items of the corpus the description of which must be specifically enriched to elevate the empirical adequation index. The rules will not then be automatically transformed or modified, constituting a very close simulation of human speaking behavior, but generalizations will never- theless be produced automatically in such a way as to augment the scope of the so- lutions that are at that moment proposed by the user.
i
The explanations (making an expres-sion more explicit), i.e. the solu- tions to APLEC's problems must be represented in a formal language - contrarily to the questions that are asked and to the supplied answers which are both given in natural lan- guage (French, English...) - in this example, the explicit reformulations are given in English in order to ma- ke the presentation easier.
Bibliography:
PANACCIO, Claude
PiANTE, Pierre
"Des phoques et des hommes, autopsie d'un d4bat id@ologique." Philosophiques, VI, i (1979), 45-63. Le D@redec~ logiciel p o u r le traitement linsuistique et i' ana- l~se de contenu des textes, Manuel de l'u- sager. Montr@al, U.Q.A.M., Service de l'informatique, m a r s 1980.