http://wrap.warwick.ac.uk/
Original citation:
Dain, J. A. (1991) Syntax error handling in language translation systems. University of Warwick. Department of Computer Science. (Department of Computer Science Research Report). (Unpublished) CS-RR-188
Permanent WRAP url:
http://wrap.warwick.ac.uk/60877
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.
A note on versions:
-Research
report
1BB
Department of Computer Science
University of Warwick
Coventry CY4
7/J,
SYNTAX ERROR HANDLING
IN
LANGUAGE TRANSLATION
SYSTEMS
JULIA
DAIN
(RR188)
CONTENTS
INTRODUCTION
1.
MODELS
FORSYNTAXERRORS
1.1 Minimum
Distance Errors1.2
Parser Defined Errors2.
LANGUAGES
FOR ERRORHANDLING
3.
ERROR REPORTING - TFIE USERINTERFACE
4.
ERRORDETECTION
5.
ERROR RECOVERY SCIIEMES5.1
Global Correction5.2
Regional Recovery - Alterations to the Parsing Stack and lnput5.3
l-ocat Recovery - Alterations to the Input Alone6.
CRITERTA FOR ERROR RECOVERY6.1
Performance6.2
Efficiency
6.3
Easeof
Use6.4
Goals7. EVALUATION
OF ERROR RECOVERY SCT{EMES7.1
Quantitative Performance Evaluation and Comparison7.2
Error Messages7.3
Automatic Generation7.4
Conclusions8.
FUTURE DIRECNONS
INTRODUCTION
Language
translation
systems such as compilers and interpretersfor
programming languageshave the task
of
translating input in a source language into equivalent output or actions in a targetlanguage.
In
addition to performing translation of correct input,language translation systems arealso usually expected to handle input that does not conform
with
the legal rules ordefinition
of
the source language. Such incorrect input is said to contain lexical, syntactic or semantic errors, according to whether the lexical, syntactic or semantic rules of the source language are violated.A
language translation system is conventionally decomposedinto
several tasks or phasesincluding lexical
analysis, syntax analysis,
semanticanalysis, code generation and
codeoptimisation.
Lexical
analysis apportions theinput
streaminto
those stringsof
input
symbols that make up the lexical tokensof
the source language. Syntax analysis checks that the sequenceof lexical tokens accords
with
the syntactic structure of the source language, usually described bya context-free gftunmar
(CFG).
Semantic analysis checksfor
non-context-free structure. Codegeneration and
optimisation perform
the desired translationinto
targetcode.
The phasesof
lexical,
syntactic and semantic analysis are usually expectedto
handlelexical,
syntactic and semantic errorsrespectively.
Thus a parserperforming
syntax analysis is usually expected to handle any stringof
lexical tokens, including strings that cannot be parsed according to the CFG used to describe the sourcelanguage.
The design of a syntax analyser includes design decisionson how to
proceed when theinput
is
syntacticallyincorrect:
theresulting
proceduresform
asyntax error handling scheme.
Good
syntrx
error handling enhances the valueof
a translation systemby
showing its usershow to provide
correctinput;
baderror
handling preventseffective
useof
a
system. Syntaxanalysis is
well
undentoodfor
correct input, to the extent that designers frequently use softwaretools
for
the
automatic construction
of
efficient syntax analysers.
But
the
analysis
of
syntactically incorrect input
remains lesswell
understood.This
surveyis
concernedwith
thedesign and construction
of
good syntax error handling schemes for language translation systems.Why
it
is important to have good error handling can be illusnatedwith
an example. Consider thefollowing
source input program in the C programming language:I
main
o
2
{ inti;
<'ll
4
prj-ntf("%d\n", i);
There
is
a semi-colon missingfrom
the endof
line
3.
When this programis
submitted to theportable C
compiler
widely
available under theUNIX
system Uohnson 19781, the diagnosticsissued are:
"file.c", line 4:
syntax
error
"fi1e.c",
Line 4:
i11ega1
character:1-34 (octal)
"fi1e.c",
line 4:
cannot
recover from earlier errors:
goodbye!
precisely - the user's attention is directed to the whole
of line
4.
The message about anillegal
character cannot be deciphered wittrout a character code
table.
The word"illegal"
should not beused, as
it
is inaccurate andintimidatory
[Molich
and Nielsen1990].
No
hints are given as to what might constitute acorrection. Finally
the compiler appears to give up altogether. The errorhandling
mechanisms are inadequate and prevent thecompiler
from
doing
a goodjob
on theinput, which
contains a common, simpleerror.
Giving
the same input to theGNU
C compiler[Stallrnan 1989] results
in
the error messages:"f
ife.
c"
: In
functi-on
main
"fj-l-e .c"
:4:
parse
error before 'printf
'
These messages
are an improvement
over
the previous example,
asthey
locate the
errorprecisely,
anddo not give
any misleadingor
inaccurateinformation.
They would
be furtherimproved
by giving
the user some suggestions on how to correct the enor.The issues
in
syntax error handling can be summarizndby thefollowing
questions: what is asyntax error?
how
will
it
be dealtwith? which
languageswill
be handled? and what are thetechnical issues
for
the parser? Thefirst
question, what is an error, can be answeredformally,
by giving a theoretical model which characterizes errors abstractly, or
informally,
from a practicalor
user-oriented
view which
characterizeserrors according
to plausibility or
frequency
of
occurrence.
The
second question, howwill
an error be dealtwith,
can be answered either byhalting
syntax analysis as soon as an error is detected, orby
attempting to continue analysis bysome means of
recovery.
Recovery must restart the parser and may include alteration or repair tothe
input.
It
may not be
feasible
to
produce
a theoretically optimal repair,
becauseof
considerations of efficiency, nor mayit
be feasible to determine the original intention of the user, becauseof
thenanre of
the system in which the recoverymethd
operates. The third question iswhether the languages to be handled are
arbinary
context-free languagesor
a restricted class,such as practical programming languages. Typical programming languages may
exhibit
special characteristics which have implications for errorhandling.
Studying such languages, and the useof
them, may leadin
different directions from a theoretical study: for example, to questions suchas what are the psychologically plausible errors or the most common in practice, which errors are
hard to handle, and what are the implications
for
programming languagedefinition.
The fourthquestion addresses the
technical
issuesfor
a parser:how
will
errors beidentified,
how
will
recovery be effected, what communication
with
the userwill
take place. Relevant factors includethe method
usedfor
parsing, the
natureof
accessto
the
sourceinput,
the extent
to
which
strategies are language-dependent, and the need for efficiency.
This paper surveys theory and practice
of
syntax error handling, addressing the particularissues
of
theoretical modelsfor
syntax errors, languagesto
be handled, error detection,eror
reporting,
andpractical
methodsfor
error
recovery.
Somefamiliarity is
assumedwith
theprinciples and practice of compiler design.
The
termserror handling,
error recovery, errorrepair
anderror correction
are usedwith
different
meaningsin
theliterature.
In
this paper thefollowing
definitions
will
be used.Error
Inndling,
or
anerror
handling scheme, is a general termfor
the method used to treat a syntaxerror
in
theinput to
a parser,including
reporting the error to theuser.
Error
recovery is errorhandling
in
which an attempt is made to continuepaning
the remaining input after detection of asyntax
error.
Error
repair describes those changes which an error recovery scheme may make tothe
input
andpossibly
the parse stackin
order to continueparsing. Error
correction is
errorhandling which alters the input to that which the user intended.
1.
MODELS FOR SYNTAX
ERRORS
Before discussing how to handle syntax errors,
it
is necessary to establishwith
some precisionwhat a
syntix
erroris.
Unfortunately,
aninformal
conceptof
a syntax error doesnot
alwayscorrespond
with
a
formal definition.
A
CFG
usedto
describe thecontext-free
syntaxof
aprogramming language defines precisely those strings which are syntactically colrect programs,
namely sentences
of
the language, but not what the syntax errors arefor
a string which is not asentence
of
the language.
This
sectionreviews
the modelsof
minimum
distance errors and parser defined errors.1.1
Minimum
Distance
Errors
Minimum or
Hamming
distanceis
frequently
used as aformal
model
for
syntaxelrors.
It
measures the shortest way to transform a syntactically incorrect programinto
a correct one andthus approximates
to
the prograrnmer's conceptof
syntaxelTors.
Given
afixed
setof
editoperations,
typically
the insertion, deletionor
replacementof
a single symbol, theminimum
distance between
two
strings is the minimum number of edit operations needed to transform one string into theother.
For example, the minimum distance between the nvo strings of parentheses"(0)"
and"0"
is two, obtained by two insertions.Additionally,
costs may be associatedwith
the edit operations to grve a least-cost sequence of edit operations needed to transform one string intothe other [Wagner
andFischer 19741. The
minimum
distance measureis
usedto
define
aminimum
distanceerror correction
for
a
string
anda
language:a
minimum
distance errorcorrection is a sentence
of
the language nearest to the string,in
the sense that there is no othersentence whose
minimum
distancefrom
the stringis
smaller. A
least-costenor
correction issimilarly
definedto
be a sentenceof
the language such that thereis
no other sentence whose least-cost sequence of edit operations isof
smallercost.
For syntax erTor correction, the symbolsto
be operated uponby
theedit
operations are the terminalsof
a CFG (thelexical
tokensof
a programmin g lan guage).The
minimum
distance error correctionfor
a string and a language is not uniquely defined.Consider the string
"(0"
and a language of balanced parentheses generated by the grammarwith
productions S
+ 0
|
(S).Two
minimum distance elror correctionsof
distance 1for
thissring
are
"0"
and"(0)".
This example is readily extended to that of common programming languages. For example, in Pascal [Cooper 1983], arithmetic expressions may be bnacketedwith
an arbitrarynumber
of
pairs
of
parentheses( ),
andin
C
[Kernighan
andRitchie
1978f,blocks may
bebracketed
with
an arbitrary number ofpain
of braces{
}.which the edit operations are
applied
The location of the lefunost errorin
an incorrect string canbe uniquely defined for a given minimum distance error correction as the leftrnost point at which
the string
differs
from
its
correction.
For the example string"(0"
and the correction"0",
theleftmost error is at the second
symbol.
For a stringwith
more than one correction we may wishto choose a correction
with
the longest possibleprefix
of
theoriginal
string.
Sofor
the string"(0"
and the corrections"0"
and"(0)",
the correctionwith
the longest possibleprefix
is"(0)".
The
lefrnost
error locationfor
a longestprefix
minimum distance error correction then uniquelydefines a point
in
an incorrect string which isthefrsr
minimurn distanceerror.
For a string and alongest
prefix
minimum
distance error correction atminimum
distance one, thereis only
onepoint
atwhich
the stringdiffers from
the correction and thisis
the (unique) minimum distanceerror.
For the string"(0"
the minimum distance error is after thethird
symbol, at the end of thestring.
1.2
Parser Defined
Errors
A
parserfor
a language may not detect an error in its inputuntil
some distance after the locationof
the lefnnost
error
accordingto
theminimum
distanceerror
model.
The
following is
anincorrect
Pascal programin
which
the plausible syntax erroris
the omissionof
the keyword'f
or'
atthe beginning
of line
3.
(In
the Pascal programs used as examples, declarationsof
variables are omitted; these omissions constitute semantic errors, not syntax errors.)
-1
program
example
(output)
;2
begj-n
3
n':
1
to
l-0
do
sum
::
sum
+
n4
end.
The
minimum
distance error coincideswith
the plausible error and occurs online
3
after thesymbol
'begin'
at thesymbol'n';
aminimum
distance error correctionfor
the program is toinsert the symbol
'f
or'
at thispoint.
But
parsers such as the practicalLL
orLR
parserswith
one
symbol
of
lookaheadwill
not
detect aneror
until
the symbol
'to'
is
met,bcause
thefragment
'n .:
L'
forms a legal Pascal statement. Peterson introduced the conceptof
parserdefined
errorsto
model this
classof
errors
[Peterson1972].
The parserdefined error
in
anincorrect string,
with
respect to a language, marks the point at which aprefix of
the string ceasesto be a
prefix
of the language. In the example above, the stringproglram
example
(output)
; begin n ':
1 is aprefx
of
a Pascal progftlm, but the stringprogram
example
(output;
; begin n ':
1
to
is
not.
The
parserdefined error
is
atthe symbol
'to'.
The
name "parserdefined
error"
is somewhat misleading, as the parser defined erroris
definedby
the language andits
prefixes,rather than by a parser
for
the language.It
must be bornein
mind that error handling is performed at the levelof lexical
tokens, theunits
or
symbolsfor
syntax analysis, and not at thelevel
of individual
sourcecharacten.
Thehuman who reads the source program makes use of much more information than is available to a
piuser
or
syntax error handling scheme,including
notonly
compositionof
tokensfrom
source characters,but
alsolayout
of
the progrnm, context-sensitiveinformation
such as types, andlogical
meaning.
So the modelof
minimum
distanceeror
correction, although theoreticallyuseful and convenient as a
formal
measureof
the numberof
errorsin
a string,is limited in
itsability
to model the user'sintention.
The model of parser defined errors is useful in the way thatit
models the errorsin
a stringfrom
thepoint
of
view
of
a parsing machine, but is even morelimited
in itsability
to model the human user's intention.2.
LANGUAGES FOR
ERROR HANDLING
This
sections surveyswork
on
programming language design relatedto
the occurrence andhandling
of
syntax errors, not the much wider questionsof
language design eitherin
general orrelated to other aspects such as programming languages
for
particularapplications.
There arefew
paperson
this
topic
in
comparisonwith work
on
the designof
error handling
schemes.There
is
somework on
the
designof
languageswith
respectto
the
kind
and frequencyof
occrurence
of
errors [Peterson 1972; Gannon and Horning
197 5;Wetherell
1977;Ripley
andDruseikis 19781.
A
few papers are concerned wittr various measuresof
sizeof
CFGs [Ginsburgand
Lynch
1976: Bertsch 19831.Finally
several authors consider waysof
making CFGs more suitablefor
usein
particular error handling schemes [Pai andKieburtz
1979,1980;Tai
1980a,1980b; Mauney and Fischer 19881.
Gannon and
Horning [1975]
discuss the designof
programming languagesfrom
thepoint
of
view of improving reliability
of programs. Both syntactic and semantic errors are consideredand
it
is
shown that language design decisions have a significant effect on progamreliability.
This work does not seem to have been followed up; current research
in
software engineeringfor
reliability
favours the useof
formal methods and development of programsfrom
specifications,via transformations.
Some properties which are desirable for error correction are undecidable
for
arbitrary CFGs;in
particular,
it
is
undecidablewhether minimum
distanceerrors
and parserdefined
errorscoincide
[Peterson 1972:
Wetherell 19771.
However, strings
for
common
programminglanguages can be
exhibited
in
which
theminimum
distanceerror
doesnot
coincidewith
theparser
defined
error.
(For
Pascal, seethe
exampleprogram
given
in in
Section
1.2.)
A
probabilistic approach can be used to give probabilistic measures
for
the occurrence of errors andthe distance
from
an error toits
detectionby
a parser[Wetherell 1977).
Measures can also be usedfor
theprobability
that an errorwill
be detected and that a repair is aprefix
of the language.Wetherell suggests that language designers should use such probabilistic analyses
in
an attemptto
make languages less susceptibleto
common errors.Ripley
andDruseikis
[978]
give
ananalysis
of
acollection
of
students' Pascal programswhich
shows the natureof
such commonerrors
for
Pascal.
They
find
that
themajority
(88Vo)of
syntaxerrors
aresimple, the
mostcommon being omission
of
a single token and substitution of an incorrecttoken,
and that elrorscauses great problems
to
all
methodsfor
errorrecovery.
They recommend avoidanceof
suchproblems
by
appropriate language design,for
exampleby
restricting
comments and stringconstants to a single line.
Ginsburg and
Lynch
11976l compare grarnmarsfor
size complexity,in
termsof
numbersof
gralnmar
symbols andproductions.
They
show thatonly
linear
improvementin
size can beobtained for the
CFGs.
Bertsch t19831 defines a size measure of efficiencyof
a CFG as the sum over all productionsof
the lengths of righthand sides, and gives optimal forms for expressions inPascal based on experiments measuring parsing
times.
Results in this area are of relevance whenconsidering the size
of
aparticular
CFG,which
determines the sizeof
its
parser and also thenumber
of
symbols or productions to be considered by an automatic error handling scheme.Some
error handling
schemesrequire the
useof
special symbols
in
the
language; acontext-free language
(CFL)
to be handled by such a scheme mustb
of
suitable design. Pai andKieburtz
tl979l
deftneJiducial
symbolsof
a languagewhich
are usedby their
error recoveryalgorithm
as tokens which constrain the stringfollowing
themin
a sententialform
in particularways. Tai
[1980]
discussespredictors, which
are substringsof
a sentence that determine the contentsof
the parse stack when the symbolsin
the substring are parsed, and shows how thesemay
be usedin
error
recovery.
Mauney
and Fischer[988]
refine
thesenotions to identify
particular
symbols, parsing
aheadto which
ensuresthat
a certain
classof
repairs can
be constructed.3.
ERROR REPORTING
-
THE
USER INTERFACE
Good
error reporting is important to
the userof
a language translationsystem. For
a noviceuser,
the error
messages may make thedifference
betweenbeing
ableto run
a program andobtain results, and being frustrated
with
a progmm thatwill
notcompile.
An
experienced usermay
not
considerthat error
messages are asimportant
as,for
example,the
quality of
codegenerated. But even the most knowledgeable user makes mistakes like keyboard slips from time
to
time.
Good error messages give theability
to correct these errorsreadily,
and hence make a system more pleasant andefficient to
use.
Error
reporting constitutes the user interfaceof
anerror handling
scheme. As
such,
it
is only
possible
to
consider certain
aspectsof
it
independently
from
any method used to handle errors.Most work on error handling recognizes the importance
of
the user interface and gives goodattention
to error
messages. However, Sipputl98|
in
his
survey notes thatit
is
curious thatnone
of
his
referencesis
solely
devotedto
syntax errorreporting.
Since then the papersof
Brown
11982,
19831are devoted
solely
to
analysesof
syntax
error reporting
by
Pascalcompilers. Other later papers surveyed here are concerned
with
the user interfacein
a contextwider
than thatof
compiling,
paying attention toall
computer system messages[Dwyer
1981;Shneiderman 1982;
Norman
1983;Molich
andNielsen
1990J.Horning
tl974l
makes classicrecommendations for both the content and the form of compiler messages to the user, including a
classification
of
responsesto
errors. Brown
11982, 1983]criticizes error
messagesfrom
theuser's
point
of
view,
describing the current "state-of-the-art" as"appalling".
Heidentifies
the"egotistic" compiler
as onewhich
presents informationin
its own tenns rather thanin
thoseof
the user, and he criticizes messages which say not what is wrong but what the compiler expectedto find.
Onthe
other handKantorowitz
andLaor
[1986]
consider that alist of
the expected symbols isreadily
understood by the programmer; oneof
their prime requirementsfor
an errorhandling system is the avoidance of inaccurate messages and the automatic generation
of
usefulmessages
which
are described as simple, honest andreliable.
(Personal tastesdiffer
in
theevaluation
of
messages as"good".)
Brown also requires the avoidanceof
inaccurate messages,and suggests
that the
developmentof
integrctedprogramming
environments and the useof
windowing
systemsshould
improve the quality
of
error
messages. However,
unless the designersof
such systems pay pafticular attention to error handling and the resulting messages,there
is
no
reasonto
suppose that improvement should come,simply
because awindowing
system
or
an integrated environment is being used. For example, thefollowing
source input inthe Pascal programming language lacks a pair of parentheses for the procedure parameter on line
5:
J
program
demo(output)
;2
var i: integer;
3
begin
4
for i::1to10do
5
writeln i
6
end.
When
submitted
to
MacPascal,an
integrated Pascalprogramming environment
for
AppleMacintosh
personal computers usingwindows
and graphics,the
diagnostic message, whichpops up
in
a separatewindow
from the input program, is"This
does
not
make
sense
as
a
statement"
In
theinput program
window,
thefifth
line
is
indicated as theoffending
statement, and thesymbol
'i'
is highlightedwithin
theline.
The useof
windows and graphics enables good visualidentification of
the parser defined error, but the contentof
the error message isstill
misleadingand
uninformative.
A
better message would suggest insertionof
the missing pair of parentheses.Dwyer [1981]
usesbehavioural theory
to
develop a user-friendly interface
in
which
a system's responseto
erroneousinput
consistsof
repeating theinput, indicating
the erroneoussymbols and
displaying the correct format
of
the
input.
This
model
is
particularized
for
acompiler which attempts to print the grammar production
it
expects to use.(Ihis
is unlikely to behelpful to
mostprogrammers.)
Shneiderman U9821 makes recommendations about computersystem messages
in
general including the development of quality control and guidelines, the useof acceptance tests, and the collection of user performance data. Norman [1983] also snesses the
importance
of
analysing users' performance, particularly the errors that are made, to helpin
thedesign
of
interfaces that improve performance and reduce the number and effectof errors.
He recommends that the state of the system and a list of options should be available to the user.The most recent research
is
still
making the same recommendations:two
of Molich
anderror
messages" and"Prevent
errors"[Molich
andNielsen
1990].
Good error
messages aredefensive, blaming the system rather than the user; precise,
giving
exactinformation
about theproblem;
andconstructive,
suggestinghow to
proceed.The
conclusionsfrom
an interestingexperiment in dialogue evaluation are that designers
of
dialogues areinsufficiently
aware of theimportance
of
design
to
prevent
or
tolerateerrors,
andthat
theprinciples
involved
are notcommon knowledge.
The reporting
of
errors
and
their
treatment can
be
studied as
a
separatetopic, with
recommendationsfor
improvements. But implementationof
those improvements is impossible to achieve without changing the parser or theeror
r@overy scheme.4.
ERROR DETECTION
A
parser detects an errorin
itsinput
whenit
has no legal move onits
nextinput
symbol.
Theproblem
of
error detection(with
respect to parser defined errors) can be said to be both definedand solved by parsers which possess the correct
prefa properfy.
Such a parser consumesonly
input that is a
prefix
of the language to be parsed;it
hals in
a configuration where the next inputsymbol
concatenatedwith
thepreviously
consumedinput
doesnot
form
aprefix
-
thatis,
acorrect-prefix
parser halts at the parser definederror.
Many parsers,including
all LL
andLR
parsers, possess the correct
prefix
property; operator precedence parsers do not possessit.
The correct prefix property is important for both error reporting and error recovery, becauseit
enables the parserto
output a diagnostic message andinvoke
an error recovery scheme at the pointof
error.
The diagnostic message locates the exact incorrect symbol according to the parser definederror
model.
The error rerovery scheme is invoked before consumptionof
the incorrect symbol, rather than several symbols later.Although
correct-prefix
parsers never consumeincorrect
input
symbols,
StrongLL(1),
SLR(I)
andLALR(I)
parsers may perform parsing actions beforehalting
[Fischer etal
1979; Graham etal19791.
StrongLL(l)
parsers may pop the top non-terminal on the parsing stack, inan expansion by the empty
production.
SLR(I)
andLALR(I)
parsen may replace a sequenceof
gnrmmar symbols
on
theparsing
stackby
a non-terminal
in
areduction move,
because themethod
of
constructionfor
the parsing tables uses less lookaheadinformation
than the methodfor
canonicalLR
tables. Additionally,
the tables ofLR
parsers may be compacted so thatfor
astate where the
only
non-enor
moveis
a reduce move, this reductionwill
be indicatedfor
allinput symbols; these are called default reductions.
The term immediate
error
detection properry is usedfor
parserswhich
halt as soon as thepi[ser
defined error is the next symbolof
the remaining input,without performing
any actions.This
property
is
strongerthan the correct
prefix propefty,
and hasthe
advantagefor
errorrecovery
of
leaving
the parsing
stackin
the configuration
in
which the error is
detected,providing
moreinformation
for
an error handling scheme. The canonicalLL
andLR
parserspossess the immediate error detection property, but are not usually
of
practicalinterest.
Ghezzitl975l
defines a subclassof
LL(l)
grarnmarswhich
have parserswith
the property. There arerelevant differences
between StrongLL(l)
parsers andLL(l)
parsers as constmctedby
thealgorithms
in
[Aho
andUllman 1972,pp.345,
350].
SuchLL(l)
parsers possess the immediateerror detection property, but have larger parse tables than Strong
LL(l)
parsers,which
do notpossess the immediate error detection
property
[Fischeret
al1979].
Both possess the correctprefix property.
Fischer et al give a parsing algorithmfor
non-nullable StrongLL(1)
grammarswhich
does perform immediate error detection. The algorithm is improved so thatit
parsesall
Strong
LL(l)
grammars fMauney and Fischer 1981].5.
ERROR RECOVERY
SCHEMES
Error handling
schemes rangefrom
thosewhich quit
as soon as an erroris
detected,to
thosewhich perform
full
error correction. Most error handling schemeslie
somewhere in between andattempt some
form of
error recovery, rather thanquitting,
looping or crashing on the one hand,or attempting conection on the
other.
To be strict, true error correction cannot be achieved for allinputs by any scheme which does not consult the user about his or her
intentions.
However, the terrncoffection
is occasionally usedwith
a less strict meaning,for
examplefor
a compiler that always produces a syntactically correct structue and the corresponding target code [Conway andWilcox
19731.Minimum
distanceeror
correction is the term used to describe schemes in whichthe user's intention is modelled by the minimum distance metric
[Aho
and Peterson1972;Lyon
1974;
Krawczyk
19801. The error handlingin
these schemes is integratedwith
parsing, ratherthan
invoked
as a separate module when an erroris detected.
Some schemes performlocally
minimum
distanceerror
correction, thatis
they produce an errorrepair which is
aminimum
distance error correction
for
the input over a bounded, localregion
[Anderson and Backhouse 1981; Mauney and Fischer 19821. Most schemes however do not perform error correction, butuse
someform
of
error repair to effect error
recovery and
get
the
parserback
on
track.Approcahes to error repair may involve alterations to the input alone, known as local recovery, or
alterations to both the input and the parse stack, known as regional recovery.
The
history
of
work
onsynto(
error handling datesfrom
the nineteen-sixties.During
thatdecade, production compilers were developed and their need
for
good quality error handling wasidentified
and,to
some extent,met.
The areaof
error recovery was pioneeredby
CORC, theCornell
Computing languagecompiler
[Conway andMaxwell
1963], its successorfor
a dialectof
PLI
[Conway andWilcox
1973], theDITRAN
compiler [Moulton andMuller
1967], and thePL360 compiler
[Wirth
1968].
These papers also supplied some impoftant ideasfor
later work.The aim
for
CORC andits
successor was to complete translationof
every program submitted,whether correct or
noL
The PL360 compiler provided a heuristic solution to the problem of errorrecovery
in
a precedence parser.Although
thequality
of error recoveryin
these compilers wasgood, the methods used were ad-hoc and hand-coded. The techniques used have subsequently
been developed
for
automation. Thefust
paper on automatic error handling, and the only one to be publishedin
the sixties, was due to Ironstl963l. It
gives a parsing algorithm which carries allpublication
of
many
papers.
The
next
sectionsof
this
paper attemptto
explain the
basictechniques
of
error recovery
schemes andto give
a comprehensive surveyof
recentwork.
(Bibliographies
of
lessrecent
work
may be
found
in
[Ciesinger
19791,[Sippu
1981]
andlMattem
19841.)The primary aim of an error recovery scheme is to restore the parser, which is in a state from
which
it
cannot proceed, to a statein
which parsingof
the input can be resumed. The strategymay
beto
useglobal
or
local information,
or
a combinationof
thetwo.
Actions
takenwill
usually include making alterations to the parsing configuration, that is to the input and possibly to
the parse
stack.
Error recovery
schemes can be classifiedinto
three areas, those performingglobal correction, those altering
only
the input (local recovery) and those altering the input andthe parse stack (regional
recovery).
A
schemethat performs
global
correction
usesfull
information
about the entireinput in
order to make itschoices.
A
scheme that alters the inputalone
is
in
effect making
an assumption about the stateof
the parsing stack, thatit
is
in
somesense correct and forms a basis
from
which parsing canrestart. Left
context information, aboutpreviously parsed input, may be used to decide on a suitable repair, but may not be changed.
A
disadvantage
of
this approach is that anenor
which is not detecteduntil
some symbols after theminimum
distanceerror
is
impossible
to correct.
A
schemethat
altersthe
stack does notnecessarily assume that the previously parsed
input
is correct, and has greaterflexibility in
itschoice
of
restartstate.
A
disadvantageof
this approach is that recovery actions that pop statesfrom
the stack simulate the undoingof
previous parsingactions. In
acompiler
these actionsfrequently include code generation and alterations to the symbol table, which are hard to undo in
practice.
5.1
Global
Correction
Schemes that perform global correction adopt an approach in which the entire input is taken into
account when handling errors,
with
the aim of producing a minimum distance or least-cost errorcorrection
for
an incorrectstring.
Global correction is effected by adapting a general context-freeparsing algorithm
for
use on the entire input, incorporating error nansformations eitherdirectly
into the parsing algorithm or by means
of
special productions added to theCFG.
This approachcontrasts
with
one in which an efficient parser analyzes the inputuntil
anillegal
symbol is met" atwhich point a separate error handling scheme is activated
The correction
of
errors
is
typically
modelled
by
considering three types
of
error
transformation
of
a string:
replacement,insertion
or
deletion
of
a single
symbol. Aho
andPeterson
U9721give
an algorithm based on thatof
Earley[970]
which parses anyinput
string according to a given CFG using the smallest possible number of transformations required to mapthe string into a
syntactically correct
string.
Error
productions
which
simulate the
abovetransformations are added
to
thegiven
CFG,yielding
an ambiguous grammar generatingall
sentences over the input alphabet. This gftLmmar is used to gurde the parse and count the number
of
error transformations made. The algorithm requires O(n3 ) time and O( n2 ) space, renderingit
impractical
for
usein
a productioncompiler.
Lyon
[1974]
also gives analgorithm
based onEarley's
algorithm which
computes theminimum
numberof
error ffansformations needed to transform an incorrect string into a sentence. The algorithm handles all context-free languages.A
similar
approach performs global recoveryby restricting
the setof
errors that can be handledto
thoseinvolving
single edit operations on alimited
setof
input
symbols, separatorsand parenthesis
stnrcture
symbols[Krawczyk 1980].
A
CFG
which
models
such errors isconstructed
by
adding to theoriginal
CFG productionswith
such symbols deleted or replaced.The resulting language must
lie within
the class of languages that can be handled by the chosenparsing
methd.
The error correctionis
then performedby
the parseritself
for
this
CFG. TheLR parsing method is extended to handle the ambiguity which commonly arises from the deletion
of parenthesis structure symbols from productions.
5.2
Regional Recovery
-
Atterations
to
the
Parsing Stack
and the Input
The central problem
for
an error recovery scheme which is notintegated with
the parser is thatof determining how to transform a parser configuration
in
which there is no legal move into onein which there is at least one legal move, so that parsing can continue. This section
will
considersolutions
to
this problem using
several
different
techniques,whose common element
ismodification
of
both the parsing stack and theinput.
The problemis
now oneof
determininghow much
of
the parse stack andinput
to alter and what the inserted stack andinput
symbolsshould
be. In
effect,
the error recovery method must aim toidentify
and replace aportion or
phrase
of
theoriginal
input extending toleft
and rightof
the symbol at the point of discoveryof
error; input
symbols
to
the
left
of
the
error
symbol
have already been parsed and are thus representedin
some way on the parse stack.52.1
Panic ModeOne
of
the most basic formsof
error recovery,panic
mode, has been extensively used inproduction compilers and
has
provided
a
basis
for
many
more
sophisticated
schemes[McKeeman et
al
1970]. In
panic mode, input symbols are deleteduntil
oneof
a setof
markersymbols, determined by the compiler-writer for the particular language, is
met.
The parse stackis
then popped
until
the
marker symbol can be
shifted.
Panic mode recovery
is
easy toimplement and
efficient in
operation, but often resultsin
the deletionof
several input and stacksymbols, leading to indifferent qualiry of crror handling.
More complex schemes than panic mode use various techniques to
identify
the phrase to bereplaced:
forward
moves, error productions, marker symbols, costs, and combinationsof
thesetechniques.
5.22
Forward
MovesThe technique
of modifying
the parse stack by performing an alternative reduction is known asphrase-level recovery, due to
Wirth
U9681 and formalized for precedence parsers and named byerror is supposed to occur, and to recover by replacing that phrase
with
a suitable non-terminal.Aforwardmove
is used to gain information about the upcominginput.
ln
the forward move, theerror recovery
schemetemporarily
passescontrol
back
to
the parser so
that
someof
theremaining input can be
parsed.
After
the forward move, the recovery
scheme uses theinformation
gainedfrom
the
parse ahead,to find
the
shortest sequenceof
stack and inputsymbols, bounded
by
the appropriate precedence relations, that can be replacedby
a uniquenon-terminal
which
gives
avalid
reduction
and subsequentlegal
parserconfiguration.
Thetechnique of phrase-level recovery
with
a forward move can also be usedin
conjuctionwith
anLR
parser [James 1972].A
baclcward rnove can be used before a forward move, to obtain further information aboutalready
parsedinput
and
the
stackconfiguration,
or to
alter the
stackby
making further
reductions [Graham and Rhodes
1975].
Grammar symbols are assigned costsfor
insertion anddeletion operations, so that a string
of
symbolswith
least cost can be chosen as a modification tothe
stack. An
alternative useof
a backward move makes the assumption that errors occur inclusters containing up to a
fixed
numberof
errors[Levy
1975]. The backward move is used tofind
theposition
of
the last special symbol met, called a beacon; theforward
moveis
used to generate legal continuations and to select one.Irinius'
methodof
phraseJevel recovery has been formalized and extendedfor
LR parsers, and used togetherwith
spelling correction and local correction as a basisfor
the generationof
automatic
error handling
in
the compiler-writing
systemt{LP78 [Sippu
1981; Sippu
andSoisalon-Soininen 1982, 19831.
Experiences gainedin
using the
HLP78
compiler
writing
system suggest that ad-hoc modiflrcations to CFGs give better error recovery and diagnostics than can be provided by using the system
with
the most natural CFGs [Raiha 1980; Raiha etal
1983].Experimental results show that the quality of phrase-level recovery depends heavily on the form
of the
productions.
For example,if
a listof
statements in Pascal is generated by the productionsstateme nt-s eque nce
+
stotement - seque nc e ; statern ent
I s tateme ntthen a semicolon missing
from
betweentwo
statements causes the deletionof
the wholeof
thesecond statement. Phrase-level recovery action chooses a non-terminal as a reduction goal and isolates a phrase in the input which is to be replaced by that goal; in this example the non-terminal statement-sequcnce is chosen as the reduction goal and the phrase is delimited at its righnnost end
by a semicolon or the end
symbol.
If
the less natural pnrductionsstateme nt- s e quc nc e -> stateme nt _list s emico lo n snteme
nt
I statementsemicolon
+
1are used instead, the non-terminal semicolon is chosen as the reduction goal and the phrase in the
input to
be replacedis
the empty phrase between thetwo
statements.In
this case phrase-levelrecovery is equivalent to local recovery, because the
form of
the productions has been carefullychosen to ensure insertion or replacement of a suitable terminal.
Another scheme
for LR
parsers based on the ideasof
Lrvy
and Graham and Rhodes restartsthe parser on a forward move and then attempts repair by insertion
of
a single symbol [Mickunasand
Modry
19781. Several forward moves may be considered, as the restart states areall
those parser states to which there is a transition after shifting thecurent
input symbol.52.3
Enor
ProductionsAn
error productionis
a production added to a CFGwhich
includes the useof
a new terminalerror in
the right-hand sideof
theproduction.
Theerror
terminalwill
be incorporated into thepaning
tables, possiblyby
a parser-generator, and used to guide errorrecovery.
The popularLALR
parser-generator yaccflohnson
1975] uses such an automatic error recovery scheme, dueto
Aho
and Johnson[1974).
The user adds error productionsof
theform A
+
error
or
A
+
...
error
...to
the CFGforming
theyacc
input
The error recovery scheme pops stack statesuntil
the top state can shift theerror
symbol.
The stack is then reduced by the appropriate errorproduction, and input is discarded
until
a legalshift
symbol ismet.
Three legal input symbolsmust be shifted before recovery is complete - any intervening
illegal
symbols are deleted. Thesame
errorrecovery
scheme is used for Bison, the GNU parser-generator[Donelly
and Stallman19881.
Poonen[1977]
describes asimilar
schemefor LR
parsers,which
also uses anerror
terminal in productions added to the gmmmar by the
compiler-writer.
When an error is detected,all
statesin
the
stackwhich can
shift
error
are inspectedto
determineall
their
legal shift
symbols. Input is deleted
until
one of these is met and the stack is poppeduntil
a state which canshift
this
symbol is on top. This method rnay delete more stack states, but fewer input symbols,than that of
Aho
and Johnson.5.2.4
Marker
SymbolsMarker symbols are distinguished terminals of the CFG used in error recovery, usually to
delimit
portions
of
the inputwithin
which alterations may be made.The Zurich Pascal compiler contains a top-down syntax analyzer implemented by the method
of
recursive
descent[Wirth
197l].
The method
usedfor
error recovery
is a
sophisticateddevelopment
of
panic mode recoveryemploying
setsof
marker symbolsknown
asfollow
setstwirth
1976,Amman 19781.
Atthepointof
detectionof error,thetop-downparserhasagoal
non-terminal on top
of
its stack, associatedwith
which is afollow
setof terminals.
Thefollow
set contains those terminals
which
are legal input symbolsfor
the current parser configuration,together
with
those terminals which, although not legal, should not beskipped.
Error recoveryconsists
of
skipping
input until
sucha symbol is met,
simultaneouslyrebuilding
thepatial
syntax
tree.
The recovery method is implemented by an error procedurewith
a formal p:rameterthat represents a set
of
symbols compatiblewith
the current syntaxtree.
An additional parirmetermay be used to represent a set
of
symbols compatiblewith
arebuilt
nee[Amman 1978].
Eachparser procedure
must
call
the error
procedurewith
the
appropriateparameters. The
errorrecovery scheme
in
theIBM
PascaWS compiler is also based on the useof
follow
sets [Spenkeet
al
19841. The
method
is
made
more
systematic and suitable
for
generating directly
t19801. For a
formal
descriptionof
follow
sets and algorithmsfor enor
recovery usingfollow
sets and error productions, seeStirling
[985].
These error productions are productions whichcontain
instancesof
anerror
message, rather than the error productionsof Aho
and Johnson[1974] which
contain instancesof
an errortoken.
Kantorowitz
andLaor
[986]
use a schemefor
LL(l)
parsers basedon
follow
setsin
which
ensuringthe
usefulnessof
error
messages generated isof primary importance. An
error message consistsof
alist of
legal symbols at thepoint
of
discoveryof
error and anindication
of
which input
symbols are skippedin
recovery.kwi
etal
U9781 describe a techniquefor
top down andLR
parserswhich
uses marker symbolssupplied
by
the
compiler-writer to delimit
the
start and endof
a phraseof
theinput,
and anon-terminal reduction goal
for
such a phrase.Pai and
Kieburtz
11979, 19801 also use markerorftducial
(trustworthy) symbols toidentify
a phrase
for
recoveryin
anLL(l)
parser.
The parser scansinput
until
such a symbolis
met,when
the
stackis
poppeduntil
a sententialsuffix
beginningwith
thefiducial
symbol can beaccepted.
Tai
[1980b]
andMauney
andFischer
[988]
discusssimilar
modelsfor
marker symbols.5.2.5
CostsCosts
of
edit operations can be used to determine a locally least-costrepair
[Graham and Rhodes19751.
Any
method can be used to construct a setof
possible repairs,followed
by computationof
the
costof
the repairswith
respectto
the actualinput. A
developmentof
this
techniqueemploys an additional measure, the
reliabiliry
value of an input symbol [Spenke etal
1984]. Thereliability
value measures the probability that the symbol was not placed in the input accidentally.Insertion
anddeletion
operations are simulatedby
popping the
stack anddiscarding input
symbols
until
the parser is restored to avalid
configuration. The choiceof
repair is based on acomparison
of
thetotal
costof
theedit
operationswith
thereliability of
the next actual inputsymbol to be accepted.
5.2.6
CombinationsCombinations
of
the
above techniques are used,particulary
in
those schemeswhich
can bedescribed as
two-level.
A
two-level
scheme hastwo
separate stagesfor
error
recovery, thesecond
of
which
is enteredonly
if
thefirst
stagefails
to produce an acceptable recovery action.Usually the idea is that a simple error may be corrected by a simple technique, but
if
the error iscomplex or cannot be repaired by the simple technique, then a more complicated technique must
be brought to bear on the
problem.
Graham, Haley and Joyll979l
give a nvo-level schemefor
LR parsers
which
uses several ideas : a forward move, to obtainright
context, and costsfor
editoperations, to guide the choice
of
a repair to the input, in thefirst
stage; and enor productionsfor
phrase-level recovery
in
the second stageif
no first-level repair succeeds. The scheme also usessemantic
information, in
hand-coded recovery procedures. The useof
semantic informationfor
syntax
eror
recovery has been formalized and extended [Corbett1985]. A
potential repair canbe tested
with
a forward movewith
executionof
semanticactions.
This approach yields furtherinformation to guide the choice of repair, but has the drawback that
it
requires the compiler to beable to undo the effects
of
semantic actions.Burke
andFisher
tl982l
give
atwo-level
schemefor LR
parserswhich
attempts localrecovery
of
a single edit operationfirst,
backing up on the parse stack,followed
if
necessary byphrase-level
recovery.
The notion of scope is used to control the region in which local recoveryis attempted, and to provide the goal
for
secondary recovery by insertionof
a string whichwill
close
scope. This
schemeis
extendedto
onewith
three levelsof
recovery[Burke
and Fisher19871.
Thefirst
level performs
simple recovery as before;the
secondlevel
performs scoperecovery
which
deletes and/or insertsinput
symbolsto
close an open scope; thethird
levelperforms phrase-level recovery
which
deletesinput
beforeor
after the symbol at thepoint
of
detection
of
error.
Boullier
and Jourdantl987l
give a two-level scheme for table-driven parsersand lexical analyzers. The
first
level attempts insertions, deletions and replacements at the pointof
detectionof
errorwith
the aimof
finding
a matchin
alist
of corrections.
The second stagedeletes input up to a key terminal or marker symbol, and pops the stack
until
that symbol can beaccepted.
A
similar two-level
schemefor
LR
parsers also attempts a single repair at the errorpoint, but
uses aforward
moveon
theinput to
determine whether therepair
succeeds [Dain19891. The second level chooses a non-terminal from the parse table as the reduction goal
for
a phrase-level recovery which pops the stack and deletes input.5.3
Local
Recovery
-
Alterations
to
the
Input
Alone
The designer
of
an error recovery scheme may choose tolimit
the possible transformationsof
parser
configurations
by allowing
modifications
to
the unexpendedinput
only.
Oneof
thereasons
for
this choice is thatmodifying
the stack may require changing semantic informationassociated
with
the parsedinput, which is
difficult
to
achievein
practice[Gries
l97I].
Theproblems
to
be
solved arefirstly,
determininghow
muchof
the unexpendedinput,
from
thesymbol at the point
of
discoveryof
error and possibly extending rightwards, should be altered,and secondly, determining an appropriate replacement string or
repair. Input
symbols to theleft
of the error symbol have already been parsed and are not eligible for edit operations; neither is the
crurent state
of
the parser to be altered. Severalof
the techniques used toidentify
symbols to bereplaced
in
schemes which alter both stack and input are also used here, namely forward moves,marker
symbols, andcosts.
The
useof
error
productionsis
inappropriate, asthis
techniqueinvolves replacing stack and input symbols by a suitable non-terminal.
5.3.1
Forward
MoveA
forward move on
theinput
can be usedto obtain information on how to
proceedwithout
altering the parse
stack. An
SLR parser can be extendedwith
extra states to handle errorswith
aforward move [Druseikis
and
Ripley 1976).
At
the
point
of
detection
of
error, a
specialnon-SLR parser is activated to parse the upcoming input as far as
possible.
Theinitial
stateof
this parser