EDITING BY COMPUTER
A thesis submi itt ed for the degree of :MABTER OF SCIENCE
at the
Au:st.ralian N Htional University Ca.n:berra
J'une
1967
by
Davi.d MENDUS
PREFACE SUMMARY
STATEMENT
CHAPTER 1
CONTENTS
II\1TRODUCT ION
1.1 Introduction and definitions. 1.2 Error checks.
1.3
Automatic adjustment.1.4 Some extensions and problems.
1.5
The logical edit.CHAPTER 2. LOGICAL EDIT SPECIFICATION 2.1 Logical edit design.
2.2 The extent of the language.
2.
3
Existing techniques.2.4
Evaluation of existing methods. 2.5 The suggested approach.CHAPI'ER
3,
THE L~NGUAGE3.1
Itt.tnoduotimn.3. 2 T.he structure of FRED •
3.3
Executable statements.3.4
Declarative statements. 3.5 Examples.· ~HAFTER
4.
THE IMPLEMENTATION 4.1 Pre-procesor form.4.
2 Compiling techniques.4.3
The recogniser.4.4
P.t-Qc-e~e-i.ri,£ rou ttnes.4.5
Implementation limitations.i i i iv 1 2 15 21 28
33
3839
43
49
5763
CHAPTER __5. USING FRED 5. JL Example l
5. 2 Example 2
5.3 Possible improvements.
APPENDIX A SYNTAX NOTATION
APPENDIX B JOB CONTROL LANGUAGE
REFERENCES
APPENDIX C A LIST ING OF THE PREPROCESSOR
APPENDIX D THE EXAMPLES
· - - - -
·
-
-I29 130 133 135
137
I39
141
attached
PREFACE
I wish to thank Mr B.W. Smith for supervising this thesis, and for his many valuable criticisms and suggestions.
ii.
SIDAMARY.
Editing and correction of records by computer is very important when large volumes of data are being processed. As wel l as improving the quality of the result s, a well
designed edit can also simplify the processing of the data 'by detecting and correcting records which could cause the
processing routi ne to malfunction.
Chapter 1 consider s this task, and justifies its division into two sections. Fil e handling and input/output is dependent on the machine and peripherals used, but r equires no knowledge of the logical content of the record. On the other hand9 the checking of the items and generation of error messages is less machi ne dependent, but needs a detailed
knowledge of t he logical content of the r ecord. This latter task, called the "logical edit" is considered in this the sis.
The creation of a logical edit involves three phases; analysis , programming and testinga Commonly a different
language is used far the analysis and the progr &'TII!ling9 and this can lead to er~ors and difficulty with t he testing. Using decision tables the first two phases can be com'bined1 as decision tables can be used as a programming language in conjuction with a suitable compiler.
However, they ar not ideal as a means of specifying a logical edit. In particular, they are cumbersome and time consuming. A need exists f or a better language that can be used both for analysi s and as a computer program. Chapter 2
considers the properties of such a language. "FREDi1
iiL
The statements of F1RED represent the most common logical t asks required in editing, and include such features
as look-up tables 9 error flag setting, adjustments and so on, not normally contained as single statements in a
prograffit~ing languageo To deal with l ess common tasks9 arbitrary blocks of PL/I statements may be included in an edit.
FRED also has a structure to represent the branches of the edit in an intuitive way. This structure exists in two forms, one of which is used for the program itself, and hence appears on the program listings. This makes a program written in FRED particularly easy to read and debug.
FRED is written to be compatible with PL/I, and a FRED edit will generally form one procedure of a PL/I
progr am to perform all editing tasks.
Chapter 4 describes the way in which a FRED pre-processor has been written and tested. The pre-processor
is written in PL/I and converts an edit written in FRED into a logical]y equivalent PL/I procedure. Any PL/I stat©ments r ead by the pre-processor are copied unchanged so that a program may be writ ten of mixed PL/I and FRED blocks, and the whole submitted first to the pre-processor,
then to the PL/I compiler to produce an edit.
Chapt er 5 contains sample edits written in FRED, and mentions improvements suggested by the early use of FRED.
iv.
~TEMENTI
The mat erial reported in this thesis is the
author1s own work except where references to other
1.
I
I
1.1 IN'I'RODUCTION AND DEEJ]JTIO]i§.
l.]. Editing is a process to check the of items of information. These items are into logical records (frequently ref erred
2.
accuracy of a set usually organised
to as records).
A logical record i s a collection of related items; for example, if a set of items is the information collected for an area in a population census, then those items relating to one dwelling may form a logical record.
The items may also be grouped because of the
properties of the medium which stores them. For example, only a f ew items may be contained on one punched card or
one length of paper tape. Such a group is called a
physical record. In many ap9lications there is a simple r8lation between logical and physical r ecords.
A logical record is represented by a sequence of data chs.rac ters on sor.1e medium together with an implied structure which enables these characters to be interpreted as the i tems of information. The data normally consist of
alphanumeric characters which represent the infon;1a tion1 and separators 9 which may also be alphanumeric, which do not represent information, but which, in conjunction with
the implied structure, enable the items of information of the r ecord to be det ermined. Thus, the character string
"£
142/10/4"
contains three separators 11£
11, 11
/ " and 11
/ 11
, and these enaole the char act ers 11
142104" to be interpreted ns one hundred and forty two pounds, ten shillings and
fourpence.
I
I
3
.
improve the gualitlf of the information and remove these
errors which will impede later operations on the fi le. Thus for example, in an edit attached to the job of computing pay, an attempt will be made to remove all errors by including redundant information such as checksums to allow the checking of all items. In a typical statistical application on the other hand, such an edit would be impractical with the
available resources and the edit will be designed to remove all errors which would seriously affect the statistics and merely reduce other errors to an aeceptable level. The edit may also be used to point out exceptional, but correct
records. For example, the import of a 1:·rarship costing several mill ions of dollars disturbs the detailed import statistics so much that special mention may have to be made, Such events are easily highlighted at the editing stage.
A further use of the edit is to simplify later processing tasks. Dif'ficult situations such as zero divisors can often be detected at the edi~ stage and
appropriate action taken quite easily, thus simplifying the later processing runs. Missing and absurd information such as illegal code values can also be detected and corrected during the editing. Also at this stage codes and serial
numbers can be inserted to be of use in the later tasks, such as a serial to be used as a subscript in later tabulation routines, and batch totals can be computed to check on the
completeness of the data.
The errors arise in two ways. The first ie due to incorrect, incompl ete or inconsistent irf' orJW. tion at the earliest stage, for example, an error of observation, or
4.
errorsn. The second type of' error i s that caused by a mistake in the extraction, transcription or storage of the
information. For example an error caused by punching an
incorrect character into paper tape, or by including some
records twice in the file, or an error while reading the data from magnetic tape. These errors will be referred to
as "transcription errors11 •
The value of editing is such that is is common to
include r edundant information just for the purpose of
providing an editing checko Thus, the total of a group of
items may be included in the record, or the rate at which
duty is levied may be i ncluded in an excise record even
though this should be determinable from t he nature of the
i tem, and a check digit may be ap~lied to a code number.
1.1. 2 There ar e many advantages in using a ccmputer for ed.i ting.. The types of logical and arithmetic checks that
are most useful can be performed quickly and cheaply on a computero Further, the computer does not get bored by
applying the same checks over and over again, ru1d continues
to operat e with very high reliability.
The value of computer editing is increased if the
edit i s used to query unlikely events as possible errors
and also to make corrections where possible to t he errors
found.
The unlikely events queried in this way can then
be checked manually for two important conditions. First,
reference to source documents or beyong may show that the record is in error. For example this wil l usually be the
case when hospi tal r ecords report the birth of quintuplets,
information ma;y be correct, but so significant as to need special action, such as the case of the import of a warship mentioned earlier. Records queried in this way form a class
which must usually be distinct from the class of records found to be in error9 since many programs are written that
do not allow further processing to be done on a file that contains errors, whereas , as noted above, many of these queried records will be correct and have to be processed in
the usual way. 'rhe attention that ca.."'1 be given to these
exceptional cases when singled out in this way is much greater than is possible when this type of record has to be searched
for by hand from the whole set of i nformati on.
Making corrections by computer is a very valuable
t echnique, though it i s very difficult to program. Indeed some corrections that can be recognised fairly easily
manually seem too complex to program using existing techniques.
An example is recognising that a punched card has been read
upside down. However, many simple transcriptions and source errors or omi ssions can be properly corrected by program,
particularly when the records contain numeric data.
Corrections are particularly valuable in work of a statistical
nature, where complete accuracy is not expected, and limited resources make it impr actical to investigate all sourcer errors. Her e, it i s a common procedure to apply an
11 arbitrary adj us tmen t" , that would make the informati on an acr,eptable record with at the worst; an insigni ficant effect on the accuracy of the fi nal statistics. These
adjustment rules can be programmed very effectively, and the ability of the computer to make comnlicated decisions very rapidly allows these adjustment routines to take account of
6.
the reliability ana_ predictability of the comp·u.ter allows the effect of such rules to be estimated with a high degree
of accuracy, and steps can be taken to reduce bias in a way
that is not possible with clerical amendment.
Using the computer for editing has a further
important advantage in the very common case of the further
processing also being done by computer. This is that the i ncidence of transcription errors between editing and
processing is reduced to a minimum. If the processing occurs immediately after editing, then no transcription occurs at
all, and the data will only be alt ered by a malfunction of the computer - the likelihood of whj_ch can truly. be regarded
as negligible.
If the processing occurs at another time, even on another computer, then the data can be stored on one of the magnetic mass·-storage media such as magnetic tapes, discs or drums. The reliability of these devices i s far greater than that of mechanical readers and writers involving for example paper tape or punched cards. The error rate will generally be less than one error in 100,000,000 characters, and since an error is generally detected by the reading device or by program, using such devices as checkstuns and parity bits, the chances of an undetected error will generally be even less.
There is 9 however, one severe problem in editing by computer. All types of error must be anticipated and
included in the program before ed: ting commences. This is natuarlly a much more difficult task than that of recognising
errors as they occur.
I
l
r
7.
information over a l arge period of time and aocess them by a rapid :process of"association. Thus, if the edit of a pers
on-nel file specifies no check on date of birth and if due to a
mechanical fault a large number of r ecords are_ transcribed
with dates of birth in, say, the seventeenth century, these
will not be detected by a computer edit and may remain
unnoticed until a renort quotes the average age of employee
as 342 with embarrassing consequences! Using clerical editing
however 9 there is a fair chance that such errors, unforeseen
initially, may be discovered and corrected before much damage
i s doneo
The original analysis of the jnformation for computer
editing needs to be very complete. Though this analysis can
be tested by test runs on act~al data, the results of which are
closely e.~~amined, much work still needs to be done on this problem
1.1. 3 Once the computer edit has been written it can be
used rapidly and freq_uently. Thus~ the data can be edited
in conveniently sized batches soon after these have been
received. There are several advantages to such a scheme.
Any source errors that are detected can be queried
soon after the information is collected. Memories will still
be fresh, documents still accessible and respondents found
easily. The quality of the corrections to these errors will
then be much higher than if the query was made later.
Transcription errors which are detected may lead
to repairs to the transcription devices to correct a r ecurring
error, or further explanation to an operator or coder who is
misinterpreting the instructions. In this way the build up
of a large numher of systematic transcription errors may be
avoided.
Analysis of the ~rrors recorded may lead to
imroverrents to the editing itself, or may highlight
t
I
8.
deficiencies in the information being provided. Thus the
system may be improved at the earliest possible r.10ment.
Because of these considerations 9 ed.i ting is usuallJr
carried. out at an early stage of computer operations on the
file, and is often associated with the transcription from an external medium such as punched cards, paper tape or
documents for an optical reader, to a computer oriented
medium such as magnetic tape. This procedure has the further
advantage that the other computer runs can be written with
the assumption the.t the data will contain no destructive errors such as zero d.i visors or out of range subscripts,
since these can be excluded during the edit.
In some cases the edit will form a computer run on
its oi~1, in nthers it will merely be a part of a run to
perform several steps. If the information arrives in batches
over a period of time, but cannot be processed until all has been edited, then a batch will normally be edited as soon
after its receival as possible oy a separate edit routine.
If, on the other hand, the volume of data i s small ahd the
edits are not too complicated, it may be desirable to edit
and process the data (or at l east the acceptable secti on) at
once. Another factor in t hj_s decision i s whether or not the
combined editing and processing programs can be conveniently
fitted into computer memory directly or by a suitable overlay
process.
Though there are many schemes for performing editss
the two represented by the attached flow cl1arts are typical
of a large number of edi t runs.
The first flow chart represents an application where
ther0 is only a small volume of data, and the error rates will be low, f or example a scientific computation. Here
I
. BA
The second flO'N chart repr esents a larger application where the edit is a separate computer program. The output of
this program will be a file of records now arranged in the form best suited for later processing. Since the record
size is freq_uently lar ge, even error recoras are includ;;d on the file and. a. second edit program is written to correct the file by altering records, removing records or inserting records. Records altered or inserted in this program will generally be edited wj_ th the same edits as the origin.al
records.
References (2) and (3) discuss the importance of editing and its relation to other computer operations.
..
I
.
..
:i 0 tt..
J JJ ~ e \I..
J ..l, J, ~ IJ.J .:!..
0 1 C ..., .I: " .• .! .. .i, :£ ••
.I.I t t " ~ .. ~ .,,.
C.. V ......
..
~ " ~ ~ .!J...
C .I. 0...
V ~-l
~ -0..
"
..c0 V .!J \I \. 1) .11
.,
..
"
ti:" C
9
"
....
Rec.d 1:-i.c.
n r'fC t: rr,o rd.
Reo.d ""' o-r"'cJ"'e"'I::.
Yl:S
~ri·L~ '"'f.,._..cf:lot1t
e.l>u ut 1:.-1.1 f,1•
AS .-~1u(v--~d.
Y'°S
w~,·l-c ,.-.fo, ... d.=toll"
Gbo.,I, H,, f,1,
C,\ <<!'\U(<ro/
r ... 1=,,r''"' 1:.-~, J.t-,. c"'""•<~,..J AJ , · ~ ~
0 J. (.,.f.,..- .. 1:-10 ..
E'd ,I:-1:-1-tr r,tco,d A"''j <f'<o<• ~
Yl<'S
\vr,"L·1 • .,,·1:-c bl•
~Yr-or ~~ ~• ,,~.,
(i) Fir~c edil:- ru"', repeo.l=ed Q.S re9uired.
Ft.,d l:cl., cu«,.,,-cf; r e£or d o .. H,.,. J,· 1~.
Fon-, l:-l.1
c. .... r .. cJ ,cl ruo,cl.. l=d<I, !:-~, ,uoral.
(; 1) A....,tt,_,d .... ev,.\: run, r<.pec.~ecl a~ r<?9u,·r~ol.
NO w~,t--y '=''-'• ,.,au~
o ... t-. ""• ,,. .. ~r.t-f,1,
l,,J,.,·t~ .. ""'iJt-Of'j of
h, o-"~d-~-t
W'ro·L~ 1.-1., rrrorc( ..,I:; H,t ovl:p•l:-/.·lr
11.
1.1.4
There are two kinds of output from an edit. Thefirst i s tho file of logical recordfl in their edited form,
and the second is a file of the messages 2bout the records
that wer e cenerated by the edit. Although the form9 and even
the numb er of output fil es of the above t ~rpe s V9.ry from
application to applicLi.tion the general terms "edited fileli
and 11error listings" will be used for the two types.
In those programs where the editing and processing
are combined, the term edited file can be applied to the one
logical record curPently being processed9 but where the editing
is separated. from the processing then all edited records must
be stored or- an external ~ile. Since this is created by,and
usea. exclusively b;y the computer it is usually written on one
of the magnetic mass storage media9 tape. discs or drums.
This file wi l l fre(]_uently be very different in format to the
original data file.
The logical record will frcq_uently have a more rigid format. It is common9 especial ly when using paper tape as an
input medium, to economise on characters by devices such as
excluding l eading zeros, non-sigr1ificant-blanks and omitting
sections of the record ir possible. This is valuable in that
i t r educes significm1tl:l the number of che.racters to be
punched9 often the most expensive operation of the whole
processing , but it does mean that the la~,.out of the r ecords
wi l l vary considers.bl~r on the file and a special routine is
needed to interpret the varying forms of structures. As this
routine may be quite extensive it is desirabl e to use it once
only and a:rter the first reading the file ~~ept in a rigid
format which ma~, include a large number of non-significant
characters but which is easy to interpret.
The logical record may al so contain i terns d1ich v. ere
12-. ·
' values that were implicit in the input form of the record
and which are needed expl ici tl;~.r for processing. These values may not have been present in the source data or they may have been omitted during transcription. For exanple~ the results for weekly turnover for a retail store may include the figures for :'values of goods sold· including sales tax" , ahd '~value of
sales tax". from which "value of goG>-cls sold exclusive of sales
tax" can be inpu. ted for processing.
Other added items may be special items that are attached for sorting, classification or tabulation. The derivation of these extra items may be very complicatea., and
it is advisable to perform this on a once- and-for-all basis. The logical edit i s a suitable time for such computation. An
exampl e might be to add a 11 size code" of 1 if the stores
turnover is less than 10,000 annually, 2 it if is 10,000 but
l ess than 509000 or 3 if it exceeds 50,000. If figures uere later required for medium sized stores only it is easier to consider those shops of size code 2 only, than to test the turnover of every record, particularly as each record may be edited once only, but processed on several different runs.
A unique identifier or serial number may be attached to each r ecord on the edited file. This will enable
amendments ,to be specified to a particular record without
fear of ambiguity. 11).. check digit maJr be attached to this
number to reduce the chances of a misquoted or badly
transcribed number causing an adjustment to the wrong record. There may also be a flag attached to each record specifying
the condition on the record - whether good, queried or in
error. This may be r eferred to so that bad r ecords may be
In very large jobs the edited file is sometimes
split into two. One file will contain only those records that have passed the edit, or possibly only those batches
for which all records have passed the edit. This file
should contain a l arge proportion of the records and may be
processed at once. The other file contains all bad or doubtful records. The main advantage i s that to process
the amendments it is necessary to r ead only the shorter file, not the whole set of data. The running time of the
amendments program should be correspondingly reduced.
The err or l i stings in some cases may contain details
of every record, but more com:11only l i s t only those r ecords
that were found to be in er ror, were queried or had
adjustments made to them. The important f eatures ar e that the items of the r ecord are clearly laid out in a way that f acilitates compar ison wj.th the source i:n:format i on, that clear but concise messages give details of the errors f ound, and that the serial number is included ready for the
amend-ment processing. In some cases it is possible to design the
listing \SO that corrections may be made on the l i sting whi ch
is then used as a source document for the data preparation of
the amendment.
A separate error l isting may be needed for those
records so bad that they cannot be interpreted as items of information. These records are particularly common on paper tape. Since the items, which are the most important feature
of the above listing canrrot be distinguished, a completely new layout is r equired to show the record as a string of character s. Again error messages will be given; these will generally refer to unintelligible separators. Since the
the edited file, so unless the sequence of records is so
important that a space must be reserved for the corrected
version of this r ecord, no serial number will be alloted
to ito Since the l ayout of these listings i s so dependent
1,5 ..
1. 2 ERROR CHECKS.
1. 2. 1 An a priori knowledge of the no. ture of the i terns of
information and the relationships between t hem is needed to
define the error checks. Nordbotten (5) classifies this
knowledge into two general classes, t~eoretical knowl edge
and empirical knowledge.
TheoreticRl knowledge is the knowl edge given by the
definitions of the items and rela.tions. Thus: if an item i s
defined to be numeric it cannot contain an alphabetic
character 9 and value for sal es tax multiplied by rate of
sales tax must equal sales tax collected. If an item, or
a relation, fails a check based on theoretical knowl edge,
then there must be an err or. In most applications, the
r eco1"d. cannot be accepted without some adjustment so that
the record now satisfies the check. Theoretical checks are
easi er to define than empiri cal checks, and ar e fundamental
to the editing process. However, the value of the edit will
be much increased if empirical checks are added to the core
of theoretical checks.
Empirical knowledge is knowl edge about the items of information that is true as a matter of fact and not of
necessity. This will.lie gather ed from experience of previous
records of this t ype or by estimation of the values that may
occur from models, samples or similar edits. Thus, the yield
of oats should not exceed
40
bushels per acre, or thetemperature during an experiment should not exceed 120 degrees.
In gener al, it i s not possibl e to say that an item
or group of items which fail an empirical check are in error;
16.
that the i t erns are correct. H0vvever, it may be desirable,
though not rigorous 9 to say that an item failing on
empirical check i s in error.
N
The limit on such & check may be chose'd to sa.tisfy
one. of two criteria. · Firstly, the p,rooabili ty that a p;ood record i s rejected as in error may be kept to some limit ors secondlys the pirobabili ty that a bad record may be accepted as correct can be kept to some limit. As there are more
good records than errors, and as the good records al'e "better behaved11
, it is more convenient in practice to satisfy the
first criterion, and this is usually done.
:Cm:pirical checks may be applied on a two- level basis, for example "if the age at first marriage exceeds 70 then query the r ecord; if it exceeds 100 then the record is in error11
• Other extensions of the empirical checks are to reset the limits during editing either to refine the limits, or to fol low trends9 or to use historical or other external
information particular to thBt record.
No further distinction is made between theoretical and empirical checks in this section.
1.2.2. The most common type of check is that in vr:1ich one
item is checked against values or rules written into the edit program. In its simplest form this is the 11
17.
to take special action on encountering certain characters;for example, as well as checking that there ar e numeric
character s only in an item9 special action may be taken on
read:img "?11
s which is often used to denote a transcription
error while converting data from one medium to another by computer.
Other checlrn t hat can be applied to one i tern of
information alone use more knowledfe of v1hat the information represents. For example9 the code for sex must be M or F9 or the amount of income tax paid annually must exceed one
dollar. Large look-up tables may be r equired to establish validity in some cases; for exe_mple9 country of origin codes
on over seas trade statistics nwnber over lOC\ The i tern ma;y
even be split into sub-items and a relati on between these
examined, for exampl e a number with a check digit.
1
.
2
.
3
.
A second cless of checks consi sts of those carried out by applying logical or aritturetic checks to two or more items within one record. These checks are ofter referredto as 11cross- checks". They apply to the information
represented by the i tems and usual ly assume that the data
characters arc valid. For this reason these checks commonly follow a series of validity checks and are often omitted when one or more of the items involved has failed a pr evious validi t~,r check.
Logical relations of this type are widely applied to numeric, coded or alphabetic information. Very common
are simple existence relations9 if A then B; for example,
18.
for example if relation to head of household on a population
census form is D for daughter, then sex must be F 9 female.
For arithmetic information, a large class of
functional checks is available. In the most general form
such a checl<: requi res that the value of some function ff
of the items being checked must lie between the values of
two other functions a and b of items of the record; that is
where generall y a and b are simple functions such as
constants or values obtained from a look-up table on one
i tern. The principal function f may hov1ever be quite
complicated.
Some special cases of thi s check ar e very_ important.
One iG a check v.:here f is the ratio of tvrn i terns and a and
bare constants or simpl e functions of a third item. For
example, the ratio of the amount of wheat produced to the
number of acres planted must lie between two limits, each
dependant on the state in w~ich the f~rm i s situat ed.
Another important case is that where f i s an
algebraic sum of several items and a and bare either both
zero or some small value, negative and positive respectively.
For example, in a report on the composition of a chemical
mixture: ·- a ::. ( sum of weights of component i terns - total
weight·,·~ a where the value a has ·been chosed to allow for
any r easonable weighing inaccuracies, but is still small
1.2.4. Further checks are possible when more than one record is edited at a timeo These checks are similar in structure to those described above but involve items from more than one record. Again, these checks usually assume the validity of
the data characters hac ~cen checked previously, and may be omitted when items fail a validity check" There are three ways in which che:cks of this kind can be con true ted.
Firstly
9 the file of records may have a heirachal
structure. The records in such files refer to information on two or more levels. For example9 r ecords at the first level
might be s. summary of a group of records at the second level. This may occur 9 for instance, in the r ecords of a blood oank
where the first level record would contain details of the blood required by one hospital served by the bank. This
would be f©llowed by a set of second level records listing the blood required for each surgeon in the hospital. An obvious
check is to confirm that the total quantity of blood of each type required by the surgeons is equal to the quantity of that type required by the hospital. There may even be third level records listing the blood required for each patient, a set of these records being f.rouped behind the record for the
appropriate surgeon.
A second corrnnon type of heirachal file is that where the high order records contain the inf'ormation always present about one entity, and are followed by variable numbers of lower order records containing variable information. For example, a persorun:el file may contain a high level r ecord for each employee containing personnel number 9 date of birth, present position and so on. Each would be followed by a
any r ecord. for which salary for a l)revj_ous position exceeds
pr esent salary.
A practical problem with this kind of editing that is
sometimes very severe i s that it is not always possible to
retain the high 1evel record while all low level records
r elating to it are edited without destroying the file
structure. If it i s important not to write the high l evel
record until all records of the group have been edited special devices such as scratch files may be needed.
Secondl~, a r ecord may rive information about an
identifiable unit, and another r ecord tmr this unit exists
on some other, already edited, file. Once the identifying
information has been edited the tv/0 files can be matched and
edits applied between the r ecords. Common examples of this
type occur when a similar file exists for an earlier period
of time; here items can be compared for "start and finish11
matches, for example stock held at the end of the earlier
period should equal stock held at the beginning of the later
period.
Thirdly, the file may be homogeneous, but checks
can ~e made between each record and its neighbours. The
nature of the information may suggest checks, for example
if the file contains records representing the same items
over a period of time ma:ny checks can be devised to allow
only reasonable variations. Even in general cases, statistical
checlcs can be applied to short strings of records, of ten 3,
to report unlilrnly variations in the i terns ( see Nordbotten
21..
1. 3. :. AUT011ATIC ADJUSTlv1ENT.
1.3.1, The correction of erroneous items , or the insertion of
missing items by computer is a difficult, but worthwhile
operation. It is l Gss common than editing by computer, and
even in those appli c&tions where it is used, it is of ten done
on a trial basis and limited to a few items. However, its
use does lead to a very l arge saving in time and labour, and
hence cost, and it may l ead to a considerable improvGment in accuracy.
The alternative, correction by clerks~ is sometimes
regarded as exact, but in fact it i s often a very unreliable
process. Even in cases where it is possible for the adjuster
to find the correct value, by reference to sour ce documents
or beyond, there can be no guarantee that this has been done, or if it has, that the amendment has been processed without
transcription er ror. ·VhGr e it is not uossible 'bo recreate
the correct value, such as in statistical applications where
an error exists in the source data, then it is comr:'lon practice
for the adjuster to create some new value by means of rules
provided by the systems analyst or by using his experience.
The former kind of adjustment will clearly be done better by
a computer which will foll ow faithfully the instructions of
the analyst, and whi ch should make possible an elimination
of bias. It is the experience of the writer that as many as
10°/o 0f corrections made clerical ly are incorrect after transcription, in the sense that the r ecord when adjusted
still fails the editing checks.
There s.r e two severe problems in automatic adjustment
by computer , decidi ng which item is incorrect, and choosing
22.
fa.ct that the checks and corrections have to be written before
editing commences9 and so the analysis has to be very
thorough indeed to be effective in all cases. Once put into
operation it is difficult to change any of the instructions,
and if there j_s an error, in extreme cases, sever al hundred
thousand_ records may be edited before it can be corrected.
The procedure then is difficult 1 but the r esults so worthwhile that automatic adjustments should be included wher ever possible. Most of the met hods discussed below are particularly applicable to numeric data. Correcting
alphabet-ic data such as names and ada.r es ses in much more d_j_fficul t.
1.3.?. The difficulty of deciding which item. is in error occurs only when functional or logical checks between two or more i terns fail. This kind of check unfortunately j_s a very
common detector of source error. The preferable way of r esolving the problem is to apply further checks to each of
the items until the erroneous item oan be determined, but this is not always possible.
In the case o:f editing a pay slip, for exampl es if
the relation
net pay+ net tax+ deductions+ superannuation - gross
pay= O
i s not satisfied, then we could. introduce other checks such
as "does gross pay = anEIUal salary / 2 6?11 or
n does the value
for deductions equal the sum of the various deduction items?"
If one of these checks fails and if the other value for the
item satisfies t he original equation then we can be confident not only that vie have found the item in error 9 but that we
In other cases a check as to the r easonableness of
each item may be of value. Thus 9 if in a chemical mixture 9
the sum of the component weights does not equal the total
weight, then a check on each weight might reveal that one
·was of unlikely magnitude9 and the error could be attri1mted
to that item with some degree of confidence.
A method that is very effective when the record is
of a suitable type is to evaluate all possible checks before
maldng any adjus tmcn t, the:im making all necessary acJ.jus tmen ts
from the information then available about every relation of
interest. After such an adjustment the r ecord is frequentl:?
re-edit ed and possibly further adjustments are made until
either the record is correct9 or adjustment is halted after a
spec ifiecl number of iterations.
Frequently however , there is not enough information
in the r ecord to allow items to be checked in several ways
so that the err®neous one of a set can be determined.
Usually the data preparation resources available do not allow
sufficient redundant i terns to l)e included for this purpose.
A co1mri.on method in such cases is to evaluate a priori the
reliability cof the various i tems9 and where an adjustment has
tone made
9 to make it to the item considered least reliable.
For example it is frequently found that values or prices will
be more reliable than quantities9 especially in respect of
source errors9 since it is al~ays in the interests of one
party or another to ensure the accuracy of such an item.
A further method sometimes may be justified on grounds
of expediency. In this method the item which is least
important, or else least significant9 will be adjusted. No
·
24.
is only adopted where absolute accuracy is either impossible
or too expensive.
The techniQ_Ue used wherever possible by the author
is a combination of tbe first and third methods above. This
is to take a small groun of items classed as most r eliable
and to odit these with all possible rigour. The validity of these items having 1rnen established (as it usually is) 9 the scope of the edit is gradually extended by considering new
i terns and editing these individually and. against the group
of items already edited. If necessary these new items are
adjusted to agree with the previously edited items. In this way the whole record can often be edited so tho.tat any failure it i s obvious which item is in error, and how it
should be corrected. If an error is detected in the starting group, then as these items are generally the most significant in the record9 the record is in ,,n unacceptable condition
and must be corrected after careful reference to the source
documents or beyond. In this case the edit of the rest of
the r ecord ,:,ill "be confined to signalling error, and no
corrections will l)C made.
One of t he first edits using this technique was
tested against clerical correction. A batch of data was
edited using the program, and the original form of those
r ecords that wer e automatically adjusted were shown to a
clerk
9 exper ienced in editing such data. He was asked to
correct these records , and when he had done so, his
correct ions and the computer corrections were compared
car efully with the source documents. It was fou:nd that
the computer adjustment s were of a significantly higher
1. 3.3 The val ue to which an item shoul d be adjust ed is in some ce.ses obvious. 1_'fher e some ari thmetj_c ec_:,_uali ty or nea
r-equality has to be satisf i ed, or where further relat j_ons
suggest some value which satisfi es the check, then the
adjustment can be made in some con:fidence.
It often happens however than no such method is
availa1)l c. One possibl e approach is to use a knowledge of the lil~ely forms of er ror to the i tern during extraction of
transcription to find an acceptable value. For exampl e, if
a numeric value read from a punched card is found to contain
an alphabetic character, thcon a knowledge of the pt1-nching
machine' s chcracteristics could suggest a value such as
than numeral wr:ose punching forms part of the code :fhr the
digit found. Thus $ if the cards were punched on an I . B. M.
026, an A would be adjusted to 1, B to 2, C to 3 and so on. Similarly, i f a set of characters from paper tape i s found
to be in error, then the characters v,hi ch have the same
punching 1rnt with different case shift would be tried. On
a l ess complicated level 9 if an item is found to be too
l arge, then it mi rrht be worthwhile to truncate the l eading
digit and accept the r educed value.
If none of the above methods is suitable then the
method of 11 safest choice11 can be used. In this methods the
item i s adjusted to a value so that ei ther the maximum possible error of the adjustment is least, for example
repl acing an alphabet ic character in a numeric item by
5,
or else it is adjusted so that the effect on the output i s
l east, for example if sex is not stat ed then set this to male.
1rhree very interesting refinement s of the "saf est
26
.
particularly appl i cable to statistical data, and the reference includes some discussion of the effect of these methods on the
finll statistics.
Each method depends upon the construction of a cross-classifj_cat ion on a group of i terns so t hat the cUfferences
between records belonging to differ ent classes is great, while
theI1e is lj ttle variation between records in the same class.
When an item is found to be in error, or missing, the r ecord is classified by the other items and the value f or the bad i tern i s chosen from this class if i ca ti on in one of three wa3,s.
In the ncold deck method", prior to editing a set
of values is stored in the computer s at least one for each
classification. The bad item is adjusted to the value5 or
one o:f the valu~ stored for this item in the classif:i cation
which agrees with the record in all the other items of the
group. If there are several values5 then one is selected by
a systematic or random selection procesR.
The 0 hot d.ecl<: method" is similar except that when
a r ecord is found to be valid it is used to update the d.eck
by r eplacing the old values of its classifications by its own values. The correction procedure rill therefore adjust an item to the value of that item in a recent valid record
of the same classification.
Suppose, for example, that the set oi values contaills
one element only and that the items in the group were sex and hours worked last week, and the preset values for hours worked
were
41
f or males9 25 for females. If the cold check methodi s used, whenever the figure for hours worked is unacceptable then it will be set to
4
1
if the sex is male and 25 if the27
.
continuall;y upda tedo If' a r ecord is read and ace epted where
a female works
37
hours, then the stored value i s changed to37
and if the next record has sex female but no value forhours worked, this item will be set to
37
.
I t can be seen readily that the hot deck method
has less tendency to concentrate records around a few sets
of values, and is also better able to foll ow any trends that
may be present in the data.
The third method, the 11 Monte Car l o method" is more complicated, but except where there are trends in the data which favour the hot deck method, it ma;y be more ace eptable
st.J.ti stically. In this method t rn :9asses are made over the
d&ta. On the first pass, no corrections are made, but the
distribution of values in each of the classifications is
built up by assum:ing that the accer>ted values are true values.
Using these sets of values and a. theoretical model of their
distribution - usual ly the Normal distribution, the p ramete
of the distribution nre computed. On the second pass ,
whenever a correction has to be made a random nwnber fenerator
2,8.
1.4 SOBE EXTZ):rSIOFS AND PROBLEMSc
1.4.1 Most of the preceding ideas are especially relevant to numeric data; where a range of ari thmetic checks is
available for the edit9 or to coded classificationss where
sufficient logical relations exist to produce useful checks. The problor:1. of writing a satisfactory edit for large alphabetic
items such as names and addresses or descriptive items is much more diff icul t.
This is because of the very wide range of acceptable names, and the sm:?.11 nui .. "ber of rules i n their :t'ormation. The rules are full of aml.:iigui ties, for example JO::-IN is acceptable both as a given name '.3.nd as a surname , and in some cases , they
are deliberately broken~ such asKOSY KA.FE. 'I'he author 7 when
studying a computer produced list of names and addresses of companies with statisticians experi011ced in the fie le., fou11d that abo~t 5 °/o of all names could not be classed as right or wrong except by ref erence to source docU1:1ents. There seem to be no rules available for such names as \'JEST"'":;RN GOLF FIELDS
EXPLORATION CO. or Trill y:1_r.fICK CLEAIC"RS PTY. LTD.
However, in view of the large amounts of this sort of data that are processed, and the huge amount of work
involved in checking these lists clericall y; it seems desirable to edit as far as possible. For exrunple; it i s possible to restrict the characters used in names to alpha··-etic, hyphens and apostrophes, to detect such errors as SM2TH, Other checks can be determined by the nature of the d8.ta being processed. Thus, in a motor vehicle registration applicationj the number of manuf'actur er s is small and any name given as such can be tested against each. A correction process can even be
29.
search mo.de dovrn the list to find the correction, thus FL~RD
could be corrGcted to PORD9 but FROD would still give trouble.
Again, if names arc ceing held$ then it may also be convenient
to hold the initials for addressing say, and an ecli t can be
written to checlc given names against initials.
1. L~. 2 A similar lack of possible checks exists for a code
number, or reference, which distinguishes a record from others
on the f ile. This is a very important item and it may be
used freq_uently for updating and retrieval purposes, but
generally it has
nm
logical relation to any other information in the record. Similar considerations apply to items such asace 01JluilJ't numbers 1hich maJ not form part of the processing
sys tern, but which are vi tall~,r important.
To overcome this problem the t echniques of ch-3ck
digits have been devised (see Beckley
(5)).
Herc an extracharacter or characters are added to the number. These
characters carry no extra information and do not increas8
the scope of the number, but they are calculated from a
the other digits of tr,.e number. The method
rel ation between
of derivation is carefully chosen so that the co:rnmon types of
error that can occur while transcrj_bing the number, that is misreading one char acter or transposing two adjacent charac~
ters, will cause the r elation to be broken. Thus each
time that the ccrde is read by the computer, the calcul<-.tion
is repeated and if the computed check digit agr ees with that
given then there i s a very strong probability (often as high
as • 999) that the number has been correctly transcribed.
The most commonly used check digits are based on
the "modulo 11° techn.ique, where a product is formed of
30.
The remainder is used as the check digit. Where the remainder
is 10 either the digit X mny be allocated, or else the code
number i-na.y not be alloca tcd. Some times 11 alphabetic charac
-ters are used.
For example if the number if 1234~ we form the sum
1 X 23+ 2 X 22 + 3 X 2 +. 4 = 26
and dividing by 11 we have remainder
4
so the number withcheck digit is 1231+4.
1.4.3 It has been mentioned before that it i s often
valuable to use the de ta currently being edited to adjust
and improve the edit checlrn themselves. This technique
is frequently applied to the upper and l ower limits of a
functional check where the function is a r tio~
This adjustment can serve two purposes. Firstly it can be used where there is little or no knowl edge on
which the limits can be set initially. At the start of
edlting the limits can be set very wide, and use made of
thiG technique to improve the limits as the editing progresses.
Secondly9 where the data are not homogeneous, for example
tompe:ra tu res r ecorded which will vary seasonally, this
adjustment can be used to follow the trend of the data and
keep the edit effective.
The resetting of the limits is usually carried out
at set intervals in the editings not necessarily after each
record. Thus9 the limits may be r eset after each batch of
processing, or dUJ:ing the batch &f'ter ~Y-ever·- 1000 records.
Normally the amount of data the.t is processed between
31.
values derived from i t to represent the values of the whole
of the data. For this reason some hi storical aggregates are
usually used too. That i s some totals will be kept of t he
records to date and these will be mer ged with the recent
values before the calculations and resetti~g are performed.
The weights given to the recent records wi l l be dependent
on the nature of the data; if the purpose of the adjustment
is to follow the trend of the data then more weight will be
given t o the recent value than if the data were homogeneous
and the adjustment used to improve the limits.
A similar approach, w:t~ l ess statistical justifica
-tion is to count9 during the editing run9 the number of
r ecords submitted to this check9 the nu.~ber that wer e too
high and the number that were too low. At the end9 compute
the ratio of those rejected as too high to the number
submitted, and if the ratio is higher than a predetermined
value, raise the upper limit by a pr eset amount. If the
ratio is lover than e. second predetermined value, then
lower the upper limit. Treat the lower limi t similarly.
1
.
4
.
4
It is desirable to produce some statistics on theediting of each batch of data. In its simplest form this
can be merely a count of the number of r ecords in er ror~
the number queried and the number accepted. These figures
give an instant, but rough, picture of the quality of the
data , and may be of use in deciding tllat a batch of data
is so bad that it cannot be nccepteda Such a decision may
enable the p~inting of the er r·or listings to be cancelled,
as theso may be held on magnetic t ape and printed only when
tho s tatistics as above have been examined. As printing i s
a rel atively expensive operation, this may represent a
32.
More useful is a count of the number of failures at
each editing check. These give a very clear picture of the type of er ror occurring so that attention can be given to
any er ror occurring frequently. This may be prevented by clearer instructions to those preparinr: t he data, or by
modificRtion of the edit program if this i s at fault. These figures can be made available very rapidly and give a better
picture of trends in t he results than the long process of
cl erical inspection of the complete err or listings.
Particularly in those cases where adjustments are made on a statistical , or even arbitrary, basis. it may be useful to exte~d the statistics given with each run by
aggregs.ting the changes made to each item throughout the
batch. These provide a control not upon the datn quality
but en theErliting process~ and may provide an estimate of
any bias thct is introduced into the file by these adjus
t-ments.
A further output of the editing run may be a list, or a summary of a key item of each record9 set out so that
the l ist can be scanned rapidly to detect omissions or
duplications in the data. Sometimes both are produced,
and the suntmary used t o detect the existence of a discrepancy
and the appropriate section of the list used to find the
actual errors. Such a listing is often called a 11balancing
1. 5 'rHE LOGICAL EDIT.
1.5.1 The editing process, as it is usually programmed, can be divided into four operations, which can be described in general terms as:
(1). The data characters are read from some input device and interpreted according to some predetermined structural rules.
(2). The record, as produced by (1) is subjected to editing checks and adjustments.
( 3). Messages are produced to signal any errors, queries or adjustments.
(4).
The edited record is stored in some formsuitable for later processing.
The file update and correction phase folJows a very similar pattern, step (1) now being extended to include the matching of the amendment information to the appropriate information on the fj_lc to be updated. The remarlcs made below will apply in general to both processes.
The purpose of this section is to define a subset of this whole process, called the logical edit, and to
justify its consideration as a separate entity.
Operations (1) and
(
4)
do not require any knowledge of t he logical content of the record. All that is required is a !cnowledge of the physical structure of the record, and of the characteristics of the particular input/output device being used. The connection of these sections with the restof the program is simple, usually being only a block of
:,4
.
is complicated if an error is detected in the input operation.
It may happen that the input record is impossible to
interpret. This sometimes occurs in a free format medium
such as paper tape. Under these circumstances, the r ecord
must be excluded from thee:::l.its but some message must be given
to enable the record or records to be identified and
resubmitted for later processing. This message will usually
consist only of the character string and some diagnostic, as
not enough information will be availaDle to write the record
in accordance with the pri nciples discussed earlier. These
messages will thus be quite different from these written by
operation ( 3) and ma:,· be regarded as dis tine t from them.
They may either be written om a separate file or else on the
same file relying on the sophistication of the software to
accept messages from two independent .sources for the same file.
On the other hand, the input record may be int.er
-pretable, but may contain errors in some items that are
detected at the input stage; for example a parity error
may be signa1Ied. In such cases~ it is often desirable to
note this fact during the editing of operation (2), possibly
modifying the checks becG.use of it and possibly connnenting
on the error in the normal listings of operation
(
3)
.
Forthese purposes a flag or set of flags may be needed to
cornrrru.nicate between (1) and (2).
It does seem, therefore, that t he first and last
operations can be separat ed from operation (2). Some
communication between them is necessary, but this is not
very i nvolved. Further such a separation seems logically
justified because, in general terms, opeiations (1) and
(4)
35.
of the record but need a knowledge of the particular machine
environmento On the other hand~ oper 2tion (2) can be
specified without rega.rd to the machi:r1e environment ~ but is
dependent on the logical conteJ::,t of the r ecord. Furthermore 9
there is some justification for regcrdi ng the problem of
efficient spe'Cification and. implementation of ( 1) and (L~) as
"solved". Even 2.part from the input/output fa.cili ties of
most high level anguage s which are usucJ.lly adequate for this
task unless there is a very l ar ge volume of data, most machines
have available an input/output program to deal very rapidly
with logical r ecords in the very manner needed by the oper a
-tions. In the few cases, such as paper tape ; where the
conventions of ea.ch installation diff'er to such an extent
that little can be done on an inter-installation basis, it is
quite easy to provide a package to read logical records
accoro.ing to the installation' s own standards.
rrhese operations will not be considered further
in this paper.
Operation (3) alJove is also to soP.1e extent machine
dependent, as again an output, device i s used requiring
programs particular to the machine. However it cannot be
isolated from operation (2) as clearly as can the others. Thj_s is because some lcnowledge of the logical content of
the record is needed to provide efficient diagnostics. For ex .mple 11 ille o;al charac ter1
' i s not as useful a.s ; ill ege.l
character in account number11
•
A reasonable compromise is to require operation (2)
to set up the diagnostic messages in an eAylicit form, and
hand these over to a generally coded output routine to write
them in the necessary f'orm. Any editinf·· statistics needed
36.
In accordance with the above, the author has
written a genera.li sed edit routine f or the Control Data 3600. This r outine must be used in conjuctj_on wj_th a program
per forming o:9eration (2) f or the particular application.
Control cards are re:?.d wbich specify the structure
required for the input files t he edited file and error
listings. The routine then reads one logical record f or the
input file and interprets this according to the specifi ed
struct ure. Any errors that are noted at this stage are
si3nalled by a set of flags.
The 1)rovided rou tin3 is given this record and
must edit i t , storing the edi ted record. and any diagnostic
massages to be written in arrays accessible to the
generalised routine. The edited r ecord is then written and the error listings written if necessar;;r. The next
input record is then read and so on.
This rout ine has been used l::>y several :,::ieople
for different applications and. t:t.e division into
provided. ai1d generalised routines seems q_ui te satisf2ctory.
1. 5.3 The r est of this paper will therefore consider that sul)set of t he cdi t ing -process wl·1ich r equires a
knowledge of the l ogical cont ent of the records. A
generalised technique v1ill be introduced t o enable this
section to be -vvrj_tten with a minimum of time and eff ort,
yet with the capc.bi li t;· to write efficient programs and
to simplifJr debugging as much as possible.
This subset will :Je r eferred. to as the n1ogical
edi t;r • and is definec! as follows:
11 a logical ecl.i t is a procec1.ure w ich accepts a record
37.
necessary, and evaluat0eits condition Dy applying checks on or between the items and external valuas which may De part
of the ed.i t program or may De other records of some kind.
The edit may adjust items if appropriate9 and add new items,
and it will generate the ap,;,ropriate messages to report on the conditions found'1