• No results found

Editing by computer

N/A
N/A
Protected

Academic year: 2019

Share "Editing by computer"

Copied!
152
0
0

Loading.... (view fulltext now)

Full text

(1)

EDITING BY COMPUTER

A thesis submi itt ed for the degree of :MABTER OF SCIENCE

at the

Au:st.ralian N Htional University Ca.n:berra

J'une

1967

by

Davi.d MENDUS

(2)

PREFACE SUMMARY

STATEMENT

CHAPTER 1

CONTENTS

II\1TRODUCT ION

1.1 Introduction and definitions. 1.2 Error checks.

1.3

Automatic adjustment.

1.4 Some extensions and problems.

1.5

The logical edit.

CHAPTER 2. LOGICAL EDIT SPECIFICATION 2.1 Logical edit design.

2.2 The extent of the language.

2.

3

Existing techniques.

2.4

Evaluation of existing methods. 2.5 The suggested approach.

CHAPI'ER

3,

THE L~NGUAGE

3.1

Itt.tnoduotimn.

3. 2 T.he structure of FRED •

3.3

Executable statements.

3.4

Declarative statements. 3.5 Examples.

· ~HAFTER

4.

THE IMPLEMENTATION 4.1 Pre-procesor form.

4.

2 Compiling techniques.

4.3

The recogniser.

4.4

P.t-Qc-e~e-i.ri,£ rou ttnes.

4.5

Implementation limitations.

i i i iv 1 2 15 21 28

33

38

39

43

49

57

63

(3)

CHAPTER __5. USING FRED 5. JL Example l

5. 2 Example 2

5.3 Possible improvements.

APPENDIX A SYNTAX NOTATION

APPENDIX B JOB CONTROL LANGUAGE

REFERENCES

APPENDIX C A LIST ING OF THE PREPROCESSOR

APPENDIX D THE EXAMPLES

· - - - -

·

-

-I29 130 133 135

137

I39

141

attached

(4)

PREFACE

I wish to thank Mr B.W. Smith for supervising this thesis, and for his many valuable criticisms and suggestions.

(5)

ii.

SIDAMARY.

Editing and correction of records by computer is very important when large volumes of data are being processed. As wel l as improving the quality of the result s, a well

designed edit can also simplify the processing of the data 'by detecting and correcting records which could cause the

processing routi ne to malfunction.

Chapter 1 consider s this task, and justifies its division into two sections. Fil e handling and input/output is dependent on the machine and peripherals used, but r equires no knowledge of the logical content of the record. On the other hand9 the checking of the items and generation of error messages is less machi ne dependent, but needs a detailed

knowledge of t he logical content of the r ecord. This latter task, called the "logical edit" is considered in this the sis.

The creation of a logical edit involves three phases; analysis , programming and testinga Commonly a different

language is used far the analysis and the progr &'TII!ling9 and this can lead to er~ors and difficulty with t he testing. Using decision tables the first two phases can be com'bined1 as decision tables can be used as a programming language in conjuction with a suitable compiler.

However, they ar not ideal as a means of specifying a logical edit. In particular, they are cumbersome and time consuming. A need exists f or a better language that can be used both for analysi s and as a computer program. Chapter 2

considers the properties of such a language. "FREDi1

(6)

iiL

The statements of F1RED represent the most common logical t asks required in editing, and include such features

as look-up tables 9 error flag setting, adjustments and so on, not normally contained as single statements in a

prograffit~ing languageo To deal with l ess common tasks9 arbitrary blocks of PL/I statements may be included in an edit.

FRED also has a structure to represent the branches of the edit in an intuitive way. This structure exists in two forms, one of which is used for the program itself, and hence appears on the program listings. This makes a program written in FRED particularly easy to read and debug.

FRED is written to be compatible with PL/I, and a FRED edit will generally form one procedure of a PL/I

progr am to perform all editing tasks.

Chapter 4 describes the way in which a FRED pre-processor has been written and tested. The pre-processor

is written in PL/I and converts an edit written in FRED into a logical]y equivalent PL/I procedure. Any PL/I stat©ments r ead by the pre-processor are copied unchanged so that a program may be writ ten of mixed PL/I and FRED blocks, and the whole submitted first to the pre-processor,

then to the PL/I compiler to produce an edit.

Chapt er 5 contains sample edits written in FRED, and mentions improvements suggested by the early use of FRED.

(7)

iv.

~TEMENTI

The mat erial reported in this thesis is the

author1s own work except where references to other

(8)

1.

(9)

I

I

1.1 IN'I'RODUCTION AND DEEJ]JTIO]i§.

l.]. Editing is a process to check the of items of information. These items are into logical records (frequently ref erred

2.

accuracy of a set usually organised

to as records).

A logical record i s a collection of related items; for example, if a set of items is the information collected for an area in a population census, then those items relating to one dwelling may form a logical record.

The items may also be grouped because of the

properties of the medium which stores them. For example, only a f ew items may be contained on one punched card or

one length of paper tape. Such a group is called a

physical record. In many ap9lications there is a simple r8lation between logical and physical r ecords.

A logical record is represented by a sequence of data chs.rac ters on sor.1e medium together with an implied structure which enables these characters to be interpreted as the i tems of information. The data normally consist of

alphanumeric characters which represent the infon;1a tion1 and separators 9 which may also be alphanumeric, which do not represent information, but which, in conjunction with

the implied structure, enable the items of information of the r ecord to be det ermined. Thus, the character string

142/10/4"

contains three separators 11

£

11

, 11

/ " and 11

/ 11

, and these enaole the char act ers 11

142104" to be interpreted ns one hundred and forty two pounds, ten shillings and

fourpence.

(10)

I

I

3

.

improve the gualitlf of the information and remove these

errors which will impede later operations on the fi le. Thus for example, in an edit attached to the job of computing pay, an attempt will be made to remove all errors by including redundant information such as checksums to allow the checking of all items. In a typical statistical application on the other hand, such an edit would be impractical with the

available resources and the edit will be designed to remove all errors which would seriously affect the statistics and merely reduce other errors to an aeceptable level. The edit may also be used to point out exceptional, but correct

records. For example, the import of a 1:·rarship costing several mill ions of dollars disturbs the detailed import statistics so much that special mention may have to be made, Such events are easily highlighted at the editing stage.

A further use of the edit is to simplify later processing tasks. Dif'ficult situations such as zero divisors can often be detected at the edi~ stage and

appropriate action taken quite easily, thus simplifying the later processing runs. Missing and absurd information such as illegal code values can also be detected and corrected during the editing. Also at this stage codes and serial

numbers can be inserted to be of use in the later tasks, such as a serial to be used as a subscript in later tabulation routines, and batch totals can be computed to check on the

completeness of the data.

The errors arise in two ways. The first ie due to incorrect, incompl ete or inconsistent irf' orJW. tion at the earliest stage, for example, an error of observation, or

(11)

4.

errorsn. The second type of' error i s that caused by a mistake in the extraction, transcription or storage of the

information. For example an error caused by punching an

incorrect character into paper tape, or by including some

records twice in the file, or an error while reading the data from magnetic tape. These errors will be referred to

as "transcription errors11 •

The value of editing is such that is is common to

include r edundant information just for the purpose of

providing an editing checko Thus, the total of a group of

items may be included in the record, or the rate at which

duty is levied may be i ncluded in an excise record even

though this should be determinable from t he nature of the

i tem, and a check digit may be ap~lied to a code number.

1.1. 2 There ar e many advantages in using a ccmputer for ed.i ting.. The types of logical and arithmetic checks that

are most useful can be performed quickly and cheaply on a computero Further, the computer does not get bored by

applying the same checks over and over again, ru1d continues

to operat e with very high reliability.

The value of computer editing is increased if the

edit i s used to query unlikely events as possible errors

and also to make corrections where possible to t he errors

found.

The unlikely events queried in this way can then

be checked manually for two important conditions. First,

reference to source documents or beyong may show that the record is in error. For example this wil l usually be the

case when hospi tal r ecords report the birth of quintuplets,

(12)

information ma;y be correct, but so significant as to need special action, such as the case of the import of a warship mentioned earlier. Records queried in this way form a class

which must usually be distinct from the class of records found to be in error9 since many programs are written that

do not allow further processing to be done on a file that contains errors, whereas , as noted above, many of these queried records will be correct and have to be processed in

the usual way. 'rhe attention that ca.."'1 be given to these

exceptional cases when singled out in this way is much greater than is possible when this type of record has to be searched

for by hand from the whole set of i nformati on.

Making corrections by computer is a very valuable

t echnique, though it i s very difficult to program. Indeed some corrections that can be recognised fairly easily

manually seem too complex to program using existing techniques.

An example is recognising that a punched card has been read

upside down. However, many simple transcriptions and source errors or omi ssions can be properly corrected by program,

particularly when the records contain numeric data.

Corrections are particularly valuable in work of a statistical

nature, where complete accuracy is not expected, and limited resources make it impr actical to investigate all sourcer errors. Her e, it i s a common procedure to apply an

11 arbitrary adj us tmen t" , that would make the informati on an acr,eptable record with at the worst; an insigni ficant effect on the accuracy of the fi nal statistics. These

adjustment rules can be programmed very effectively, and the ability of the computer to make comnlicated decisions very rapidly allows these adjustment routines to take account of

(13)

6.

the reliability ana_ predictability of the comp·u.ter allows the effect of such rules to be estimated with a high degree

of accuracy, and steps can be taken to reduce bias in a way

that is not possible with clerical amendment.

Using the computer for editing has a further

important advantage in the very common case of the further

processing also being done by computer. This is that the i ncidence of transcription errors between editing and

processing is reduced to a minimum. If the processing occurs immediately after editing, then no transcription occurs at

all, and the data will only be alt ered by a malfunction of the computer - the likelihood of whj_ch can truly. be regarded

as negligible.

If the processing occurs at another time, even on another computer, then the data can be stored on one of the magnetic mass·-storage media such as magnetic tapes, discs or drums. The reliability of these devices i s far greater than that of mechanical readers and writers involving for example paper tape or punched cards. The error rate will generally be less than one error in 100,000,000 characters, and since an error is generally detected by the reading device or by program, using such devices as checkstuns and parity bits, the chances of an undetected error will generally be even less.

There is 9 however, one severe problem in editing by computer. All types of error must be anticipated and

included in the program before ed: ting commences. This is natuarlly a much more difficult task than that of recognising

errors as they occur.

(14)

I

l

r

7.

information over a l arge period of time and aocess them by a rapid :process of"association. Thus, if the edit of a pers

on-nel file specifies no check on date of birth and if due to a

mechanical fault a large number of r ecords are_ transcribed

with dates of birth in, say, the seventeenth century, these

will not be detected by a computer edit and may remain

unnoticed until a renort quotes the average age of employee

as 342 with embarrassing consequences! Using clerical editing

however 9 there is a fair chance that such errors, unforeseen

initially, may be discovered and corrected before much damage

i s doneo

The original analysis of the jnformation for computer

editing needs to be very complete. Though this analysis can

be tested by test runs on act~al data, the results of which are

closely e.~~amined, much work still needs to be done on this problem

1.1. 3 Once the computer edit has been written it can be

used rapidly and freq_uently. Thus~ the data can be edited

in conveniently sized batches soon after these have been

received. There are several advantages to such a scheme.

Any source errors that are detected can be queried

soon after the information is collected. Memories will still

be fresh, documents still accessible and respondents found

easily. The quality of the corrections to these errors will

then be much higher than if the query was made later.

Transcription errors which are detected may lead

to repairs to the transcription devices to correct a r ecurring

error, or further explanation to an operator or coder who is

misinterpreting the instructions. In this way the build up

of a large numher of systematic transcription errors may be

avoided.

Analysis of the ~rrors recorded may lead to

imroverrents to the editing itself, or may highlight

t

I

(15)

8.

deficiencies in the information being provided. Thus the

system may be improved at the earliest possible r.10ment.

Because of these considerations 9 ed.i ting is usuallJr

carried. out at an early stage of computer operations on the

file, and is often associated with the transcription from an external medium such as punched cards, paper tape or

documents for an optical reader, to a computer oriented

medium such as magnetic tape. This procedure has the further

advantage that the other computer runs can be written with

the assumption the.t the data will contain no destructive errors such as zero d.i visors or out of range subscripts,

since these can be excluded during the edit.

In some cases the edit will form a computer run on

its oi~1, in nthers it will merely be a part of a run to

perform several steps. If the information arrives in batches

over a period of time, but cannot be processed until all has been edited, then a batch will normally be edited as soon

after its receival as possible oy a separate edit routine.

If, on the other hand, the volume of data i s small ahd the

edits are not too complicated, it may be desirable to edit

and process the data (or at l east the acceptable secti on) at

once. Another factor in t hj_s decision i s whether or not the

combined editing and processing programs can be conveniently

fitted into computer memory directly or by a suitable overlay

process.

Though there are many schemes for performing editss

the two represented by the attached flow cl1arts are typical

of a large number of edi t runs.

The first flow chart represents an application where

ther0 is only a small volume of data, and the error rates will be low, f or example a scientific computation. Here

(16)

I

. BA

The second flO'N chart repr esents a larger application where the edit is a separate computer program. The output of

this program will be a file of records now arranged in the form best suited for later processing. Since the record

size is freq_uently lar ge, even error recoras are includ;;d on the file and. a. second edit program is written to correct the file by altering records, removing records or inserting records. Records altered or inserted in this program will generally be edited wj_ th the same edits as the origin.al

records.

References (2) and (3) discuss the importance of editing and its relation to other computer operations.

..

I

(17)

.

..

:i 0 tt

..

J JJ ~ e \I

..

J ..l, J, ~ IJ.J .:!

..

0 1 C ..., .I: " .• .! .. .i, :£ •

.I.I t t " ~ .. ~ .,,

.

C.. V ....

..

..

~ " ~ ~ .!J

...

C .I. 0

...

V ~

-l

~ -0

..

"

..c

0 V .!J \I \. 1) .11

.,

..

"

ti:

" C

9

"

....

(18)

Rec.d 1:-i.c.

n r'fC t: rr,o rd.

Reo.d ""' o-r"'cJ"'e"'I::.

Yl:S

~ri·L~ '"'f.,._..cf:lot1t

e.l>u ut 1:.-1.1 f,1•

AS .-~1u(v--~d.

Y'°S

w~,·l-c ,.-.fo, ... d.=toll"

Gbo.,I, H,, f,1,

C,\ <<!'\U(<ro/

r ... 1=,,r''"' 1:.-~, J.t-,. c"'""•<~,..J AJ , · ~ ~

0 J. (.,.f.,..- .. 1:-10 ..

E'd ,I:-1:-1-tr r,tco,d A"''j <f'<o<• ~

Yl<'S

\vr,"L·1 • .,,·1:-c bl•

~Yr-or ~~ ~• ,,~.,

(i) Fir~c edil:- ru"', repeo.l=ed Q.S re9uired.

Ft.,d l:cl., cu«,.,,-cf; r e£or d o .. H,.,. J,· 1~.

Fon-, l:-l.1

c. .... r .. cJ ,cl ruo,cl.. l=d<I, !:-~, ,uoral.

(; 1) A....,tt,_,d .... ev,.\: run, r<.pec.~ecl a~ r<?9u,·r~ol.

NO w~,t--y '=''-'• ,.,au~

o ... t-. ""• ,,. .. ~r.t-f,1,

l,,J,.,·t~ .. ""'iJt-Of'j of

h, o-"~d-~-t

W'ro·L~ 1.-1., rrrorc( ..,I:; H,t ovl:p•l:-/.·lr

(19)

11.

1.1.4

There are two kinds of output from an edit. The

first i s tho file of logical recordfl in their edited form,

and the second is a file of the messages 2bout the records

that wer e cenerated by the edit. Although the form9 and even

the numb er of output fil es of the above t ~rpe s V9.ry from

application to applicLi.tion the general terms "edited fileli

and 11error listings" will be used for the two types.

In those programs where the editing and processing

are combined, the term edited file can be applied to the one

logical record curPently being processed9 but where the editing

is separated. from the processing then all edited records must

be stored or- an external ~ile. Since this is created by,and

usea. exclusively b;y the computer it is usually written on one

of the magnetic mass storage media9 tape. discs or drums.

This file wi l l fre(]_uently be very different in format to the

original data file.

The logical record will frcq_uently have a more rigid format. It is common9 especial ly when using paper tape as an

input medium, to economise on characters by devices such as

excluding l eading zeros, non-sigr1ificant-blanks and omitting

sections of the record ir possible. This is valuable in that

i t r educes significm1tl:l the number of che.racters to be

punched9 often the most expensive operation of the whole

processing , but it does mean that the la~,.out of the r ecords

wi l l vary considers.bl~r on the file and a special routine is

needed to interpret the varying forms of structures. As this

routine may be quite extensive it is desirabl e to use it once

only and a:rter the first reading the file ~~ept in a rigid

format which ma~, include a large number of non-significant

characters but which is easy to interpret.

The logical record may al so contain i terns d1ich v. ere

(20)

12-. ·

' values that were implicit in the input form of the record

and which are needed expl ici tl;~.r for processing. These values may not have been present in the source data or they may have been omitted during transcription. For exanple~ the results for weekly turnover for a retail store may include the figures for :'values of goods sold· including sales tax" , ahd '~value of

sales tax". from which "value of goG>-cls sold exclusive of sales

tax" can be inpu. ted for processing.

Other added items may be special items that are attached for sorting, classification or tabulation. The derivation of these extra items may be very complicatea., and

it is advisable to perform this on a once- and-for-all basis. The logical edit i s a suitable time for such computation. An

exampl e might be to add a 11 size code" of 1 if the stores

turnover is less than 10,000 annually, 2 it if is 10,000 but

l ess than 509000 or 3 if it exceeds 50,000. If figures uere later required for medium sized stores only it is easier to consider those shops of size code 2 only, than to test the turnover of every record, particularly as each record may be edited once only, but processed on several different runs.

A unique identifier or serial number may be attached to each r ecord on the edited file. This will enable

amendments ,to be specified to a particular record without

fear of ambiguity. 11).. check digit maJr be attached to this

number to reduce the chances of a misquoted or badly

transcribed number causing an adjustment to the wrong record. There may also be a flag attached to each record specifying

the condition on the record - whether good, queried or in

error. This may be r eferred to so that bad r ecords may be

(21)

In very large jobs the edited file is sometimes

split into two. One file will contain only those records that have passed the edit, or possibly only those batches

for which all records have passed the edit. This file

should contain a l arge proportion of the records and may be

processed at once. The other file contains all bad or doubtful records. The main advantage i s that to process

the amendments it is necessary to r ead only the shorter file, not the whole set of data. The running time of the

amendments program should be correspondingly reduced.

The err or l i stings in some cases may contain details

of every record, but more com:11only l i s t only those r ecords

that were found to be in er ror, were queried or had

adjustments made to them. The important f eatures ar e that the items of the r ecord are clearly laid out in a way that f acilitates compar ison wj.th the source i:n:format i on, that clear but concise messages give details of the errors f ound, and that the serial number is included ready for the

amend-ment processing. In some cases it is possible to design the

listing \SO that corrections may be made on the l i sting whi ch

is then used as a source document for the data preparation of

the amendment.

A separate error l isting may be needed for those

records so bad that they cannot be interpreted as items of information. These records are particularly common on paper tape. Since the items, which are the most important feature

of the above listing canrrot be distinguished, a completely new layout is r equired to show the record as a string of character s. Again error messages will be given; these will generally refer to unintelligible separators. Since the

(22)

the edited file, so unless the sequence of records is so

important that a space must be reserved for the corrected

version of this r ecord, no serial number will be alloted

to ito Since the l ayout of these listings i s so dependent

(23)

1,5 ..

1. 2 ERROR CHECKS.

1. 2. 1 An a priori knowledge of the no. ture of the i terns of

information and the relationships between t hem is needed to

define the error checks. Nordbotten (5) classifies this

knowledge into two general classes, t~eoretical knowl edge

and empirical knowledge.

TheoreticRl knowledge is the knowl edge given by the

definitions of the items and rela.tions. Thus: if an item i s

defined to be numeric it cannot contain an alphabetic

character 9 and value for sal es tax multiplied by rate of

sales tax must equal sales tax collected. If an item, or

a relation, fails a check based on theoretical knowl edge,

then there must be an err or. In most applications, the

r eco1"d. cannot be accepted without some adjustment so that

the record now satisfies the check. Theoretical checks are

easi er to define than empiri cal checks, and ar e fundamental

to the editing process. However, the value of the edit will

be much increased if empirical checks are added to the core

of theoretical checks.

Empirical knowledge is knowl edge about the items of information that is true as a matter of fact and not of

necessity. This will.lie gather ed from experience of previous

records of this t ype or by estimation of the values that may

occur from models, samples or similar edits. Thus, the yield

of oats should not exceed

40

bushels per acre, or the

temperature during an experiment should not exceed 120 degrees.

In gener al, it i s not possibl e to say that an item

or group of items which fail an empirical check are in error;

(24)

16.

that the i t erns are correct. H0vvever, it may be desirable,

though not rigorous 9 to say that an item failing on

empirical check i s in error.

N

The limit on such & check may be chose'd to sa.tisfy

one. of two criteria. · Firstly, the p,rooabili ty that a p;ood record i s rejected as in error may be kept to some limit ors secondlys the pirobabili ty that a bad record may be accepted as correct can be kept to some limit. As there are more

good records than errors, and as the good records al'e "better behaved11

, it is more convenient in practice to satisfy the

first criterion, and this is usually done.

:Cm:pirical checks may be applied on a two- level basis, for example "if the age at first marriage exceeds 70 then query the r ecord; if it exceeds 100 then the record is in error11

• Other extensions of the empirical checks are to reset the limits during editing either to refine the limits, or to fol low trends9 or to use historical or other external

information particular to thBt record.

No further distinction is made between theoretical and empirical checks in this section.

1.2.2. The most common type of check is that in vr:1ich one

item is checked against values or rules written into the edit program. In its simplest form this is the 11

(25)

17.

to take special action on encountering certain characters;

for example, as well as checking that there ar e numeric

character s only in an item9 special action may be taken on

read:img "?11

s which is often used to denote a transcription

error while converting data from one medium to another by computer.

Other checlrn t hat can be applied to one i tern of

information alone use more knowledfe of v1hat the information represents. For example9 the code for sex must be M or F9 or the amount of income tax paid annually must exceed one

dollar. Large look-up tables may be r equired to establish validity in some cases; for exe_mple9 country of origin codes

on over seas trade statistics nwnber over lOC\ The i tern ma;y

even be split into sub-items and a relati on between these

examined, for exampl e a number with a check digit.

1

.

2

.

3

.

A second cless of checks consi sts of those carried out by applying logical or aritturetic checks to two or more items within one record. These checks are ofter referred

to as 11cross- checks". They apply to the information

represented by the i tems and usual ly assume that the data

characters arc valid. For this reason these checks commonly follow a series of validity checks and are often omitted when one or more of the items involved has failed a pr evious validi t~,r check.

Logical relations of this type are widely applied to numeric, coded or alphabetic information. Very common

are simple existence relations9 if A then B; for example,

(26)

18.

for example if relation to head of household on a population

census form is D for daughter, then sex must be F 9 female.

For arithmetic information, a large class of

functional checks is available. In the most general form

such a checl<: requi res that the value of some function ff

of the items being checked must lie between the values of

two other functions a and b of items of the record; that is

where generall y a and b are simple functions such as

constants or values obtained from a look-up table on one

i tern. The principal function f may hov1ever be quite

complicated.

Some special cases of thi s check ar e very_ important.

One iG a check v.:here f is the ratio of tvrn i terns and a and

bare constants or simpl e functions of a third item. For

example, the ratio of the amount of wheat produced to the

number of acres planted must lie between two limits, each

dependant on the state in w~ich the f~rm i s situat ed.

Another important case is that where f i s an

algebraic sum of several items and a and bare either both

zero or some small value, negative and positive respectively.

For example, in a report on the composition of a chemical

mixture: ·- a ::. ( sum of weights of component i terns - total

weight·,·~ a where the value a has ·been chosed to allow for

any r easonable weighing inaccuracies, but is still small

(27)

1.2.4. Further checks are possible when more than one record is edited at a timeo These checks are similar in structure to those described above but involve items from more than one record. Again, these checks usually assume the validity of

the data characters hac ~cen checked previously, and may be omitted when items fail a validity check" There are three ways in which che:cks of this kind can be con true ted.

Firstly

9 the file of records may have a heirachal

structure. The records in such files refer to information on two or more levels. For example9 r ecords at the first level

might be s. summary of a group of records at the second level. This may occur 9 for instance, in the r ecords of a blood oank

where the first level record would contain details of the blood required by one hospital served by the bank. This

would be f©llowed by a set of second level records listing the blood required for each surgeon in the hospital. An obvious

check is to confirm that the total quantity of blood of each type required by the surgeons is equal to the quantity of that type required by the hospital. There may even be third level records listing the blood required for each patient, a set of these records being f.rouped behind the record for the

appropriate surgeon.

A second corrnnon type of heirachal file is that where the high order records contain the inf'ormation always present about one entity, and are followed by variable numbers of lower order records containing variable information. For example, a persorun:el file may contain a high level r ecord for each employee containing personnel number 9 date of birth, present position and so on. Each would be followed by a

(28)

any r ecord. for which salary for a l)revj_ous position exceeds

pr esent salary.

A practical problem with this kind of editing that is

sometimes very severe i s that it is not always possible to

retain the high 1evel record while all low level records

r elating to it are edited without destroying the file

structure. If it i s important not to write the high l evel

record until all records of the group have been edited special devices such as scratch files may be needed.

Secondl~, a r ecord may rive information about an

identifiable unit, and another r ecord tmr this unit exists

on some other, already edited, file. Once the identifying

information has been edited the tv/0 files can be matched and

edits applied between the r ecords. Common examples of this

type occur when a similar file exists for an earlier period

of time; here items can be compared for "start and finish11

matches, for example stock held at the end of the earlier

period should equal stock held at the beginning of the later

period.

Thirdly, the file may be homogeneous, but checks

can ~e made between each record and its neighbours. The

nature of the information may suggest checks, for example

if the file contains records representing the same items

over a period of time ma:ny checks can be devised to allow

only reasonable variations. Even in general cases, statistical

checlcs can be applied to short strings of records, of ten 3,

to report unlilrnly variations in the i terns ( see Nordbotten

(29)

21..

1. 3. :. AUT011ATIC ADJUSTlv1ENT.

1.3.1, The correction of erroneous items , or the insertion of

missing items by computer is a difficult, but worthwhile

operation. It is l Gss common than editing by computer, and

even in those appli c&tions where it is used, it is of ten done

on a trial basis and limited to a few items. However, its

use does lead to a very l arge saving in time and labour, and

hence cost, and it may l ead to a considerable improvGment in accuracy.

The alternative, correction by clerks~ is sometimes

regarded as exact, but in fact it i s often a very unreliable

process. Even in cases where it is possible for the adjuster

to find the correct value, by reference to sour ce documents

or beyond, there can be no guarantee that this has been done, or if it has, that the amendment has been processed without

transcription er ror. ·VhGr e it is not uossible 'bo recreate

the correct value, such as in statistical applications where

an error exists in the source data, then it is comr:'lon practice

for the adjuster to create some new value by means of rules

provided by the systems analyst or by using his experience.

The former kind of adjustment will clearly be done better by

a computer which will foll ow faithfully the instructions of

the analyst, and whi ch should make possible an elimination

of bias. It is the experience of the writer that as many as

10°/o 0f corrections made clerical ly are incorrect after transcription, in the sense that the r ecord when adjusted

still fails the editing checks.

There s.r e two severe problems in automatic adjustment

by computer , decidi ng which item is incorrect, and choosing

(30)

22.

fa.ct that the checks and corrections have to be written before

editing commences9 and so the analysis has to be very

thorough indeed to be effective in all cases. Once put into

operation it is difficult to change any of the instructions,

and if there j_s an error, in extreme cases, sever al hundred

thousand_ records may be edited before it can be corrected.

The procedure then is difficult 1 but the r esults so worthwhile that automatic adjustments should be included wher ever possible. Most of the met hods discussed below are particularly applicable to numeric data. Correcting

alphabet-ic data such as names and ada.r es ses in much more d_j_fficul t.

1.3.?. The difficulty of deciding which item. is in error occurs only when functional or logical checks between two or more i terns fail. This kind of check unfortunately j_s a very

common detector of source error. The preferable way of r esolving the problem is to apply further checks to each of

the items until the erroneous item oan be determined, but this is not always possible.

In the case o:f editing a pay slip, for exampl es if

the relation

net pay+ net tax+ deductions+ superannuation - gross

pay= O

i s not satisfied, then we could. introduce other checks such

as "does gross pay = anEIUal salary / 2 6?11 or

n does the value

for deductions equal the sum of the various deduction items?"

If one of these checks fails and if the other value for the

item satisfies t he original equation then we can be confident not only that vie have found the item in error 9 but that we

(31)

In other cases a check as to the r easonableness of

each item may be of value. Thus 9 if in a chemical mixture 9

the sum of the component weights does not equal the total

weight, then a check on each weight might reveal that one

·was of unlikely magnitude9 and the error could be attri1mted

to that item with some degree of confidence.

A method that is very effective when the record is

of a suitable type is to evaluate all possible checks before

maldng any adjus tmcn t, the:im making all necessary acJ.jus tmen ts

from the information then available about every relation of

interest. After such an adjustment the r ecord is frequentl:?

re-edit ed and possibly further adjustments are made until

either the record is correct9 or adjustment is halted after a

spec ifiecl number of iterations.

Frequently however , there is not enough information

in the r ecord to allow items to be checked in several ways

so that the err®neous one of a set can be determined.

Usually the data preparation resources available do not allow

sufficient redundant i terns to l)e included for this purpose.

A co1mri.on method in such cases is to evaluate a priori the

reliability cof the various i tems9 and where an adjustment has

tone made

9 to make it to the item considered least reliable.

For example it is frequently found that values or prices will

be more reliable than quantities9 especially in respect of

source errors9 since it is al~ays in the interests of one

party or another to ensure the accuracy of such an item.

A further method sometimes may be justified on grounds

of expediency. In this method the item which is least

important, or else least significant9 will be adjusted. No

(32)

·

24.

is only adopted where absolute accuracy is either impossible

or too expensive.

The techniQ_Ue used wherever possible by the author

is a combination of tbe first and third methods above. This

is to take a small groun of items classed as most r eliable

and to odit these with all possible rigour. The validity of these items having 1rnen established (as it usually is) 9 the scope of the edit is gradually extended by considering new

i terns and editing these individually and. against the group

of items already edited. If necessary these new items are

adjusted to agree with the previously edited items. In this way the whole record can often be edited so tho.tat any failure it i s obvious which item is in error, and how it

should be corrected. If an error is detected in the starting group, then as these items are generally the most significant in the record9 the record is in ,,n unacceptable condition

and must be corrected after careful reference to the source

documents or beyond. In this case the edit of the rest of

the r ecord ,:,ill "be confined to signalling error, and no

corrections will l)C made.

One of t he first edits using this technique was

tested against clerical correction. A batch of data was

edited using the program, and the original form of those

r ecords that wer e automatically adjusted were shown to a

clerk

9 exper ienced in editing such data. He was asked to

correct these records , and when he had done so, his

correct ions and the computer corrections were compared

car efully with the source documents. It was fou:nd that

the computer adjustment s were of a significantly higher

(33)

1. 3.3 The val ue to which an item shoul d be adjust ed is in some ce.ses obvious. 1_'fher e some ari thmetj_c ec_:,_uali ty or nea

r-equality has to be satisf i ed, or where further relat j_ons

suggest some value which satisfi es the check, then the

adjustment can be made in some con:fidence.

It often happens however than no such method is

availa1)l c. One possibl e approach is to use a knowledge of the lil~ely forms of er ror to the i tern during extraction of

transcription to find an acceptable value. For exampl e, if

a numeric value read from a punched card is found to contain

an alphabetic character, thcon a knowledge of the pt1-nching

machine' s chcracteristics could suggest a value such as

than numeral wr:ose punching forms part of the code :fhr the

digit found. Thus $ if the cards were punched on an I . B. M.

026, an A would be adjusted to 1, B to 2, C to 3 and so on. Similarly, i f a set of characters from paper tape i s found

to be in error, then the characters v,hi ch have the same

punching 1rnt with different case shift would be tried. On

a l ess complicated level 9 if an item is found to be too

l arge, then it mi rrht be worthwhile to truncate the l eading

digit and accept the r educed value.

If none of the above methods is suitable then the

method of 11 safest choice11 can be used. In this methods the

item i s adjusted to a value so that ei ther the maximum possible error of the adjustment is least, for example

repl acing an alphabet ic character in a numeric item by

5,

or else it is adjusted so that the effect on the output i s

l east, for example if sex is not stat ed then set this to male.

1rhree very interesting refinement s of the "saf est

(34)

26

.

particularly appl i cable to statistical data, and the reference includes some discussion of the effect of these methods on the

finll statistics.

Each method depends upon the construction of a cross-classifj_cat ion on a group of i terns so t hat the cUfferences

between records belonging to differ ent classes is great, while

theI1e is lj ttle variation between records in the same class.

When an item is found to be in error, or missing, the r ecord is classified by the other items and the value f or the bad i tern i s chosen from this class if i ca ti on in one of three wa3,s.

In the ncold deck method", prior to editing a set

of values is stored in the computer s at least one for each

classification. The bad item is adjusted to the value5 or

one o:f the valu~ stored for this item in the classif:i cation

which agrees with the record in all the other items of the

group. If there are several values5 then one is selected by

a systematic or random selection procesR.

The 0 hot d.ecl<: method" is similar except that when

a r ecord is found to be valid it is used to update the d.eck

by r eplacing the old values of its classifications by its own values. The correction procedure rill therefore adjust an item to the value of that item in a recent valid record

of the same classification.

Suppose, for example, that the set oi values contaills

one element only and that the items in the group were sex and hours worked last week, and the preset values for hours worked

were

41

f or males9 25 for females. If the cold check method

i s used, whenever the figure for hours worked is unacceptable then it will be set to

4

1

if the sex is male and 25 if the

(35)

27

.

continuall;y upda tedo If' a r ecord is read and ace epted where

a female works

37

hours, then the stored value i s changed to

37

and if the next record has sex female but no value for

hours worked, this item will be set to

37

.

I t can be seen readily that the hot deck method

has less tendency to concentrate records around a few sets

of values, and is also better able to foll ow any trends that

may be present in the data.

The third method, the 11 Monte Car l o method" is more complicated, but except where there are trends in the data which favour the hot deck method, it ma;y be more ace eptable

st.J.ti stically. In this method t rn :9asses are made over the

d&ta. On the first pass, no corrections are made, but the

distribution of values in each of the classifications is

built up by assum:ing that the accer>ted values are true values.

Using these sets of values and a. theoretical model of their

distribution - usual ly the Normal distribution, the p ramete

of the distribution nre computed. On the second pass ,

whenever a correction has to be made a random nwnber fenerator

(36)

2,8.

1.4 SOBE EXTZ):rSIOFS AND PROBLEMSc

1.4.1 Most of the preceding ideas are especially relevant to numeric data; where a range of ari thmetic checks is

available for the edit9 or to coded classificationss where

sufficient logical relations exist to produce useful checks. The problor:1. of writing a satisfactory edit for large alphabetic

items such as names and addresses or descriptive items is much more diff icul t.

This is because of the very wide range of acceptable names, and the sm:?.11 nui .. "ber of rules i n their :t'ormation. The rules are full of aml.:iigui ties, for example JO::-IN is acceptable both as a given name '.3.nd as a surname , and in some cases , they

are deliberately broken~ such asKOSY KA.FE. 'I'he author 7 when

studying a computer produced list of names and addresses of companies with statisticians experi011ced in the fie le., fou11d that abo~t 5 °/o of all names could not be classed as right or wrong except by ref erence to source docU1:1ents. There seem to be no rules available for such names as \'JEST"'":;RN GOLF FIELDS

EXPLORATION CO. or Trill y:1_r.fICK CLEAIC"RS PTY. LTD.

However, in view of the large amounts of this sort of data that are processed, and the huge amount of work

involved in checking these lists clericall y; it seems desirable to edit as far as possible. For exrunple; it i s possible to restrict the characters used in names to alpha··-etic, hyphens and apostrophes, to detect such errors as SM2TH, Other checks can be determined by the nature of the d8.ta being processed. Thus, in a motor vehicle registration applicationj the number of manuf'actur er s is small and any name given as such can be tested against each. A correction process can even be

(37)

29.

search mo.de dovrn the list to find the correction, thus FL~RD

could be corrGcted to PORD9 but FROD would still give trouble.

Again, if names arc ceing held$ then it may also be convenient

to hold the initials for addressing say, and an ecli t can be

written to checlc given names against initials.

1. L~. 2 A similar lack of possible checks exists for a code

number, or reference, which distinguishes a record from others

on the f ile. This is a very important item and it may be

used freq_uently for updating and retrieval purposes, but

generally it has

nm

logical relation to any other information in the record. Similar considerations apply to items such as

ace 01JluilJ't numbers 1hich maJ not form part of the processing

sys tern, but which are vi tall~,r important.

To overcome this problem the t echniques of ch-3ck

digits have been devised (see Beckley

(5)).

Herc an extra

character or characters are added to the number. These

characters carry no extra information and do not increas8

the scope of the number, but they are calculated from a

the other digits of tr,.e number. The method

rel ation between

of derivation is carefully chosen so that the co:rnmon types of

error that can occur while transcrj_bing the number, that is misreading one char acter or transposing two adjacent charac~

ters, will cause the r elation to be broken. Thus each

time that the ccrde is read by the computer, the calcul<-.tion

is repeated and if the computed check digit agr ees with that

given then there i s a very strong probability (often as high

as • 999) that the number has been correctly transcribed.

The most commonly used check digits are based on

the "modulo 11° techn.ique, where a product is formed of

(38)

30.

The remainder is used as the check digit. Where the remainder

is 10 either the digit X mny be allocated, or else the code

number i-na.y not be alloca tcd. Some times 11 alphabetic charac

-ters are used.

For example if the number if 1234~ we form the sum

1 X 23+ 2 X 22 + 3 X 2 +. 4 = 26

and dividing by 11 we have remainder

4

so the number with

check digit is 1231+4.

1.4.3 It has been mentioned before that it i s often

valuable to use the de ta currently being edited to adjust

and improve the edit checlrn themselves. This technique

is frequently applied to the upper and l ower limits of a

functional check where the function is a r tio~

This adjustment can serve two purposes. Firstly it can be used where there is little or no knowl edge on

which the limits can be set initially. At the start of

edlting the limits can be set very wide, and use made of

thiG technique to improve the limits as the editing progresses.

Secondly9 where the data are not homogeneous, for example

tompe:ra tu res r ecorded which will vary seasonally, this

adjustment can be used to follow the trend of the data and

keep the edit effective.

The resetting of the limits is usually carried out

at set intervals in the editings not necessarily after each

record. Thus9 the limits may be r eset after each batch of

processing, or dUJ:ing the batch &f'ter ~Y-ever·- 1000 records.

Normally the amount of data the.t is processed between

(39)

31.

values derived from i t to represent the values of the whole

of the data. For this reason some hi storical aggregates are

usually used too. That i s some totals will be kept of t he

records to date and these will be mer ged with the recent

values before the calculations and resetti~g are performed.

The weights given to the recent records wi l l be dependent

on the nature of the data; if the purpose of the adjustment

is to follow the trend of the data then more weight will be

given t o the recent value than if the data were homogeneous

and the adjustment used to improve the limits.

A similar approach, w:t~ l ess statistical justifica

-tion is to count9 during the editing run9 the number of

r ecords submitted to this check9 the nu.~ber that wer e too

high and the number that were too low. At the end9 compute

the ratio of those rejected as too high to the number

submitted, and if the ratio is higher than a predetermined

value, raise the upper limit by a pr eset amount. If the

ratio is lover than e. second predetermined value, then

lower the upper limit. Treat the lower limi t similarly.

1

.

4

.

4

It is desirable to produce some statistics on the

editing of each batch of data. In its simplest form this

can be merely a count of the number of r ecords in er ror~

the number queried and the number accepted. These figures

give an instant, but rough, picture of the quality of the

data , and may be of use in deciding tllat a batch of data

is so bad that it cannot be nccepteda Such a decision may

enable the p~inting of the er r·or listings to be cancelled,

as theso may be held on magnetic t ape and printed only when

tho s tatistics as above have been examined. As printing i s

a rel atively expensive operation, this may represent a

(40)

32.

More useful is a count of the number of failures at

each editing check. These give a very clear picture of the type of er ror occurring so that attention can be given to

any er ror occurring frequently. This may be prevented by clearer instructions to those preparinr: t he data, or by

modificRtion of the edit program if this i s at fault. These figures can be made available very rapidly and give a better

picture of trends in t he results than the long process of

cl erical inspection of the complete err or listings.

Particularly in those cases where adjustments are made on a statistical , or even arbitrary, basis. it may be useful to exte~d the statistics given with each run by

aggregs.ting the changes made to each item throughout the

batch. These provide a control not upon the datn quality

but en theErliting process~ and may provide an estimate of

any bias thct is introduced into the file by these adjus

t-ments.

A further output of the editing run may be a list, or a summary of a key item of each record9 set out so that

the l ist can be scanned rapidly to detect omissions or

duplications in the data. Sometimes both are produced,

and the suntmary used t o detect the existence of a discrepancy

and the appropriate section of the list used to find the

actual errors. Such a listing is often called a 11balancing

(41)

1. 5 'rHE LOGICAL EDIT.

1.5.1 The editing process, as it is usually programmed, can be divided into four operations, which can be described in general terms as:

(1). The data characters are read from some input device and interpreted according to some predetermined structural rules.

(2). The record, as produced by (1) is subjected to editing checks and adjustments.

( 3). Messages are produced to signal any errors, queries or adjustments.

(4).

The edited record is stored in some form

suitable for later processing.

The file update and correction phase folJows a very similar pattern, step (1) now being extended to include the matching of the amendment information to the appropriate information on the fj_lc to be updated. The remarlcs made below will apply in general to both processes.

The purpose of this section is to define a subset of this whole process, called the logical edit, and to

justify its consideration as a separate entity.

Operations (1) and

(

4)

do not require any knowledge of t he logical content of the record. All that is required is a !cnowledge of the physical structure of the record, and of the characteristics of the particular input/output device being used. The connection of these sections with the rest

of the program is simple, usually being only a block of

(42)

:,4

.

is complicated if an error is detected in the input operation.

It may happen that the input record is impossible to

interpret. This sometimes occurs in a free format medium

such as paper tape. Under these circumstances, the r ecord

must be excluded from thee:::l.its but some message must be given

to enable the record or records to be identified and

resubmitted for later processing. This message will usually

consist only of the character string and some diagnostic, as

not enough information will be availaDle to write the record

in accordance with the pri nciples discussed earlier. These

messages will thus be quite different from these written by

operation ( 3) and ma:,· be regarded as dis tine t from them.

They may either be written om a separate file or else on the

same file relying on the sophistication of the software to

accept messages from two independent .sources for the same file.

On the other hand, the input record may be int.er

-pretable, but may contain errors in some items that are

detected at the input stage; for example a parity error

may be signa1Ied. In such cases~ it is often desirable to

note this fact during the editing of operation (2), possibly

modifying the checks becG.use of it and possibly connnenting

on the error in the normal listings of operation

(

3)

.

For

these purposes a flag or set of flags may be needed to

cornrrru.nicate between (1) and (2).

It does seem, therefore, that t he first and last

operations can be separat ed from operation (2). Some

communication between them is necessary, but this is not

very i nvolved. Further such a separation seems logically

justified because, in general terms, opeiations (1) and

(4)

(43)

35.

of the record but need a knowledge of the particular machine

environmento On the other hand~ oper 2tion (2) can be

specified without rega.rd to the machi:r1e environment ~ but is

dependent on the logical conteJ::,t of the r ecord. Furthermore 9

there is some justification for regcrdi ng the problem of

efficient spe'Cification and. implementation of ( 1) and (L~) as

"solved". Even 2.part from the input/output fa.cili ties of

most high level anguage s which are usucJ.lly adequate for this

task unless there is a very l ar ge volume of data, most machines

have available an input/output program to deal very rapidly

with logical r ecords in the very manner needed by the oper a

-tions. In the few cases, such as paper tape ; where the

conventions of ea.ch installation diff'er to such an extent

that little can be done on an inter-installation basis, it is

quite easy to provide a package to read logical records

accoro.ing to the installation' s own standards.

rrhese operations will not be considered further

in this paper.

Operation (3) alJove is also to soP.1e extent machine

dependent, as again an output, device i s used requiring

programs particular to the machine. However it cannot be

isolated from operation (2) as clearly as can the others. Thj_s is because some lcnowledge of the logical content of

the record is needed to provide efficient diagnostics. For ex .mple 11 ille o;al charac ter1

' i s not as useful a.s ; ill ege.l

character in account number11

A reasonable compromise is to require operation (2)

to set up the diagnostic messages in an eAylicit form, and

hand these over to a generally coded output routine to write

them in the necessary f'orm. Any editinf·· statistics needed

(44)

36.

In accordance with the above, the author has

written a genera.li sed edit routine f or the Control Data 3600. This r outine must be used in conjuctj_on wj_th a program

per forming o:9eration (2) f or the particular application.

Control cards are re:?.d wbich specify the structure

required for the input files t he edited file and error

listings. The routine then reads one logical record f or the

input file and interprets this according to the specifi ed

struct ure. Any errors that are noted at this stage are

si3nalled by a set of flags.

The 1)rovided rou tin3 is given this record and

must edit i t , storing the edi ted record. and any diagnostic

massages to be written in arrays accessible to the

generalised routine. The edited r ecord is then written and the error listings written if necessar;;r. The next

input record is then read and so on.

This rout ine has been used l::>y several :,::ieople

for different applications and. t:t.e division into

provided. ai1d generalised routines seems q_ui te satisf2ctory.

1. 5.3 The r est of this paper will therefore consider that sul)set of t he cdi t ing -process wl·1ich r equires a

knowledge of the l ogical cont ent of the records. A

generalised technique v1ill be introduced t o enable this

section to be -vvrj_tten with a minimum of time and eff ort,

yet with the capc.bi li t;· to write efficient programs and

to simplifJr debugging as much as possible.

This subset will :Je r eferred. to as the n1ogical

edi t;r • and is definec! as follows:

11 a logical ecl.i t is a procec1.ure w ich accepts a record

(45)

37.

necessary, and evaluat0eits condition Dy applying checks on or between the items and external valuas which may De part

of the ed.i t program or may De other records of some kind.

The edit may adjust items if appropriate9 and add new items,

and it will generate the ap,;,ropriate messages to report on the conditions found'1

(46)

Figure

table language 9 and not at any ex isting implementation ..
table .. is effective in this case too.
Table Driven Compil[1964] 59-65.

References

Related documents

Stowe Fellowship for Mission Research at Yale University Divinity School: June 2013 The Fund for Theological Education Ministry Fellowship: 2004-2005.. Phi Beta Kappa,

The combination adhesive from hibiscus leaf (Hisbiscus rosa-sinensis L.) and composition of organic waste charcoal briquettes which showed the best quality was 15 grams

We have shown here that it is indeed possible to encapsulate two different proteins inside the CCMV capsid, i.e. EGFP and PalB. In the future, multiple different enzymes, capable

That the Kiwanis International Foundation Board approves a grant to the US Fund for UNICEF for US$2 million for The Eliminate Project. This completed the action of the Board

The use of commerce is conducted in this way, spurring and drawing on innovations in electronic funds transfer, supply chain management, Internet marketing,

In the Third District, the number of borrowers rose from just over 1.1 million (11.5 percent of the CCP) at the start of 2005 to just under 1.8 million (17.5 percent of the CCP)

Pump features a magnetic drive motor with sealed pump chamber for long life even with continuous use (up to 5000 hrs.) and is well suited for circulating water through heat

Moved by Lammers seconded by Lear to close the hearing and approve the Application submitted by Trenton Snow (Applicant) for James and Camilla Roark and Kay Tarr (Owner) for the