• No results found

10 Checking and Correcting

57 Anomaly searching

DISCREPANCY SEARCHING has similar uses to ANOMALY SEARCHING and they are often used together.

57 Anomaly searching

Example: Download a sample of several thousand deals from a commodity trading system and sort them into date order using the deal date as entered by the trader. You may  nd that some of those deals appear to have been made around hundred years ago. The date has been entered incorrectly and the anomaly is obvious.

ANOMALY SEARCHING is useful in COMPUTER SUPPORTED AUTHORIZATIONS and generally as a computerized check on data. It is also a primary weapon in a

RECOVERY/CLEANSING PROJECT.

™™™

Ideally errors would not arise but they do so throw computer power at the data to

 nd them.

Imagine that a process creates data records and these are stored in a computer system and also copied over to another computer system for further processing.

By searching for anomalies I mean looking in just one of these computer systems for records that look odd for some reason. By searching for discrepancies I mean comparing the data records between the two computer systems for differences that indicate something has gone wrong.

Of course we prefer to avoid having incorrect data arise at all, and prefer to prevent incorrect data ever getting into computer systems, but these things happen so the next line of defence is to look for anomalies.

This may be particularly relevant if edit checks at the front end are weak, out of date, or were weak when the data was entered, perhaps years ago.

Searching for anomalies can be done routinely or as part of a special project. It is normally done using computer queries followed by more in-depth investigation by humans. Over time, anomaly search that is part of a project can become a routine control through greater automation.

Searching for anomalies in large populations of data records makes it possible to use techniques not available in edit checking because they rely on seeing a large sample of data records, against which anomalies can stand out.

Typical anomalies found in practice include:

Empty  elds.

Numbers out of the reasonable range e.g. zero, much too large, negative, total

does not exactly agree to the sum of details.

Numbers vastly too large or too small, suggesting that the wrong units have

been used e.g. user is supposed to enter a money  gure in thousands of dollars but mistakenly enters it in dollars.

Values that are not one of the valid list of values for that  eld.

More  eld formats than should appear e.g. if the standard format for a customer

code is three letters followed by  ve digits then other formats, if found, are anomalous.

Dates that are obviously wrong e.g. from the wrong century.

Dates out of order e.g. an item sold before it was bought.

Wrong list of  elds on a data record –– possible with some kinds of database on

very old systems.

Flag values (i.e.  elds which have simple codes) that are inconsistent with other

 eld values or are in inconsistent combinations e.g. ‘‘number of children=2’’ but

‘‘parent(Y/N)=N’’.

Subsidiary record missing e.g. billing records for a telecom service that should

always include a record to bill for the line, but where that record is sometimes missing.

Subsidiary record present that should not be.

Distribution of digits in a large population is different from Benford’’s law.

3

Combinations of  eld values that are statistically unusual.

Field values that are statistically unlikely given the values of other  elds on the

record e.g. typically old items might have lower values so that an old item with a high value is unlikely. These can be shown on scatter plots.

Anomaly search can be done using implicit or explicit rules, and where explicit rules are used these can be identi ed by a person or created by software automatically.

Before thinking about explicit rules it is quite easy and productive simply to look at the data. Sort it on each numeric or date  eld and look at the top and bottom items, look at the number of formats used in each  eld, and generally get familiar with the records and  elds in a  le.

Information about what data should look like is often surprisingly hard to get, having been known only to the original programmers of a system, perhaps years ago, with no effective handover to the people who now support it.

As more knowledge is acquired it becomes possible to start formulating rules to check the data against, and putting these into software queries or just using them in spreadsheets.

If there are many data (hundreds of records at the very least) and you have the necessary skill it may be worthwhile trying data mining methods. Methods that 20 years ago would have seemed impossibly academic and dif cult are now available in software that runs fast and costs little (or, in the case of R,4 costs nothing).

One approach is to use some form of cluster analysis to break the population down into subpopulations on the basis of ‘‘similarity’’. Different algorithms have different capabilities. Some work only on number  elds, some only on text  elds, and a few can cope with both together. Clusters with few records in them are potentially anomalous.

Another approach is to use some variables on each record to predict the value of another variable and then search for records where the predicted and actual values are different or different by more than a certain amount. This could be done with neural network software, SVM software, or classi cation tree software to name just three well-known types. Again some techniques work only with text  elds, some with numbers, and a few with both.

If anomaly searches are to become routine controls operating automatically then it may help to use techniques that automatically update the search rules. A decision needs to be made about how often the rules will be revised.

In summary, searching for anomalies using software is an ef cient way to search for even small percentages of error in large data sets and  nds more than most people expect. Therefore:

Search for anomalies in data using software tools and human expertise, progressing from simple reviews of sorted data to searches using explicit rules.

™™™

ANOMALY SEARCHING has similar uses to DISCREPANCY SEARCHING and they are often used together.