A complex mixed token is a mixture of alphas and numerics that do not fit into the above classes, for example, 123A45, ABC345TR.
A special token contains special characters that are not generally encountered in addresses. These include !, @, ~, %, and so on.
A null token is any word that is to be considered noise. These words may appear in the classification table and are given a type of zero. Similarly, actions can convert normal tokens into null tokens.
The standard address forms:
123 E Maine Av
3456 No Cherry Hill Road 123 South Elm Place
would be tokenized and classified as follows:
123 ^ Numeric
No D Direction
Cherry Hill ? Unknown words
Road T Street type
The pattern represented by this address can be coded as:
^ | D | ? | T
The vertical lines separate the operands of a pattern. All of the addresses above will match this pattern. The classification of D comes from the classification table. This has entries indicating that No, East, E, NW, and so on, are all given a class of D to indicate that they generally represent directions. Similarly, the classification of T is given to entries in the table representing street types (Road, St, Place, and so on).
Patterns and actions
The pattern file consists of a series of patterns and associated actions. After the input record is separated into tokens and classified, the patterns are executed in the order they appear in the pattern file. A pattern either does or doesnt match the input record. If it matches, then the actions associated with the pattern are executed. If it doesnt, the actions are skipped. In either case, processing continues with the next pattern in the file.
The pattern file is a standard ASCII file that can be created or updated using any standard text editor. It has the following general format:
There is one special section in the pattern tablethe poststart actions. The poststart actions are those actions that should be executed after the pattern matching process is finished for the input record. An example of a postaction is computing Soundex codes for street names. The special section is optional. If omitted, then the header and trailer lines should also be omitted.
Other than the special section, the pattern file consists of sets of patterns and associated actions. The pattern requires one line.
The actions are coded one action per line. The next pattern may start on the following line.
Blank lines may be used freely to increase readability. For example, it is suggested that blank lines or comments separate one pattern/action set from another.
Comments are indicated by a semicolon. All characters following a semicolon (;) are considered to be comments. An entire line may be a comment line by specifying a semicolon as the first nonblank character. For example:
;
; This is a standard address pattern
;
^ | ? | T ; 123 Main St
As an illustration of the pattern format, consider postactions of computing a Soundex code for street name and processing patterns to handle:
COPY [1] {HN} ; Copy House number (123) COPY_A [2] {PD} ; Copy direction (N) COPY [3] {SN} ; Copy street name (Main) COPY [4] {ST} ; Copy street type (St)
Notice that this example pattern file has a postsection that computes the Soundex of the street name (in match key field {SN}) and moves the result to the {XS} match key field.
The first pattern matches a numeric, followed by a direction, followed by one or more unknown words, followed by a street type (such as 123 N Main St). The associated actions are to copy operand [1] (numeric) to the {HN} house number field of the match key. Copy the standard abbreviation of operand [2] to the {PD} prefix direction field of the match key. Copy the unknown word or words in operand [3] to the {SN} street name field of the match key. Copy the standard abbreviation of the fourth operand to the {ST} street type field of the match key. Exit the pattern program. A blank line indicates the end of the actions for the pattern.
The second pattern/action set is similar to the first, except that it handles cases like 123 Main St. If there is no match on the first pattern, then the next pattern in sequence is attempted.
Pattern format summary
This section discusses the basic format of patterns and how various elements of the standardization process can be referenced.
A pattern consists of one or more operands. Each operand is separated by a vertical line. For example, the pattern ^ | D | ? | T has four operands. These are referred to in actions as [1], [2], [3], and [4]. The unknown class (?) refers to one or more consecutive unknown alphabetic words. This simplifies pattern construction, since names like Main, Cherry Hill, and Martin Luther King will all match to a single ? operand.
Spaces may separate operands if desired. For example, ^|D|?|T is equivalent to ^ | D | ? | T.
Comments may follow a semicolon:
;
; Process standard addresses
;
^ | D | ? | T ; 123 N Main St
Match key fields may be referred to by enclosing the two character match key name in braces. For example, {SN} refers to the street name field (SN) of the match key.
Pattern matching stops after the first match is encountered. For example, in an input record like 123 Main St & 456 Hill Place, the pattern ^|?|T matches to 123 Main St and not to 456 Hill Place.
The simplest patterns consist only of classification types. For example:
^ | D | ? | T
These are straightforward and do not require much further explanation. Remember that hyphens and slashes may be present in the patterns. For example:
123-45 matches to ^ | - | ^ 123 1/2 matches to ^ | ^ | / | ^
This section presents information on unconditional pattern matching. These patterns are not sensitive to individual values.
They are the simplest to code and the most general. Conditions can be specified to cause patterns to match only under specified circumstances. Conditions are discussed in the next section, Conditional patterns.
Simple pattern classes
This section describes the simple pattern classes. Any of the simple classes may appear in a pattern specified in the Pattern Rule file. These classes are slightly different from the classes assigned to a string when the input record is read. This is because several forms can match to a single input token. In other words, this is a pattern-matching language. The simple pattern classes are:
A-Z: Classes supplied by user from classification table
^: Numeric
?: One or more consecutive unknown alpha words +: A single alphabetic word
&: A single token of any class
>: Leading numeric
Notice that the null class (0) is not included in this list. The null class is used either in the classification table or in the RETYPE action to make a token null. However, since it never matches to anything, it would never be used in a pattern.
The classes A through Z correspond to classes coded in the classification table. For example, if APARTMENT is given the class of M in the classification table, then a simple pattern of M will match.
The class ^ represents a single number. For example, 123 will match. Also, 123.456 will match since periods are filtered out of the input stream. The number 1,230 is three tokens: the number 1, a comma, and the number 230.
The class ? matches to one or more consecutive alphabetic words. For example, MAIN, CHERRY HILL, and SATSUMA PLUM TREE HILL all match to a single ? operand, providing none of the aforementioned words are in the classification table for the process. This is useful for street names where a multiword street name should be treated identically to a single word street name.
A single alphabetic word can be matched with a + class. This would match the first word in each example. This is useful for separating the parts of an unknown string. For example, in a name like JOHN QUINCY JONES, the individual words can be copied to a match key with first name {FN}, middle name {MN}, and last name {LN}, as follows:
+ | + | + COPY [1] {FN}
COPY [2] {MN}
COPY [3] {LN}