Getting Started with SAP Data Quality
(Part 2)
Data Quality, part of SAP Data Services, is a powerful tool for cleansing and matching customers,businesses, postal addresses, and much more. However, the wide range of options and capabilities can be intimidating for new users. In this post, we will discuss configuration of the address cleanse and data cleanse transforms.
Configuring the AddressCleanse Transform
The cleanse transform editor is very similar to that of the query transform, with the input schema in the upper left, the output schema in the upper right, and a set of tabs below. In a cleanse, however, we only have three tabs: Input, Options, and Output.
Input Tab
The Input tab contains a list of fields on which the work will be performed. Each input column you wish to cleanse must be assigned to one of these fields. For example, you might drag your 5-digit zip code column down to POSTCODE1 and your city column to LOCALITY1. While some of the mappings are clear, Data Quality’s terminology can often be confusing. For an initial, basic cleanse, we recommend using the following:
FIRM or NAME – Company name or Individual name
ADDRESS_LINE – Full street address with or without unit number
LOCALITY1 – City
REGION1 – State
POSTCODE1 or POSTCODE_FULL – 5-digit or 9-digit postal code
If your address data is not yet standardized to this degree, you may find the following generic or combined fields useful:
ADDRESS_LINE – As above, full street address with or without unit
LASTLINE – City, state, and zip code
MULTILINE1-12 – Any data
Because the MULTILINE fields can contain anything, even varying from record to record, Data Quality will simply make a best-effort parse. While the results of this parse are generally good, they can vary widely depending on the quality and consistency of the input data and should be carefully checked. Options Tab
The Options tab defines which operations Data Quality will perform on your data and how those operations are configured. Since the list of options is extensive and most can safely be left as defaults, we will focus on those most relevant to an initial cleanse.
The Reference Files section includes the locations of the files Data Quality uses to validate and complete postal addresses. If your reference files reside in the default installation folder, these values likely do not need to change. However, if you need to run multiple sets of files or verify their locations, you will need to use these settings.
Assignment Options contains the flags telling Data Quality which operations to run. For an initial cleanse, we recommend DPV (Delivery Point
Validation) and LACSLink (Locatable Address Conversion System). DPV is the foundation of postal verification, indicating whether an address is valid for delivery or the precise reason it failed. LACSLink supplements DPV by updating addresses when the physical address has not moved, but some part of the address assignment has changed. For example, when a rural route is
converted to a named street or when a direction is appended to a street name. Together, DPV and LACSLink meet the minimum USPS address processing requirements for CASS (Coding Accuracy Support System) certification.
The remaining fields in Assignment Options supplement address verification by tracking moves (NCOALink) and new addresses (EWS), determining whether an address is residential (RDI), appending business suite number (SuiteLink), tracking mail carrier delivery route (eLOT), and assigning geocodes (Geo Mode).
The Standardization Options section contains most of the settings affecting output format. By default, these fields are configured to produce output preferred by USPS, which is suitable for most business needs. You might need to alter them if you want to preserve certain parts of the input data within the standardized output that would otherwise be discarded or placed in separate fields. For example, you may want to retain the order of dual addresses to meet business requirements or to preserve place names due to customer preference. We recommend considering the following:
Standardize Unassigned Address – Whether or not to standardize and
reformat addresses for which Data Quality could not find a valid assignment. This is useful if you want consistent formatting even for incorrect addresses.
Use USPS Primary Name Abbreviation – Attempts to abbreviate the address line to 30 characters or fewer using USPS
standards. For addresses that must be shortened, this provides more meaningful output than a simple truncation.
Capitalization – USPS prefers all
uppercase, but businesses and customers often prefer mixed-case.
Preserve Dual Address Order – This must be set to “Yes” if the input order should be retained regardless of address assignment. Otherwise, the assigned address will be placed immediately before locality, giving it precedence in delivery. Only useful if you’re using dual addresses, e.g. PO Box and street.
Output Tab
The Output tab contains a list of all fields exposed as results of the cleansing, but, depending on your input and options, only some will contain data. Generally, you will want to look for fields with names
similar to the assigned input mappings, plus status and detail fields for the operations you selected in Assignment Options.
You will usually be able to compile a full, cleansed address with some combination of the following:
FIRM – Firm name.
ADDRESS_LINE – Full street and unit information. In most cases, this is identical to PRIMARY_SECONDARY_ADDRESS.
PRIMARY_ADDRESS – Street or PO Box info without secondary address, e.g. unit number. This can also be constructed using individual PRIMARY_* output fields.
SECONDARY_ADDRESS – Floor, unit number, etc.
LOCALITY1_OFFICIAL – Official city or town name.
COUNTY_NAME – County name.
REGION1 – State or territory.
POSTCODE_FULL – The 9-digit postal code. This can also be constructed using POSTCODE1 and POSTCODE2.
Unlike the fields above, status codes don’t contain any address information. Instead, they’re your windows into the cleansing operations themselves. Status codes are the best way to see detailed data on whether or not a record was successfully cleansed, why it failed, or which fields were changed. Many of the status codes are cryptic alphanumeric indicators, but you can find their translations in the Data Services technical manuals. For DPV and LACSLink, we recommend using these fields:
STATUS_CODE – Six-digit code indicating exactly which parts of the input address were changed.
UNDELIVERABLE_INDICATOR – True (T) or False (F) indicating whether the address is
deliverable per the USPS. Note, an address can be valid while still being undeliverable.
DPV_FOOTNOTE – A set of two-character codes either confirming that the address passed DPV verification or identifying the reasons it failed. The codes and their meanings are found in the Data Services technical manuals.
DPV_STATUS – A single-character code indicating the general result of DPV verification.
LACSLINK_INDICATOR – A single-character code indicating the general result of LACSLink verification.
LACSLINK_RETURN_CODE – A two-character code either confirming that the address passed DPV verification or identifying the reasons it failed.
In addition to a name, each field also has a class (FIELD_CLASS), address class (FIELD_ADDRCLASS), category (FIELD_CATEGORY), type, and content type.
FIELD_CLASS – Field Classes are vaguely defined in the documentation, but they generally represent different the levels of processing used to generate the field. For example, you can choose to output a field after being parsed, after being fully standardized and corrected, or both. BEST represents the highest level of processing and is usually a safe choice.
FIELD_ADDRCLASS – The Address Class specifies the type of address for the generated field. DELIVERY is a safe default and indicates the standardized address used for delivery assignment. DUAL is only used when both a street and postal address are input, and it indicates the address not used for delivery assignment. OFFICIAL is similar to DELIVERY, but always uses the USPS preferred option when alternates are available.
FIELD_CATEGORY – The Category defines the generated field as either individual, processed address components (COMPONENT), standardized versions of the input fields (STANDARDIZED), or members of a suggestion list (SUGGESTION). COMPONENT should be your default.
Type – Type is the data type of the generated field.
Content Type – Content Type describes the kind of data in the field.
Checking the box to the left of a field name automatically adds it to the output schema with default type and mapping. The default name of the output column includes the field class, address class, and
\
In addition to the generated fields, you can add fields directly from the input schema by dragging and dropping them into the output as you would in a Query transform. This allows you not only to output fields without sending them through the cleanse, but also to preserve the original versions of your input fields side-by-side with the cleansed results.
Configuring the DataCleanse Transform
The DataCleanse transform is very similar to AddressCleanse, but simpler in most respects. Again, the transform is organized with an input schema, output schema, and set of tabs.
Input Tab
The general behavior of the input tab is the same as described above. The main difference is that the fields are tailored to non-address customer and business data such as name, phone, email, date, etc. Also, you can include up to 6 different emails, dates, phones, and names in each record, whereas the AddressCleanse only accepts a single address. Assuming a single set of customer data per record, we recommend the following:
PERSON1_GIVEN_NAME1 – Given name (i.e. first name, forename)
PERSON1_GIVEN_NAME2 – Second given name (i.e. middle name)
PERSON1_FAMILY_NAME1 – Family name (i.e last name, surname)
FIRM_NAME1 – Business or organization name
FIRM_LOCATION1 – Location within a business or organization
DATE1 – Any date, often used for date of birth
EMAIL1 – Any email address
PHONE1 – Any phone number
SSN1 – U.S. Social Security number
If the name or firm is not yet parsed, you may want to use the general or composite fields:
NAME_LINE1 – Full name, including title, maturity, etc
FIRM_LINE1 – Firm name or location or both
NAME_OR_FIRM_LINE1 – Any combination of name and firm data
MULTILINE1 – Any data
Again, Data Quality will make a best effort to determine the type and content of these fields, but may make incorrect assumptions.
Options Tab
This list of options is much shorter and more intuitive than in the AddressCleanse. Most important is the Cleansing Package Name, the dropdown for which will pre-populate with the installed cleansing packages. It’s likely the only option available will be PERSON_FIRM, which is appropriate.
Standardization Options contains most of the non-date customizations. Here you define whether the middle and family names are used to assign gender, whether prenames (e.g. Mr, Mrs) should be assigned that were not
included in the input, and the capitalization to be applied to the output. You may also define the formats applied to phone and SSN. Two of these options are a bit more confusing, but important:
Name Order – This indicates the order in which the names will be delivered as input. If you can guarantee a first-last or last-first order for all records, setting this to the appropriate “_STRICT”
setting may greatly improve cleansing accuracy. If just a majority of names follow one order, you may get some benefit from a “_SUGGEST” setting instead.
Enable Presumptive Name Parsing – This determines whether the data in the NAME_LINE field will be parsed as a names even if the cleanse cannot make an assignment. If this option is set to “No,” unassigned input will be placed unparsed in the EXTRA field. If you are sure NAME_LINE will contain only names, this option should be set to “Yes.” (Note also the Firm version of this option, Enable Presumptive Firm Parsing.)
Finally, Date Options allow you to fine-tune date formats. This is particularly useful for compatibility with target systems expecting specific patterns. The cleanse can usually produce the necessary output, removing the need for additional reformatting downstream.
Output Tab
As with the input tab, the output tab is set up similarly to the AddressCleanse, with a list of fields and checkboxes. However, the fields are organized differently. Here, each is assigned to a parent component (PARENT_COMPONENT), a logical grouping of related data. For example, all of the data (first and last name, gender, title, etc) related to the first person in the input can be found under the PERSON1 component. Each firm, person, phone number, email address, and date has its own parent component. Also unlike the AddressCleanse, DataCleanse output fields do not have address classes or categories, only field classes (FIELD_CLASS). The PARSED class generates data that has been parsed into discrete components, but not further modified. The STANDARDIZED class modifies data to meet the
standardization options and should be your default.
The available output consists of high-level customer and firm fields as well as the individual elements contained with them. For example, you can choose to output the full date, the separate
month/day/year, or both. Start with a basic set of fields:
GIVEN_NAME1 – First name or forename
GIVEN_NAME2 – Middle or second given name
FAMILY_NAME1 – Last name or surname
FIRM – Business or organization name
FIRM_LOCATION – Location within business or organization
NORTH_AMERICAN_PHONE – Full 10-digit North American phone number.
Components: NORTH_AMERICAN_PHONE_AREA_CODE NORTH_AMERICAN_PHONE_PREFIX NORTH_AMERICAN_PHONE_LINE
DATE – Full date.
Components: DATE_DAY DATE_MONTH DATE_YEAR
EMAIL – Full email.
Components: EMAIL_USER
EMAIL_DOMAIN_ALL EMAIL_DOMAIN_TOP
SSN – Full 9-digit Social Security number. Components: SSN_AREA
SSN_GROUP SSN_SERIAL
The cleanse also predicts gender (GENDER and GENDER_ID) with varying degrees of confidence. The possible values indicate the gender itself, the confidence of the prediction, and whether single or multiple names were used. This may be useful, but isn’t guaranteed to be correct, so be careful using it on external-facing data.
Finally, the transform generates match standard fields (*_MATCH_STD1-6) for names, firms, and titles. These are standard aliases for data successfully located in the dictionary. For example, “Chris” might have standard aliases of “Christopher” and “Christine,” while “PhD” has “Ph.D.” These fields are intended for use in downstream match transforms.
Conclusion
Using these settings, you should be able to quickly configure a cleansing dataflow to produce useful output for addresses, customers, and businesses. In the next post, we will examine the output to
determine which data might need to be tweaked or augmented and how to best configure it as input for match transforms.
Bruce Labbate
Business Intelligence Consultant Decision First Technologies [email protected]
Bruce Labbate is a business intelligence consultant specializing in data warehousing, data quality, and ETL development. Bruce delivers customized SAP Data Services solutions for customers across all industries. With Decision First Technologies, Bruce utilizes Data Integrator, Data Quality, Information Design Tool, and a variety of database technologies, including SQL Server, DB2, Oracle, Netezza, and HANA.