Understanding Strategies for Data Cleanse Transform

How you configure your Data Cleanse transforms depends on the type of data you are cleansing.

Name, title, firm, and firm location data

You can standardize name data and generate discrete standardized fields for prename, first name, middle name, last name, maturity post name, and honorary post name based on which field you decide to evaluate to determine if two records match.

For the first name and middle name match standards, you can generate up to six first name match standards and up to six middle name match standards. Even though there are a maximum of six first and middle match name standards, you can only use a maximum of three first name match name standards and a maximum of three middle name match name standards when matching.

The Data Cleanse transform also parses up to six job titles per record, up to two firm names (such as IBM), and up to two firm locations (such as Engineering Dept.). This transform can also convert firm names to accepted acronyms, such as General Motors Corp. to GM.

Social Security number data

Data Cleanse parses US Social Security numbers (SSN) that are either by themselves or on an input line surrounded by other text. Data Cleanse outputs the individual components of a parsed Social Security number: the entire SSN, the area, the group, and the serial. Data Cleanse parses Social Security numbers in two steps. First, it identifies a potential SSN by looking for any of three patterns:

Once the pattern is identified, Data Cleanse performs a validity check on the first five digits only. If the number fails validation, the number is not output, as it is not considered a valid SSN as defined by the US government.

E-mail data

When Data Cleanse parses input data it recognizes as an e-mail address, it outputs the individual components of a parsed address: the e-mail user name, complete domain name, top domain, second domain, third domain, fourth domain, fifth domain, and host name. You can also verify that an e-mail address is properly formatted and flag the address as belonging to an Internet service provider (ISP). Data Cleanse does not verify whether the domain name is registered. Nor does it verify that an e-mail server is active at that address, the user name is registered on that e-mail server, or that the personal name in the record can be reached at this e-mail address. For example, with the input data, [email protected], Data Cleanse outputs each element in the following fields:

Output field Output value

Email [email protected]

Data Cleanse can parse both North American (US and Canada) and international phone numbers. When Data Cleanse parses a phone number, it outputs the individual components of the number into the appropriate fields.

Data Cleanse recognizes phone numbers by their pattern and (for non-US numbers) by their country code. For North American phone numbers, it looks for commonly used patterns such as (234) 567-8901, 234-567-8901, and 2345678901. It gives you the option for some reformatting on output (such as your choice of delimiters).

Data Cleanse searches for European and Pacific-Rim numbers by pattern. The patterns used are defined from the US and require that the country code appear at the beginning of the number. Note that Data Cleanse does not offer any options for reformatting international phone numbers or cross-compare to the address to see if the country and city codes in the phone match the address.

Date data

Data Cleanse can parse up to six dates from your defined record. Data Cleanse identifies the dates in the input, breaks dates into day/month/year components, and makes dates available as output in either the original format (for example DD-MMM-YY) or a user-selected standard format (for example, MM/DD/YYYY).

International data

By default, Data Cleanse can identify international data presented in multiple formats.

There are also several ways that you can use Data Cleanse to identify and manipulate various forms of other international data, including prenames, greetings, and personal identification numbers.

• Customizing greetings and prenames per country

The default prenames and salutations found in the Data Cleanse greetings option group are commonly used in English-speaking nations. For countries where English is not the primary language, you can modify these options to reflect common prenames and salutations.

• Modifying the phone file for other countries

By default, Data Cleanse includes phone number patterns for many countries.

However, if you find that you need parsing for a country that is not included, you can modify the international phone file (drlphint.dat) to enable Data Cleanse to detect phone number patterns that follow a different format. New phone number patterns can be added to the international phone file using regular expressions.

• Using personal ID numbers

With a default Data Quality installation, Data Cleanse can identify USA Social Security numbers and separate them into discrete components. If your data includes personal identification numbers, which are different from US SSNs, you can use User-Defined Pattern Matching to identify the numbers. Number formats to be identified by User-Defined Pattern Matching can be set up using regular expressions.

• Using cleansing packages

Cleansing Packages are packages that enhance the ability of Data Cleanse to accurately process various forms of global data by including language-specific reference data and parsing rules. Since cleansing packages are based on the standard Data Cleanse transform, you can use the sample transforms in your projects in the same way you would use Data Cleanse and gain the advantage of enhanced reference data and parsing rules.

In document BODS30 - SAP Data Services - Data Quality Management (Page 111-114)