Distribution Profiling - Profiling Data - Configuration Guide for SAP HANA Smart Data Integrati

6 Profiling Data

6.2 Distribution Profiling

Distribution profiling allows you to identify patterns, words, and values within fields.

You can perform distribution profiling on columns of data to get an understanding of the frequency of different values, words, and patterns.

Type of Distribution Profiling Description

Pattern profiling Examines string columns and normalizes the string by replac ing uppercase characters, lowercase characters and numeric characters with representative placeholders. This function keeps count of the unique computed normalized strings found for the input column.

● Uppercase characters are replaced with an X ● Lowercase characters are replaced with an x ● Numeric characters are replaced with 9

After the character replacement, the function keeps count of the patterns found for the input column.

Word profiling Examines the input string and extracts words based on blank space characters as delimiters and keeps a count of unique words found in the column

Field profiling Keeps count of the all the unique column values found in the input column

The results of all three profiling types are output to a single result set.

Distribution Profiling Interface

The syntax for calling the distribution profiling procedure is: CALL _SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION('SAMPLE_SERVICES','PROFILE',"SAMPLE_SERVICE S"."PROFILE_DIST_COLUMNS",?) Or: CALL _SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION(schema_name=>'SAMPLE_SERVICES',object_name= >'PROFILE',columns=>"SAMPLE_SERVICES"."PROFILE_DIST_COLUMNS",result=>?)

Calling the stored procedure requires passing in four parameters:

1. Schema of the object containing the data that is to be distribution profiled. For example, 'SAMPLE_SERVICES', in the following:

CALL

_SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION('SAMPLE_SERVICES','PROFILE',"SAMPLE_SERV ICES"."PROFILE_DIST_COLUMNS",?)

2. Object that contains the data to be profiled. For example, 'PROFILE', in the following: CALL

_SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION('SAMPLE_SERVICES','PROFILE',"SAMPLE_SERV ICES"."PROFILE_DIST_COLUMNS",?)

3. Schema and object combination that contains the columns and the distribution profiling options (Pattern, Column, and/or Word) to be used when profiling the data. For example,

"SAMPLE_SERVICES"."PROFILE_DIST_COLUMNS"', in the following: CALL

_SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION('SAMPLE_SERVICES','PROFILE',"SAMPLE_SERV ICES"."PROFILE_DIST_COLUMNS",?)

The object passed in for this parameter must match the format of the _SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION_COLUMNS table type:

Column Name Data Type (Length)

COLUMN_NAME NVARCHAR (256) PATTERN_PROFILE TINYINT

WORD_PROFILE TINYINT COLUMN_PROFILE TINYINT

The COLUMN_NAME column is to be populated with the column names of the source that you want to perform distribution profiling on. Values of 1 (enable profiling type) or 0 (disable profiling type) are to be specified for the PATTERN_PROFILE, WORD_PROFILE, and COLUMN_PROFILE columns to indicate the type(s) of profiling to be performed on the respective column. In the example below for the object passed into the stored procedure that contains the columns and corresponding profiling options, pattern profiling will occur for column FIRST_NAME, LAST_NAME, and DATE_OF_BIRTH, word profiling will occur for

FIRST_NAME and PHONE and column profiling will occur for FIRST_NAME and LAST_NAME:

COLUMN_NAME PATTERN_PROFILE WORD_PROFILE COLUMN_PROFILE

FIRST_NAME 1 1 1

LAST_NAME 1 0 1

PHONE 0 1 0

DATE_OF_BIRTH 1 0 0

Note

If this object is completely empty, all columns with supported data types will be profiled for all three distribution profiling types (Pattern, Column, and Word)

Note

Only columns with the following data types will be processed as part of distribution profiling: ○ STRING (VARCHAR, NVARCHAR, SHORTTEXT)

○ ALPHANUM (ALPHANUM) ○ FIXEDSTRING (CHAR, NCHAR)

4. The parameter to be used for the result set. For example, ?, in the following: CALL

_SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION('SAMPLE_SERVICES','PROFILE',"SAMPLE_SERV ICES"."PROFILE_DIST_COLUMNS",?)

The format of the profiled output will reflect the _SYS_TASK. PROFILE_FREQUENCY_DISTRIBUTION_RESULT table type:

Column Name Data Type/Length

COLUMN_NAME NVARCHAR (256) PATTERN_VALUE NVARCHAR (5000) PATTERN_COUNT BIGINT WORD_VALUE NVARCHAR (5000) WORD_COUNT BIGINT COLUMN_VALUE NVARCHAR (5000) COLUMN_COUNT BIGINT

Because distribution profiling is a built in stored procedure, output is only available as a result set and cannot be persisted to a table (even when the ‘WITH OVERVIEW’ syntax is present.)

Example

For example, we could perform distribution profiling on the following sample data, represented in these tables.

Table 125: SAMPLE_SERVICES.EMPLOYEE

ID FIRST_NAME LAST_NAME PHONE DATE_OF_BIRTH EMAIL OFFICE_LOCATION

1000 SARAH JONES 456-345-1234 07/27/79 [email protected] MINNEAPOLIS 2000 SARAH PARKER 608-742-5678 11/15/84 [email protected] NEW YORK 3000 JUAN DE LA ROSA 546-387-7754 01/31/90 [email protected] NEW YORK Table 126: SAMPLE_SERVICES.EMPLOYEE_COLUMNS

COLUMN_NAME PATTERN_PROFILE WORD_PROFILE COLUMN_PROFILE

ID 1 1 1

FIRST_NAME 1 1 1 LAST_NAME 1 1 1

COLUMN_NAME PATTERN_PROFILE WORD_PROFILE COLUMN_PROFILE PHONE 1 1 1 DATE_OF_BIRTH 1 1 1 EMAIL 1 1 1 OFFICE_LOCATION 1 1 1

Note

To disable pattern, word, or column profiling for a field, specify 0 for that profiling type column. In this example, all fields are enabled.

Below we show an example of the distribution profiling procedure being called within another stored procedure where the input columns to be profiled are being selected from a pre-existing view and where the output result set is being inserted into a physical table.

Note that for simplicity purposes of this sample, all columns of the table object are being profiled via a select statement of the SYS.TABLE_COLUMNS view and the three distribution profiling types are hard coded to 1. Column table “SAMPLE_SERVICES.PROFILE_DIST_OUT" has the same schema as the table type object _SYS_TASK. PROFILE_FREQUENCY_DISTRIBUTION_RESULT table type:

CREATE PROCEDURE "SAMPLE_SERVICES"."DISTRIBUTION_PROFILING" (IN in1 VARCHAR(50), IN in2 VARCHAR(50))

LANGUAGE SQLSCRIPT AS

BEGIN

columns = SELECT COLUMN_NAME, 1 as "PATTERN_PROFILE", 1 as "WORD_PROFILE", 1 as "COLUMN_PROFILE" FROM "SYS"."TABLE_COLUMNS" WHERE SCHEMA_NAME = :in1 and "TABLE_NAME" = :in2;

CALL _SYS_TASK.PROFILE_FREQUENCY_DISTRIBUTION (:in1,:in2, :columns, results); insert into SAMPLE_SERVICES.PROFILE_DIST_OUT select * from :results;

END;"

Then we can call the stored procedure, noted above:

call SAMPLE_SERVICES.DISTRIBUTION_PROFILING ('SAMPLE_SERVICES', 'EMPLOYEE') This process generates the following result set.

Table 127: Distribution profiling results

COL UMN_NAME

PATTERN_VALUE PAT

TERN_COUNT

WORD_VALUE WORD_COUNT COL

UMN_VALUE COL UMN_COUNT ID 9999 3 3000 1 3000 1 ID ? ? 2000 1 2000 1 ID ? ? 1000 1 1000 1

COL UMN_NAME

PATTERN_VALUE PAT

TERN_COUNT

WORD_VALUE WORD_COUNT COL

UMN_VALUE

COL UMN_COUNT

FIRST_NAME XXXXX 2 SARAH 2 SARAH 2 LAST_NAME XXXXXX 1 ROSA 1 JONES 1 LAST_NAME XXXXX 1 JONES 1 PARKER 1 LAST_NAME XX XX XXXX 1 DE 1 DE LA ROSA 1 LAST_NAME ? ? PARKER 1 ? ? LAST_NAME ? ? LA 1 ? ? PHONE 999-999-9999 3 456-345-1234 1 456-345-1234 1 PHONE ? ? 608-742-5678 1 608-742-5678 1 PHONE ? ? 546-387-7754 1 546-387-7754 1 DATE_OF_BIRTH 99/99/99 3 07/27/79 1 07/27/79 1 DATE_OF_BIRTH ? ? 01/31/90 1 01/31/90 1 DATE_OF_BIRTH ? ? 11/15/84 1 11/15/84 1 EMAIL [email protected] 1 sara.parker@abc tech.com 1 sara.parker@abc tech.com 1 EMAIL [email protected] 1 sara.jones@abc tech.com 1 sara.jones@abc tech.com 1 EMAIL [email protected] 1 juan.delar osa@abc- tech.com 1 juan.delar osa@abc- tech.com 1 OFFICE_LOCA TION

XXX XXXX 2 NEW 2 NEW YORK 2

OFFICE_LOCA TION XXXXXXXXXXXX 1 YORK 2 MINNEAPOLIS 1 OFFICE_LOCA TION ? ? MINNEAPOLIS 1 ? ?

In document Configuration Guide for SAP HANA Smart Data Integration and SAP HANA Smart Data Quality (Page 180-186)