5 Transforming Data
5.6 Data Mask
Protect the personally identifiable or sensitive information by covering all or a portion of the data.
Some examples of personal and sensitive data include credit card numbers, birth dates, tax identification numbers, salary information, medical identification numbers, bank account numbers, and so on. Use data masking to support security and privacy policies, and to protect your customer or employee data from possible theft or exploitation.
Placement in the flowgraph
Place the Data Mask node toward the end of your flowgraph to ensure that all fields that are to be masked have undergone processing by upstream nodes. If you place the Data Mask node before other nodes, the downstream nodes may not process the actual data but rather the masked data, and in some cases, the node won’t be able to process the fields at all if the Data Mask node replaced input data with blanks or a masking character such as “#”. There are four types of masking available, depending on the content type of the columns that you want to mask.
● Mask
● Numeric Variance ● Date Variance ● Pattern Variance
The following column and data types are supported for masking.
Column Type Data Type Rule Type
Character alphanum, nvarchar, shorttext, and var char
Column Type Data Type Rule Type
Date date, seconddate, and timestamp Date Numeric bigint, decimal, double, integer, real,
smalldecimal, smallint, and tinyint
Mask, Numeric Variance
Note
Do not mask columns that are used for the primary key. If the column you masking is designated as the primary key, it will loose it's primary key status.
To configure the Data Mask node:
1. Drag the Data Mask node onto the canvas. The columns available for masking are shown in the Masking tab. 2. In the Data Masking Rule column, click the wrench icon for the column that contains the data you want
masked.
3. Select the type of masking and configure the settings. See the description of options in the separate Mask, Date Variance, Numeric Variance, and Pattern Variance Type topics.
4. Click Apply, and then Back to view the entire flowgraph.
To edit or delete masking rules:
1. In the Masking tab, click the wrench icon next to the rule that you want to change or delete. 2. To change the rule, click Edit Rule. Make the appropriate changes, and then click Apply.
To delete the rule, click Remove Rule.
5.6.1 Change Default Data Mask Settings
Set the default format and language when the data is ambiguous, and set the seed value to ensure referential integrity.
Context
When the input data is vague or ambiguous, the Data Mask node will output the format and language you specify here. For example, if your Last_Updated column has the date 2016-04-12, depending on the date format for the country, the date could be April 12, 2016, or it could be December 4, 2016. Setting the default Date format to Year Day Month ensures that the output data will be referring to December 4, 2016.
When you want to maintain referential integrity, set the Seed option. This will still mask the data, but in a way that ensures consistent values each time the data is output. For example, let's say that you are masking the
Customer_ID value, and want to ensure each ID is randomized on output. You can use any combination of numbers and characters to create an identifiable value such as Region9_Cust. This value is not output; it just ensures that the output data is consistent each time the flowgraph is run. For example, let's say that you are running a Numeric Variance with a Fixed Number and have set the Variance option to 5.
Input data Valid output range
2550 2545-2555
3000 2995-3005
5500 4595-5505
After the first run, let's say the output data is:
Output data after initial processing
2552 3001 5505
With the seed value set, the subsequent processing keeps the same output for each record. Whereas without the seed value set, the output continues to be randomized.
Output after the second run with the seed value set Output after the second run without the seed value
2552 2554
3001 2998
5505 5497
Note
The seed value applies to all columns that are set with the Numeric, Date and Pattern variances. Therefore, if you are randomizing with these 3 types of variances on multiple columns, all of the output data is consistent from run to run.
Procedure
1. Open the Data Mask node. 2. Click Default Settings.
Results
Option Description
Date format Specifies the order in which month, day, and year elements appear in the input string.
The software uses this value only when the day, month, or year in the input string is ambiguous. Choose one of these formats:
● Day_Month_Year ● Month_Day_Year ● Year_Day_Month ● Year_Month_Day
For example, you can see how important the Default Date Format is when the date string is ambiguous. In English, when an input string is 2014/02/01, parsing can’t determine if “02” or “01” is the month, so it relies on the setting in Default Date For mat for clarification. If the user sets the Default Date Format to Year_Day_Month, the software parses the string as January 2, 2014. However, if the Default Date Format is Year_Month_Day, the software parses the string as February 1, 2014.
The Default Date Format may not be necessary in this next example. In English, when the input string is 2014/31/12, the software can parse the string to a date of December 31, 2014 even though the user set the Default Date Format to
Month_Day_Year.
Month format Specifies the format in which the randomized month is output when the software cannot determine the output month format based on the input alone:
● Full: Output the long version of the month name.
● Short: Output the abbreviated form of the month name, when an abbreviated form exists.
Note
This option applies only when the month is text (not a number).
For example, lets say that, in English, the software randomizes an input date of
2015/05/05 to a randomized output date of 2015/03/22. However, because “May” is ambiguous in determining if the output is full or short, the software relies on the Default Month Format setting to determine the output format for month. When this option is set to Full, the software knows to output “March” for the month. If this option is set to Short, the software knows to output “Mar”.
Option Description
Language Specifies the language that the software should use when determining the output of an ambiguous input month string. The default language is English.
Note
This option applies only when the month is text (not a number).
Example: The software cannot determine if the language of an input date like Abril 26, 2014 is in Spanish or Portuguese. Therefore it uses the Default Language value to determine the language to be used on output. The software then uses the Default Language value for the randomized output month name.
Note
The software does not verify that the user-defined Default Language corresponds to the language of the input month.
Seed An alpha and/or a numeric string. Set this option once when you want to maintain referential integrity each time you run the job. One seed value maintains referential integrity for the following variance types set up in the Data Mask node: Number Var iance, Date Variance, and Pattern Variance.
To retain the referential integrity for subsequent jobs using this job setup, use the same data. Do not make changes to the Data Mask node settings.
Example
Retain referential integrity using a seed value to keep the altered values the same when you run a job multiple times.
Date variance seed example: If you randomize the input value "June 10, 2016" by 5 days, the output will be a date between "June 5, 2016" and "June 15, 2016". If the output for the first run is "June 9, 2016", using the seed value will output the value "June 9, 2016" on all subsequent runs, so that you can be certain the data is
consistent. Not using the seed value might return a value of "June 11, 2016 on the next run, and "June 7, 2016" on the following run.
Numeric variance seed example: If you randomize the input value "500" with a fixed value of 5, the output will be a number between 495-505. If the output for the first run is "499", using the seed value will output the value "499" in all subsequent runs, so that you can be certain the data is consistent. Not using the seed value might return a value of "503" on the next run, and "498" on the following run.
5.6.2 Mask Type
Replace all or a portion of the data with another character.
For the column you specify, you can select to replace the beginning or ending characters, or to mask all of the data in the column.
Option Description
All Masks all of the data in the column with the specified masking character. Everything except first/last __ char
acters Masks a portion of the data.
First: Reveals the specified number of characters at the beginning of the value, and masks the rest of the characters. For example, setting this option so the first four char acters will show the actual data for the first four characters. The data starting with the fifth character is masked through the end of the value.
Last: Reveals the specified number of characters at the end of the value, and masks the rest of the characters. For example, setting this option so the last two characters will show the actual data for the last two characters. All of the characters before the last two characters are masked.
Masking character The character or number that replaces the masked data, for example, "#" or "*". When using a numeric variance, the character must be a number. You can also leave this blank to remove all masked characters.
Maintain format Enabling this option retains any special characters such as dashes, slashes or periods, spaces between characters and formatting in the output. For example, if you have a phone number that uses dashes, then the dashes are output.
Disabling this options replaces special characters and spaces with the designated masking character.
Mask all characters in email name Enabling this option masks all characters (even special characters such as a period or a slash mark) that appears to the left of the @ symbol. The format of the email domain portion of the address (for example, @sap.com) is retained; the @ and the period, for example, @xxx.xxx.
Example
If have a column for User_ID, and the value is "Smith7887", and you set the option to Everything except the first 4 characters, your output would be "Smitxxxxx".
If you set the option to Everything except the last 2 characters, your output would be "xxxxxxx87". If your masking character is left blank, your output would be "87".
Example
If you have a column for Phone1, and the value is "800-555-1234" and select Maintain format, your output would be "xxx-xxx-xxxx". Not selecting this option would output "xxxxxxxxxxxx".
Example
If you have a column for Email1, and the value is "[email protected]", and you enable Maintain format and Mask all characters in email name, then your output would be "[email protected]".
If you enable Maintain format and disable Mask all characters in email name, then your output would be "[email protected]".
If you disable both Maintain format and Mask all characters in email name, then your output would be "xxxxxxxxxxxxxxxxxx".
5.6.3 Date Variance Type
Use Date Variance to output randomized dates.Set the options to alter input dates based on a date variance type (set number of days, months, years, or within a set range).
When using the fixed days, months and years options, the application generates an internal calculation based on the variance number. For example, if you have an input value of "May 27, 2005" and select Fixed Months and a variance of three, then the calculated minimum and maximum dates are Feb 27, 2005 and Aug 27, 2005. If you have set a user-defined minimum date of Apr 1, 2005 (within the calculated minimum value), and a user-defined maximum date of Dec 31, 2005 (outside the calculated maximum value), then the new valid range is April 1, 2005 (the user-defined minimum) through Aug 27, 2005 (the internally calculated maximum value).
Option Description
Variance Type Define how you want to vary a date.
Fixed days: Varies the date by a fixed number of days that oc cur before or after the input date.
Fixed months: Varies the date by a fixed number of months that occur before or after the input month.
Fixed years: Varies the date by a fixed number of years that occur before or after the input year.
Note
Selecting Fixed days, Fixed months, and Fixed years re sults in an internal calculation based on the variance chosen. If you also select to define a Minimum and Maxi mum date range that falls outside of the internal calcula tion, it may result in a smaller variance range. In these cases, the node uses a combination of user-defined and in ternally calculated minimum and maximum dates. See the examples below.
Range: Varies the date within the user-defined minimum and maximum dates that you set. You must set the Minimum and Maximum date values.
Variance Required for the variance types (days, months, years). Deter mines the number of days, months, or years by which to ran domize the input. Enter a value greater than zero.
Option Description
Minimum date Required for range variance, and optional for fixed date var iance types. Enter a date, or select a value by clicking on the calendar icon. The minimum acceptable date value is Sep 14, 1752.
Note
An internal minimum date is calculated for each record. If the calculated minimum date is within the user-defined minimum set for this option, then the node bases the ran dom output on the calculated minimum.
Maximum date Required for range variance, and optional for fixed date var iance types. Enter a date, or select a value by clicking on the calendar icon. The maximum acceptable date value is Dec 31, 9999.
Note
An internal maximum date is calculated for each record. If the calculated maximum date is within the user-defined maximum set for this option, then the node bases the ran dom output on the calculated maximum.
Example
The following table shows several examples for one record in a database: May 27, 1995. The examples help to illustrate how the date is calculated internally, and how it is output when the user-defined minimum and maximum values are used together.
Variance Type Internally calcu lated date
Output value is within this range
Notes
Fixed Days Variance: 10
Minimum date: <not set> Maximum date: <not set>
Min date: May 17, 1995
Max date: Jun 6, 1995
Min date: May 17, 1995 Max date: Jun 6, 1995
The output uses the internally calculated dates, because the user-defined dates are not specified.
Fixed Days Variance: 100
Minimum date: Mar 1, 1995 Maximum date: Aug 31, 1995
Min date: Feb 9, 1995
Max date: Sep 4, 1995
Min date: Mar 1, 1995 Max date: Aug 31, 1995
The user-defined minimum and maximum dates are within the calculated minimum and maximum dates. Therefore, the user-defined dates are used.
Variance Type Internally calcu lated date
Output value is within this range
Notes
Fixed Months Variance: 6
Min date: Jan 1, 1995 Max date: Dec 31, 1995
Min date: Nov 27, 1994
Max date: Nov 27, 1995
Min date: Jan 1, 1995 Max date: Nov 27, 1995
The user-defined minimum date is within the calculated variance, and is the value used for output.
The maximum user-defined date is outside of the calculated variance. Therefore, the maxi mum calculated date is used.
Fixed Years Variance: 15 Min date: Jan 1, 1965 Max date: Dec 31, 2015
Min date: May 27, 1980
Max date: May 27, 2010
Min date: May 27, 1980
Max date: May 27, 2010
Both of the user-defined minimum and maxi mum dates are outside of the calculated var iance. Therefore the calculated date is used.
Range
Min date: Jan 1, 1965 Max date: Dec 31, 2015
n/a Min date: Jan 1, 1965 Max date: Dec 31, 2015
Because there is no variance for range, only the user-defined minimum and maximum val ues are used.
5.6.4 Numeric Variance Type
Use Numeric Variance to output randomized numbers.Set the options to alter numeric input data based on a numeric variance type (percentage, fixed number, or within a range).
When using the percentage and fixed number options, the application generates an internal calculation based on the variance number. For example, if you have an input value of "10,000" and select to vary by 25%, then the calculated minimum and maximum values are 7500-12500. If you have set a user-defined minimum value of 9500 (within the calculated minimum value), and a user-defined maximum value of 15000 (outside the calculated maximum value), then the new valid range is 7500 (the user-defined minimum) through 12500 (the internally calculated maximum value).
Option Description
Variance Type Define how you want to vary a number.
Percentage: Varies the data by a percentage that is within a calculated minimum and maximum range.
Fixed number: Varies the data by a fixed number that is within a calculated minimum and maximum range.
Note
Selecting percentage and fixed number results in an inter nal calculation based on the variance chosen. If you also select to define a Minimum and Maximum date range that falls outside of the internal calculation, it may result in a smaller variance range. In these cases, the node uses a combination of user-defined and internally calculated mini mum and maximum values. See the examples below. Range: Varies the data that is greater than or equal to the user-defined minimum value and less than or equal to the user-defined maximum values that you set. You must set the minimum and maximum values.
Variance Required for the variance types (percentage and fixed num ber). Determines the number by which to randomize the in put. Enter a value greater than zero.
Minimum value Required for range variance, and optional for percentage and fixed number variance types. Enter a value as a whole number or decimal. For best results, set a realistic minimum value.
Note
An internal minimum value is calculated for each record. If the calculated minimum value is within the user-defined minimum set for this option, then the node bases the ran dom output on the calculated minimum.
Maximum value Required for range variance, and optional for percentage and fixed number variance types. Enter a value as a whole number or decimal. For best results, set a realistic maximum value.