Providing Secure Representative Data Sets

(1)

TesT DaTa ProTecTion

Providing secure representative Data sets

World Headquarters

321 North Clark Street, Suite 740 ChiCago, il 60654

telephoNe: 312-600-4422

researcH & development 349 MarShall ave, Suite 302 St louiS, Mo 63119

telephoNe: 314-499-8984

veloCiData iNC.

- www.veloCiData.CoM

(2)

1. VelociData enterprise streaming compute appliance (esca)

1 2. Test Data

1 3. creating a representative Model copy

2 3.1 challenges in creating the Model copy

3 4. VelociData TDP

4 5. Format Preserving Masking

4 6. Deterministic Masking

6 7 . TDP Use cases

7 7.1 Use case 1: creating a secure, HiPaa-compliant full production dataset from Microsoft sQL server

8 7.2 Use case 2: secure data for insertion into an azure cloud

9 7.3 Use case 3: securing test data for off-shore developers

10 7.4 Use case 4: creating daily datasets for Development, Qa, and Test integration

10 8. summary

10

(3)

1. VelociData enterprise streaming compute appliance (esca)

the velociData enterprise Streaming Compute appliance (eSCa) is the result of over two decades of development and the deployment of hundreds of systems in the most demanding it environments. the system comprises a unique combination of components in a system that is dedicated to high performance processing of streaming and serial information.

Figure 1: The first enterprise streaming compute appliance. Cloud Mainframe Application Servers Cloud Production Databases Enterprise Data Warehouse Hadoop HDFS Batch Process Delegation Sensitive Data Protection

Streaming Data Masking,

Encryption, Transformation & Distribution

Streaming Data Ingestion

this white paper focuses on using eSCa to protect sensitive data when it is used for testing software applications. to do this, the data must be rendered unusable but still retain their format (e.g., obfuscated telephone numbers will still be 10 aSCii dig-its), their volume (no-specific subsetting is required), and their relation (fields will still join properly). These processes can be applied as the data moves from source to target and representative model copies can be different for different targets without slowing the data down.

2. Test Data

Development, testing, and quality assurance groups need access to data to build and test applications. For better, more rapid development, that data needs to look and feel like real production data. in many organizations, the way they achieve that look and feel is by copying over production data directly. this is acceptable for some data sets, but when the production environ-ment holds phi, pCi, or any other pii data, this exposes the company to unnecessary risk, including:

•

exposing sensitive data to a (drastically) broader set of users provides greater opportunity for breaches due to social engineering

•

it organizations need to manage and secure more user accounts, more data centers or network segments, and more copies of data at rest

(4)

as an alternative, organizations could offer anyone who doesn’t truly need the production data access to a “Model Copy” that holds the key characteristics of the production data, yet doesn’t carry any true personally identifying information. to offer this in an effective way, it’s important to differentiate between systems or users that need access to actual production data, or a representative model copy:

Table 2: Example Data Needs

production data cHaracteristic data (model copy)

transactional Systems analytics

Billing Systems application Development Fraud Detection applications testing / Qa

Reporting (user specific) reporting (general reports) Proof of Concept / Evaluation Projects

the key characteristics of model copies of data is that they must be representative in data character, distribution, and volume, and they must be fast and easy to generate. when these are generated quickly and easily, administrators can strictly limit access to raw production data, while being able to safely and easily provide representative data to a broad set of users. this provides several benefits, including:

•

less need for limiting user access, compensating controls, securing environments, etc.

•

less pressure for exposing production data into different development groups (especially when the model copy very closely mirrors the production data)

•

Faster, more productive development, Qa, integration testing, etc.

3. creating a representative Model copy

one of the best ways to generate a truly representative model copy is to perform a selective, deterministic, format-preserving masking operation on the raw production data to generate a derived output. this will ensure that test data will very closely mirror production for many different purposes.

Representative: The test data is derived table for table, row for row from the production data

Selective: Any sensitive fields (e.g., PHI) within those tables are masked using a NIST standard algorithm

Deterministic: All similar input fields will map to the same masked output value such that correlations and joins can match on

the same keys

Format-Preserving: Output records must maintain the same data format (text, phone numbers, social security numbers, dates, etc.) when all of these conditions are met, testing environments can use the same database schemas, the same testing algorithms, run the same processing operations, and observe the same volumes and capacities that will be observed in the production environment.

(5)

Figure 2: Test Data Protection

3.1 challenges in creating the Model copy

there are several concerns with the current solutions in the market that make creating a true model copy in an effective man-ner challenging:

1. Formatting or Schema changes – Many masking solutions require changes to the format of the data elements when encrypting or masking the data

2. Lack of Deterministic Behavior – Many simple masking solutions perform pseudo-random operations on the data to mask it, breaking the ability to perform correlations / aggregations / etc.

3. Limited Performance – Most software vendors that provide format preserving encryption only transform a few hun-dred fields per second, which makes large data copies infeasible given typical time windows.

4. Lack of Tool Integration – Many masking solutions are not integrated into data movement / data transformation com-ponents, requiring the users to create complex multi-product multi step jobs

5. Hard to Use Interfaces – Most solutions require complicated tools to access masking functionality 6. Discovery Challenges – identifying phi / pii elements is often a time-consuming chore

7. Insufficient Throughput – inability to perform daily refreshes or offer production-sized volumes for stress and perfor-mance which often results in data sub-setting vs. full model copies

(6)

4. VelociData TDP

velociData offers a solution that can perform format-preserving masking while facilitating data movement / data transforma-tions required to move data between production and test / development environments. this solution includes:

Table 2: VelociData TDP

Feature description Format preserving Masking

(static and dynamic)

ability to de-identify data without changing its characteris-tics (permanent and reversible)

Note that both static and dynamic operations are fully deterministic

hashing (MD5, Sha-2) Combine multiple input fields into a hashed surrogate key that can be used for tokenization

Field redaction ability to remove / clear sensitive data elements that are not required for the model copy

Data transformation ability to connect to a wide variety of data sources and to transform data formats in between (e.g. mainframe eBCDiC to aSCii)

lookup / replace ability to perform lookup-based replacements of sensitive terms with non-sensitive values

5. Format Preserving Masking

velociData offers a format preserving masking or format preserving encryption option that conforms to the NiSt 800-38g standard. This solution can mask or encrypt data without changing the format of the fields. This means that a credit card number that is stored as 16 aSCii numeric digits can be deterministically masked into 16 aSCii numeric digits. a varchar “name” field in the database can be masked or encrypted into an equivalent number of alphabetic characters.

(7)

Figure 3: Example Masking

this format preserving characteristic allows users to fully secure their data without needing to change the database schema of development or testing systems.

Below are the sets of field types currently supported or in development by VelociData: Table 3: VelociData Masking Data Types

value description

name _{all alphabetic characters and hyphens}

numeric _{aSCii numeric digits: 0-9}

alphabetic _{upper and lowercase characters: a-z and a-Z} alphabetic_uppercase _{all upper case alphabetic characters: a-Z} alphabetic_lowercase _{all lower case alphabetic characters: a-z}

alphanumeric _{all alphabetic characters and base 10 digits: a-z, a-Z, 0-9} alphanumeric_uppercase _{all upper case alphabetic characters and}

base 10 digits: a-Z and 0-9

alphanumeric_lowercase _{all lower case alphabetic characters and} base 10 digits: a-z and 0-9

hex_uppercase _{aSCii numeric digits 0-9 and letters a-F} hex_lowercase _{aSCii numeric digits 0-9 and letters a-f}

date _{Dates in aSCii numbers, in the format YYYYMMDD} printable _{all printable aSCii characters}

everything _{the full set of aSCii characters}

mailing_address _{In Development- ability to mask addresses into valid uSpS} mailing address output

(8)

Also note that VelociData’s performance allows for data to be masked or encrypted at 10 million fields per second. (Where competing solutions can handle hundreds or thousands of fields per second) As many fields are encrypted out of each record in your data set, this means the difference between trickling records through the system in dozens per second or moving data through at hundreds of thousands of records per second.

when production data sets contain millions or billions of records, this could mean the difference between being forced to mask only a small subset of your data or being able to mask the entire data set in a matter of minutes.

6. Deterministic Masking

Note that the nature of masking is critical in ensuring that data in the model copy are truly representative of your source data set. to clarify what that means, consider the diagram below:

Figure 4: Deterministic Masking

Notice in this case that “John” is masked to “id’hw” each time it is observed in the data, and notice that the patient’s SSN is masked to the same output value every time, even when looking at multiple different tables. this allows data sets to be joined and correlated, even when the join keys are being masked.

(9)

nal information. in the rare circumstances where the original data need to be recovered, velociData works with key manage-ment systems to enable reversible processing when required. these methods and modes can all be accommodated on data in flight passing through the network or on static data at rest headed for data stores including data warehouses and HDFS.

Table 4: VelociData Data Masking Processing Types

ForM oF oBFUSCaTIoN description

Redaction/removal _{removing original information in its entirety (no spaces or other} characters left); in some instances a single character e.g., “*”, may denote a point of redaction

Scrambling/shuffling _{No fixed algorithm; information is replaced with a series of} (pseu-do-)random characters; non-deterministic

Replacement/substitution _{A fixed character pattern (usually a single character) replaces} sensi-tive information; e.g., phone # may become: (xxx) xxx-xxxx

Hashing _{NiSt standard MD5 and Sha families; deterministic with the same} salt; non-reversible

Encryption _{NiSt standard (aeS and derivatives); block-oriented; deterministic} and reversible with the same key

Format-preserving Encryp-tion

NIST standard under consideration; field-oriented; retains field character; deterministic; reversible or non-reversible is user-select-able

7 . TDP Use cases

velociData offers an extremely valuable format-preserving data masking mode. this data security process conforms to the NIST 800-38G specification and allows users to encrypt (reversibly) or mask (irreversibly) data without changing its schema or field specifications (lengths and dictionaries are preserved). This enables downstream applications to run without any chang-es. use cases include local targets, private and public clouds, and targets where data cross geographic, company, or regulatory boundaries. A data set containing 10 million records with ten sensitive fields in each record can be secured in seconds using velociData rather than a day using conventional approaches.

(10)

the overall processing time for this table including all database queries, masking operations, and insertion into the resulting database, was just over one minute (65 seconds). With the longest running process being the database insert

7.2 Use case 2: secure data for insertion into an azure cloud

a retail company must de-identify pii data from records it needs to share with its business partners. this sensitive data con-tains names, addresses, phone numbers, and other personally identifying data. the manufacturer wants to put the data into a hosted environment but cannot let unprotected data leave its firewall. For this reason they have chosen to use VelociData to de-identify the data in their datacenter before it leaves to enter the cloud.

the data contains a large volume of daily transactions. the business associates require the freshest data to address immedi-ate results of campaigns, implementation changes for agile app development, and preparing model reports.

Figure 6: Schematic for Securing Data to a Cloud Datastore

As identified in Figure 6, data move through the VelociData appliance de-identifying the PII data found within the data flow. these records then are allowed to move to the cloud-based storage for access by business associates of the retailer. Since no sensitive data remain there is no risk to the company or the individuals should unauthorized access be gained or data breach occur.

Corporate firewal

Figure 5: Schematic for Creating Secure Model Copies Mainframe Data Sources Non-Mainframe Data Sources Sensitive Data Sensitive Data DB2 IMS VSAM RDBMS Files Log QA Database POC / Test CSV Files Masked Data (Model Copy)

Data Center

Development Database

Application Test

Environment

Regulatory, Company or Geographic Boundary

7.1 Use case 1: creating a secure, HiPaa-compliant full production dataset from

Microsoft sQL server

A large health benefits provider needs to create a model copy of a full production dataset for access by their developers. All 18 PHI data field types need to de-identified for HIPAA/HITECH audit compliance. The production data is about 400 GB loaded into Microsoft SQL Server. Following the outline of Figure 5, a workflow is established that:

1. extracts data out of SQl Server;

2. secures the data through the velociData appliance using format-preserving masking (to ensure data integrity and application usability); and

3. performs a bulk load of the model data into a development set of tables.

As an example, one of the tables contains 1 Million records, each of which are comprised of 34 fields. For HIPAA Final Rule compliance 14 of the fields in each record need to be de-identified (totaling 14 M fields). The dataset included a number of different field types (names, SSNs, ...) requiring the following dictionaries:

•

Names

•

Numbers

•

Dates

•

Numerics

•

hex_uppercase

•

hex_lowercase

•

alphanumerics

•

alphanumeric_uppercase

•

alphanumeric_lowercase

•

printable characters

(11)

the overall processing time for this table including all database queries, masking operations, and insertion into the resulting database, was just over one minute (65 seconds). With the longest running process being the database insert

7.2 Use case 2: secure data for insertion into an azure cloud

a retail company must de-identify pii data from records it needs to share with its business partners. this sensitive data con-tains names, addresses, phone numbers, and other personally identifying data. the manufacturer wants to put the data into a hosted environment but cannot let unprotected data leave its firewall. For this reason they have chosen to use VelociData to de-identify the data in their datacenter before it leaves to enter the cloud.

the data contains a large volume of daily transactions. the business associates require the freshest data to address immedi-ate results of campaigns, implementation changes for agile app development, and preparing model reports.

Figure 6: Schematic for Securing Data to a Cloud Datastore

As identified in Figure 6, data move through the VelociData appliance de-identifying the PII data found within the data flow. these records then are allowed to move to the cloud-based storage for access by business associates of the retailer. Since no sensitive data remain there is no risk to the company or the individuals should unauthorized access be gained or data breach occur.

Corporate firewal

(12)

7.3 Use case 3: securing test data for off-shore developers

A major Telco would like to move production data to India to leverage faster, round-the-clock development and lower costs. In order to remove audit deficiencies they would like to generate a model copy of the data to send to off-shore. While de-iden-tification removes the risk from leaking precious sensitive customer and corporate data the developers require access to a dataset that closely mirrors fresh production data in character such as volume, distribution, and relation. the dataset rep-resents 30 million records and 12 fields per record that need to be de-identified. VelociData can provide a fresh test dataset for the off-shore partners in a minute where the alternate solution takes almost a week before the data are available in test ... by then, the developers have a new application built to be tested!

7.4 Use case 4: creating daily datasets for Development, Qa, and Test integration

A large financial institution needs to provide model datasets with de-identified data to different parts of the development process. While all data needs to be fully de-identified for every user, not all data needs to go to all groups; as an example Web Development may not need a field relating to fraud but Test Integration may need it to complete processing.

velociData test Data protection solution has the ability to route different dataset builds to different end users. leveraging routing is fast and efficient and provides the right data, in the right form, to the right individuals. Proper data arrive at the given locations saving on storage and maintenance of tB of useless replicated data.

8. summary

the velociData appliance offers an easy to deploy, easy to use solution for test data protection. the system does not require any coding for integration and operation in the existing software and data base environment. rather, it operates as a simple network resource for automatically masking sensitive data at wire speed.

the appliance can communicate with all kinds of systems, including mainframes, commodity servers, and cloud services and can work relational data, flat files, logs, and XML data, and it requires no additional software or hardware to operate.

the velociData test Data protection solution reduces regulatory exposure and hacker risk, and it improves software testing speed and agility.

(13)

9. Let Us Help you

For reducing hacker risk and regulatory exposure in test data protection, velociData offers the fastest time to safety.

if you are using custom coding or packaged software for test data protection, velociData would like to show you how our unique appliance-based solution can significantly reduce your cost and increase the speed of your test data protection workflow.

if you are testing software with sensitive data unprotected, you are taking a huge risk and should consider adopting some remedy immediately, either velociData’s or some other. we would like to show you how quickly you can make this problem go away. please contact us at [email protected] to see what we can do for you.

author: ron indeck

ron indeck is the president & Cto of velociData and has over 25 years of industry and academic experi-ence, most recently as a founder and Cto of exegy. he was a professor at washington university in St. louis, where he was the Das Family Distinguished professor and Director of the Center for Security technologies. Among his distinguished professional affiliations, Dr. Indeck was also the President of the Institute of Elec-trical and electronics engineers (ieee) Magnetics Society. Dr. indeck has been named the Bar association inventor of the Year.

Providing Secure Representative Data Sets

TesT DaTa ProTecTion