• No results found

Synthetic Data Generation for Realistic Analytics Examples and Testing

N/A
N/A
Protected

Academic year: 2021

Share "Synthetic Data Generation for Realistic Analytics Examples and Testing"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Synthetic Data Generation for

Realistic Analytics Examples and

Testing

Ronald J. Nowling

Red Hat, Inc.

[email protected]

http://rnowling.github.io/

(2)

Who Am I?

Software Engineer at Red Hat

Data Science Team, Emerging

Technologies

Evaluate open-source Big Data space

Ensure software works for Red Hat

customers

Promote data science internally through

consulting projects

(3)

Synthetic Data

No licensing, privacy, or intellectual

property concerns

Scalable: Laptops to Clusters!

More reliable than external data sets

Enable more realistic example

applications

Enable more comprehensive testing than

wordcount and TeraSort

(4)

Data Transformation and

Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform

Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

(5)

Data Transformation and

Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform

Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

5  

(6)

Data Transformation and

Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform

Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

(7)

Data Transformation and

Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform

Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

7  

(8)

Data Transformation and

Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform

Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

(9)

Data Transformation and

Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform

Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

9  

(10)

Synthetic Data

Sensitive Data

Real data on cluster for scalability testing and

validation

Synthetic data for local development and testing

Needed smaller data sets for checking

calculations

Total aggregation results requires re-running old

pipeline

Extra burden on operations team

(11)

Validation

Script

Data

Generator

Expected

Cumulative

Activity

Accounts

Raw Daily

Page Views

Expected

Daily Activity

Transformation

and Summarization

Pipeline

Cumulative

Activity

Daily Activity

Validation with Synthetic Data

(12)

Validation

Script

Data

Generator

Expected

Cumulative

Activity

Accounts

Raw Daily

Page Views

Expected

Daily Activity

Transformation

and Summarization

Pipeline

Cumulative

Activity

Daily Activity

(13)

Validation

Script

Data

Generator

Expected

Cumulative

Activity

Accounts

Raw Daily

Page Views

Expected

Daily Activity

Transformation

and Summarization

Pipeline

Cumulative

Activity

Daily Activity

Validation with Synthetic Data

(14)

Validation

Script

Data

Generator

Expected

Cumulative

Activity

Accounts

Raw Daily

Page Views

Expected

Daily Activity

Transformation

and Summarization

Pipeline

Cumulative

Activity

Daily Activity

(15)

Validation

Script

Data

Generator

Expected

Cumulative

Activity

Accounts

Raw Daily

Page Views

Expected

Daily Activity

Transformation

and Summarization

Pipeline

Cumulative

Activity

Daily Activity

Validation with Synthetic Data

(16)

Validation

Script

Data

Generator

Expected

Cumulative

Activity

Accounts

Raw Daily

Page Views

Expected

Daily Activity

Transformation

and Summarization

Pipeline

Cumulative

Activity

Daily Activity

(17)

Issues Tackled

Error in account validation introduced

while refactoring code

Usage of the correct join types

Validation of date-time operations

Correct Output Formats

(18)

Apache BigTop

BigPetStore Blueprints

Problem domain: Transactions for a

fictional chain of pet stores

BigPetStore Data Generator simulates

customer purchasing behavior to

generate realistic transaction data

Blueprints for big data ecosystem

Hadoop: MapReduce / Pig / Hive / Mahout

Spark

(19)

BigPetStore

(20)

BigPetStore

(21)

BigPetStore

21  

Core (RDDs)

HCFS

(22)

BigPetStore

Spark SQL

Core (RDDs)

HCFS

(23)

BigPetStore

23  

Spark SQL

MLLib

Core (RDDs)

HCFS

(24)

Team Cluster

~10 nodes

(25)

Potential Issues

Infrastructure

Storage

Software Installation

Software Upgrades

Spark Configuration Tuning

User Management

(26)

Real Stories

Creating a new user

User Gluster permissions incorrect

Cluster upgrade

Spark upgrade didn’t take because of issue with

Ansible role configuration

Wiped out our spark.conf – master / mesos

settings wrong

Gluster moint points disappeared on reboot

(27)

k8petstore

Public IP

Proxy

Users

BPS Data

Generator

Redis

Master

Redis

Slave

Web

Application

Redis

Slave

Redis

Slave

BPS Data

Generator

BPS Data

Generator

27  

(28)

k8petstore

Public IP

Proxy

Users

BPS Data

Generator

Redis

Master

Redis

Slave

Web

Application

Redis

Slave

Redis

Slave

BPS Data

Generator

BPS Data

Generator

(29)

k8petstore

Public IP

Proxy

Users

BPS Data

Generator

Redis

Master

Redis

Slave

Web

Application

Redis

Slave

Redis

Slave

BPS Data

Generator

BPS Data

Generator

29  

(30)

k8petstore

Public IP

Proxy

Users

BPS Data

Generator

Redis

Master

Redis

Slave

Web

Application

Redis

Slave

Redis

Slave

BPS Data

Generator

BPS Data

Generator

(31)

k8petstore

(32)

Use Cases

Configuration

Scalability

(33)

k8petstore

OpenContrail networking solution demo

1

Kubernetes JuJu Charm documentation

example

2

Kubernetes v1.0 launch talk at OSCON

3

[1] -

https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and-opencontrail/

[2] -

http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html

[3] -

http://www.oscon.com/open-source-2015/public/schedule/detail/45281

33  

(34)

APACHE BIGTOP

(35)

BigPetStore

(36)
(37)

BigTop Bazaar

(38)

Vision

Encourage synthetic data generation for

testing and realistic examples

Serve as a resource for the larger Apache

and open source communities

Emphasis on

Flexibility

Scalability

Realism

We look forward to collaborating and getting

(39)

Conclusion

Synthetic data generators and blueprints are

useful!

Case studies:

Data Processing Pipelines

Cluster Deployment

Kubernetes

BigPetStore and BigTop Data Generators

efforts in Apache BigTop

Open invitation to get involved and

collaborate

(40)

Resources

http://bigtop.apache.org/

http://github.com/apache/bigtop

http://rnowling.github.io/

(41)

QUESTIONS

References

Related documents

Bringing together various access control concepts such as separation of duty, workflow-related concepts, and context constraints is necessary in order to es- tablish a unified

Notice also that if the outside option of deploying general skills in other industries is higher than that of deploying industry specific skills, the outsiders will tend to be

The purpose of the project will be to analyze existing engagement plan, develop and field-test communication and engagement tools, and to offer recommendations for future

The simplest way to make sense of the diversity of investable opportunities within the impact investing marketplace is to sort the market by five basic categories: Asset

Name: Date: Mailing Address: Main Type of Business: 101218907 Saskatchewan Ltd.. 31 box 99, meskanaw contract

For example, if a student received a performance level score of 3 on the dimension of purpose/task, a score of 3 on the dimension of organization, a score of 2 on the dimension

Although 7 companies invested in solar energy, taking into account the higher number of solar power investors among the thirty companies included in the study,

But our results show that an inherently slow drive, such as the Zip Drive, connected through a SCSI port, offers dramatically better performance for many purposes than a faster