A Scalable Data Transformation Framework using the Hadoop Ecosystem

(1)

A Scalable Data Transformation Framework

using the Hadoop Ecosystem

Raj Nair

Director–Data Platform

Kiru Pakkirisamy

(2)

AGENDA

• About Penton and Serendio Inc • Data Processing at Penton

• PoC Use Case

• Functional Aspects of the Use Case

• Big Data Architecture, Design and Implementation • Lessons Learned

• Lessons Learned • Conclusion

(3)

About Penton

• Professional information services company

• Provide actionable information to five core markets

Agriculture Transportation Natural Products

Infrastructure Industrial Design & Manufacturing Success Stories

Success Stories

EquipmentWatch.com Govalytics.com

Prices, Specs, Costs, Rental Analytics around Gov’t capital spending down to county level

SourceESB NextTrend.com

Vertical Directory, electronic parts Identify new product trends in the natural products industry

(4)

About Serendio

Serendio provides Big Data Science

Solutions & Services for

Data-Driven Enterprises.

(5)

Data Processing at Penton

(6)

What got us thinking?

• Business units process data in silos

• Heavy ETL

– Hours to process, in some cases days

• Not even using all the data we want

• Not logging what we needed to

(7)

Data Processing Pipeline _New features New Insights New Products Biz Value Assembly Line processing

The Data Processing Pipeline

InsightsProducts

(8)

Penton examples

• Daily Inventory data, ingested throughout the day

(tens of thousands of parts)

• Auction and survey data gathered daily • Aviation Fleet data, varying frequency

Analyze Ingest, store

Clean, validate Apply Business Rules_Map

Analyze Report Distribute

Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future

(9)

Current Design

• Survey data loaded as CSV files

• Data needs to be scrubbed/mapped • All CSV rows loaded into one table

• Once scrubbed/mapped data is loaded into main tables • Not all rows are loaded, some may be used in the future

(10)

What were our options?

Adopt Hadoop Ecosystem

- M/R: Ideal for Batch Processing - Flexible for storage

- NoSQL: scale, usability and flexibility Expand RDBMS options

- Expensive - Complex

HBASE Oracle _ServerSQL

(11)

POC Use Case

(12)

Primary Use Case

• Daily model data – upload and map – Ingest data, build buckets

– Map data (batch and interactive) – Build Aggregates (dynamic)

Issue: Mapping time Issue: Mapping time

(13)

Functional Aspects

(14)

Data Scrubbing

• Standardized names for fields/columns

• Example - Country

– Unites States of America -> USA – United States -> USA

(15)

Data Mapping

• Converting Fields - > Ids

– Manufacturer - Caterpillar -> 25

– Model - Caterpillar/Front Loader -> 300

• Requires the use of lookup tables and partial/fuzzy

matching strings

(16)

Data Exporting

(17)

Key Pain Points

• CSV data table continues to grow

• Large size of the table impacts operations on rows in a single file

(18)

Criteria for New Design

• Ability to store an individual file and manipulate it easily – No join/relationships across CSV files

• Solution should have good integration with RDBMS

• Could possibly host the complete application in future

• Technology stack should possibly have advanced analytics capabilities

NoSQL model would allow to quickly retrieve/address individual file and manipulate it

(19)

Big Data Architecture

(20)

Solution Architecture

REST API

CSV and Rule Management Endpoints

CSV Files Master database of Products/ Parts Current Push Updates Insert RDB-> Data Upload UI

API Calls Launch

Data manipulation APIs exposed through REST layer Existing Business Applications HBASE HADOOP HDFS Current Oracle Schema Insert Accepted Data MR Jobs Survey REST Drools Use HBase as a store for CSV files Drools – for rule based data scrubbing Operations on individual files in UI through Hbase Get/Put Operations on all/groups of files using MR jobs

(21)

Hbase Schema Design

• One row per HBase row • One file per HBase row

– One cell per column qualifier (simple and started the development with this approach)

(22)

Hbase Rowkey Design

• Row Key

– Composite

• Created Date (YYYYMMDD) • User

• FileType • GUID • GUID

• Salting for better region splitting

(23)

Hbase Column Family Design

• Column Family

– Data separated from Metadata into two or more

column families

– One cf for mapping data (more later)

– One cf for analytics data (used by analytics

coprocessors)

(24)

M/R Jobs

• Jobs

– Scrubbing – Mapping – Export

• Schedule

– Manually from UI – Manually from UI

(25)

Sqoop Jobs

• One time

– FileDetailExport (current CSV) – RuleImport (all current rules)

• Periodic

– Lookup Table Data import – Lookup Table Data import

• Manufacture • Model • State • Country • Currency • Condition • Participant

(26)

Application Integration - REST

• Hide HBase AP/Java APIs from rest of

application

• Language independence for PHP front-end

• REST APIs for

– CSV Management

(27)

Lessons Learned

(28)

Performance Benefits

• Mapping

– 20000 csv files, 20 million records

– Time taken – 1/3rd of RDBMS processing

• Metrics

– < 10 secs vs (Oracle Materialized View)

• Upload a file

– < 10 secs

• Delete a file

(29)

Hbase Tuning

• Heap Size for

– RegionServer

– MapReduce Tasks

• Table Compression

– SNAPPY for Column Family holding csv data

• Table data caching

(30)

Application Design Challenges

• Pagination – implemented using intermediate REST layer and scan.setStartRow.

• Translating SQL queries

– Used Scan/Filter and Java (especially on coprocessor) – No secondary indexes - used FuzzyRowFilter

– Maybe something like Phoenix would have helped

• Some issues in mixed mode. Want to move to 0.96.0 for • Some issues in mixed mode. Want to move to 0.96.0 for

better/individual column family flushing but needed to 'port' coprocessors (to protobuf)

(31)

Hbase Value Proposition

• Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size

• Relieve load on RDBMS - no more CSV data tables

• Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade)

• Redundant store for CSV files

(32)

Roadmap

• Benchmark with 0.96

• Retire Coprocessors in favor of Phoenix (?)

• Lookup Data tables are small. Need to find a better alternative than HBase

than HBase

• Design UI for a more Big Data appropriate model – Search oriented paradigm, than exploratory/ paginative – Add REST endpoints to support such UI

(33)

Wrap-Up

(34)

Conclusion

• PoC demonstrated

– value of the Hadoop ecosystem

– Co-existence of Big data technologies with current solutions

– Adoption can significantly improve scale

(35)