So What s the Big Deal?

(1)

(2)

So What’s the Big Deal?

(3)

NCOAUG Training Day – February 22, 2013

Presentation Agenda

• Introduction

• What is Big Data?

• So What is the Big Deal?

• Big Data Technologies

• Identifying Big Data Opportunities

• Conducting a Big Data Proof‐of‐Concept

• Big Data Case Study (if we have time)

• Q&A

• Links to More Information

(4)

Introduction

• RhinoSource, Inc.

– Oracle App/Tech Consulting and Managed Services

• Oracle E‐Business Suite

• Oracle Business Intelligence

• Oracle Database (performance, partitioning and replication)

• Application Development and Advanced PL/SQL Development

– Advanced Technology Consulting and Managed Services

• Big Data

• Mobile Applications

• Cloud Computing

– CIO‐Level Advisory Services

• IT Strategy, Planning and Project Management

• ERP/CRM Evaluation and Implementation

(5)

WHAT IS BIG DATA?

(6)

What Makes Up Big Data?

• Blog posts, user comments

• Emails and Messaging

• Web server logs

• Instrumentation of online stores

• Image and video uploads

• Process data, such as RFID

• Sensor device data

• External data sets

– Census data – Weather data

– Geographical data

• “Shadow Data” (replicated copies and change journals)

(7)

NCOAUG Training Day – February 22, 2013

The 3 V’s of Big Data

• Velocity

– Data generated at a faster rate than ever before – Server logs, smart phones, sensor devices, RFID

– Challenge: Existing systems cannot process new data fast enough

• Variety

– Data more varied and complex – Structured and unstructured

– Many formats: text, document, image, video

– Challenge: Existing databases do not handle varying data formats well

• Volume

– Orders of magnitude larger

– 2.5 Zetabytes of new data created in 2012

– 8 Zetabytes on new data projected to be created in 2015

• 3 Billion Internet users, 15 Billion connected devices

– Challenge: Existing databases do not cost‐effectively scale to Big Data sizes

(8)

Big Data Growth Trend

Ze tta b yt e s

40% CAGR

(9)

NCOAUG Training Day – February 22, 2013

How Much is 1 Zettabyte?

(10)

New Data by End of 2015

17 ZB of New Data

By end of

2015!

(11)

SO, WHAT’S THE BIG DEAL?

(12)

“Can Big Data Show Us The Way?”

• Scientific American, Dec ’11:

– "...the rise of 'big data' [is] a trend that is striking many scientists as being on a par with the invention of the telescope and microscope."

– "...many experts believe we

are on the cusp of opening

up new worlds of inquiry."

(13)

NCOAUG Training Day – February 22, 2013

Big Data Advantages

• Better, more accurate predictions

• Deeper, richer insights into customers, business partners and the operations

• Real‐time Big Data analytics enables faster decision‐making

• Creates competitive advantage

• Improves bottom line

(14)

Big Data Spending

• Companies have spent $4.3 billion on Big Data as of the end of 2012.

• Gartner predicts those initial investments will in turn trigger a domino effect of upgrades

and new initiatives

– Valued at $34 billion for 2013, per Gartner.

– Over a 5 year period, spend is estimated at $232

billion.

(15)

BIG DATA TECHNOLOGIES

(16)

A Brief History of Big Data

RDBMS

Data Warehouse

RAC

Sc ale

Time

Distributed

Big Data Cluster

(17)

NCOAUG Training Day – February 22, 2013

How Do We Store Big Data?

• NoSQL databases store data records as key‐value pairs

– Or as triplets with a timestamp.

• Schema‐less or “schema‐optional”

– Values may be structured or unstructured (developer’s choice).

• Not relational

– No relationships between records – No join support in a NoSQL database.

– Does not use SQL to store and retrieve records.

• Highly optimized for retrieval and appending operations.

– High performance writes.

– High performance retrieval by primary key.

– Little functionality beyond record storage and retrieval.

• Highly Scalable to huge amounts of data

– Millions or Billions of records

– Partition data across many distributed, inexpensive servers for cost‐effective scalability and availability

• Must trade off between Availability versus Consistency (CAP Theorem).

(18)

Popular NoSQL Databases

Key‐Value Stores Column‐Oriented Databases

Graph Databases Document Databases

(19)

NCOAUG Training Day – February 22, 2013

Why Not Relational for Big Data?

• Transforming and loading data into RDBMS requires extensive pre‐

processing of data into a pre‐defined schema

– Doesn’t work well for semi‐structured and unstructured data

– Can take more time than is available before next batch must be loaded

• Joining multiple data sets at query time is an expensive operation

• RDBMS scaling must be done vertically to larger and more expensive servers and storage solutions

• RDBMS clustering requires expensive networking and shared storage infrastructures

– Fiber Channel, Infiniband, SAN, NAS

• Challenging to distribute data across data centers

– Replication strategies are “add‐ons” and complex

• Strict Consistency requirement is enforced at the cost of write

performance and availability (CAP Theorem)

(20)

Dr. Brewer’s CAP Theorem

CP:

BigTable Hadoop/Hbase

MongoDB Oracle NoSQL

Redis p

AP:

Cassandra CouchDB

Dynamo Riak SimpleDB

CA:

“Pick 2”

RDBMS

Oracle RAC

(21)

NCOAUG Training Day – February 22, 2013

Scalability Comparison

(Logarithmic Scale)

1 10 100 1000 10000 100000

MongoDB Oracle RAC Cassandra Hadoop

Terabytes Server Nodes

RAC

21 PB,

2000 Nodes at Facebook 300 TB,

400 Nodes at Digital Reasoning 71 TB,

48 Nodes at Amazon 10 TB,

100 Nodes

at CraigsList

(22)

Scalability Comparison

(Linear Scale)

0 5000 10000 15000 20000 25000

MongoDB Oracle RAC Cassandra Hadoop

Terabytes Server Nodes

RAC

21 PB,

2000 Nodes at Facebook 300 TB,

400 Nodes at Digital Reasoning 71 TB,

48 Nodes at Amazon 10 TB,

100 Nodes

at CraigsList

(23)

Feature Comparison

Cassandra

• Best of the NoSQLs for Cross‐Data Center Replication and High‐Availability

• Known to scale to 100’s of Terabytes (but theoretically to Petabytes)

• Tunable Consistency at operation‐level for writes and reads. Availability model (AP).

• Primary and Secondary Indexes

• Queries are Real‐Time (CQL, Thrift)

• No Join Support

• Masterless Peer‐to‐Peer Ring Architecture = No S.P.O.F.

• Provides most cost‐effective HA and scalability of the NoSQLs

• Written in Java

• Minimum of 3 nodes recommended.

• Easy to install and setup on commodity hardware.

Hadoop/HBase

• The current “Gold Standard” of the NoSQLs for Data Analysis

• Known to scale to Petabytes (1000’s of Terabytes)

• Consistency model (CP)

• Hadoop Queries are Batch (MapReduce).

HBase provides real‐time queries similar to Cassandra.

• Joins are Possible

• Master‐Slave Architecture = S.P.O.F.

(Name/JobTracker Node)

• Written in Java

• Minimum of 5 nodes recommended.

• More challenging installation and setup.

• Warm Standby and Shared Storage Required for High‐Availability Failover, so higher

infrastructure costs.

(24)

Best of All Worlds

• DataStax Enterprise

– Cassandra Real‐Time Database

• Peer‐to‐Peer HA Architecture

• Cross‐Data Center Replication

• Real‐Time, Low‐Latency Queries

– Hadoop A Analytics

• Map/Reduce, Hive, Pig (Joins)

– Solr Search

• Full‐Text Search

• Rich Document Handling (Word, PDF)

(25)

NCOAUG Training Day – February 22, 2013

Plus Cluster Management

(26)

Current Big Data Challenges

• Integrating Big Data with existing databases and BI/reporting systems.

– JDBC, ODBC – sqoop

• Security and Encryption

– DataStax Enterprise 3.0 (In Beta)

• Transparent Data Encryption

• Internal and External Authentication

• Data Auditing

(27)

IDENTIFYING BIG DATA

OPPORTUNITIES

(28)

Big Data Use Cases

• Context for Interactions and Transactions

– Reward Points – Warranty Policies – Social media chatter

– Survey response feedback – Website requests

• Connection with Outside Patterns

– Weather Data – Demographic Data – Geographical Data

– Government Compliance Data

• Improving Disaster and Outage Response Times by Spotting Trends

• Compliance Checks and Audits

• Competitive Insights into How Your Products and Services (and your competition’s) are used and perceived in the marketplace.

• Database Infrastructure Behind Mobile and Web Applications

(29)

NCOAUG Training Day – February 22, 2013

Great Places to Look

for Big Data Opportunities

• Server Logs

– Web server and app server logs – Call center/phone system logs

• Product Data

– Performance data – Sensor data

– Positional data

– Streamed live or captured in Log files / Data files

• Current RDBMS Archive‐Purge Strategies

– What data are you deleting every day/month/year?

– Financial Data, Operational Data, Customer Interactions

(30)

Implementing Big Data

• Identify "Game Changing" Big Data opportunities.

• Define a business case.

• Identify existing business and functional capabilities.

• Augment existing capabilities with 3rd‐party assistance.

• Conduct low‐cost Proof‐of‐Concept project to

demonstrate feasibility.

(31)

NCOAUG Training Day – February 22, 2013

Low Cost Proof‐of‐Concept

• Take advantage of a cloud platform like Amazon Web Services (AWS) and Amazon EC2.

– Run a multi‐node cluster for less than $25/day.

– Get started instantly. Have a cluster up‐and‐running in only a few hours.

– NoSQL technologies are perfectly suited for the cloud deployed model.

– Amazon Machine Images (AMIs) exist for most NoSQL products that can be started in just a few minutes.

– You can make it as secure as you need it to be.

(32)

Low Cost Proof‐of‐Concept

• Now that you have a cluster up‐and‐running:

– Load up some test data. (Check out sqoop.) – Get your HiveQL book in hand and start doing

some analysis.

• Delete the servers once you are done. Only pay for the time the servers are running.

• You can always bring the cluster in‐house for

production, but you might find out it’s more

cost‐effective to leave it in the Cloud!

(33)

BIG DATA CASE STUDY

(If we have time)

(34)

Client Overview

• Mobile social networking startup

• Focused on families with kids

• Launching in Q1‐2013

• Currently in “Stealth Mode” pending launch the first week of March, 2013

• Big Data Use Case:

– Infrastructure behind mobile app

(35)

NCOAUG Training Day – February 22, 2013

The Challenge

• Big Data application

– Semi‐structured and unstructured data

• Low latency (<100ms) for user experience

• 24 x 7 high availability

• Cloud deployment (Amazon AWS)

• Analytical capability required

(36)

The Solution

• DataStax Enterprise Big Data Database Cluster

– Cassandra database for low‐latency reads and writes

• Cluster architecture for high‐availability

• Tunable read and write consistency

– Integrated Hadoop workload support for analytics – Integrated Solr workload support for search feature – DataStax OpsCenter tool for cluster management

• Benefits

– High performance reads and writes = good customer experience – Only single cluster required for Cassandra, Hadoop and Solr

– Commercial‐grade support – Cost effective solution

– Fast deployment (30 days)

(37)