• No results found

So What s the Big Deal?

N/A
N/A
Protected

Academic year: 2022

Share "So What s the Big Deal?"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

So What’s the Big Deal?

(3)

NCOAUG Training Day – February 22, 2013

Presentation Agenda

• Introduction

• What is Big Data?

• So What is the Big Deal?

• Big Data Technologies

• Identifying Big Data Opportunities

• Conducting a Big Data Proof‐of‐Concept

• Big Data Case Study (if we have time)

• Q&A

• Links to More Information

(4)

Introduction

• RhinoSource, Inc.

– Oracle App/Tech Consulting and Managed Services

• Oracle E‐Business Suite

• Oracle Business Intelligence

• Oracle Database (performance, partitioning and replication)

• Application Development and Advanced PL/SQL Development

– Advanced Technology Consulting and Managed Services

• Big Data

• Mobile Applications

• Cloud Computing

– CIO‐Level Advisory Services

• IT Strategy, Planning and Project Management

• ERP/CRM Evaluation and Implementation

(5)

WHAT IS BIG DATA?

(6)

What Makes Up Big Data?

• Blog posts, user comments

• Emails and Messaging

• Web server logs

• Instrumentation of online stores

• Image and video uploads

• Process data, such as RFID

• Sensor device data

• External data sets

– Census data – Weather data

– Geographical data

• “Shadow Data” (replicated copies and change journals)

(7)

NCOAUG Training Day – February 22, 2013

The 3 V’s of Big Data

• Velocity

– Data generated at a faster rate than ever before – Server logs, smart phones, sensor devices, RFID

– Challenge:  Existing systems cannot process new data fast enough

• Variety

– Data more varied and complex – Structured and unstructured

– Many formats: text, document, image, video

– Challenge:  Existing databases do not handle varying data formats well

• Volume

– Orders of magnitude larger

– 2.5 Zetabytes of new data created in 2012

– 8 Zetabytes on new data  projected to be created in 2015

• 3 Billion Internet users, 15 Billion connected devices

– Challenge:  Existing databases do not cost‐effectively scale to Big Data sizes

(8)

Big Data Growth Trend

Ze tta b yt e s

40% CAGR

(9)

NCOAUG Training Day – February 22, 2013

How Much is 1 Zettabyte?

(10)

New Data by End of 2015

17 ZB of  New Data

By end of 

2015!

(11)

SO, WHAT’S THE BIG DEAL?

(12)

“Can Big Data Show Us The Way?”

• Scientific American, Dec ’11:

"...the rise of 'big data' [is] a  trend that is striking many  scientists as being on a par  with the invention of the  telescope and microscope."

"...many experts believe we 

are on the cusp of opening 

up new worlds of inquiry."

(13)

NCOAUG Training Day – February 22, 2013

Big Data Advantages

• Better, more accurate predictions

• Deeper, richer insights into customers,  business partners and the operations

• Real‐time Big Data analytics enables faster  decision‐making

• Creates competitive advantage

• Improves bottom line

(14)

Big Data Spending

• Companies have spent $4.3 billion on Big Data  as of the end of 2012.

• Gartner predicts those initial investments will  in turn trigger a domino effect of upgrades 

and new initiatives

– Valued at $34 billion for 2013, per Gartner.

Over a 5 year period, spend is estimated at $232 

billion.

(15)

BIG DATA TECHNOLOGIES

(16)

A Brief History of Big Data

RDBMS

Data Warehouse

RAC

Sc ale

Time

Distributed

Big Data Cluster

(17)

NCOAUG Training Day – February 22, 2013

How Do We Store Big Data?

NoSQL databases store data records as key‐value pairs

Or as triplets with a timestamp.

• Schema‐less or “schema‐optional”

Values may be structured or unstructured (developer’s choice).

• Not relational

No relationships between records No join support in a NoSQL database.

Does not use SQL to store and retrieve records.

• Highly optimized for retrieval and appending operations.

High performance writes.

High performance retrieval by primary key.

Little functionality beyond record storage and retrieval.

• Highly Scalable to huge amounts of data

Millions or Billions of records

Partition data across many distributed, inexpensive servers for cost‐effective scalability and availability

• Must trade off between Availability versus Consistency (CAP Theorem).

(18)

Popular NoSQL Databases

Key‐Value Stores Column‐Oriented Databases

Graph Databases Document Databases

(19)

NCOAUG Training Day – February 22, 2013

Why Not Relational for Big Data?

• Transforming and loading data into RDBMS requires extensive pre‐

processing of data into a pre‐defined schema

– Doesn’t work well for semi‐structured and unstructured data

– Can take more time than is available before next batch must be loaded 

• Joining multiple data sets at query time is an expensive operation

• RDBMS scaling must be done vertically to larger and more  expensive servers and storage solutions

• RDBMS clustering requires expensive networking and shared  storage infrastructures

– Fiber Channel, Infiniband, SAN, NAS

• Challenging to distribute data across data centers

– Replication strategies are “add‐ons” and complex

• Strict Consistency requirement is enforced at the cost of write 

performance and availability (CAP Theorem)

(20)

Dr. Brewer’s CAP Theorem

CP:

BigTable Hadoop/Hbase

MongoDB Oracle NoSQL

Redis p

AP:

Cassandra CouchDB

Dynamo Riak SimpleDB

CA:

“Pick 2”

RDBMS

Oracle RAC

(21)

NCOAUG Training Day – February 22, 2013

Scalability Comparison

(Logarithmic Scale) 

1 10 100 1000 10000 100000

MongoDB Oracle RAC Cassandra Hadoop

Terabytes Server Nodes

RAC

21 PB,

2000 Nodes at Facebook 300 TB,

400 Nodes at Digital Reasoning 71 TB,

48 Nodes at Amazon 10 TB,

100 Nodes

at CraigsList

(22)

Scalability Comparison

(Linear Scale)

0 5000 10000 15000 20000 25000

MongoDB Oracle RAC Cassandra Hadoop

Terabytes Server Nodes

RAC

21 PB,

2000 Nodes at Facebook 300 TB,

400 Nodes at Digital Reasoning 71 TB,

48 Nodes at Amazon 10 TB,

100 Nodes

at CraigsList

(23)

Feature Comparison

Cassandra

• Best of the NoSQLs for Cross‐Data Center  Replication and High‐Availability

• Known to scale to 100’s of Terabytes (but  theoretically to Petabytes)

• Tunable Consistency at operation‐level for  writes and reads.  Availability model (AP).

• Primary and Secondary Indexes

• Queries are Real‐Time (CQL, Thrift)

• No Join Support

• Masterless Peer‐to‐Peer Ring Architecture =  No S.P.O.F.

• Provides most cost‐effective HA and  scalability of the NoSQLs

• Written in Java

• Minimum of 3 nodes recommended.

• Easy to install and setup on commodity  hardware.

Hadoop/HBase

• The current “Gold Standard” of the NoSQLs for Data Analysis

• Known to scale to Petabytes (1000’s of  Terabytes)

• Consistency model (CP)

• Hadoop Queries are Batch (MapReduce).  

HBase provides real‐time queries similar to  Cassandra.

• Joins are Possible

• Master‐Slave Architecture = S.P.O.F. 

(Name/JobTracker Node)

• Written in Java

• Minimum of 5 nodes recommended.

• More challenging installation and setup.

• Warm Standby and Shared Storage Required  for High‐Availability Failover, so higher 

infrastructure costs.

(24)

Best of All Worlds

• DataStax Enterprise

– Cassandra   Real‐Time Database

• Peer‐to‐Peer HA Architecture

• Cross‐Data Center Replication

• Real‐Time, Low‐Latency Queries

– Hadoop A  Analytics

• Map/Reduce, Hive, Pig (Joins)

– Solr Search

• Full‐Text Search

• Rich Document Handling (Word, PDF)

(25)

NCOAUG Training Day – February 22, 2013

Plus Cluster Management

(26)

Current Big Data Challenges

• Integrating Big Data with existing databases  and BI/reporting systems.

– JDBC, ODBC – sqoop

• Security and Encryption

– DataStax Enterprise 3.0 (In Beta)

• Transparent Data Encryption

• Internal and External Authentication

• Data Auditing

(27)

IDENTIFYING BIG DATA 

OPPORTUNITIES

(28)

Big Data Use Cases

• Context for Interactions and Transactions

– Reward Points – Warranty Policies – Social media chatter

– Survey response feedback – Website requests

• Connection with Outside Patterns

– Weather Data – Demographic Data – Geographical Data

– Government Compliance Data

• Improving Disaster and Outage Response Times by Spotting Trends

• Compliance Checks and Audits

• Competitive Insights into How Your Products and Services (and your competition’s)  are used and perceived in the marketplace.

• Database Infrastructure Behind Mobile and Web Applications

(29)

NCOAUG Training Day – February 22, 2013

Great Places to Look

for Big Data Opportunities

• Server Logs

– Web server and app server logs – Call center/phone system logs

• Product Data

– Performance data – Sensor data

– Positional data

– Streamed live or captured in Log files / Data files

• Current RDBMS Archive‐Purge Strategies

– What data are you deleting every day/month/year?

– Financial Data, Operational Data, Customer Interactions

(30)

Implementing Big Data

• Identify "Game Changing" Big Data  opportunities.

• Define a business case.

• Identify existing business and functional  capabilities.

• Augment existing capabilities with 3rd‐party  assistance.

• Conduct low‐cost Proof‐of‐Concept project to 

demonstrate feasibility.

(31)

NCOAUG Training Day – February 22, 2013

Low Cost Proof‐of‐Concept

• Take advantage of a cloud platform like Amazon  Web Services (AWS) and Amazon EC2.

– Run a multi‐node cluster for less than $25/day.

– Get started instantly.  Have a cluster up‐and‐running  in only a few hours.

– NoSQL technologies are perfectly suited for the cloud  deployed model.

– Amazon Machine Images (AMIs) exist for most NoSQL products that can be started in just a few minutes.

– You can make it as secure as you need it to be.

(32)

Low Cost Proof‐of‐Concept

• Now that you have a cluster up‐and‐running:

– Load up some test data.  (Check out sqoop.) – Get your HiveQL book in hand and start doing 

some analysis.

• Delete the servers once you are done.  Only  pay for the time the servers are running.

• You can always bring the cluster in‐house for 

production, but you might find out it’s more 

cost‐effective to leave it in the Cloud!

(33)

BIG DATA CASE STUDY

(If we have time)

(34)

Client Overview

• Mobile social networking startup

• Focused on families with kids

• Launching in Q1‐2013

• Currently in “Stealth Mode” pending launch  the first week of March, 2013

• Big Data Use Case:

– Infrastructure behind mobile app 

(35)

NCOAUG Training Day – February 22, 2013

The Challenge

• Big Data application

– Semi‐structured and unstructured data

• Low latency (<100ms) for user experience

• 24 x 7 high availability

• Cloud deployment (Amazon AWS)

• Analytical capability required

(36)

The Solution

• DataStax Enterprise Big Data Database Cluster

– Cassandra database for low‐latency reads and writes

• Cluster architecture for high‐availability

• Tunable read and write consistency

– Integrated Hadoop workload support for analytics – Integrated Solr workload support for search feature – DataStax OpsCenter tool for cluster management

• Benefits

– High performance reads and writes = good customer experience – Only single cluster required for Cassandra, Hadoop and Solr

– Commercial‐grade support – Cost effective solution

– Fast deployment (30 days)

(37)

NCOAUG Training Day – February 22, 2013

Technical Details

• Installed DataStax Enterprise 2.2.1 on Amazon  AWS

– 3 x M1.Large Nodes

– Will double to 6 nodes later in the year – Each node will hold ~800GB of data

• Implemented monitoring and alerts

– Cluster stats collected every 15 seconds  – Stats stored in db and graphed 

– Amazon SNS for notifications (email and SMS)

(38)

Amazon AWS and EC2

(39)

NCOAUG Training Day – February 22, 2013

OpsCenter Cluster Management

(40)

Cluster “Ring View”

(41)

NCOAUG Training Day – February 22, 2013

Performance Monitoring

(42)

Customized Dashboards

(43)

NCOAUG Training Day – February 22, 2013

More Custom Dashboards 

(44)

Q&A

(45)

NCOAUG Training Day – February 22, 2013

More Reading

• www.rhinosource.com/bigdata.html

(46)

Thank you!

References

Related documents

In this case, where self-fulfilling crises are possible, but where there is no incentive for the government to gamble for redemption, the optimal strategies of the government

In current setups, EIT images are recorded using one elec- trode belt positioned at a defined height, typically in the in- tercostal space between the 4 th and the 6 th ribs. The

And such a given pattern is discretized to fit in to integer node number (see Figure 4.2). Thus job completion time in such a dynamic cluster can be observed and compared to that

2.7 Cancellation/Termination: If the Contractor defaults in its agreement to provide personnel or equipment to the University's satisfaction, or in any other way fails to

MCLE COURSE NO: 901282502 Applies to the College of the State Bar of Texas and to the Texas Board of Legal Specialization in Estate Planning and Probate Law and Tax

Student demonstrates internalizing behaviors in the school setting that interfere with the student’s learning such as: perseveration, self-deprecating statements, pretending to be

When multiple queued orders exist at the best market price, a passive order from the same trading member as the aggressive order shall have cross priority and shall

Citations without application to the focus of this study are not included in the chart; of these, many examples of puny‹nomai con- cern interpretation of myth and ritual, while