Big Data Introduction

(1)

Insert Information Protection Policy Classification from Slide 8

Big Data Introduction

Ralf Lange

(2)

(3)

(4)

(5)

What is Map Reduce

[ , , , , , ]

, , , , ,

(6)

Basics Of Hadoop

In Memory

File 1 Piece 1

1 File 1 Piece 2

2 File 1 Piece 3

3

2

5

3

6

4

7 Name Node

Data Node

JAR

Map Reduce Map Reduce Map Reduce Map Reduce

(7)

(8)

Programming Languages

Normal

Hadoop

HCatalog

PIG

(9)

Management

Thread 2 Thread 1 Process 2 Process 1

ZooKeeper

(10)

(11)

(12)

(13)

Oracle Big Data Solution

Oracle BI

Foundation Suite

Oracle Real-Time

Decisions

Endeca Information

Discovery

Decide

Oracle

Advanced

Analytics

Oracle

Database

Oracle

Spatial

& Graph

Acquire – Organize – Analyze

Oracle Big Data

Connectors

Oracle Data

Integrator

Stream

Oracle Event

Processing

Apache

Flume

Oracle

GoldenGate

Oracle

NoSQL

Database

Cloudera

Hadoop

Oracle R

Distribution

Scalable key-value store

Scalable, low-cost data storage

and processing engine

(14)

Massive detail data

Big batch jobs

Unifying data sources

Many data marts merged in

Hadoop to provide

unified views of data

Long running batch jobs can run

in Hadoop to make the most of

the DB

Store more raw detail data for

less cost, while keeping

aggregates in the DB

Big Data ≠ Unstructured Data

(15)

(16)

(17)

(18)

Hadoop

The Apache Hadoop software library is a

framework

that allows for the

distributed processing

of

large data sets

across

clusters of computers

using

simple programming models

. Hadoop is designed to scale up from

single servers to thousands of machines, each offering local

computation and storage. Rather than rely on hardware to deliver

high-availability, the library itself is designed to detect and handle failures at

the application layer, so delivering a

highly-available service

on top of a

cluster of computers, each of which may be prone to failures.

Framework for distributed processing

Large Data Sets

Clusters of Computers

Simple Computing Models

Highly Available Service

(19)

What to Pay Attention To



Distributed Storage

– HDFS



Parallel Processing Framework

– MapReduce



Higher-Level Languages

– Hive

– Pig

– Etc.

(20)

HDFS

The Distributed Filesystem



What is it?



Benefits



Limitations

The petabyte-scale distributed file system at

the core of Hadoop.



Linearly-scalable on commodity hardware



An order of magnitude cheaper per TB



Designed around schema-on-read



Low security

(21)

Interacting with HDFS



NameNodes and DataNodes

– NameNodes contain edits and organization

– DataNodes store data



Command-line access resembles UNIX filesystems

– ls (list)

– cat, tail (concatenate or tail file)

– cp, mv (copy or move within HDFS)

– get, put (copy between local file system and HDFS)

(22)

HDFS Mechanics

DataNode

Suppose we have a large file

And a set of DataNodes

(23)

HDFS Mechanics

DataNode

• The file will be broken up into blocks

• Blocks are stored in multiple locations

• Allows for parallelism and fault-tolerance

(24)

MapReduce

The Parallel Processing Framework



What is it?



Benefits



Limitations

The parallel processing framework that

dominates the Big Data landscape.



Provides data-local computation



Fault-tolerant



Scales just like HDFS



You are the optimizer



Quasi-functional model is counterintuitive

(25)

MapReduce Mechanics

Suppose 3 face cards are

removed.

How do we find which suits

are short using

(26)

MapReduce Mechanics

Map Phase:

Each TaskTracker has some data local to it.

Map tasks operate on this local data.

If face_card: emit(suit, card)

TaskTracker/DataNode

(27)

MapReduce Mechanics

Shuffle/Sort:

Intermediate data is shuffled and sorted for delivery to the reduce tasks

Sort

(28)

MapReduce Mechanics

Reduce Phase:

Reducers operate on local data to produce final result

Emit:key, count(key)

TaskTracker

(29)

Hive

A move toward declarative language



What is it?



Benefits



Limitations

A SQL-like language for Hadoop.



Abstracts MapReduce code



Schema-on-read via InputFormat and SerDe



Provides and preserves metatdata



Not ideal for ad hoc work (slow)



Subset of SQL-92

(30)

Storing a Clickstream



Storing large amounts of

clickstream data is a

common use for HDFS



Individual clicks aren’t

valuable by them selves



We’d like to write queries

over all clicks

(31)

Defining Tables Over HDFS



Hive allows us to define

tables over HDFS

directories



The syntax is simple SQL



SerDes allow Hive to

deserialize data

(32)

How Does It Work

Anatomy of a Hive Query

**SELECT suit, COUNT(*)**

FROM cards

WHERE face_value > 10

GROUP BY suit;

How does Hive execute

this query?

(33)

Anatomy of a Hive Query

**SELECT suit, COUNT(*)**

FROM cards

WHERE face_value > 10

GROUP BY suit;

1. Hive optimizer builds a MapReduce Job

2. Projections and predicates

become Map code

3. Aggregations become Reduce code

4. Job is submitted to

MapReduce JobTracker

Map task

If face_card:

emit(suit,

card)

Reduce task

emit(suit,

count(suit))

Shuffl

e

(34)

Using Hadoop To Optimize

IT

(35)

Big Data and Optimized Operations

• Big Data can handle a lot of heavy lifting

– It’s a complement to the database

• Big Data allows access to more detail data for less

(36)

Optimizing ETL, Saving SLAs

Mission

Critical

Reporting

Ad Hoc

Analysis

Long-running

batch

transformation

Big Data Problem

Base Table

Copy/Move

Base Table to

Hadoop

Load to

Oracle

Long-running

batch

transformation

(37)

Big Data Problem

Store More Details For Less

Base Table

Aggregation

Reporting Table

(38)

Using Hadoop To Build

New Datasets

(39)

What Does a Big Data World Look Like?

Truck / Motor Manufacturer

Collections

Internal sensors

Miles Per Gallon, Driving

techniques

Location information

Uses

Better tailored servicing plans

Better targeted marketing

Offer better finance deals or related

options

More data for R&D

Sell on to partners

(40)

Big Data and Analytics



Big Data does not make analytics easier

–

There is no magic bullet



Some things work better in a database



Big Data allows the collection of new datasets

(41)

No Magic Bullets



Food monitoring by RFID tags  Fridge monitors food

usage and sell-by dates



Monitor the complete car  Better targeted marketing



There is a gap between

–

The available dataset

–

The value proposition

(42)

Some Things Work Better in RDBMS

•Time Series

Analysis

•Spatial Analysis

•Linear and

Nonlinear Modeling

•Interaction with

SAS and R

•Clustering on

massive data

•Fine-grained

classification

•Dataset

construction

•Deploying models

on many subgroups

(43)

Collecting New Datasets

The Complete Car

Minute-by-minute

MPH

GPS

Readings

On-board

Vehicle

Diagnostics

How does the

customer

drive?

Where does

the customer

drive?

How do we

maximize their

value?

Trip

(Location and

Speed)

Vehicle Usage

Report

Big Data Problem

(44)

More Granular Modeling

Testing Trip Dynamics

Analyst

New Model for

Maintenance

Alerts

Test and

Summarize

On All Engine

Readings

Aggregated

Test

Results

Big Data Problem

(45)

Fitting Fat Tails

Modeling “outlying” customers

Analyst

Significant

value may exist

in the tails

Parallelized

Locally-weighted

Linear

Regression

Model for all

data

Big Data Problem

(46)

Q&A

(47)

(48)