• No results found

Big Data Introduction

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Introduction"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

1 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Big Data Introduction

Ralf Lange

(2)

2 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(3)

3 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(4)

4 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(5)

5 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

What is Map Reduce

[ , , , , , ]

, , , , ,

(6)

6 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Basics Of Hadoop

In Memory

File 1 Piece 1

1

File 1 Piece 2

2

File 1 Piece 3

3

2

5

3

6

4

7

Name Node

Data Node

Data Node

Data Node

Data Node

JAR

Map Reduce Map Reduce Map Reduce Map Reduce

(7)

7 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(8)

8 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Programming Languages

Normal

Hadoop

HCatalog

PIG

(9)

9 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Management

Thread 2 Thread 1 Process 2 Process 1

ZooKeeper

(10)

10 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(11)

11 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(12)

12 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(13)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 13

Oracle Big Data Solution

Oracle BI

Foundation Suite

Oracle Real-Time

Decisions

Endeca Information

Discovery

Decide

Oracle

Advanced

Analytics

Oracle

Database

Oracle

Spatial

& Graph

Acquire – Organize – Analyze

Oracle Big Data

Connectors

Oracle Data

Integrator

Stream

Oracle Event

Processing

Apache

Flume

Oracle

GoldenGate

Oracle

NoSQL

Database

Cloudera

Hadoop

Oracle R

Distribution

Scalable key-value store

Scalable, low-cost data storage

and processing engine

(14)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 14

Massive detail data

Big batch jobs

Unifying data sources

Many data marts merged in

Hadoop to provide

unified views of data

Long running batch jobs can run

in Hadoop to make the most of

the DB

Store more raw detail data for

less cost, while keeping

aggregates in the DB

Big Data ≠ Unstructured Data

(15)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 15

(16)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 16

(17)

17 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

(18)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 18

Hadoop

The Apache Hadoop software library is a

framework

that allows for the

distributed processing

of

large data sets

across

clusters of computers

using

simple programming models

. Hadoop is designed to scale up from

single servers to thousands of machines, each offering local

computation and storage. Rather than rely on hardware to deliver

high-availability, the library itself is designed to detect and handle failures at

the application layer, so delivering a

highly-available service

on top of a

cluster of computers, each of which may be prone to failures.

Framework for distributed processing

Large Data Sets

Clusters of Computers

Simple Computing Models

Highly Available Service

(19)

19 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

What to Pay Attention To

Distributed Storage

– HDFS

Parallel Processing Framework

– MapReduce

Higher-Level Languages

– Hive

– Pig

– Etc.

(20)

20 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

HDFS

The Distributed Filesystem

What is it?

Benefits

Limitations

The petabyte-scale distributed file system at

the core of Hadoop.

Linearly-scalable on commodity hardware

An order of magnitude cheaper per TB

Designed around schema-on-read

Low security

(21)

21 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Interacting with HDFS

NameNodes and DataNodes

– NameNodes contain edits and organization

– DataNodes store data

Command-line access resembles UNIX filesystems

– ls (list)

– cat, tail (concatenate or tail file)

– cp, mv (copy or move within HDFS)

– get, put (copy between local file system and HDFS)

(22)

22 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

HDFS Mechanics

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

Suppose we have a large file

And a set of DataNodes

(23)

23 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

HDFS Mechanics

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

The file will be broken up into blocks

Blocks are stored in multiple locations

Allows for parallelism and fault-tolerance

(24)

24 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

MapReduce

The Parallel Processing Framework

What is it?

Benefits

Limitations

The parallel processing framework that

dominates the Big Data landscape.

Provides data-local computation

Fault-tolerant

Scales just like HDFS

You are the optimizer

Quasi-functional model is counterintuitive

(25)

25 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

MapReduce Mechanics

Suppose 3 face cards are

removed.

How do we find which suits

are short using

(26)

26 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

MapReduce Mechanics

Map Phase:

Each TaskTracker has some data local to it.

Map tasks operate on this local data.

If face_card: emit(suit, card)

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

(27)

27 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

MapReduce Mechanics

Shuffle/Sort:

Intermediate data is shuffled and sorted for delivery to the reduce tasks

Sort

(28)

28 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

MapReduce Mechanics

Reduce Phase:

Reducers operate on local data to produce final result

Emit:key, count(key)

TaskTracker

TaskTracker

TaskTracker

TaskTracker

(29)

29 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Hive

A move toward declarative language

What is it?

Benefits

Limitations

A SQL-like language for Hadoop.

Abstracts MapReduce code

Schema-on-read via InputFormat and SerDe

Provides and preserves metatdata

Not ideal for ad hoc work (slow)

Subset of SQL-92

(30)

30 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Storing a Clickstream

Storing large amounts of

clickstream data is a

common use for HDFS

Individual clicks aren’t

valuable by them selves

We’d like to write queries

over all clicks

(31)

31 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Defining Tables Over HDFS

Hive allows us to define

tables over HDFS

directories

The syntax is simple SQL

SerDes allow Hive to

deserialize data

(32)

32 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

How Does It Work

Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

How does Hive execute

this query?

(33)

33 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

1.

Hive optimizer builds a MapReduce Job

2.

Projections and predicates

become Map code

3.

Aggregations become Reduce code

4.

Job is submitted to

MapReduce JobTracker

Map task

If face_card:

emit(suit,

card)

Reduce task

emit(suit,

count(suit))

Shuffl

e

(34)

34 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Using Hadoop To Optimize

IT

(35)

35 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Big Data and Optimized Operations

Big Data can handle a lot of heavy lifting

– It’s a complement to the database

Big Data allows access to more detail data for less

(36)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 36

Optimizing ETL, Saving SLAs

Mission

Critical

Reporting

Ad Hoc

Analysis

Long-running

batch

transformation

Big Data Problem

Base Table

Copy/Move

Base Table to

Hadoop

Load to

Oracle

Long-running

batch

transformation

(37)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 37

Big Data Problem

Store More Details For Less

Base Table

Aggregation

Reporting Table

(38)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 38

Using Hadoop To Build

New Datasets

(39)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 39

What Does a Big Data World Look Like?

Truck / Motor Manufacturer

Collections

Internal sensors

Miles Per Gallon, Driving

techniques

Location information

Uses

Better tailored servicing plans

Better targeted marketing

Offer better finance deals or related

options

More data for R&D

Sell on to partners

(40)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 40

Big Data and Analytics

Big Data does not make analytics easier

There is no magic bullet

Some things work better in a database

Big Data allows the collection of new datasets

(41)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 41

No Magic Bullets

Food monitoring by RFID tags  Fridge monitors food

usage and sell-by dates

Monitor the complete car  Better targeted marketing

There is a gap between

The available dataset

The value proposition

(42)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 42

Some Things Work Better in RDBMS

•Time Series

Analysis

•Spatial Analysis

•Linear and

Nonlinear Modeling

•Interaction with

SAS and R

•Clustering on

massive data

•Fine-grained

classification

•Dataset

construction

•Deploying models

on many subgroups

(43)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 43

Collecting New Datasets

The Complete Car

Minute-by-minute

MPH

GPS

Readings

On-board

Vehicle

Diagnostics

How does the

customer

drive?

Where does

the customer

drive?

How do we

maximize their

value?

Trip

(Location and

Speed)

Vehicle Usage

Report

Big Data Problem

(44)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 44

More Granular Modeling

Testing Trip Dynamics

Analyst

New Model for

Maintenance

Alerts

Test and

Summarize

On All Engine

Readings

Aggregated

Test

Results

Big Data Problem

(45)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 45

Fitting Fat Tails

Modeling “outlying” customers

Analyst

Significant

value may exist

in the tails

Parallelized

Locally-weighted

Linear

Regression

Model for all

data

Big Data Problem

(46)

46 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Insert Information Protection Policy Classification from Slide 8

Q&A

(47)

47 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

(48)

48 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

References

Related documents

Teknik pencatuan yang digunakan pada penelitian ini adalah teknik pencatuan tidak langsung biasa disebut proximity coupling dimana tidak ada kontak langsung antara

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Insert Information Protection Policy Classification from Slide 12 44 Copyright © 2013, Oracle and/or its affiliates. All

Como veremos más adelante (§ 4.3.2), este enfoque, que con- figura la violencia de género como doblemente unidireccional, respecto a los autores (solo hombres) y a las víctimas

We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are

Cloud infrastructure and scalable data stores are areas of heavy development, and a design that eases interchange of underlying components allows using the latest advances in

The object of this exploratory research is to explore four exceptional cases to gain a deeper understanding of why the businesses felt they needed to develop and innovate

Second team honorees included junior catcher Mike Meeuwsen of Grand Rapids, Mich., and sophomore second baseman Matt Klein of DeWitt, Mich.. Ruby and Labbe were also named