• No results found

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

N/A
N/A
Protected

Academic year: 2021

Share "Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Greenplum Database

Getting Started with

Big Data Analytics

Ofir Manor

(2)

Agenda

• Introduction to Greenplum

• Greenplum Database Architecture

• Flexible Database Configuration

• Beyond SQL – Flexible Analytics

• Flexible Deployment

(3)

!!!

!!!

!!!

!!!

!!!

Big Data

Is Less

About Size, And

More About Freedom”

―Techcrunch

!!!

!!!

!!!

“Findings: ‘Big Data’ Is

More Extreme Than

Volume”

― Gartner

“Big Data! It’s Real, It’s

Real-time, and It’s

Already Changing Your

World”

―IDC

“Total data:

‘bigger’ than big

data”

(4)

!!!

!!!

!!!

!!!

!!!

Big Data

Is Less

About Size, And

More About Freedom”

―Techcrunch

!!!

!!!

!!!

“Findings: ‘Big Data’ Is

More Extreme Than

Volume”

― Gartner

“Big Data! It’s Real, It’s

Real-time, and It’s

Already Changing Your

World”

―IDC

“Total data:

‘bigger’ than big

data”

― 451 Group

THE ERA OF

BIG DATA

(5)

Industries Are Broadly

Embracing Big Data

Retail

•CRM – Customer Scoring •Store Siting and Layout •Fraud Detection / Prevention •Supply Chain Optimization

Advertising & Public Relations

•Demand Signaling •Ad Targeting •Sentiment Analysis •Customer Acquisition

Financial Services

•Algorithmic Trading •Risk Analysis •Fraud Detection •Portfolio Analysis

Media & Telecommunications

•Network Optimization •Customer Scoring •Churn Prevention •Fraud Prevention

Manufacturing

•Product Research •Engineering Analytics •Process & Quality Analysis •Distribution Optimization

Energy

•Smart Grid •Exploration

Government

•Market Governance •Counter-Terrorism •Econometrics •Health Informatics

Healthcare & Life Sciences

•Pharmaco-Genomics •Bio-Informatics

•Pharmaceutical Research •Clinical Outcomes Research

(6)
(7)
(8)
(9)
(10)

Extreme Performance for Analytics

Optimized for BI and analytics

Deep integration with statistical packages

High performance parallel implementations

Simple and automatic

Just load and query like any database

Tables are automatically distributed

across nodes

Extremely scalable

MPP shared-nothing architecture

All nodes can scan and process in parallel

Linear scalability by adding nodes

(11)

A Mature Enterprise Platform

PRODUCT

FEATURES

CLIENT ACCESS

& TOOLS

Multi-Level Fault Tolerance (RAID, Mirroring, DR with

Data Domain Boost)

Shared-Nothing MPP Parallel Query Optimizer Polymorphic Data Storage™

CLIENT ACCESS

ODBC, JDBC, OLEDB, MapReduce, etc.

CORE MPP

ARCHITECTURE

Parallel Dataflow Engine gNet™ Software Interconnect Scatter/Gather Streaming™ Data Loading Online System Expansion Workload Management

GREENPLUM

DATABASE ADAPTIVE

SERVICES

LOADING & EXT. ACCESS

Petabyte-Scale Loading Trickle Micro-Batching Anywhere Data Access

STORAGE & DATA ACCESS

Hybrid Storage & Execution (Row- & Column-Oriented)

In-Database Compression Multi-Level Partitioning Indexes – Btree, Bitmap, etc.

External Table Support

LANGUAGE SUPPORT

Comprehensive SQL Native MapReduce SQL 2003 OLAP Extensions

Programmable Analytics Analytics Extensions (GeoSpatial, PR/R, PL/Java,

PL/Python, PL/Perl)

3

rd

PARTY TOOLS

BI Tools, ETL Tools

Data Mining, etc

ADMIN TOOLS

Greenplum Command Center Greenplum Package Manager

(12)

Segment

SQL Client

Master

Segment

Segment

Segment

High-Speed Interconnect

Extremely Scalable MPP

Shared-Nothing Architecture

(13)

Linear Scalability

Segment

SQL Client

Master

Segment

Segment

Segment

High-Speed Interconnect

Segment

Segment

Segment

Segment

Each node has its own

CPU and I/O resources

Add nodes to scale

Rebalance happens in

(14)

High Availability

Master

Segment

Segment

Segment

Segment

Master

Master Server Data Protection

Replicated transaction logs for server failure

Optional RAID protection for drive failures

Upon server failure

Standby server activated

Administrator alerted

Orchestrated failover

Segment Server Data Protection

Mirrored segments for server failures

Optional RAID protection for drive failures

Upon server failure

Mirrored segments take over with no loss of

service

Fast online differential recovery

(15)

SINGLE RACK COMPARISON

Most Powerful Data Loading Capabilities

Industry leading performance

at 10+TB per-hour per-rack

Scatter-Gather Streaming™

provides true linear scaling

Support for both large-batch and

continuous real-time loading

strategies

Enable complex data

transformations “in-flight”

Transparent interfaces to loading

via support files, application, and

services

Greenplum load rates scale linearly with

the number of racks, others do not.

For example, two racks = >20TB/H

Greenplum Oracle

Exadata

Netezza Teradata

(16)

Polymorphic Table Storage

TM

Enable Information Lifecycle Management (ILM)

Storage types can be mixed within a table or database

Four table types: heap, row-oriented AO, column-oriented, external

Block compression: Gzip (levels 1-9), QuickLZ

TABLE

‘CUSTOMER’

Mar

‘11

Apr

‘11

May

‘11

Jun

‘11

‘11

Jul

Aug

‘11

Sept

‘11

Oct

‘11

Nov

‘11

Row-oriented for

HOT DATA

Column-oriented for

COLD DATA

(17)

In-Database Analytics

Bringing the power of parallelism to

commonly-used modeling and analytics

functions

In-database analytics

SAS – HPA, Access, and Scoring Accelerator

MADLib – An open-source library of advanced

analytics functions

Analytics extensions supported, including

PostGIS - Geospatial support, PL/R - Statistical

Computing, PL/Java, PL/Perl, etc.

MAD

lib

MAD

lib

(18)

SAS and Greenplum

A Strategic Partnership for High-Performance Computing

Access relational data-sets for agile analysis

SAS/ACCESS provides fast, transparent and

secure access to Greenplum data.

Leverage database scalability for rapid model

deployment

SAS Scoring Accelerator publishes models for

execution in parallel across the Greenplum

cluster.

Build complex models at massive scales

The SAS High-Performance Analytics Appliance

combines SAS In-Memory Analytics with

Greenplum parallelism to produce

(19)

MADlib

Scalable in-database

analytics

Data-parallel

Mathematical Algorithms

Statistical Algorithms

Machine learning Algorithms

Supports structured and

unstructured data.

Delivered via open-source

Accessibility

Skill development

Converge business,

academic, and open-source

communities

(20)

MADlib In-Database

Analytical Functions

Descriptive Statistics

Modeling

Quantile

Correlation Matrix

Profile

Association Rule Mining

CountMin (Cormode-Muthukrishnan)

Sketch-based Estimator

K-Means Clustering

FM (Flajolet-Martin) Sketch-based

Estimator

Naïve Bayes Classification

MFV (Most Frequent Values)

Sketch-based Estimator

Linear Regression

Frequency

Logistic Regression

Histogram

Support Vector Machines

Bar Chart

SVD Matrix Factorisation

Box Plot Chart

Decision Trees/CART

(21)

Greenplum Analytics Labs

Packaged solutions that

produce business value and

actionable results

Accelerate analytics

capabilities on your data with

your analysts

Leverage the expertise of

Greenplum’s Data Scientists

Establish a strategic vision for

(22)

Greenplum Delivers Choice & Flexibility

Greenplum Data

Computing Appliance

Choose Greenplum

Database and/or

Hadoop modules in

¼ rack increments

Scale up by adding

your choice of

additional modules

Minimal time to value

Greenplum

Software Solutions

Greenplum

Database, Hadoop,

& Chorus on your

x86 hardware

Flexibility for any

workload or

environment

Perpetual or

(23)

Seamless Infrastructure Integration

EMC Data Domain

Efficient Backup & Restore

EMC VMAX or VNX SAN Mirror

For Advanced Storage

Management

Isilon Scale Out Storage

For Big Data Staging

EMC VMAX SRDF

EMC Data Domain

Replication

(24)

Simple To Manage

Greenplum Command Center

Complete platform management and control

Greenplum Package Manager

Automates install, uninstall, update, and query for analytics extensions

Support package migration during upgrade, segment recovery, expansion,

and standby initialization

(25)
(26)

Powerful Partner Ecosystem

(27)

Thank you

[email protected]

Downloads, Documentation, Whitepapers etc:

http://www.greenplum.com

A copy of this presentation will be avaliable on the event’s

web site

Next Greenplum workshop in Hungary: 04 July, 2012

Register now at EMC Hungary, or Avnet Hungary

(28)

References

Related documents

sales force through a process that includes compensation plans and the role. compensation plans play

UnitedHealthcare Community Plan Billing Guidelines for Obstetrical Services and PCO Responsibilities  Use the OB service code 59426, for 7 (seven) or more prenatal visits and

• Capitation on limited cost of care with one-sided risk on total cost of care; • Capitation on limited cost of care with two-sided risk on total cost of care; and, • Capitation

Of these, 50% base their population estimates on the volume of patient panels of affiliated providers, 14% on demographic information, and 7% on enrollment in a program (See fig.

Focusing on civic participation and access to social networks as two key measures of social capital, we analyze the determinants of the individual stock of social capital,

To select a track directly, touch the [ ] icon on screen control to display the direct entry menu.. Enter the desired track using the on-screen

System admin can change the structure of folders, but all movies must be stored directly in the movie folder or in its subfolders, all music must be

When playing a MP3 disc, when button (9) is pressed and held for longer than 2 se- conds, “SCN” will blink on the display panel and all files in the selected directory will be