Scaling Hadoop for Multi-Core and Highly Threaded Systems

(1)

Scaling Hadoop for

Multi-Core and Highly

Threaded Systems

Jangwoo Kim, Zoran Radovic

Performance Architects

Architecture Technology Group Sun Microsystems Inc.

(2)

Project Overview

Hadoop Updates

CMT Hadoop Systems

Scaling Hadoop on CMT

Virtualization Technologies

Zones Logical Domains

Case Study: E-mail Discovery

Conclusions

(3)

Project Overview

• Chip Multi-Threading (CMT) processors and Hadoop are designed for maximum throughput

• Sun's JVM optimized for CMT

>  Java has been widely deployed by many customers on CMT

-  Hadoop is written with Java – an ideal throughput candidate

• Seemed like a great fit for Hadoop with the potential for a greatly reduced footprint

•  Related Work by Ning Sun and Lee Anne Simmons:

>  Blueprint: Using Logical Domains and CoolThreads

(4)

(5)

Hadoop Expands Beyond the Web...

  Some examples from the Summit

- Genetic Sequence Analysis - Parallel Data Mining in Telco - Natural language learning - Business Fraud Detection - Clinical Trials

- Retail Business Planning - ...

(6)

Map/Reduce Organization

Split 0 Split 1 Split 2 Split 3 Input HDFS map map map reduce sort/merge reduce sort/merge Output HDFS Part 0 Part 1 copy

(7)

Next-Gen Hadoop – Low Latency Focus

  Hadoop is traditionally optimized for throughput   World Record Sort source code changes

- http://developer.yahoo.net/blogs/hadoop/Yahoo2009.pdf

  “Winning a 60 Second Dash with a Yellow Elephant”

- Reducer Improvements (Shuffle); memory to memory merge - Fetch of multiple map outputs from the same node

  Reduces number of server connections

- Improved timeout behavior

- Better data corruption detection (CRC32 improvements) - Map output compression (45% of the original size)

- Improved and multi-threaded data partitioning - Lower latency with faster “heartbeat”

(8)

OpenSolaris 2009.06

  OpenSolaris Moves Into Enterprise

- UltraSPARC T1 and T2 Support, Sun4u

  5 Year Enterprise Support

  Datacenter-Ready Installation

  New and Modern Networking Stack

- http://opensolaris.org/os/project/crossbow/ - Multi-Core Optimized

- Easy Network Virtualizetion and Resource Control

  Powerful, Built-in and Free Virtualization Techology

- http://opensolaris.org/os/community/ldoms

(9)

42 GB/s read, 21 GB/s write

UltraSPARC

™

_{T2 Processor}

• 8 SPARC V9 cores @ 1.4Ghz

> 8 vertical threads per core > 2 execution pipelines per core > 1 instruction/cycle per pipeline > 1 FPU per core

> 1 SPU (crypto) per core

> 4MB 16-way 8-bank L2$

• 64 threads

• 2.5Ghz x8 PCI-Express interface

• 2 x 10Gb on-chip Ethernet

• Crypto processor per core

• _{Power: 84 watts (typical)} • http://www.opensparc.net

(10)

T5240

2U 2P US-T2 Plus Server

Blade 6000

10U US-T2 Blades

CMT Hadoop Systems

T5440

4U 2P US-T2 Plus Platform & Sun Storage J4400

(11)

* http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

(12)

all maps start

job completion job start

all maps finish

shuffle start

shuffling reducing mapping

All tasks start and finish simultaneously

all reduces

start all reduces finish

Ideal Performance Model

(13)

first map start job start

mapping

Launching many tasks can incur significant overhead

Performance Model with Serialized

Tasks

last map starts last map finishes

job completion first map

finishes

shuffling reducing last shuffle

(14)

Distributed Performance Data Collection

  Created a set of scripts to facilitate distributed execution for

performance data collection and analysis

- Based on traditional single-node system analysis tools

  mpstat, nicstat, iostat, vmstat, ...

- Varaiable sampling frequency to monitor hardware utilization - Pinpoint which resource is a bottleneck at any point

  CPU utilization, network, disk I/O

- Periods where no resource is fully utilized may indicate

poorly-tuned Hadoop configuration or other system issues

  Hadoop log processing to monitor Hadoop task timeline

- Examine startup rate, Hadoop phase overlap

(15)

(#map, # reduce)‏

30GB sort on a single T5240 node (128 threads, 128GB RAM, 16 disks)‏

T ime (mi n) ‏ <60% CPU utilization

Significant launching overhead limits scalability Mapping

Shuffling

Reducing

(16)

10-Node 150G Sort – Task Timeline

(17)

(18)

Hypervisor Logical Domain 0 Job Tracker Name Node … Logical Domain 1 Task Tracker Data Node Logical Domain 2 Task Tracker Data Node Logical Domain N Task Tracker Data Node

Intra-node Virtualization:

Logical Domains (LDOMs)‏

• Hardware-assisted Virtualization

• Single hypervisor

> OS-Level Isolation

(19)

Example LDOMs Configuration

• Single control domain

> Virtual disk server (vds)

> Virtual network switch (vsw)

> Virtual console concentrator (vcc)

• Multiple logical domains

•  ldm add-vcpu 8 ldom0 (cpu)‏

•  ldm add-memory 16G ldom0 (memory)‏

•  ldm add-vdisk vdisk0 control-vds ldom0 (disk)‏

•  ldm add-vnet vnet0 control-vsw ldom0 (network)‏

• Single control domain

  ldm bind ldom0 (bind)   ldm start ldom0 (boot) >  OS Install as usual

(20)

Intra-node Virtualization: Zones

(Containers)‏

• Software (OS) Virtualization

• Single operating system

> Application-Level Isolation

> No H/W threads and memory dedicated

Zone 0 Job Tracker Name Node … Zone 1 Task Tracker Data Node Zone 2 Task Tracker Data Node Zone N Task Tracker Data Node

(21)

Example Zones Configuration

• Create zones

  zonecfg –z zone0 –f zone0.config   zone0.config:

•  “create;

•  add net; set physical=interface; set address=IP; ..

•  add fs; set dir=mount_path ; set raw=partition; ..

•  ..”

• Zone administration

•  zoneadm –z zone0 boot (boot)‏

•  zoneadm list (list)‏

(22)

Example 4-LDOM Setup

• Evenly distributing H/W resources

(23)

(#map, # reduce, #virtual nodes)‏ T ime (mi n) ‏

~100% CPU utilization with 4 logical domains

Mapping

Shuffling

Reducing

30GB sort on a single T5240 node (128 threads, 128GB RAM, 16 disks)‏

Scaling Hadoop with Intra-node

(24)

CMT Hadoop systems scale nicely with larger datasets T ime (mi n) ‏

Large data sorting performance

(Sun Blade 6000: 10 nodes, 640 threads, 64GB RAM/node, 4 disks/node)‏

Data Size

Scaling Sorting Workload

(Without Virtualization)‏

(25)

E-mail Discovery Overview

• Preparing data for searching over large email corpus

• Five phases with different MapReduce profiles

1. PipelineMapReduce – Reads and parses 27GB of raw emails

2. DocumentSeqFileToMapFile – Prepares MapFile to retrieve data

3. PersonNormalization – Groups data into unique entities

4. Consumer – Creates indices

5. ThreadDetection – Conversation threads detected

• Output is a set of shards used in an E-mail discovery search application

(26)

(27)

CMT Hadoop systems scale for throughput applications T ime (mi n) ‏ 1 node 128 threads

Email processing performance

1 node 256 threads 10 nodes 640 threads 15 nodes 60 EC2 units

E-Discovery Results

(28)

High performance with smaller datacenter footprint Email processing performance normalized to a 40U rack

R el at ive p erf orma nce 20 nodes 128 threads / node 5 nodes 256 threads / node 40 nodes 64 threads / node 40 nodes 4 EC2 units / node 4.6X 2.0X 3.1X 1.0

(29)

MySQL Enterprise Solution

Enterprise software, services delivered as annual subscription

Subscription:

MySQL Enterprise

License (OEM):

Embedded Server Support

MySQL Cluster Carrier-Grade Training Consulting NRE Database Monitoring Support

Most up-to-date MySQL software Monthly rapid updates

Quarterly service packs Hot-fix program

Indemnification

Virtual database assistant

Global monitoring of all servers Web-based central console

Built-in advisors, expert advice Problem query detection/analysis

Online self-help MySQL Knowledge Base

24/7 problem resolution with priority escalation Consultative help

(30)

Conclusions

• Hadoop and Java scale well on CMT systems

• Startup cost dominates performance on highly threaded systems (256 threads per node)

• Virtualization techniques enable good scalability, high system utilization and better performance

> Parallelized startup

> Less external node-to-node Ethernet traffic

• Hadoop consolidation on CMT systems reduces datacenter footprint, power and cooling costs

(31)

Software Stack, Pointers to Download

• Sun CMT servers > http://www.sun.com/servers/coolthreads/overview/index.jsp • Hadoop 0.20.0 > http://hadoop.apache.org • JVM from Sun 1.6.0_13 > http://www.java.sun.com

• OpenSolaris for SPARC 2009.6

> http://www.opensolaris.org

• LDOMs 1.1

(32)

Try it Yourself

Learn More Free

• Try free for 60 days: Sun

Enterprise SPARC rack or blade systems and storage

• Test Hadoop on up to 128

threads

• 60 days to decide to buy

• Return and pay nothing – not

even shipping if you don't

sun.com/tryandbuy

• Using LDom and CoolThreads

Technology: Improving Scalability and Utilization

• Improving Database Scalability

on T5440 Blueprint

• Deploying Web 2.0 Applications

on Sun Servers and the

OpenSolaris Operating Systems

Tech Resources tab at

(33)