Scaling Hadoop for
Multi-Core and Highly
Threaded Systems
Jangwoo Kim, Zoran Radovic
Performance Architects
Architecture Technology Group Sun Microsystems Inc.
Project Overview
Hadoop Updates
CMT Hadoop Systems
Scaling Hadoop on CMT
Virtualization Technologies
Zones Logical DomainsCase Study: E-mail Discovery
Conclusions
Project Overview
• Chip Multi-Threading (CMT) processors and Hadoop are designed for maximum throughput
• Sun's JVM optimized for CMT
> Java has been widely deployed by many customers on CMT
- Hadoop is written with Java – an ideal throughput candidate
• Seemed like a great fit for Hadoop with the potential for a greatly reduced footprint
• Related Work by Ning Sun and Lee Anne Simmons:
> Blueprint: Using Logical Domains and CoolThreads
Hadoop Expands Beyond the Web...
Some examples from the Summit
- Genetic Sequence Analysis - Parallel Data Mining in Telco - Natural language learning - Business Fraud Detection - Clinical Trials
- Retail Business Planning - ...
Map/Reduce Organization
Split 0 Split 1 Split 2 Split 3 Input HDFS map map map reduce sort/merge reduce sort/merge Output HDFS Part 0 Part 1 copyNext-Gen Hadoop – Low Latency Focus
Hadoop is traditionally optimized for throughput World Record Sort source code changes
- http://developer.yahoo.net/blogs/hadoop/Yahoo2009.pdf
“Winning a 60 Second Dash with a Yellow Elephant”
- Reducer Improvements (Shuffle); memory to memory merge - Fetch of multiple map outputs from the same node
Reduces number of server connections
- Improved timeout behavior
- Better data corruption detection (CRC32 improvements) - Map output compression (45% of the original size)
- Improved and multi-threaded data partitioning - Lower latency with faster “heartbeat”
OpenSolaris 2009.06
OpenSolaris Moves Into Enterprise
- UltraSPARC T1 and T2 Support, Sun4u
5 Year Enterprise Support
Datacenter-Ready Installation
New and Modern Networking Stack
- http://opensolaris.org/os/project/crossbow/ - Multi-Core Optimized
- Easy Network Virtualizetion and Resource Control
Powerful, Built-in and Free Virtualization Techology
- http://opensolaris.org/os/community/ldoms
42 GB/s read, 21 GB/s write
UltraSPARC
™T2 Processor
• 8 SPARC V9 cores @ 1.4Ghz
> 8 vertical threads per core > 2 execution pipelines per core > 1 instruction/cycle per pipeline > 1 FPU per core
> 1 SPU (crypto) per core
> 4MB 16-way 8-bank L2$
• 64 threads
• 2.5Ghz x8 PCI-Express interface
• 2 x 10Gb on-chip Ethernet
• Crypto processor per core
• Power: 84 watts (typical) • http://www.opensparc.net
T5240
2U 2P US-T2 Plus Server
Blade 6000
10U US-T2 Blades
CMT Hadoop Systems
T5440
4U 2P US-T2 Plus Platform & Sun Storage J4400
* http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html
all maps start
job completion job start
all maps finish
shuffle start
shuffling reducing mapping
All tasks start and finish simultaneously
all reduces
start all reduces finish
Ideal Performance Model
first map start job start
mapping
Launching many tasks can incur significant overhead
Performance Model with Serialized
Tasks
last map starts last map finishes
job completion first map
finishes
shuffling reducing last shuffle
Distributed Performance Data Collection
Created a set of scripts to facilitate distributed execution for
performance data collection and analysis
- Based on traditional single-node system analysis tools
mpstat, nicstat, iostat, vmstat, ...
- Varaiable sampling frequency to monitor hardware utilization - Pinpoint which resource is a bottleneck at any point
CPU utilization, network, disk I/O
- Periods where no resource is fully utilized may indicate
poorly-tuned Hadoop configuration or other system issues
Hadoop log processing to monitor Hadoop task timeline
- Examine startup rate, Hadoop phase overlap
(#map, # reduce)
30GB sort on a single T5240 node (128 threads, 128GB RAM, 16 disks)
T ime (mi n) <60% CPU utilization
Significant launching overhead limits scalability Mapping
Shuffling
Reducing
10-Node 150G Sort – Task Timeline
Hypervisor Logical Domain 0 Job Tracker Name Node … Logical Domain 1 Task Tracker Data Node Logical Domain 2 Task Tracker Data Node Logical Domain N Task Tracker Data Node
Intra-node Virtualization:
Logical Domains (LDOMs)
• Hardware-assisted Virtualization
• Single hypervisor
> OS-Level Isolation
Example LDOMs Configuration
• Single control domain
> Virtual disk server (vds)
> Virtual network switch (vsw)
> Virtual console concentrator (vcc)
• Multiple logical domains
• ldm add-vcpu 8 ldom0 (cpu)
• ldm add-memory 16G ldom0 (memory)
• ldm add-vdisk vdisk0 control-vds ldom0 (disk)
• ldm add-vnet vnet0 control-vsw ldom0 (network)
• Single control domain
ldm bind ldom0 (bind) ldm start ldom0 (boot) > OS Install as usual
Intra-node Virtualization: Zones
(Containers)
• Software (OS) Virtualization
• Single operating system
> Application-Level Isolation
> No H/W threads and memory dedicated
Zone 0 Job Tracker Name Node … Zone 1 Task Tracker Data Node Zone 2 Task Tracker Data Node Zone N Task Tracker Data Node
Example Zones Configuration
• Create zones
zonecfg –z zone0 –f zone0.config zone0.config:
• “create;
• add net; set physical=interface; set address=IP; ..
• add fs; set dir=mount_path ; set raw=partition; ..
• ..”
• Zone administration
• zoneadm –z zone0 boot (boot)
• zoneadm list (list)
Example 4-LDOM Setup
• Evenly distributing H/W resources
(#map, # reduce, #virtual nodes) T ime (mi n)
~100% CPU utilization with 4 logical domains
Mapping
Shuffling
Reducing
30GB sort on a single T5240 node (128 threads, 128GB RAM, 16 disks)
Scaling Hadoop with Intra-node
CMT Hadoop systems scale nicely with larger datasets T ime (mi n)
Large data sorting performance
(Sun Blade 6000: 10 nodes, 640 threads, 64GB RAM/node, 4 disks/node)
Data Size
Scaling Sorting Workload
(Without Virtualization)E-mail Discovery Overview
• Preparing data for searching over large email corpus
• Five phases with different MapReduce profiles
1. PipelineMapReduce – Reads and parses 27GB of raw emails
2. DocumentSeqFileToMapFile – Prepares MapFile to retrieve data
3. PersonNormalization – Groups data into unique entities
4. Consumer – Creates indices
5. ThreadDetection – Conversation threads detected
• Output is a set of shards used in an E-mail discovery search application
CMT Hadoop systems scale for throughput applications T ime (mi n) 1 node 128 threads
Email processing performance
1 node 256 threads 10 nodes 640 threads 15 nodes 60 EC2 units
E-Discovery Results
High performance with smaller datacenter footprint Email processing performance normalized to a 40U rack
R el at ive p erf orma nce 20 nodes 128 threads / node 5 nodes 256 threads / node 40 nodes 64 threads / node 40 nodes 4 EC2 units / node 4.6X 2.0X 3.1X 1.0
MySQL Enterprise Solution
Enterprise software, services delivered as annual subscription
Subscription:
MySQL Enterprise
License (OEM):
Embedded Server Support
MySQL Cluster Carrier-Grade Training Consulting NRE Database Monitoring Support
Most up-to-date MySQL software Monthly rapid updates
Quarterly service packs Hot-fix program
Indemnification
Virtual database assistant
Global monitoring of all servers Web-based central console
Built-in advisors, expert advice Problem query detection/analysis
Online self-help MySQL Knowledge Base
24/7 problem resolution with priority escalation Consultative help
Conclusions
• Hadoop and Java scale well on CMT systems
• Startup cost dominates performance on highly threaded systems (256 threads per node)
• Virtualization techniques enable good scalability, high system utilization and better performance
> Parallelized startup
> Less external node-to-node Ethernet traffic
• Hadoop consolidation on CMT systems reduces datacenter footprint, power and cooling costs
Software Stack, Pointers to Download
• Sun CMT servers > http://www.sun.com/servers/coolthreads/overview/index.jsp • Hadoop 0.20.0 > http://hadoop.apache.org • JVM from Sun 1.6.0_13 > http://www.java.sun.com• OpenSolaris for SPARC 2009.6
> http://www.opensolaris.org
• LDOMs 1.1
Try it Yourself
Learn More Free
• Try free for 60 days: Sun
Enterprise SPARC rack or blade systems and storage
• Test Hadoop on up to 128
threads
• 60 days to decide to buy
• Return and pay nothing – not
even shipping if you don't
sun.com/tryandbuy
• Using LDom and CoolThreads
Technology: Improving Scalability and Utilization
• Improving Database Scalability
on T5440 Blueprint
• Deploying Web 2.0 Applications
on Sun Servers and the
OpenSolaris Operating Systems
Tech Resources tab at
Scaling Hadoop for
Multi-Core and Highly
Threaded Systems
Jangwoo Kim ([email protected]) Zoran Radovic ([email protected]) Denis Sheahan ([email protected]) Joseph Gebis ([email protected])
This is an extended version of our Hadoop Summit '09 presentation, Santa Clara, CA, June 2009