FPGA-based MapReduce
Framework for Machine Learning
Bo WANG
1, Yi SHAN
1, Jing YAN
2, Yu WANG
1,
Ningyi XU
2, Huangzhong YANG
11Department of Electronic Engineering
Tsinghua University, Beijing, China
2Hardware Computing Group
Outline
• Motivation
• Proposed solution: FPGA+MapReduce
• Case study: RankBoost acceleration
• Summary
The Power Barrier …
Source : Shekhar Borkar, Intel
parallel
Challenges
• General purpose CPU
architecture
– Memory wall
• CPUs are too fast; memory bandwidth is too slow • Cache Real Estate
– Power Wall
• Most power: non-arithmetic operations (out-of-order, prediction)
• Higher freq: higher leakage power • Large cache
• Traditional parallel
programming
Customized Domain Specific
Computing for Machine Learning
• Primary goal of this project
– “Automatically” utilize the parallelism in machine learning algorithms with 100x performance/power efficiency
• A few facts
– We have sufficient computing power for most applications*
– Each user/enterprise need high computation power for only selected tasks in its domain* (machine learning)
– Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture*
– MapReduce is a successful programming framework for ML/DM
• Approach
– “Supercomputer in a box” with reconfigurable hardware – Field Programmable Gate Array (FPGA) and CPUs
Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++
The Big Picture
Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler MapperMapper Mapper MapperMapper Reducer Data Manage MEMMEM MEM Programming Architecture User Constraints
“Field-Programmable Gate Array” Defined
• “Field-programmable” semiconductor device
– Change functionality after deployment
• Create arbitrary logic with “gate arrays”
– Gate arrays: “islands” of reconfigurable logic in a “sea” of
reconfigurable interconnects.
Y = i
0+ i
1+ i
2* i
3“Islands” of reconfigurable logic in a “sea” of
reconfigurable interconnects (Altera Stratix)
“Field-Programmable Gate Array” Defined
• “Field-programmable” semiconductor device
– Change functionality after deployment
• Create
arbitrary logic
with “gate arrays”
– Gate arrays: “islands” of reconfigurable logic in a “sea” of
reconfigurable interconnects.
– Implement desired functionality in hardware
• Example: X = 3*Y + 5*Z
• Hardware Description Languages (HDLs) • C/C++ to HDL compilation tools: AutoPilot
– http://www.deepchip.com/items/0482-06.html
CPU runs the application, FPGA is the application.
Why use FPGA?
• High flexibility
– Customized logic for application
– Match the application in bit
level
– Best utilize parallelism and locality in application
• High computation density
– Several Pentium cores
• High I/O bandwidth
– Up to 100s Gbps
• High internal memory
bandwidth
– Up to 10s Tbps
• Customized memory
hierarchy with no ‘cache
miss‘
• Track Moore’s Law
• Compared to ASIC
– Much lower design cost
• Compared to GPU
– Bit level flexibility – Lower power
FPGA-based High Performance
Computing
• 10X ~ 10,000X speedup reported
– Conferences: FCCM, FPGA, FPT, FPL, SC, ICS …
– Domains: scientific computing, machine learning,
data mining, graphics, financial computing, …
• Challenges
– Ad-hoc solutions
– Design productivity
Framework: MapReduce
Web Request Logs
M
ap
R
ed
uce
programmer Parallelization Functionality Data Distribution Fault Tolerance Load Balance MapReduce Runtime Two Primitive: Map (input)for each word in input emit (word, 1)
Reduce (key, values) int sum = 0;
for each value in values sum += value;
emit (word, sum)
Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++
The Big Picture
Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler MapperMapper Mapper MapperMapper Reducer Data Manage MEMMEM MEM User Constraints
FP
GA
M
ap
R
educe (
FPMR
) Framework
REDUCER REDUCER Reducer Processor Scheduler Data Controller Global Memory Intermediate <key,value> Local Memory CPU <key,value> Generator REDUCER REDUCER Mapper FPGA enable parameters PC Ie / H yp er -T ra ns po rt Merger1
2
3
4
5
6
Major Building Blocks
• Processors (workers) with pre-defined interfaces
– Mapper and reducer
• On-chip scheduler
– Dynamically scheduling
• Monitor status • Queues to record
• Data access infrastructure
– Interconnection network
• Message passing and shared memory
– Storage hierarchy
• Global memory, local memory, and register file
– Data controller
Parallelism
• Task level/data level parallelism
– Among mappers/reducers
• Instruction level parallelism
Case study: RankBoost
• An extension of AdaBoost to ranking problems
[Yoav Freund, 2003]
• Learn a ranking function by combining weak
learners
– Weak learner are usually represented by decision
stumps of features
– Slow with large number of features and training
samples
– E.g. Web search engine
RankBoost: mapper and reducer
map (int key, pair value):
// key : feature index fi
// value : document binfi, document π for each document d in value :
hist(binfi(d)) = hist(binfi(d)) + π(d) EmitIntermediate (fi, histfi);
reduce (int key, array value) :
// key : feature index fi
// value : histograms histfi , fi = 1…Nf for each histogram histfi
for i = Nbin – 1 to 0
integralfi(i) = histfi(i) + integralfi(i+1) EmitIntermediate (fi, integralfi)
RankBoost on FPMR
• Map RankBoost on FPMR
– Decide <key, value>
– #mapper/#reducer
Reducer Processor Scheduler Data Controller Global Memory bin (d ) Intermediate <fi,histfi(bin)> Local Memory CPU <bin(d),π(d)> Generator REDUCER REDUCER Mapper P C I-E FPGA enable parameters Global Memory π (d ) MergerMapper & Reducer Structure
histf RAM Dual Port Shift Registers M U X Bin FIFO Read Address M U X Write Address 8'b0 DataOut DataIn M U X Floating Point Adder 32'b0 Pi FIFO M U X 32'b0 Local Memory Address Generator Write Address DataIn Read Address DataOut Mapper Floating Point Adder MUX 32'b0 Floating Point Comparator ageb M U X Maximum Register Local Memory Write Address DataIn Read Address DataOut Address GeneratorTarget Accelerator
• PCI Express x8 interface
(Xilinx V5 LXT FPGA)
• Altera StratixII FPGA
• DDR2 modules x2, 16GB,
6.25GBps, SRAMs
Experimental results
#mapper #reducer WL / s Total / s Speedup WL Total 1 1 320.9 321.96 0.33 0.33 2 1 160.5 161.52 0.65 0.65 4 1 80.22 81.293 1.30 1.30 8 1 40.11 41.181 2.60 2.56 16 1 20.06 21.125 5.20 4.99 32 1 10.09 11.159 10.33 9.44 52 1 6.228 7.297 16.74 14.44 64 1 5.107 6.176 20.42 17.06 128 1 2.616 3.685 39.87 28.59 146 1 2.242 3.311 46.52 31.82