FPGA-based MapReduce Framework for Machine Learning

(1)

FPGA-based MapReduce

Framework for Machine Learning

Bo WANG

1

_{, Yi SHAN}

1

_{, Jing YAN}

2

_{, Yu WANG}

1

_,

Ningyi XU

2

_{, Huangzhong YANG}

1

1_{Department of Electronic Engineering}

Tsinghua University, Beijing, China

2_{Hardware Computing Group}

(2)

Outline

• Motivation

• Proposed solution: FPGA+MapReduce

• Case study: RankBoost acceleration

• Summary

(3)

The Power Barrier …

Source : Shekhar Borkar, Intel

parallel

(4)

(5)

Challenges

• General purpose CPU

architecture

– Memory wall

• CPUs are too fast; memory bandwidth is too slow • Cache Real Estate

– Power Wall

• Most power: non-arithmetic operations (out-of-order, prediction)

• Higher freq: higher leakage power • Large cache

• Traditional parallel

programming

(6)

Customized Domain Specific

Computing for Machine Learning

• Primary goal of this project

– “Automatically” utilize the parallelism in machine learning algorithms with 100x performance/power efficiency

• A few facts

– We have sufficient computing power for most applications*

– Each user/enterprise need high computation power for only selected tasks in its domain* ₍_{machine learning}₎

– Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture*

– MapReduce is a successful programming framework for ML/DM

• Approach

– “Supercomputer in a box” with reconfigurable hardware – Field Programmable Gate Array (FPGA) and CPUs

(7)

Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++

The Big Picture

Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler Mapper_Mapper Mapper Mapper_Mapper Reducer Data Manage MEM_MEM MEM Programming Architecture User Constraints

(8)

“Field-Programmable Gate Array” Defined

• “Field-programmable” semiconductor device

– Change functionality after deployment

• Create arbitrary logic with “gate arrays”

– Gate arrays: “islands” of reconfigurable logic in a “sea” of

reconfigurable interconnects.

(9)

Y = i

₀

+ i

₁

+ i

₂

* i

₃

“Islands” of reconfigurable logic in a “sea” of

reconfigurable interconnects (Altera Stratix)

(10)

“Field-Programmable Gate Array” Defined

• “Field-programmable” semiconductor device

– Change functionality after deployment

• Create

arbitrary logic

with “gate arrays”

– Gate arrays: “islands” of reconfigurable logic in a “sea” of

reconfigurable interconnects.

– Implement desired functionality in hardware

• Example: X = 3*Y + 5*Z

• Hardware Description Languages (HDLs) • C/C++ to HDL compilation tools: AutoPilot

– http://www.deepchip.com/items/0482-06.html

CPU runs the application, FPGA is the application.

(11)

Why use FPGA?

• High flexibility

– Customized logic for application

– Match the application in bit

level

– Best utilize parallelism and locality in application

• High computation density

– Several Pentium cores

• High I/O bandwidth

– Up to 100s Gbps

• High internal memory

bandwidth

– Up to 10s Tbps

• Customized memory

hierarchy with no ‘cache

miss‘

• Track Moore’s Law

• Compared to ASIC

– Much lower design cost

• Compared to GPU

– Bit level flexibility – Lower power

(12)

FPGA-based High Performance

Computing

• 10X ~ 10,000X speedup reported

– Conferences: FCCM, FPGA, FPT, FPL, SC, ICS …

– Domains: scientific computing, machine learning,

data mining, graphics, financial computing, …

• Challenges

– Ad-hoc solutions

– Design productivity

(13)

Framework: MapReduce

Web Request Logs

M

ap

R

ed

uce

programmer Parallelization Functionality Data Distribution Fault Tolerance Load Balance MapReduce Runtime Two Primitive: Map (input)

for each word in input emit (word, 1)

Reduce (key, values) int sum = 0;

for each value in values sum += value;

emit (word, sum)

(14)

Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++

The Big Picture

Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler Mapper_Mapper Mapper Mapper_Mapper Reducer Data Manage MEM_MEM MEM User Constraints

(15)

FP

GA

M

ap

R

educe (

FPMR

) Framework

REDUCER REDUCER Reducer Processor Scheduler Data Controller Global Memory Intermediate <key,value> Local Memory CPU <key,value> Generator REDUCER REDUCER Mapper FPGA enable parameters PC Ie / H yp er -T ra ns po rt Merger

1 ₂

3

4

5

6

(16)

Major Building Blocks

• Processors (workers) with pre-defined interfaces

– Mapper and reducer

• On-chip scheduler

– Dynamically scheduling

• Monitor status • Queues to record

• Data access infrastructure

– Interconnection network

• Message passing and shared memory

– Storage hierarchy

• Global memory, local memory, and register file

– Data controller

(17)

Parallelism

• Task level/data level parallelism

– Among mappers/reducers

• Instruction level parallelism

(18)

Case study: RankBoost

• An extension of AdaBoost to ranking problems

[Yoav Freund, 2003]

• Learn a ranking function by combining weak

learners

– Weak learner are usually represented by decision

stumps of features

– Slow with large number of features and training

samples

– E.g. Web search engine

(19)

(20)

RankBoost: mapper and reducer

map (int key, pair value):

// key : feature index fi

// value : document bin_fi, document π for each document d in value :

hist(bin_fi(d)) = hist(bin_fi(d)) + π(d) EmitIntermediate (fi, hist_fi);

reduce (int key, array value) :

// key : feature index fi

// value : histograms hist_fi , fi = 1…N_f for each histogram hist_fi

for i = N_bin – 1 to 0

integral_fi(i) = hist_fi(i) + integral_fi(i+1) EmitIntermediate (fi, integral_fi)

(21)

RankBoost on FPMR

• Map RankBoost on FPMR

– Decide <key, value>

– #mapper/#reducer

Reducer Processor Scheduler Data Controller Global Memory bin (d ) Intermediate <fi,histfi(bin)> Local Memory CPU <bin(d),π(d)> Generator REDUCER REDUCER Mapper P C I-E FPGA enable parameters Global Memory π (d ) Merger

(22)

Mapper & Reducer Structure

histf RAM Dual Port Shift Registers M U X Bin FIFO Read Address M U X Write Address _8'b0 DataOut DataIn M U X _Floating Point Adder 32'b0 Pi FIFO M U X 32'b0 Local Memory Address Generator Write Address DataIn Read Address DataOut Mapper Floating Point Adder MUX 32'b0 Floating Point Comparator ageb M U X Maximum Register Local Memory Write Address DataIn Read Address DataOut Address Generator

(23)

Target Accelerator

• PCI Express x8 interface

(Xilinx V5 LXT FPGA)

• Altera StratixII FPGA

• DDR2 modules x2, 16GB,

6.25GBps, SRAMs

(24)

Experimental results

#mapper #reducer WL / s Total / s Speedup WL Total 1 1 320.9 321.96 0.33 0.33 2 1 160.5 161.52 0.65 0.65 4 1 80.22 81.293 1.30 1.30 8 1 40.11 41.181 2.60 2.56 16 1 20.06 21.125 5.20 4.99 32 1 10.09 11.159 10.33 9.44 52 1 6.228 7.297 16.74 14.44 64 1 5.107 6.176 20.42 17.06 128 1 2.616 3.685 39.87 28.59 146 1 2.242 3.311 46.52 31.82



31.82X speedup with 146 parallel mappers

(25)

Scalability

Mapper 1 2 4 8 16 ALUT 1% 2% 3% 5% 10% Register 1% 2% 4% 6% 11% Mapper 32 52 64 128 146 ALUT 19% 31% 38% 75% 86% Register0 50 17% 100 32%150 39%200 81%250 89%300 0 20 40 60 80 WL with CDP Total with CDP WL w/o CDP Total w/o CDP N mappers S p e e d u p