• No results found

FPGA-based MapReduce Framework for Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "FPGA-based MapReduce Framework for Machine Learning"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

FPGA-based MapReduce

Framework for Machine Learning

Bo WANG

1

, Yi SHAN

1

, Jing YAN

2

, Yu WANG

1

,

Ningyi XU

2

, Huangzhong YANG

1

1Department of Electronic Engineering

Tsinghua University, Beijing, China

2Hardware Computing Group

(2)

Outline

• Motivation

• Proposed solution: FPGA+MapReduce

• Case study: RankBoost acceleration

• Summary

(3)

The Power Barrier …

Source : Shekhar Borkar, Intel

parallel

(4)
(5)

Challenges

• General purpose CPU

architecture

– Memory wall

• CPUs are too fast; memory bandwidth is too slow • Cache Real Estate

– Power Wall

• Most power: non-arithmetic operations (out-of-order, prediction)

• Higher freq: higher leakage power • Large cache

• Traditional parallel

programming

(6)

Customized Domain Specific

Computing for Machine Learning

• Primary goal of this project

– “Automatically” utilize the parallelism in machine learning algorithms with 100x performance/power efficiency

• A few facts

– We have sufficient computing power for most applications*

– Each user/enterprise need high computation power for only selected tasks in its domain* (machine learning)

– Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture*

– MapReduce is a successful programming framework for ML/DM

• Approach

– “Supercomputer in a box” with reconfigurable hardware – Field Programmable Gate Array (FPGA) and CPUs

(7)

Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++

The Big Picture

Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler MapperMapper Mapper MapperMapper Reducer Data Manage MEMMEM MEM Programming Architecture User Constraints

(8)

“Field-Programmable Gate Array” Defined

• “Field-programmable” semiconductor device

– Change functionality after deployment

• Create arbitrary logic with “gate arrays”

– Gate arrays: “islands” of reconfigurable logic in a “sea” of

reconfigurable interconnects.

(9)

Y = i

0

+ i

1

+ i

2

* i

3

“Islands” of reconfigurable logic in a “sea” of

reconfigurable interconnects (Altera Stratix)

(10)

“Field-Programmable Gate Array” Defined

• “Field-programmable” semiconductor device

– Change functionality after deployment

• Create

arbitrary logic

with “gate arrays”

– Gate arrays: “islands” of reconfigurable logic in a “sea” of

reconfigurable interconnects.

– Implement desired functionality in hardware

• Example: X = 3*Y + 5*Z

• Hardware Description Languages (HDLs) • C/C++ to HDL compilation tools: AutoPilot

– http://www.deepchip.com/items/0482-06.html

CPU runs the application, FPGA is the application.

(11)

Why use FPGA?

• High flexibility

– Customized logic for application

– Match the application in bit

level

– Best utilize parallelism and locality in application

• High computation density

– Several Pentium cores

• High I/O bandwidth

– Up to 100s Gbps

• High internal memory

bandwidth

– Up to 10s Tbps

• Customized memory

hierarchy with no ‘cache

miss‘

• Track Moore’s Law

• Compared to ASIC

– Much lower design cost

• Compared to GPU

– Bit level flexibility – Lower power

(12)

FPGA-based High Performance

Computing

• 10X ~ 10,000X speedup reported

– Conferences: FCCM, FPGA, FPT, FPL, SC, ICS …

– Domains: scientific computing, machine learning,

data mining, graphics, financial computing, …

• Challenges

– Ad-hoc solutions

– Design productivity

(13)

Framework: MapReduce

Web Request Logs

M

ap

R

ed

uce

programmer Parallelization Functionality Data Distribution Fault Tolerance Load Balance MapReduce Runtime Two Primitive: Map (input)

for each word in input emit (word, 1)

Reduce (key, values) int sum = 0;

for each value in values sum += value;

emit (word, sum)

(14)

Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++

The Big Picture

Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler MapperMapper Mapper MapperMapper Reducer Data Manage MEMMEM MEM User Constraints

(15)

FP

GA

M

ap

R

educe (

FPMR

) Framework

REDUCER REDUCER Reducer Processor Scheduler Data Controller Global Memory Intermediate <key,value> Local Memory CPU <key,value> Generator REDUCER REDUCER Mapper FPGA enable parameters PC Ie / H yp er -T ra ns po rt Merger

1

2

3

4

5

6

(16)

Major Building Blocks

• Processors (workers) with pre-defined interfaces

– Mapper and reducer

• On-chip scheduler

– Dynamically scheduling

• Monitor status • Queues to record

• Data access infrastructure

– Interconnection network

• Message passing and shared memory

– Storage hierarchy

• Global memory, local memory, and register file

– Data controller

(17)

Parallelism

• Task level/data level parallelism

– Among mappers/reducers

• Instruction level parallelism

(18)

Case study: RankBoost

• An extension of AdaBoost to ranking problems

[Yoav Freund, 2003]

• Learn a ranking function by combining weak

learners

– Weak learner are usually represented by decision

stumps of features

– Slow with large number of features and training

samples

– E.g. Web search engine

(19)
(20)

RankBoost: mapper and reducer

map (int key, pair value):

// key : feature index fi

// value : document binfi, document π for each document d in value :

hist(binfi(d)) = hist(binfi(d)) + π(d) EmitIntermediate (fi, histfi);

reduce (int key, array value) :

// key : feature index fi

// value : histograms histfi , fi = 1…Nf for each histogram histfi

for i = Nbin – 1 to 0

integralfi(i) = histfi(i) + integralfi(i+1) EmitIntermediate (fi, integralfi)

(21)

RankBoost on FPMR

• Map RankBoost on FPMR

– Decide <key, value>

– #mapper/#reducer

Reducer Processor Scheduler Data Controller Global Memory bin (d ) Intermediate <fi,histfi(bin)> Local Memory CPU <bin(d),π(d)> Generator REDUCER REDUCER Mapper P C I-E FPGA enable parameters Global Memory π (d ) Merger

(22)

Mapper & Reducer Structure

histf RAM Dual Port Shift Registers M U X Bin FIFO Read Address M U X Write Address 8'b0 DataOut DataIn M U X Floating Point Adder 32'b0 Pi FIFO M U X 32'b0 Local Memory Address Generator Write Address DataIn Read Address DataOut Mapper Floating Point Adder MUX 32'b0 Floating Point Comparator ageb M U X Maximum Register Local Memory Write Address DataIn Read Address DataOut Address Generator

(23)

Target Accelerator

• PCI Express x8 interface

(Xilinx V5 LXT FPGA)

• Altera StratixII FPGA

• DDR2 modules x2, 16GB,

6.25GBps, SRAMs

(24)

Experimental results

#mapper #reducer WL / s Total / s Speedup WL Total 1 1 320.9 321.96 0.33 0.33 2 1 160.5 161.52 0.65 0.65 4 1 80.22 81.293 1.30 1.30 8 1 40.11 41.181 2.60 2.56 16 1 20.06 21.125 5.20 4.99 32 1 10.09 11.159 10.33 9.44 52 1 6.228 7.297 16.74 14.44 64 1 5.107 6.176 20.42 17.06 128 1 2.616 3.685 39.87 28.59 146 1 2.242 3.311 46.52 31.82

31.82X speedup with 146 parallel mappers

(25)

Scalability

Mapper 1 2 4 8 16 ALUT 1% 2% 3% 5% 10% Register 1% 2% 4% 6% 11% Mapper 32 52 64 128 146 ALUT 19% 31% 38% 75% 86% Register0 50 17% 100 32%150 39%200 81%250 89%300 0 20 40 60 80 WL with CDP Total with CDP WL w/o CDP Total w/o CDP N mappers S p e e d u p

(26)

Design Productivity

• Manual design

– More than 3 months after the hardware circuit

board was ready

• FPGA-based MapReduce

– Weeks

(27)

Summary

• Designed building blocks for MapReduce on FPGA

• Achieved comparable result with manual design

• Future work

– Use C2HDL compilers to further increase the design

productivity

– Build Runtime for multiple machines

(28)

Thanks!

References

Related documents

calculation of mobility parameters (passability, efficiency, speed, fuel consumption, etc.) for each of the selected characteristic zones of the studied area;..

This research aims at (1) investigating the needs of English teacher and students toward character-based English materials in Junior High School, (2) analyzing

• Defensibly reduces storage volumes – Zoom allows corporations to map and control their data resources through a proven process of records retention and data deletion. This

Abstract: The article presents and compares two ways to stimulate sharing and exchange of online educational resources across different languages and educational settings:

Mapper Node HDFS Mapper Node HDFS Mapper Node HDFS Mapper Node HDFS Reducer Node HDFS Reducer Node HDFS Result Copy/Shuffle Mapper Node HDFS Mapper Node HDFS Mapper Node

The dorsal dermomyotome of newly formed somites in E2.5 chicken embryos were electroporated with the Tol2 fl anked, inducible, CRISPR mediated gene-targeting vectors, with

The teacher should give some guidelines to help students in the reading process. And also the teacher should choose the suitable technique in teaching English especially in

In lean participants, hunger (top-left panel) and fullness scores (top-right panel) based on visual analogue scores (VAS) half-hourly after test breakfast consumption did not