• No results found

Design of Runtime System for Task-based FPGA Programming

N/A
N/A
Protected

Academic year: 2021

Share "Design of Runtime System for Task-based FPGA Programming"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Design of Runtime System for

Task-based FPGA Programming

Programming Environment Research Team

Jinpil Lee

JLESC 2020

(2)

Task offloading to FPGA using task

Batch generation in runtime

(3)

⚫ Writing computing kernels for FPGAs

► OpenCL/OpenMP/OpenACC/DPC++/DSL

⚫ How to control the generated kernel? ► OpenCL event

► OpenACC kernel ► OpenMP target

generates FPGA kernels ► XcalableMP task

assumption

1. optimized libraries using FPGAs (e.g. BLAS for FPGA)

2. APIs are called by CPUs 3. inner-node parallelism

(inter-node work distribution is the future work)

Background

FPGA SGEMM SGEMM SGEMM SGEMM FPGA SGEMM SGEMM SGEMM SGEMM FPGA SGEMM SGEMM SGEMM FPGA SGEMM SGEMM SGEMM 1

(4)

⚫ In some OpenMP task program, tasks like yellow part will appear ► dominant computation time

► can be executed in parallel by OpenMP task ⚫ Host can offload these tasks asynchronously ► overlap between CPU and FPGA

Motivated Example

#pragma omp task depend …

{ …

calc_on_fpga(event); do {

#pragma omp taskyield

status = checkStatus(event); } while (status != done);

… }

#pragma xmp task depend async on dev(FPGA) … { calc(event); } #pragma xmp variant(FPGA:calc) { calc_on_fpga(event); }

(5)

⚫ Prepared independent queue for each FPGA instance ► parallel execution

► cl_command_queue… like CUDA stream object ⚫ Asynchronous task execution

► Argobots

Implementation

Hardware configuration

CPU Intel Xeon E5-2660 v4 x 2 Host DRAM DDR4-2400 16GB x 4 FPGA Board BittWare A10PL4 (Intel Arria10 GX1150) PCIe Gen3 x8 FPGA DRAM DDR4-2133 4GB x 2 Software configuration OS CentOS 7.3 x64 Host Compiler Intel C Compiler 18.1 Thread library Argobots 1.0b1

FPGA compiler Intel FPGA SDK for OpenCL 17.12.304

(6)

SIMD vs Instance

SIMD width num. of instances Fmax (MHz) DSPs (%) BRAM (%) S16 16 1 164.71(1.0) 69 60 S8 8 2 179.59(1.09) 69 69 S4 4 4 213.94(1.29) 70 90 Lo w er is Be tt er Hig he r is Be tt er 1 0.97 1 1 1 1.06 1 1.041.08 1 1.04 1.1

(7)

Multiple Instances with

⚫ Independent task queue ► like multi-core CPUs

► can use XMP/OMP task ⚫ Single task queue

► like GPUs

► internal hardware scheduler

► single task will use the whole device → requires parallelism inside a task

Parallelism on FPGAs

FPGA Core0 Core2 Core1 Core3

(8)

⚫ batch: gathering multiple calculation kernels into a single task ⚫ implementation depends on the target hardware

- CPU/GPU implementation exploits its parallelism - assign kernels onto computing cores

⚫ needs code modification

Batched Kernel API

void calc_dgemms(const int num_kernels, const int n, double **A, double **B,

double **C, double *DMONE, double *DONE) {

for (int i = 0; i < num_kernels; i++) {

cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,

&DMONE, A[i], n, B[i], n, &DONE, C[i], n);

cudaDeviceSynchronize(); }

}

BLAS API

void calc_dgemms_batched(const int num_kernels, const int n, double **A, double **B, double **C, double *DMONE, double *DONE) {

cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &DMONE, A, n, B, n, &DONE, C, n, num_kernels);

cudaDeviceSynchronize(); }

(9)

⚫ static vs dynamic ⚫ static

- no runtime overhead

- covers easy cases ⚫ dynamic

- simple

- runtime overhead - irregular patterns

Code Translation from OMP

for (int i = k+1; i < num_tiles; i++) { for (int j = k+1; j < i; j++) {

#pragma omp task depend(in:A[k][i],A[k][j]) depend(out:A[j][i])

dgemm(A[k][i], A[k][j], A[j][i], tile_size, tile_size); }

#pragma omp task depend(in:A[k][i]) depend(out:A[i][i]) dsyrk(A[k][i], A[i][i], tile_size, tile_size);

}

for (int i = k+1; i < num_tiles; i++) {

#pragma omp task depend(in:A[k][i],A[k][K+1:i]) ¥¥ depend(out:A[K+1:i][i])

dgemm_batch(A_ptr_array, ...);

#pragma omp task depend(in:A[k][i]) depend(out:A[i][i]) dsyrk(A[k][i], A[i][i], tile_size, tile_size);

(10)

void cholesky_decomposition(double* A[nt][nt], ...) { for (int k = 0; k < nt; k++) {

void *args[3] = {A[k][k], &tile_size, &tile_size}; void *out_data[1] = {A[k][k]};

task_create(0, potrf, 3, args, 0, NULL, 1, out_data); for (int i = k + 1; i < nt; i++) {

void *args[4] = {A[k][k], A[k][i], &tile_size, &tile_size}; void *in_data[1] = {A[k][k]};

void *out_data[1] = {A[k][i]};

task_create(0, trsm, 4, args, 1, in_data, 1, out_data); }

for (int i = k + 1; i < nt; i++) { for (int j = k + 1; j < i; j++) {

void *args[5] = {A[k][i], A[k][j], A[j][i], &tile_size, &tile_size}; void *in_data[2] = {A[k][i], A[k][j]};

void *out_data[1] = {A[j][i]};

task_create(1, gemm, 5, args, 2, in_data, 1, out_data);

}

void *args[4]

⚫ Front-end generates unique batch IDs - function name, argument pointer/value ⚫ 0: non-batch ⚫ task runtime - task_create() - task_waitall()

Translated Code

8

(11)

⚫ non-batch tasks use normal OpenMP runtime ⚫ batch task

- task with batch ID - put into batch queue

- merged when the same kind of task exists

Task Runtime

Task (single)

batch task

Task

(batch) Task(batch)

batch queue task queue non-batch task

Task

(12)

⚫ task scheduling algorithm uses the task queue

- (batch) task is not ready for task scheduling when in the batch queue ⚫ batch tasks are moved to the task queue

- when the task queue is empty (ready for scheduling) - delaying scheduling to gather more kernels into a batch

Task Runtime (cont'd)

Task (single)

batch task

Task

(batch) Task(batch)

batch queue task queue normal task

Task

(single) (single)Task Task

(13)

⚫ DGEMM Loop

simple、shows the ideal performance ⚫ Blocked Cholesky Decomposition

to see the runtime overhead ⚫ single thread, all serialized

Evaluation

Item Name

CPU Intel (R) Xeon (R) CPU E5-2680 8 cores (2 sockets), 2.70 GHz

Intel Compiler 18.0.3 GPU NVIDIA Tesla K20 CUDA 9.1 with cuBLAS

(14)

⚫ block (tile) size: 128, # Blocks: 4096 → 36% perf improvement ⚫ little improvement with small matrices

⚫ not consistent with the previous result ...

(15)

⚫ system: overall runtime overhead (batch creation, scheduling) ⚫ huge performance improvement & overhead with small matrices ⚫ trade off between batch efficiency and task overhead?

(16)

⚫ Work distribution among FPGA nodes ► data distribution

► scheduling tasks among nodes by dependency ⚫ Remote Procedure Call from Fugaku to ESSPER ► by XcalableMP?

References

Related documents

A supply chain model functioning under periodic review base-stock inventory system to assist the manufacturing managers at HP to administer material in their supply chains has

The purpose of the study is to determine the technical feasibility and cost of a lake probe mission both as an element of a future Titan flagship mission and as a standalone

The availability of synthetic flavors and fragrances has made possible a large variety of products, from inexpensive beverages to perfumed soap to used cars with applied “new

The credit card number is validated using luhn algorithm, if the result of validation is true then the number will be consider as valid, and check 3 will executed,

The Galaxy View shows high level data about the entire network; Small Multiple View is in the middle giving a reason- able amount of information on a user selected subset of

The aim of this study is therefore to analyze potential transformations of volunteering in Sweden, and the implications, by exploring how Swedish volunteer organizations have used

– Others: Limit number of roles at runtime (per session) or based on history or pre-requisite (e.g., user can only be assigned to the testing role if assigned to project role

Input from control phase available Through analysis engine Output for design phase available - (only arrival data) - : not supported by FileNet, should be done manually.. The