~ Greetings from WSU CAPPLab ~

(1)

“Multicore with SMT/GPGPU provides the ultimate

performance; at WSU CAPPLab, we can help!”

~ Greetings from WSU CAPPLab ~

Dr. Abu Asaduzzaman,

Assistant Professor and Director Wichita State University (WSU)

Computer Architecture & Parallel Programming Laboratory (CAPPLab) Wichita, Kansas, USA

(2)

“Multicore with SMT/GPGPU provides the ultimate

performance; at WSU CAPPLab, we can help!”

Outline

► ■ Introduction

 (Juggling)

■ (Ultimate) Performance

 Multicore Architectures, Simultaneous Multithreading (SMT)  Multicore with SMT provides the ultimate performance – T/F?

■ CAPPLab

 Researchers, Resources  Research Activities

 (Multicore with SMT plus GPGPU/CUDA Technology)

■ Discussion

(3)

Introduction

Presenters

■ Dr. Abu Asaduzzaman

 Asst. Prof., Elec. Eng. & Computer Sci. Dept., WSU

 Director, WSU Computer Arch & Parallel Prog Lab (CAPPLab)

■ (Juggling)

 http://www.youtube.com/watch?v=PqBlA9kU8ZE

 http://www.youtube.com/watch?v=5AYevG1a8_g&feature=related

(4)

Performance

(Single-Core to) Multicore Architecture

■ History of Computing

 Word “computer” in 1613 (this is not the beginning)

 Von Neumann architecture (1945) – data/instructions memory  Harvard architecture (1944) – data memory, instruction memory

■ Single-Core Processors

 In most modern processors: split CL1 (I1, D1), unified CL2, …  Intel Pentium 4, AMD Athlon Classic, …

■ Popular Programming Languages

(5)

(Single-Core to) Multicore Architecture

Courtesy: Jernej Barbič, Carnegie Mellon University

 Input  Process/Store  Output

Multi-tasking

 Time sharing (Juggling!)

Cache not shown

(6)

Single-Core  “Core”

Performance

a single core

Courtesy: Jernej Barbič, Carnegie Mellon University

(7)

Major Steps to

Execute an

Instruction

68000 CPU and

Memory

Memory CPU D7……D0 Data Registers 31…16….8..0 A7’A7…A0 Address Registers 31…16….8..0 PC 31…16….8..0 ALU Decoder / Control Unit IR ??…16….8..0 SR 15….8..0 Start 1: I.F. 2: I.D. (3) O.F. 16b 24b (3) O.F. 24b 4: I.E.

Performance

(5) W.B. (5) W.B.

(8)

Performance

Thread 1: Integer (INT) Operation

(Pipelining Technique) 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch 4: Integer Operation Arithmetic Logic Unit (5) Result Write Back Floating Point Operation

(9)

Performance

Thread 2: Floating Point (FP) Operation

(Pipelining Technique)

Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation

(10)

Performance

Threads 1 and 2: INT and FP Operations

(Pipelining Technique)

Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation

Thread 1: Integer Operation

Thread 2: Floating Point Operation

(11)

Performance

Threads 1 and 3: Integer Operations

Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 3: Integer Operation POSSIBLE?

(12)

Performance

Threads 1 and 3: Integer Operations

(Multicore)

Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 3: Integer Operation POSSIBLE? Core 1 Core 2

(13)

Performance

Threads 1, 2, 3, and 4: INT & FP Operations

(Multicore)

Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Core 2 Thread 1: Integer Operation Thread 3: Integer Operation

POSSIBLE?

(14)

Performance

Simultaneous Multithreading (SMT)

■ Thread

 A running program (or code segment) is a process  Process  processes / threads

■ Multithreading (IP4 Hyper-threading)

 Multiple threads running in a single-processor (time sharing)

■ Simultaneous Multithreading (SMT)

 Multiple threads running in a single-processor at the same time

■ Generating/Managing Multiple Threads

(15)

Performance

Multicore Architecture

■ Single-Core Processors

■ Multiprocessor and Multicomputer Systems

 Multiple processors: shared/common memory, local memory  Multiple processors: own private memory

■ Multicore Processors

 Multiple cores on a single chip

 Working together, sharing resources

■ Multicore Programming Language supports

(16)

Parallel/Concurrent Computing

Parallel Processing – It is not fun!

 Paying the lunch bill together

 Started with $30; spent $29 ($27 + $2)

 Where did $1 go? (Juggling!)

Friend Before

Eating

Total Bill

Return Tip After Paying A $10 $1 B $10 $25 $5 $2 $1 C $10 $1 Total $30 $2 Total Spent $9 $9 $9 $27

Performance

(17)

Ultimate Performance

Multicore with SMT – is it enough?

■Example: Matrix Multiplication

 [C] = [A] [B]

 2 x 2 Matrix

 8 (i.e., 2 * 2^2) multiplications  4 (i.e., 1 * 2^2) additions

(18)

Ultimate Performance

Multicore with SMT – is it enough?

■Example: Matrix Multiplication

 [C] = [A] [B]

 3 x 3 Matrix; how many multiplications and additions?  27 (i.e., 3 * 3^2 i.e., 3^3) multiplications

(19)

Ultimate Performance

Algorithm Design Techniques

■Example: Matrix Multiplication

 [C] = [A] [B] =

 4 x 4 Matrix ; how many multiplications and additions?  64 (i.e., 4^3) multiplications  N^3 multiplications

(20)

Algorithm Design Techniques

■Example: Matrix Multiplication

 4 x 4 Matrix  64 (i.e., 4^3) multiplications  48 (i.e., 3 * 4^2) additions  2 x 2 Matrix  8 (i.e., 2^3) multiplications  4 (i.e., 1 * 2^2) additions  Are we reducing *s/+s?

 What is the message?

A_1,1 A_1,2

A B

(21)

Algorithm Design Techniques

■Example: Matrix Multiplication

 [C] = [A] [B]

 Say, we have unlimited 2 x 2 Matrix solvers with 8 MULT

 Then it takes “only” 2 * 8 MULT time unit

 Do we have unlimited solvers/cores?

(22)

Ultimate Performance

GPGPU/CUDA Technology

■ GPGPU

 General-Purpose computing on Graphics Processing Units (GPGPU, GPGP or less often GP²U).

 More for scientific usages.

■ GPU

 Graphics Processing Units.  Mainly for multimedia usages.

■ CUDA

 CUDA (Compute Unified Device Architecture) is a parallel

computing platform and programming model created by NVIDIA.  Provides GPGPU programing interface.

(23)

Ultimate Performance

GPGPU/CUDA Technology

■ (Looking back: PCI)

 PCI (Peripheral Component Interconnect)

■ CPU – PCI – Peripherals

■ CPU – PCI-E – Peripherals

 PCI Express (Peripheral Component Interconnect Express)

■ One CPU, Multiple GPUs

■ (Juggling)

CPU

GPU

GPU GPU

(24)

Ultimate Performance

GPGPU/CUDA Technology

■ GPU (the chip itself) consists of group of Streaming

Multiprocessors (SM)

■ Inside each SM:

 32 cores (sharing the same instruction)

 64KB shared memory (shared among the 32 cores)  32K 32bit registers

 2 warp schedulers (to schedule instructions)  4 special function units

(25)

Ultimate Performance

GPGPU/CUDA Technology

■ GPU (the chip itself) consists of group of Streaming

Multiprocessors (SM)

■ Inside each SM:

 32 cores (sharing the same instruction)

 64KB shared memory (shared among the 32 cores)  32K 32bit registers

 2 warp schedulers (to schedule instructions)  4 special function units

(26)

Ultimate Performance

GPGPU/CUDA Technology

■ The host (CPU) executes a kernel in GPU in 4 steps

(Step 1) CPU allocates and copies data to GPU On CUDA API:

cudaMalloc() cudaMemCpy()

(27)

Ultimate Performance

GPGPU/CUDA Technology

■ The host (CPU) executes a kernel in GPU in 4 steps

(Step 2) CPU Sends function parameters and instructions to GPU

CUDA API:

(28)

Ultimate Performance

GPGPU/CUDA Technology

■ The host (CPU) executes a kernel in GPU in 4 steps

(Step 3) GPU executes instruction as scheduled in warps

(Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy()

(29)

Ultimate Performance

GPGPU/CUDA Technology

■ CUDA Threads are grouped into blocks. This is to

optimize the use of memory.

■ Instruction sent by host to GPU is called a Kernel.

(30)

Ultimate Performance

GPGPU/CUDA Technology

Each CUDA thread will

execute in one core.

Depending on memory requirements of a

kernel, multiple block may execute on each SM.

Each kernel can only be executed by one device (unless programmer’s intervention).

Multiple kernels may be executed at one time.

(31)

Ultimate Performance

Case Study 1 (data independent computation

without GPU/CUDA)

■ Matrix Multiplication

(32)

Ultimate Performance

Case Study 1 (data independent computation

without GPU/CUDA)

■ Matrix Multiplication

(33)

Ultimate Performance

Case Study 2 (data dependent computation

without GPU/CUDA)

■ Heat Transfer on 2D Surface

(34)

Ultimate Performance

Case Study 3 (data dependent computation with

GPU/CUDA)

(35)

Ultimate Performance

Case Study 3 (data dependent computation with

GPU/CUDA)

■ Fast Effective LSP Simulation

 Many aerospace companies have incorporated fiber-reinforced

composite materials into the fuselage, either partially or wholly because of the high strength-to-weight ratio, stiffness, and larger scale manufacturing abilities at any shape. However, the lack of lightning strike protection (LSP) for the composite materials limits their use in many applications.

 We propose a fast and effective simulation model using NVIDIA general purpose graphics processing unit (GPGPU) and compute unified device architecture (CUDA) technology which is targeted to LSP analysis on composite aircrafts.

(36)

Ultimate Performance

Case Study 3 (data dependent computation with

GPU/CUDA)

■ Fast Effective LSP Simulation

 In many cases like lightning strikes on a composite material, when the charge distribution is not known, the Poisson's Equation can be used to solve any electrostatic problem. Using the Laplacian

operator on the electric potential function over a region of the space where the charge density is not zero, the Poisson's Equation is:

(37)

Ultimate Performance

Case Study 3 (data dependent computation with

GPU/CUDA)

■ Fast Effective LSP Simulation

 If the charge density is zero all over the region, the Poison's Equation becomes Laplace's Equation:

 A. Asaduzzaman, C. Yip, S. Kumar, and R. Asmatulu, “Fast, Effective, and Adaptable Computer Modeling and Simulation of Lightning Strike Protection on Composite Materials,” under

preparation, IEEE SoutheastCon conference 2013, Jacksonville, Florida, April 4-7, 2013.

(38)

Ultimate Performance

Case Study 3 (data dependent computation with

GPU/CUDA)

■ Fast Effective LSP Simulation

■ Simulation

 CPU Only

 CPU/GPU w/o shared memory  CPU/GPU with shared memory

(39)

Ultimate Performance

Case Study 4 (data independent computation

with GPU/CUDA)

■ Quantum Computing

 On going …

 Expecting collaboration with Dr. Kumar, EECS, WSU

■ Other Areas

 Eco-Biological studies  Medical studies

(40)

“Multicore with SMT/GPGPU provides the ultimate

performance; at WSU CAPPLab, we can help!”

Outline

► ■ Introduction

 Juggling

■ (Ultimate) Performance

 Multicore Architectures, Simultaneous Multithreading (SMT)  Multicore with SMT provides the ultimate performance – T/F?

■ CAPPLab

 Researchers, Resources  Research Activities

 (Multicore with SMT plus GPGPU/CUDA Technology)

■ Discussion

(41)

WSU CAPPLab

CAPPLab

■ Computer Architecture & Parallel Programming

Laboratory (CAPPLab)

 Physical location: 245 Jabara Hall

 URL: http://www.cs.wichita.edu/~capplab/  E-mail: [email protected]

 Tel: +1-316-WSU-3927

■ Key Objectives

 Lead research in advanced-level computer architecture, high-performance computing, embedded systems, and related fields.  Educate advanced-level computer architecture and parallel

(42)

WSU CAPPLab

Researchers

■ Faculty Members

 Dr. Abu Asaduzzaman, Asst. Prof., EECS, WSU

■ Students

 Chok M. Yip, MS Student, EECS Dept.  Nasrin Sultana, MS Student, EECS Dept.  Zachary A. Vickers, BS Student, EECS Dept.  Hin Yun Lee, MS in CS, EECS Dept.

■ Others

 Dr. Ramazan Asmatulu, Asso. Prof., ME, WSU  Dr. Preethika Kumar, Asst. Prof., EECS, WSU

(43)

WSU CAPPLab

Resources

■ Hardware

 1: CUDA Server – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory  2: CUDA PC – CPU: Xeon E5506, …

 3: Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3) via remote access to WSU (HiPeCC)

 2 CUDA enabled Windows Workstations/PCs  1 CUDA enabled Laptops

 More …

■ Software

(44)

WSU CAPPLab

Past/Current Activities

■ WSU became “CUDA Teaching Center” for 2012-13

 Support from NVIDIA

 Teaching parallel programming

■ Workshop

 GPGPU/CUDA/C Summer 2012 (10 participants)  GPGPU/CUDA/C Summer 2013

■ Collaborative Research

 Dr. Ramazan Asmatulu  Dr. Preethika Kumar  More…

(45)

WSU CAPPLab

Past/Current/Future Activities

■ Research Funding

 “M2SYS-WSU Biometric Cloud Computing Research Project”  Teaching (Hardware/Financial) supports from NVIDIA

■ Research Funding

 MURPA, pending, ORA, WSU

 NSF TUES Type-1, pending, NSF

(46)