“Multicore with SMT/GPGPU provides the ultimate
performance; at WSU CAPPLab, we can help!”
~ Greetings from WSU CAPPLab ~
Dr. Abu Asaduzzaman,
Assistant Professor and Director Wichita State University (WSU)
Computer Architecture & Parallel Programming Laboratory (CAPPLab) Wichita, Kansas, USA
“Multicore with SMT/GPGPU provides the ultimate
performance; at WSU CAPPLab, we can help!”
Outline
►
■ Introduction
(Juggling)
■ (Ultimate) Performance
Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance – T/F?
■ CAPPLab
Researchers, Resources Research Activities
(Multicore with SMT plus GPGPU/CUDA Technology)
■ Discussion
Introduction
Presenters
■ Dr. Abu Asaduzzaman
Asst. Prof., Elec. Eng. & Computer Sci. Dept., WSU
Director, WSU Computer Arch & Parallel Prog Lab (CAPPLab)
■ (Juggling)
http://www.youtube.com/watch?v=PqBlA9kU8ZE
http://www.youtube.com/watch?v=5AYevG1a8_g&feature=related
Performance
(Single-Core to) Multicore Architecture
■ History of Computing
Word “computer” in 1613 (this is not the beginning)
Von Neumann architecture (1945) – data/instructions memory Harvard architecture (1944) – data memory, instruction memory
■ Single-Core Processors
In most modern processors: split CL1 (I1, D1), unified CL2, … Intel Pentium 4, AMD Athlon Classic, …
■ Popular Programming Languages
(Single-Core to) Multicore Architecture
Courtesy: Jernej Barbič, Carnegie Mellon University
Input Process/Store Output
Multi-tasking
Time sharing (Juggling!)Cache not shown
Single-Core “Core”
Performance
a single core
Courtesy: Jernej Barbič, Carnegie Mellon University
Major Steps to
Execute an
Instruction
68000 CPU and
Memory
Memory CPU D7……D0 Data Registers 31…16….8..0 A7’A7…A0 Address Registers 31…16….8..0 PC 31…16….8..0 ALU Decoder / Control Unit IR ??…16….8..0 SR 15….8..0 Start 1: I.F. 2: I.D. (3) O.F. 16b 24b (3) O.F. 24b 4: I.E.Performance
(5) W.B. (5) W.B.Performance
Thread 1: Integer (INT) Operation
(Pipelining Technique) 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch 4: Integer Operation Arithmetic Logic Unit (5) Result Write Back Floating Point Operation
Performance
Thread 2: Floating Point (FP) Operation
(Pipelining Technique)
Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point OperationPerformance
Threads 1 and 2: INT and FP Operations
(Pipelining Technique)
Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point OperationThread 1: Integer Operation
Thread 2: Floating Point Operation
Performance
Threads 1 and 3: Integer Operations
Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 3: Integer Operation POSSIBLE?
Performance
Threads 1 and 3: Integer Operations
(Multicore)
Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 3: Integer Operation POSSIBLE? Core 1 Core 2Performance
Threads 1, 2, 3, and 4: INT & FP Operations
(Multicore)
Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Core 2 Thread 1: Integer Operation Thread 3: Integer OperationThread 4: Floating Point Operation
Thread 2: Floating Point Operation
POSSIBLE?
Performance
Simultaneous Multithreading (SMT)
■ Thread
A running program (or code segment) is a process Process processes / threads
■ Multithreading (IP4 Hyper-threading)
Multiple threads running in a single-processor (time sharing)
■ Simultaneous Multithreading (SMT)
Multiple threads running in a single-processor at the same time
■ Generating/Managing Multiple Threads
Performance
Multicore Architecture
■ Single-Core Processors
■ Multiprocessor and Multicomputer Systems
Multiple processors: shared/common memory, local memory Multiple processors: own private memory
■ Multicore Processors
Multiple cores on a single chip
Working together, sharing resources
■ Multicore Programming Language supports
Parallel/Concurrent Computing
Parallel Processing – It is not fun!
Paying the lunch bill together
Started with $30; spent $29 ($27 + $2)
Where did $1 go? (Juggling!)
Friend Before
Eating
Total Bill
Return Tip After Paying A $10 $1 B $10 $25 $5 $2 $1 C $10 $1 Total $30 $2 Total Spent $9 $9 $9 $27
Performance
Ultimate Performance
Multicore with SMT – is it enough?
■Example: Matrix Multiplication
[C] = [A] [B]
2 x 2 Matrix
8 (i.e., 2 * 2^2) multiplications 4 (i.e., 1 * 2^2) additions
Ultimate Performance
Multicore with SMT – is it enough?
■Example: Matrix Multiplication
[C] = [A] [B]
3 x 3 Matrix; how many multiplications and additions? 27 (i.e., 3 * 3^2 i.e., 3^3) multiplications
Ultimate Performance
Algorithm Design Techniques
■Example: Matrix Multiplication
[C] = [A] [B] =
4 x 4 Matrix ; how many multiplications and additions? 64 (i.e., 4^3) multiplications N^3 multiplications
Algorithm Design Techniques
■Example: Matrix Multiplication
4 x 4 Matrix 64 (i.e., 4^3) multiplications 48 (i.e., 3 * 4^2) additions 2 x 2 Matrix 8 (i.e., 2^3) multiplications 4 (i.e., 1 * 2^2) additions Are we reducing *s/+s?
What is the message?
A1,1 A1,2
A B
Algorithm Design Techniques
■Example: Matrix Multiplication
[C] = [A] [B]
Say, we have unlimited 2 x 2 Matrix solvers with 8 MULT
Then it takes “only” 2 * 8 MULT time unit
Do we have unlimited solvers/cores?
Ultimate Performance
GPGPU/CUDA Technology
■ GPGPU
General-Purpose computing on Graphics Processing Units (GPGPU, GPGP or less often GP²U).
More for scientific usages.
■ GPU
Graphics Processing Units. Mainly for multimedia usages.
■ CUDA
CUDA (Compute Unified Device Architecture) is a parallel
computing platform and programming model created by NVIDIA. Provides GPGPU programing interface.
Ultimate Performance
GPGPU/CUDA Technology
■ (Looking back: PCI)
PCI (Peripheral Component Interconnect)
■ CPU – PCI – Peripherals
■ CPU – PCI-E – Peripherals
PCI Express (Peripheral Component Interconnect Express)
■ One CPU, Multiple GPUs
■ (Juggling)
CPU
GPU
GPU GPU
Ultimate Performance
GPGPU/CUDA Technology
■ GPU (the chip itself) consists of group of Streaming
Multiprocessors (SM)
■ Inside each SM:
32 cores (sharing the same instruction)
64KB shared memory (shared among the 32 cores) 32K 32bit registers
2 warp schedulers (to schedule instructions) 4 special function units
Ultimate Performance
GPGPU/CUDA Technology
■ GPU (the chip itself) consists of group of Streaming
Multiprocessors (SM)
■ Inside each SM:
32 cores (sharing the same instruction)
64KB shared memory (shared among the 32 cores) 32K 32bit registers
2 warp schedulers (to schedule instructions) 4 special function units
Ultimate Performance
GPGPU/CUDA Technology
■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 1) CPU allocates and copies data to GPU On CUDA API:
cudaMalloc() cudaMemCpy()
Ultimate Performance
GPGPU/CUDA Technology
■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 2) CPU Sends function parameters and instructions to GPU
CUDA API:
Ultimate Performance
GPGPU/CUDA Technology
■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 3) GPU executes instruction as scheduled in warps
(Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy()
Ultimate Performance
GPGPU/CUDA Technology
■ CUDA Threads are grouped into blocks. This is to
optimize the use of memory.
■ Instruction sent by host to GPU is called a Kernel.
Ultimate Performance
GPGPU/CUDA Technology
Each CUDA thread willexecute in one core.
Depending on memory requirements of a
kernel, multiple block may execute on each SM.
Each kernel can only be executed by one device (unless programmer’s intervention).
Multiple kernels may be executed at one time.
Ultimate Performance
Case Study 1 (data independent computation
without GPU/CUDA)
■ Matrix Multiplication
Ultimate Performance
Case Study 1 (data independent computation
without GPU/CUDA)
■ Matrix Multiplication
Ultimate Performance
Case Study 2 (data dependent computation
without GPU/CUDA)
■ Heat Transfer on 2D Surface
Ultimate Performance
Case Study 3 (data dependent computation with
GPU/CUDA)
Ultimate Performance
Case Study 3 (data dependent computation with
GPU/CUDA)
■ Fast Effective LSP Simulation
Many aerospace companies have incorporated fiber-reinforced
composite materials into the fuselage, either partially or wholly because of the high strength-to-weight ratio, stiffness, and larger scale manufacturing abilities at any shape. However, the lack of lightning strike protection (LSP) for the composite materials limits their use in many applications.
We propose a fast and effective simulation model using NVIDIA general purpose graphics processing unit (GPGPU) and compute unified device architecture (CUDA) technology which is targeted to LSP analysis on composite aircrafts.
Ultimate Performance
Case Study 3 (data dependent computation with
GPU/CUDA)
■ Fast Effective LSP Simulation
In many cases like lightning strikes on a composite material, when the charge distribution is not known, the Poisson's Equation can be used to solve any electrostatic problem. Using the Laplacian
operator on the electric potential function over a region of the space where the charge density is not zero, the Poisson's Equation is:
Ultimate Performance
Case Study 3 (data dependent computation with
GPU/CUDA)
■ Fast Effective LSP Simulation
If the charge density is zero all over the region, the Poison's Equation becomes Laplace's Equation:
A. Asaduzzaman, C. Yip, S. Kumar, and R. Asmatulu, “Fast, Effective, and Adaptable Computer Modeling and Simulation of Lightning Strike Protection on Composite Materials,” under
preparation, IEEE SoutheastCon conference 2013, Jacksonville, Florida, April 4-7, 2013.
Ultimate Performance
Case Study 3 (data dependent computation with
GPU/CUDA)
■ Fast Effective LSP Simulation
■ Simulation
CPU Only
CPU/GPU w/o shared memory CPU/GPU with shared memory
Ultimate Performance
Case Study 4 (data independent computation
with GPU/CUDA)
■ Quantum Computing
On going …
Expecting collaboration with Dr. Kumar, EECS, WSU
■ Other Areas
Eco-Biological studies Medical studies
“Multicore with SMT/GPGPU provides the ultimate
performance; at WSU CAPPLab, we can help!”
Outline
►
■ Introduction
Juggling
■ (Ultimate) Performance
Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance – T/F?
■ CAPPLab
Researchers, Resources Research Activities
(Multicore with SMT plus GPGPU/CUDA Technology)
■ Discussion
WSU CAPPLab
CAPPLab
■ Computer Architecture & Parallel Programming
Laboratory (CAPPLab)
Physical location: 245 Jabara Hall
URL: http://www.cs.wichita.edu/~capplab/ E-mail: [email protected]
Tel: +1-316-WSU-3927
■ Key Objectives
Lead research in advanced-level computer architecture, high-performance computing, embedded systems, and related fields. Educate advanced-level computer architecture and parallel
WSU CAPPLab
Researchers
■ Faculty Members
Dr. Abu Asaduzzaman, Asst. Prof., EECS, WSU
■ Students
Chok M. Yip, MS Student, EECS Dept. Nasrin Sultana, MS Student, EECS Dept. Zachary A. Vickers, BS Student, EECS Dept. Hin Yun Lee, MS in CS, EECS Dept.
■ Others
Dr. Ramazan Asmatulu, Asso. Prof., ME, WSU Dr. Preethika Kumar, Asst. Prof., EECS, WSU
WSU CAPPLab
Resources
■ Hardware
1: CUDA Server – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory 2: CUDA PC – CPU: Xeon E5506, …
3: Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3) via remote access to WSU (HiPeCC)
2 CUDA enabled Windows Workstations/PCs 1 CUDA enabled Laptops
More …
■ Software
WSU CAPPLab
Past/Current Activities
■ WSU became “CUDA Teaching Center” for 2012-13
Support from NVIDIA
Teaching parallel programming
■ Workshop
GPGPU/CUDA/C Summer 2012 (10 participants) GPGPU/CUDA/C Summer 2013
■ Collaborative Research
Dr. Ramazan Asmatulu Dr. Preethika Kumar More…
WSU CAPPLab
Past/Current/Future Activities
■ Research Funding
“M2SYS-WSU Biometric Cloud Computing Research Project” Teaching (Hardware/Financial) supports from NVIDIA
■ Research Funding
MURPA, pending, ORA, WSU
NSF TUES Type-1, pending, NSF