Multicore Processor and
GPU
Jia Rao
Assistant Professor in CS http://cs.uccs.edu/~jrao/
Moore’s Law
•
The number of transistors on integrated circuits
doubles approximately every two years
!
•
CPU performance doubles every two years
!
The End of Moore’s Law
•
CPU performance doubles every two years?
!
3
Transistors performance
Multicore Processors
•
If wider data path, wider registers, bigger caches,
deeper pipelines, and intelligent branch prediction
can NOT double performance, what to do with the
doubled transistors?
•
Put more cores on a chip —> multicore processor
!
Multiprocessor Memory Types
•
Shared memory
!
-
there is one (large) common shared memory for
all processor
!
•
Distributed memory
!
-
each processor has its own (small) local memory,
and its content is not replicated anywhere else
!!
Multicore Processors is a
Special Kind of Multiprocessors
•
All processors on the same chip (CMP)
•
MIMD (Multiple Instructions Multiple Data)
-
Different cores execute different threads, operating on
different parts of memory
•
Shared memory multiprocessor
-
All cores share the same memory
!
Multicore Architecture
!
7
Cross-socket interconnect
Memory node-0 Memory node-1
Processor-0 Processor-1
The Cache Coherence Problem
! 8 Processor Cache Processor Cache Processor Processor Cache Interconnect Memory I/O CacheReplicate contents of memory in local caches
Processors can have different values for the same location
Adapted from slides of Fatahalian@cmu
Reading at shared address should return the last value written
!
Coherency mechanisms
•
Directory-based
- In a directory-based system, the data being shared is placed in a common
directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the
directory either updates or invalidates the other caches with that entry.!
•
Snooping
- This is a process where the individual caches monitor address lines for
accesses to memory locations that they have cached. It is called a write invalidate protocol when a write operation is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.
!
The MESI Protocol
•
All coherence related activity is broadcast to all processors
•Every cache line has one of the four states
-
Modified — cache line is present only in the current cache, is
dirty and has been modified from the value in memory
-
Exclusive — cache line is present only in the current cache, and
is clean
-
Shared — cache line may be stored in other caches, and is
clean
-
Invalid — cache line is invalid
!
The MESI Protocol (cont’)
! 11 •Processor events
- PrRd — read - PrWr — write •Bus transactions
- BusRd — read request from the bus
without intent to modify
- BusRdX — read request from the bus
with the intent to modify
- BusWB — write line out to memory
• Access a cache line in I state will cause a
cache miss
• A write can only be performed if the cache
line is in E or M states. If it is in S state, the processor broadcasts a request for
Case Study: Intel Nehalem
!
12
(CMU 15-418, Spring 2012)
Cache hierarchy of Intel Core i7
Core L1 Data Cache
L2 Cache
Shared L3 Cache
(One bank per core)
Ring Interconnect Core L1 Data Cache L2 Cache Core L1 Data Cache L2 Cache Core L1 Data Cache L2 Cache
L1: (private per core) 32 KB
8-way set associative, write back
2 x 16B loads + 1 x 16B store per clock 4-6 cycle latency
10 outstanding misses L2: (private per core) 256 KB
8-way set associative, write back 32B / clock, 12 cycle latency 16 outstanding misses L3: (per chip)
8 MB, inclusive
16-way set associative 32B / clock per bank 26-31 cycle latency
64 byte cache line size
Review: key terms
-
cache line
-
write back vs. write
through policy
-
inclusion
L3: per chip
8MB - 12MB, inclusive 16-way set associative 26-31 cycle latency
Adapted from slides of Fatahalian@cmu
L2: private per core 256KB
8-way set associative, write back 12 cycle latency
L1: private per core 32KB
8-way set associative, write back 4-6 cycle latency
Case Study: Intel Nehalem (cont’)
!
13
Performance Analysis Guide
58
UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDI Hitm in local L3 CACHE, RFO snoop 04 35 396 4000190000000000 UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDS Local L3 CACHE in F or S, load snoop 04 35 396 40001A0000000000 UNC_ADDR_OPCODE_MATCH.LOCAL.RSPIWB Hitm in local L3 CACHE, load snoop 04 35 396 40001D0000000000
UNC_ADDR_OPCODE_MATCH.REMOTE.NONE none 02 35 396 0
UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDI Hitm in remote L3 CACHE, RFO 02 35 396 4000190000000000 UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDS Remote L3 CACHE in F or S, load 02 35 396 40001A0000000000 UNC_ADDR_OPCODE_MATCH.REMOTE.RSPIWB Hitm in remote L3 CACHE, load 02 35 396 40001D0000000000
These opcode uses can be seen from the dual socket QPI communications diagrams below. These predefined opcode match encodings can be used to monitor HITM
accesses in particular and serve as the only event that allows profiling the requesting code on the basis of the HITM transfers its requests generate.
Intel TOP SECRET
Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I RspI DRd Cache L ookup Cache Miss [ Sending Req to Local Home (socket 2 owns this address) ] SnpData [Send Snoop to LLC] Snp Data Cache Lookup Cache Miss RspI Speculative mem Rd Data [Fill complete to Socket2] RspI
RdData request after LLC Miss to Local Home (Clean Rsp) Rsp I Allo cate in E s tate [I-> E] Rd Dat a [ Broadcast snoops to all other caching agents) ] SnpD ata Da ta C_E _C M P
RdData after LLC miss and load from local memory
Case Study: Intel Nehalem (cont’)
!
14 Performance Analysis Guide
59
Intel TOP SECRET
Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I DRd (1) Cache Lookup (2) Cache Miss (3) [ Sending Req to Remote Home (socket 1 owns this address) ] RdData (4) RdData (5) [Send Snoop to LLC] Snp Data (6) [Send Request to CHL] RdDa ta (6 ) Cache Lookup (7) Clea n Rs p (8 ) Rs p I (9) Speculative mem Rd (7) Data (9) Data C_E_ cmp (10) [Send complete and Data to Socket2 to allocate in E state] DataC_E_cmp (11) DataC_E_cmp (12) Allocate in E state [i->E] (13)
RdData request after LLC Miss to
Remote Home (Clean Rsp)
[RspI indicates clean snoop]
RdData after LLC miss and load from remote memory
Case Study: Intel Nehalem (cont’)
!
15 Performance Analysis Guide
60
Intel TOP SECRET
Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I DRd (1) Cache Lookup (2) Cache Miss (3) [ Sending Req to Remote Home (socket 1 owns this address) ] RdData (4) RdData (5) [Send Snoop to LLC] Snp Data (6) [Sen d Requ est to C HL] RdD ata (6) Cache Lookup (7) Hitm Rsp M-> I , Da ta(8 ) Rs p IWb , Wb ID at a (9 ) Speculative mem Rd (7) Data (9) Data C_E_ cmp (10) [Send complete and Data to Socket2 to allocate in E state] DataC_E_cmp (11) DataC_E_cmp (12) Allocate in E state [i->E] (13)
RdData request after LLC Miss to Remote Home (Hitm Res)
[Data written back to Home. RspIWb is a NDR response. Hint to home that wb data follows shortly which is WbIData.
WB
Intel TOP SECRET
Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I Wb ID ata DRd Cac he Look up Cac he M iss [ Sending Req to Local Home (socket 2 owns this address) ] SnpData [Send Snoop to LLC] Snp Data Cache Lookup Hitm Rsp M-> I , Dat a Rs pIW b Speculative mem Rd Data [Send complete to Socket2] RspIWb WbIData
RdData request after LLC Miss to Local Home (Hitm Response)
Rsp IW b Allo cate in E s tate [I-> E] RdDat a [ Broadcast snoops to all other caching agents) ] SnpD ata
[Data written back to Remote Home. RspIWb is a NDR response. Hint to home that wb data follows shortly which is WbIData] W bID ata Da ta C _ E_ Cmp WB
RdData after LLC miss, invalidate remote modified
!
copy, and load from local memory
Performance Implications
•
Scalability issues
-
significant traffic when scaling to high core counts
-
storage cost for tracking sharers
-
Increased latency of cache misses
•
False sharing
-
two threads write to different variables residing on the same
cache line, incurring significant amounts of coherence traffic
!
False Sharing
!
17
// allocate per thread data
long myData[NUM_THREADS];
Adapted from slides of Fatahalian@cmu
// allocate per thread data
struct perThreadData {
long myData;
char padding[64 - sizeof(int)];
};
PerThreadData myData[NUM_THREADS]
May lead to false sharing
Cache line Cache line
myData[0] myData[1] myData[2]
access to myData cause !
coherence traffic
Shared Resource Contention
•
Contention could happen in different shared resources
-
LLC, memory controller, hardware prefetcher,
cross-socket interconnect
!
18
Contention-Aware Scheduling on Multicore Systems
·
8: 3Fig. 1. The performance degradation relative to running solo for two different schedules of SPEC CPU2006 applications on an Intel Xeon X3565 quad-core processor (two cores share an LLC).
contention, and prefetching hardware contention all combine in complex ways to create the performance degradation that threads experience when sharing the LLC.
Our goal is to investigate contention-aware scheduling techniques that are able to mitigate as much as possible the factors that cause performance degra-dation due to contention for shared resources. Such a scheduler would provide speedier as well as more stable execution times from run to run. Any con-tention aware scheduler must consist of two parts: a classification scheme for identifying which applications should and should not be scheduled together as well as the scheduling policy that assigns threads to cores given their classifi-cation. Since the classification scheme is crucial for an effective algorithm, we focused on the analysis of various classification schemes. We studied the follow-ing schemes: Stack Distance Competition (SDC) [Chandra et al. 2005], Animal Classes [Xie and Loh 2008], Solo Miss Rate [Knauerhase et al. 2008], and the Pain Metric. The best classification scheme was used to design a scheduling algorithm, which was prototyped at user level and tested on two very different systems with a variety of workloads.
Our methodology allowed us to identify the last-level cache miss rate, which is defined to include all requests issued by LLC to main memory including prefetching, as one of the most accurate predictors of the degree to which ap-plications will suffer when co-scheduled. We used it to design and implement a new scheduling algorithm called Distributed Intensity (DI). We show experi-mentally on two different multicore systems that DI performs better than the default Linux scheduler, delivers much more stable execution times than the default scheduler, and performs within a few percentage points of the theoreti-cal optimal. DI needs only the real miss rates of applications, which can be eas-ily obtained online. As such we developed an online version of DI, DI Online (DIO), which dynamically reads miss counters online and schedules applica-tions in real time. Our schedulers are implemented at user-level, and although they could be easily implemented inside the kernel, the user-level implementa-tion was sufficient for evaluaimplementa-tion of these algorithms’ key properties.
ACM Transactions on Computer Systems, Vol. 28, No. 4, Article 8, Pub. date: December 2010.
Contention is really harmful,
!
leading to degraded and
!
Simultaneous Multithreading
(SMT)
!
19
17
A technique complementary to multi-core:
Simultaneous multithreading
• Problem addressed:
The processor pipeline
can get stalled:
– Waiting for the result
of a long floating point
(or integer) operation
– Waiting for data to
arrive from memory
Other execution units
wait unused
BTB and I-TLBDecoder Trace Cache Rename/Alloc
Uop queues Schedulers
Integer Floating Point
L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Source: Intel
Adapted from slides of pfenning@cmu
•
Problem: processor pipeline
stall
-
Waiting for long floating point/
integer operation
-
Waiting for data from memory
-
Other execution unit idle
•
Solution: having two or more
Intel’s Hyperthreading
•
Replicate — Register state,
return stack buffer, large
page ITLB
•
Partitioned — load buffer,
store buffer, reorder buffer,
small page ITLB
•
Dynamically shared —
reservation station, caches,
data TLB, 2nd level TLB
•
Unaware — execution units
!
20
17
A technique complementary to multi-core:
Simultaneous multithreading
• Problem addressed:
The processor pipeline
can get stalled:
– Waiting for the result
of a long floating point
(or integer) operation
– Waiting for data to
arrive from memory
Other execution units
wait unused
BTB and I-TLBDecoder Trace Cache Rename/Alloc
Uop queues Schedulers
Integer Floating Point
L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Source: Intel
Thread-1: floating point
Thread-2: !
integer op
Multiprocessor Scheduling
•
Per-CPU scheduler
•
Work migration to
achieve load balancing
! 21 processor' pick_next_task()'' ready'queue'
…'
ready'queue'…'
pick_next_task()'' processor' processor' pick_next_task()'' ready'queue'…'
ready'queue'…'
pick_next_task()'' processor' Kick'' processor' pick_next_task()'' ready'queue'…'
ready'queue'…'
pick_next_task()'' processor' steal'Push migration
Pull migration
Limitations of Multicore Processor
•
Single-core —> Multicore is primarily due to
- Memory wall
- ILP wall
- Power wall
•
Multicore is still not performing well
- Lack of OS and application support for parallelization
- limited scalability due to cache coherence, inter-processor synchronization
- Still hard to grow to high core count due to power wall
- Not all workloads require deep pipeline, branch predictor —> resource waste
!
GPU
•
Recap: Multicore uses MIMD architectures
•
GPU uses SIMD (single instruction multiple data)
architectures to exploit data parallelism for
-
matrix-oriented scientific computing
-
media-oriented image/sound processing
•
SIMD is more energy efficient than MIMD
-
only needs to fetch one instruction per data operation
!
GPU vs. CPU
•
GPU is designed for data parallel processing rather
than data caching and flow control
!
GPU: heterogeneous Computing
•
Heterogeneous execution model
-
CPU is the host, GPU is the device
•
Develop a C-like programming language
-
CUDA and OpenCL
•
Unify all forms of GPU parallelism as thread
•
Programming model is “Single Instructin Multiple
Thread”
!
Threads and Blocks
•
A thread is associated with each data element
•
Threads are organized into blocks
•
Blocks are organized into a grid
•
GPU hardware handles thread management, not
applications or OS
!
An Example
•
A = B * C
!
GPU Architecture
!
28
Multithreaded SIMD processor
Thread block !
scheduler
SIMD Multithreaded Processor
!
29
Process 16 elements one time
Conditional Branching
•
GPU branch hardware uses internal masks to
handle different execution paths
!
30
for (i = 0; i < 64; i = i +1)
if (x[i] != 0)
x[i] = x[i] - y[i];
else
x[i] = x[i]+ y[i];
lane 0 lane 1 lane 2 lane 3 lane 4 lane 5
Blue: mask=1!
Red: mask=0
Blue: x[i] !=0
lane 0 lane 1 lane 2 lane 3 lane 4 lane 5 Red: x[i] ==0
Coalesced Memory Access
! 31 0 1 2 3 4 5 6 7 8 9 a b original matrix storage in memory 0 1 2 3 4 5 6 7 8 9 a b non-coalesced thread 0: 0, 1, 2 thread 1: 3, 4, 5 thread 2: 6, 7, 8 thread 3: 9, a, b thread 0: 0, 4, 8 thread 1: 1, 5, 9 thread 2: 2, 6, a thread 3: 3, 7, b coalescedIrregularities
!
32
On-the-Fly Elimination of Dynamic Irregularities
for GPU Computing
Eddy Z. Zhang
Yunlian Jiang
Ziyu Guo
Kai Tian
Xipeng Shen
Computer Science Department
The College of William and Mary, Williamsburg, VA, USA
{eddy,jiang,guoziyu,ktian,xshen}@cs.wm.edu
Abstract
The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.
This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It re-veals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their rela-tions with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified soft-ware solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program perfor-mance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopar-dizing no basic efficiency of the GPU application. Finally, it is ro-bust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dy-namic irregularities in GPU computing, producing speedups be-tween 1.07 and 2.5 for a variety of applications.
Categories and Subject Descriptors D.3.4 [Programming Lan-guages]: Processors—optimization, compilers
General Terms Performance,Experimentation
Keywords GPGPU, Thread divergence, Memory coalescing, Thread-data remapping, CPU-GPU pipelining, Data transformation
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
ASPLOS’11, March 5–11, 2011, Newport Beach, California, USA. Copyright c 2011 ACM 978-1-4503-0266-1/11/03. . . $10.00 A[ ]: P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} ... = A[P[tid]]; tid: 0 1 2 3 4 5 6 7 2 4 1 0 0 6 0 0 B[ ]: tid: 0 1 2 3 4 5 6 7 if (B[tid]) {...}
(a) Irregular memory reference (b) Irregular control flow
Figure 1. Examples of dynamic irregularities (warp size=4;
seg-ment size=4). Graph (a) shows that inferior mappings between threads and data locations cause more memory transactions than necessary; graph (b) shows that inferior mappings between threads and data values cause threads in the same warp diverge on the con-dition.
1. Introduction
Recent several years have seen a quick adoption of Graphic Pro-cessing Units (GPU) in general-purpose computing, thanks to their tremendous computing power, and favorable cost effectiveness and energy efficiency. These appealing properties come from the mas-sively parallel architecture of GPU, which, unfortunately, entails a major weakness of GPU: the high sensitivity of their throughput to the presence of irregularities in an application.
The massive parallelism of GPU is embodied by the equipment of a number of streaming multiprocessors (SM), with each contain-ing dozens of cores. Correspondcontain-ingly, a typical application writ-ten in GPU programming models (e.g., CUDA [14] from NVIDIA) creates thousands of parallel threads running on GPU. Each thread has a unique ID, tid. These threads are organized into warps1.
Threads in one warp are assigned to a single SM, and proceed in an SIMD (Single Instruction Multiple Data) fashion. As a result, hundreds of threads may be actively running on a GPU at the same time. Parallel execution of such a large number of threads may well exploit the tremendous computing power of GPU, but not for irreg-ular computations.
Dynamic Irregularities in GPU Computing Irregularities in an application may throttle GPU throughput by as much as an order of magnitude. There are two types of irregularities, one on data references, the other on control flows.
Before explaining irregular data references, we introduce the properties of GPU memory access. (Without noting, “memory” refers to GPU off-chip global memory.) In a modern GPU device (e.g., NVIDIA Tesla C1060, S1070,C2050, S2070), memory is composed of a large number of continuous segments. The size of
1 This paper uses NVIDIA CUDA terminology.
369