Multicore Processor and GPU. Jia Rao Assistant Professor in CS

(1)

Multicore Processor and

GPU

Jia Rao

Assistant Professor in CS http://cs.uccs.edu/~jrao/

(2)

Moore’s Law

•

The number of transistors on integrated circuits

doubles approximately every two years

!

•

CPU performance doubles every two years

!

(3)

The End of Moore’s Law

•

CPU performance doubles every two years?

!

3

Transistors performance

(4)

Multicore Processors

•

If wider data path, wider registers, bigger caches,

deeper pipelines, and intelligent branch prediction

can NOT double performance, what to do with the

doubled transistors?

•

Put more cores on a chip —> multicore processor

!

(5)

Multiprocessor Memory Types

•

Shared memory

!

-

there is one (large) common shared memory for

all processor

!

•

Distributed memory

!

-

each processor has its own (small) local memory,

and its content is not replicated anywhere else

!

(6)

Multicore Processors is a

Special Kind of Multiprocessors

•

All processors on the same chip (CMP)

•

MIMD (Multiple Instructions Multiple Data)

-

Different cores execute different threads, operating on

different parts of memory

•

Shared memory multiprocessor

-

All cores share the same memory

!

(7)

Multicore Architecture

!

7

Cross-socket interconnect

Memory node-0 Memory node-1

Processor-0 Processor-1

(8)

The Cache Coherence Problem

! 8 Processor Cache Processor Cache Processor Processor Cache Interconnect Memory I/O Cache

Replicate contents of memory in local caches

Processors can have different values for the same location

Adapted from slides of Fatahalian@cmu

Reading at shared address should return the last value written

!

(9)

Coherency mechanisms

•

Directory-based

- In a directory-based system, the data being shared is placed in a common

directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the

directory either updates or invalidates the other caches with that entry.!

•

Snooping

- This is a process where the individual caches monitor address lines for

accesses to memory locations that they have cached. It is called a write invalidate protocol when a write operation is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.

!

(10)

The MESI Protocol

•

All coherence related activity is broadcast to all processors

•

Every cache line has one of the four states

-

Modified — cache line is present only in the current cache, is

dirty and has been modified from the value in memory

-

Exclusive — cache line is present only in the current cache, and

is clean

-

Shared — cache line may be stored in other caches, and is

clean

-

Invalid — cache line is invalid

!

(11)

The MESI Protocol (cont’)

! 11 •

Processor events

- PrRd — read - PrWr — write •

Bus transactions

- BusRd — read request from the bus

without intent to modify

- BusRdX — read request from the bus

with the intent to modify

- BusWB — write line out to memory

• Access a cache line in I state will cause a

cache miss

• A write can only be performed if the cache

line is in E or M states. If it is in S state, the processor broadcasts a request for

(12)

Case Study: Intel Nehalem

!

12

(CMU 15-418, Spring 2012)

Cache hierarchy of Intel Core i7

Core L1 Data Cache

L2 Cache

Shared L3 Cache

(One bank per core)

Ring Interconnect Core L1 Data Cache L2 Cache Core L1 Data Cache L2 Cache Core L1 Data Cache L2 Cache

L1: (private per core) 32 KB

8-way set associative, write back

2 x 16B loads + 1 x 16B store per clock 4-6 cycle latency

10 outstanding misses L2: (private per core) 256 KB

8-way set associative, write back 32B / clock, 12 cycle latency 16 outstanding misses L3: (per chip)

8 MB, inclusive

16-way set associative 32B / clock per bank 26-31 cycle latency

64 byte cache line size

Review: key terms

-

cache line

-

write back vs. write

through policy

-

inclusion

L3: per chip

8MB - 12MB, inclusive 16-way set associative 26-31 cycle latency

L2: private per core 256KB

8-way set associative, write back 12 cycle latency

L1: private per core 32KB

8-way set associative, write back 4-6 cycle latency

(13)

Case Study: Intel Nehalem (cont’)

!

13

Performance Analysis Guide

58

UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDI Hitm in local L3 CACHE, RFO snoop 04 35 396 4000190000000000 UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDS Local L3 CACHE in F or S, load snoop 04 35 396 40001A0000000000 UNC_ADDR_OPCODE_MATCH.LOCAL.RSPIWB Hitm in local L3 CACHE, load snoop 04 35 396 40001D0000000000

UNC_ADDR_OPCODE_MATCH.REMOTE.NONE none 02 35 396 0

UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDI Hitm in remote L3 CACHE, RFO 02 35 396 4000190000000000 UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDS Remote L3 CACHE in F or S, load 02 35 396 40001A0000000000 UNC_ADDR_OPCODE_MATCH.REMOTE.RSPIWB Hitm in remote L3 CACHE, load 02 35 396 40001D0000000000

These opcode uses can be seen from the dual socket QPI communications diagrams below. These predefined opcode match encodings can be used to monitor HITM

accesses in particular and serve as the only event that allows profiling the requesting code on the basis of the HITM transfers its requests generate.

Intel TOP SECRET

Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I _RspI DRd Cac_{he L} ook_up Cac_he Mis_s [ Sending Req to Local Home (socket 2 owns this address) ] SnpData [Send Snoop to LLC] Snp Data Cache Lookup Cache Miss Rs_pI Speculative mem Rd Data [Fill complete to Socket2] RspI

RdData request after LLC Miss to Local Home (Clean Rsp) Rsp I Allo cat_{e in} E s tate [I-> E] Rd Dat a [ Broadcast snoops to all other caching agents) ] SnpD ata Da ta C_E _C M P

RdData after LLC miss and load from local memory

(14)

Case Study: Intel Nehalem (cont’)

!

14 Performance Analysis Guide

59

Intel TOP SECRET

Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I DRd (1) Cache Lookup (2) Cache Miss (3) [ Sending Req to Remote Home (socket 1 owns this address) ] RdData (4) RdData (5) [Send Snoop to LLC] Snp Data (6) [Send Request to CHL] RdDa ta (6 ) Cache Lookup (7) Clea n Rs p (8 ) Rs p I (9) Speculative mem Rd (7) Data (9) Data C_E_ cmp (10) [Send complete and Data to Socket2 to allocate in E state] DataC_E_cmp (11) DataC_E_cmp (12) Allocate in E state [i->E] (13)

RdData request after LLC Miss to

Remote Home (Clean Rsp)

[RspI indicates clean snoop]

RdData after LLC miss and load from remote memory

(15)

Case Study: Intel Nehalem (cont’)

!

15 Performance Analysis Guide

60

Intel TOP SECRET

Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I DRd (1) Cache Lookup (2) _Cache Miss (3) [ Sending Req to Remote Home (socket 1 owns this address) ] RdData (4) RdData (5) [Send Snoop to LLC] Snp Data (6) [Sen d Requ est to C HL] RdD ata (6) Cache Lookup (7) Hitm Rsp M-> I , Da ta(8 ) Rs p IWb , Wb ID at a (9 ) Speculative mem Rd (7) Data (9) Data C_E_ cmp (10) [Send complete and Data to Socket2 to allocate in E state] DataC_E_cmp (11) DataC_E_cmp (12) Allocate in E state [i->E] (13)

RdData request after LLC Miss to Remote Home (Hitm Res)

[Data written back to Home. RspIWb is a NDR response. Hint to home that wb data follows shortly which is WbIData.

WB

Intel TOP SECRET

Socket 1 Socket 2 Uncore Cores L L C GQ QHL IMC Q P I Uncore Cores QHL GQ L L C IMC Q P I Wb ID ata DRd Cac he L_ook up Cac he M iss [ Sending Req to Local Home (socket 2 owns this address) ] SnpData [Send Snoop to LLC] Snp Data Cache Lookup Hitm Rsp M-> I , Dat a Rs pIW b Speculative mem Rd Data [Send complete to Socket2] RspIWb WbIData

RdData request after LLC Miss to Local Home (Hitm Response)

Rsp IW b Allo cat_{e in} E s tate [I-> E] RdDat a [ Broadcast snoops to all other caching agents) ] SnpD ata

[Data written back to Remote Home. RspIWb is a NDR response. Hint to home that wb data follows shortly which is WbIData] W bID ata Da ta C _ E_ Cmp WB

RdData after LLC miss, invalidate remote modified

!

copy, and load from local memory

(16)

Performance Implications

•

Scalability issues

-

significant traffic when scaling to high core counts

-

storage cost for tracking sharers

-

Increased latency of cache misses

•

False sharing

-

two threads write to different variables residing on the same

cache line, incurring significant amounts of coherence traffic

!

(17)

False Sharing

!

17

// allocate per thread data

long myData[NUM_THREADS];

// allocate per thread data

struct perThreadData {

long myData;

char padding[64 - sizeof(int)];

};

PerThreadData myData[NUM_THREADS]

May lead to false sharing

Cache line Cache line

myData[0] myData[1] myData[2]

access to myData cause !

coherence traffic

(18)

Shared Resource Contention

•

Contention could happen in different shared resources

-

LLC, memory controller, hardware prefetcher,

cross-socket interconnect

!

18

Contention-Aware Scheduling on Multicore Systems

·

8: 3

Fig. 1. The performance degradation relative to running solo for two different schedules of SPEC CPU2006 applications on an Intel Xeon X3565 quad-core processor (two cores share an LLC).

contention, and prefetching hardware contention all combine in complex ways to create the performance degradation that threads experience when sharing the LLC.

Our goal is to investigate contention-aware scheduling techniques that are able to mitigate as much as possible the factors that cause performance degra-dation due to contention for shared resources. Such a scheduler would provide speedier as well as more stable execution times from run to run. Any con-tention aware scheduler must consist of two parts: a classification scheme for identifying which applications should and should not be scheduled together as well as the scheduling policy that assigns threads to cores given their classifi-cation. Since the classification scheme is crucial for an effective algorithm, we focused on the analysis of various classification schemes. We studied the follow-ing schemes: Stack Distance Competition (SDC) [Chandra et al. 2005], Animal Classes [Xie and Loh 2008], Solo Miss Rate [Knauerhase et al. 2008], and the Pain Metric. The best classification scheme was used to design a scheduling algorithm, which was prototyped at user level and tested on two very different systems with a variety of workloads.

Our methodology allowed us to identify the last-level cache miss rate, which is defined to include all requests issued by LLC to main memory including prefetching, as one of the most accurate predictors of the degree to which ap-plications will suffer when co-scheduled. We used it to design and implement a new scheduling algorithm called Distributed Intensity (DI). We show experi-mentally on two different multicore systems that DI performs better than the default Linux scheduler, delivers much more stable execution times than the default scheduler, and performs within a few percentage points of the theoreti-cal optimal. DI needs only the real miss rates of applications, which can be eas-ily obtained online. As such we developed an online version of DI, DI Online (DIO), which dynamically reads miss counters online and schedules applica-tions in real time. Our schedulers are implemented at user-level, and although they could be easily implemented inside the kernel, the user-level implementa-tion was sufficient for evaluaimplementa-tion of these algorithms’ key properties.

ACM Transactions on Computer Systems, Vol. 28, No. 4, Article 8, Pub. date: December 2010.

Contention is really harmful,

!

leading to degraded and

!

(19)

Simultaneous Multithreading

(SMT)

!

19

17

A technique complementary to multi-core:

Simultaneous multithreading

• Problem addressed:

The processor pipeline

can get stalled:

– Waiting for the result

of a long floating point

(or integer) operation

– Waiting for data to

arrive from memory

Other execution units

wait unused

BTB and I-TLB

Decoder Trace Cache Rename/Alloc

Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Source: Intel

Adapted from slides of pfenning@cmu

•

Problem: processor pipeline

stall

-

Waiting for long floating point/

integer operation

-

Waiting for data from memory

-

Other execution unit idle

•

Solution: having two or more

(20)

Intel’s Hyperthreading

•

Replicate — Register state,

return stack buffer, large

page ITLB

•

Partitioned — load buffer,

store buffer, reorder buffer,

small page ITLB

•

Dynamically shared —

reservation station, caches,

data TLB, 2nd level TLB

•

Unaware — execution units

!

20

17

A technique complementary to multi-core:

Simultaneous multithreading

• Problem addressed:

The processor pipeline

can get stalled:

– Waiting for the result

of a long floating point

(or integer) operation

– Waiting for data to

arrive from memory

Other execution units

wait unused

BTB and I-TLB

Decoder Trace Cache Rename/Alloc

Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Source: Intel

Thread-1: floating point

Thread-2: !

integer op

(21)

Multiprocessor Scheduling

•

Per-CPU scheduler

•

Work migration to

achieve load balancing

! 21 processor' pick_next_task()'' ready'queue'

…'

ready'queue'

…'

pick_next_task()'' processor' processor' pick_next_task()'' ready'queue'

…'

ready'queue'

…'

pick_next_task()'' processor' Kick'' processor' pick_next_task()'' ready'queue'

…'

ready'queue'

…'

pick_next_task()'' processor' steal'

Push migration

Pull migration

(22)

Limitations of Multicore Processor

•

Single-core —> Multicore is primarily due to

- Memory wall

- ILP wall

- Power wall

•

Multicore is still not performing well

- Lack of OS and application support for parallelization

- limited scalability due to cache coherence, inter-processor synchronization

- Still hard to grow to high core count due to power wall

- Not all workloads require deep pipeline, branch predictor —> resource waste

!

(23)

GPU

•

Recap: Multicore uses MIMD architectures

•

GPU uses SIMD (single instruction multiple data)

architectures to exploit data parallelism for

-

matrix-oriented scientific computing

-

media-oriented image/sound processing

•

SIMD is more energy efficient than MIMD

-

only needs to fetch one instruction per data operation

!

(24)

GPU vs. CPU

•

GPU is designed for data parallel processing rather

than data caching and flow control

!

(25)

GPU: heterogeneous Computing

•

Heterogeneous execution model

-

CPU is the host, GPU is the device

•

Develop a C-like programming language

-

CUDA and OpenCL

•

Unify all forms of GPU parallelism as thread

•

Programming model is “Single Instructin Multiple

Thread”

!

(26)

Threads and Blocks

•

A thread is associated with each data element

•

Threads are organized into blocks

•

Blocks are organized into a grid

•

GPU hardware handles thread management, not

applications or OS

!

(27)

An Example

•

A = B * C

!

(28)

GPU Architecture

!

28

Multithreaded SIMD processor

Thread block !

scheduler

(29)

SIMD Multithreaded Processor

!

29

Process 16 elements one time

(30)

Conditional Branching

•

GPU branch hardware uses internal masks to

handle different execution paths

!

30

for (i = 0; i < 64; i = i +1)

if (x[i] != 0)

x[i] = x[i] - y[i];

else

x[i] = x[i]+ y[i];

lane 0 lane 1 lane 2 lane 3 lane 4 lane 5

Blue: mask=1!

Red: mask=0

Blue: x[i] !=0

lane 0 lane 1 lane 2 lane 3 lane 4 lane 5 Red: x[i] ==0

(31)

Coalesced Memory Access

! 31 0 1 2 3 4 5 6 7 8 9 a b original matrix storage in memory 0 1 2 3 4 5 6 7 8 9 a b non-coalesced thread 0: 0, 1, 2 thread 1: 3, 4, 5 thread 2: 6, 7, 8 thread 3: 9, a, b thread 0: 0, 4, 8 thread 1: 1, 5, 9 thread 2: 2, 6, a thread 3: 3, 7, b coalesced

(32)

Irregularities

!

32

On-the-Fly Elimination of Dynamic Irregularities

for GPU Computing

Eddy Z. Zhang

Yunlian Jiang

Ziyu Guo

Kai Tian

Xipeng Shen

Computer Science Department

The College of William and Mary, Williamsburg, VA, USA

{eddy,jiang,guoziyu,ktian,xshen_}@cs.wm.edu

Abstract

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.

This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It re-veals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their rela-tions with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified soft-ware solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program perfor-mance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopar-dizing no basic efficiency of the GPU application. Finally, it is ro-bust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dy-namic irregularities in GPU computing, producing speedups be-tween 1.07 and 2.5 for a variety of applications.

Categories and Subject Descriptors D.3.4 [Programming Lan-guages]: Processors—optimization, compilers

General Terms Performance,Experimentation

Keywords GPGPU, Thread divergence, Memory coalescing, Thread-data remapping, CPU-GPU pipelining, Data transformation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ASPLOS’11, March 5–11, 2011, Newport Beach, California, USA. Copyright c 2011 ACM 978-1-4503-0266-1/11/03. . . $10.00 A[ ]: P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} ... = A[P[tid]]; tid: 0 1 2 3 4 5 6 7 2 4 1 0 0 6 0 0 B[ ]: tid: 0 1 2 3 4 5 6 7 if (B[tid]) {...}

(a) Irregular memory reference (b) Irregular control flow

Figure 1. Examples of dynamic irregularities (warp size=4;

seg-ment size=4). Graph (a) shows that inferior mappings between threads and data locations cause more memory transactions than necessary; graph (b) shows that inferior mappings between threads and data values cause threads in the same warp diverge on the con-dition.

1. Introduction

Recent several years have seen a quick adoption of Graphic Pro-cessing Units (GPU) in general-purpose computing, thanks to their tremendous computing power, and favorable cost effectiveness and energy efficiency. These appealing properties come from the mas-sively parallel architecture of GPU, which, unfortunately, entails a major weakness of GPU: the high sensitivity of their throughput to the presence of irregularities in an application.

The massive parallelism of GPU is embodied by the equipment of a number of streaming multiprocessors (SM), with each contain-ing dozens of cores. Correspondcontain-ingly, a typical application writ-ten in GPU programming models (e.g., CUDA [14] from NVIDIA) creates thousands of parallel threads running on GPU. Each thread has a unique ID, tid. These threads are organized into warps1.

Threads in one warp are assigned to a single SM, and proceed in an SIMD (Single Instruction Multiple Data) fashion. As a result, hundreds of threads may be actively running on a GPU at the same time. Parallel execution of such a large number of threads may well exploit the tremendous computing power of GPU, but not for irreg-ular computations.

Dynamic Irregularities in GPU Computing Irregularities in an application may throttle GPU throughput by as much as an order of magnitude. There are two types of irregularities, one on data references, the other on control flows.

Before explaining irregular data references, we introduce the properties of GPU memory access. (Without noting, “memory” refers to GPU off-chip global memory.) In a modern GPU device (e.g., NVIDIA Tesla C1060, S1070,C2050, S2070), memory is composed of a large number of continuous segments. The size of

1 _{This paper uses NVIDIA CUDA terminology.}

369

http://cs.uccs.edu/~jrao/