• No results found

Whither Sequential?: Rethinking Parallel Execution for Future Multicore Processors

N/A
N/A
Protected

Academic year: 2021

Share "Whither Sequential?: Rethinking Parallel Execution for Future Multicore Processors"

Copied!
80
0
0

Loading.... (view fulltext now)

Full text

(1)

Whither Sequential?: Rethinking

Parallel Execution for Future

Multicore Processors

Guri Sohi

(with Matt Allen, Srinath Sridharan, Gagan Gupta, Hongil Yoon)

(2)

Outline

• Lessons from microprocessor evolution

• Multicore generation and its implications

• Sequential programming; parallel execution

– Dataflow-style parallel execution

– Managing perils of excessive parallelism

• Concluding thoughts

(3)

Beyond pipelining to ILP

• Late 1980s to mid 1990s

• Search for “post RISC” architecture

– More accurately, instruction processing model

• Desire to do more than one instruction per cycle—exploit ILP

• VLIW/EPIC

• Out-of-order (OOO) superscalar

(4)

VLIW/EPIC School

• Descendants of HPC computing experience (array processors)

• Search for independence (by compiler)

• Express independence in static program

• Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism

• Strive for “efficiency”

– Static scheduling to “saturate” a resource

(5)

VLIW/EPIC School

• Creating effective parallel representations (statically) introduces several problems

– Predication

– Statically scheduling loads

– Exception handling

– Recovery code

• Lots of research addressing these problems

(6)

OOO Superscalar

• Not from HPC school

– Non-scientific influence important (e.g., branch prediction)

• No static search/representation for independence

– Arguably statically representing dependent operations (e.g., accumulator) would help

• Improvements to superscalar were in this direction

(7)

OOO Superscalar

• Create dynamic parallel execution from sequential static representation

– dynamic dependence information accurate

– execution schedule flexible

• Parallelism in application/algorithm required but program representation is sequential

• None of the problems associated with trying to create a parallel representation statically

(8)

VLIW/Superscalar

• Superscalar not as “efficient” as VLIW

– Has all this extra hardware for dynamic dataflow execution

– Hard to “saturate” a resource like VLIW

• But provides natural (sequential) interface for program generator

• Much more adaptable to run time uncertainties

– E.g., resources changing dynamically

(9)

Lessons from VLIW/Superscalar

• Wisdom from HPC is natural to apply, but is this a good idea?

• Hardware “efficiency” may not be the best notion of “efficiency”

• Parallel execution can be achieved even without a parallel representation

(10)

Lessons from VLIW/Superscalar

• Dataflow execution is much more flexible and adaptable than control flow execution

– E.g., new forms of speculation easily added

– E.g., resource architecture easily changed

• (Hardware) overheads for achieving dynamic dataflow execution worth the effort

(11)

On to Multicore

Parallel execution hardware is everywhere

11

(12)

What Are We Going to Do?

• Leverage knowledge from 4+ decade experience in parallel processing

– Mostly in scientific/HPC arena

– Create a program with parallelism expressed statically

• Teach students about parallel programming

– Eventually statically parallel programs will be pervasive and universal

• Have expert programmers write parallel libraries

• Etc, etc

(13)

Lots of Experience with Parallelism

• Models, languages, hardware

• Main issues: expressing, exposing, handling

• Examples: Actors, Shared memory, Message passing, Dataflow, Task Execution systems, Futures, Concurrent OO, Inspector-Executor, Speculative systems, Thread-level speculation, SIMD, Vectors, Jade, Linda, Cilk, Multilisp, Id, PVM, MPI, OpenMP, Charm, ………..

• Parallel Programming vs. Parallel Execution

(14)

Canonical Parallel Execution Model

A: Analyze program to identify independence in program

– independent portions executed in parallel

B: Create static representation of independence

synchronization to satisfy independence assumption

C: Dynamic parallel execution unwinds as per static representation

– potential consequences due to static assumptions

(15)

What We Desire

• Parallel execution

– Regardless of what the static program is.

• Achieve this without significant rethinking in common programming practices/styles

(16)

Control and Data-Driven

• Parallelism is due to operations on disjoint sets of data

• Static representation typically control-driven

– Most if not all practical programming languages

– Need to ensure that execution is on disjoint data

• Use synchronization to ensure

• But nature of data revealed dynamically

– Potentially obscures application parallelism

• Conflates parallelism with execution schedule

(17)

Control and Data-Driven

• Data-driven focuses on data dependence

– Naturally separates operations on disjoint data

– Can be easily derived from total (sequential) order

• Remember VLIW (control driven and parallel) and OOO superscalar (data-driven from sequential)

• My view: data-driven models much more powerful and practical than control-driven

– How to get such a model for future hardware?

(18)

Data-driven Parallel Execution

• Would like to achieve data-driven parallel execution

• Do we need new programming languages? No

• Do we need to learn to program in parallel? No

• Do we need to throw out everything we have and start from scratch? No

• How are we going to deal with hardware “uncertainties” in the future?

(19)

Hardware Going Forward

• Multiple general-purpose processing cores

– dynamically specialized

• Some “special-purpose” hardware

• GPUs, specialized units, accelerators, etc.

• Over-provisioning: pool of available (i.e., powered on) resources might change frequently

– Now called “dark silicon”

• Frequent errors

– E.g., due to near threshold operation

(20)

Meanwhile

• Software is everywhere and its presence is going to continue to increase

– Software not “shrink wrap” any more

• Programmer productivity is paramount; code

efficiencyless important?

• Modularity, abstraction, encapsulation,

information hiding, etc., become the norm

• Object-oriented programming principles (and languages) becoming ubiquitous

(21)

Going Forward

• Programmers are going to continue to express computation in familiar ways: sequential,

modular, object-oriented programs, with heavy use of abstraction and encapsulation

• We want them to use a parallel algorithm, but don’t necessarily want them to statically express parallelism

How are we going to make it work?

(22)

Abstraction, Sequential, Dataflow

• Then: abstraction is a friend of software

• Now: abstraction is going to help us use future hardware

View program as a sequential representation of abstractions of computations which are going to be executed on a (dynamically) heterogeneous pool of hardware resources in a dataflow manner

(23)

What?

• Data-driven parallel execution from statically sequential program

– Data-centric (dynamic) expression of dependence

• No expression of independence

– Determinate, race-free execution

– No locks, no memory models and no explicit

synchronization

– Easier to write, debug, and maintain

– Comparable or better performance than conventional parallel models without their drawbacks/pitfalls

(24)

 Sequential imperative programming language • Leverage modern programming principles

• Provision for worst-case scenarios

 Achieving dataflow parallel execution

(25)

Static Program object A .. while (cond) { function F ( ) } function F’ ( ) Dynamic Sequence .. evaluate cond (PS1) F1 ( ) evaluate cond (PS2) F2 ( ) evaluate cond (PS3) F3 ( ) ..

Program Decomposition

(26)

Static Program object A .. while (cond) { function F ( ) } function F’ ( )

Program Decomposition

Functions .. evaluate cond (PS1) F1 ( ) evaluate cond (PS2) F2 ( ) evaluate cond (PS3) F3 ( ) ..
(27)

Static Program object A .. while (cond) { function F ( ) } function F’ ( )

Program Decomposition

Program Segments .. evaluate cond (PS1) F1 ( ) evaluate cond (PS2) F2 ( ) evaluate cond (PS3) F3 ( ) ..
(28)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..  Collection of computations  Implicit order

Program Decomposition

(29)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor T ime
(30)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor PS1 T ime
(31)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor PS1 T ime F1
(32)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor PS1 T ime F1 PS2
(33)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor PS1 T ime F2 PS2 F1
(34)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor PS1 T ime PS3 F2 PS2 F1
(35)

Dynamic Sequence .. PS1 F1 ( ) PS2 F2 ( ) PS3 F3 ( ) ..

Program Execution

Multicore Processor PS1 T ime F3 PS3 F2 PS2 F1
(36)

Dynamic Sequence .. F1 ( ) PS1 F2 ( ) PS2 F3 ( ) PS3 ..  Implicit order - Hindrance + Exploitable

Program Execution

(37)

object A .. while (cond) { function F ( ) } function F’ ( )

“Instruction”

Level of Abstraction

(38)

object A .. while (cond) { function F ( ..interface.. ) } function F’ ( )

“Operands”

Level of Abstraction

(39)

object A ..

while (cond) {

function F ({wr_set} {rd_set})

}

function F’ ( )

“Operands”

(40)

object A

..

while (cond) {

function F ({wr_set} {rd_set}) }

function F’ ( )

“Unit of Data”

(41)

 Our view of a program • Computations

• Level of abstraction

Classical dataflow machines

• Tokens: data dependence

• Resource management

(42)

 Proposed Model

• Level of abstraction

• Imperative programs can mutate data

 Token protocol

• Data dependence

 Program order

(43)

object A ..

while (cond) {

function F ({wr_set} {rd_set}) }

function F’ ( )

(44)

object A ..

while (cond) {

function F ({wr_set} {rd_set}) } function F’ ( ) F1 {B, C} {A} F2 {D} {A} F3 {A, E} {I} F4 {B} {D} F5 {B} {D} F6 {G} {H}

(45)

object A ..

while (cond) {

function F ({wr_set} {rd_set}) } function F’ ({?} {?}) F1 {B, C} {A} F2 {D} {A} F3 {A, E} {I} F4 {B} {D} F5 {B} {D} F6 {G} {H} F’ {?} {?}

(46)

Data Dependence

F1 {B, C} {A} F2 {D} {A} F3 {A, E} {I} F4 {B} {D} F5 {B} {D} F6 {G} {H} F’ {?} {?} Time F1 F2 F3 F4 F5 F6
(47)

 Runtime library (C++ currently)

 Implements token protocol

 Achieves parallel execution

(48)

Composing Programs

Sequential Program Object A .. while (cond) { function F: ( .. ) } Function F’: {?} {?}

Program in Proposed Model

Object A ..

while (cond) { }

(49)

Composing Programs

Sequential Program Object A .. while (cond) { function F: ( .. ) } Function F’: {?} {?}

Program in Proposed Model

Object A : Token

..

while (cond) { }

(50)

Composing Programs

Sequential Program Object A .. while (cond) { function F: ( .. ) } Function F’: {?} {?}

Program in Proposed Model

Object A : Token .. while (cond) { df_execute ( &F); } F’( .. );

(51)

Composing Programs

Sequential Program Object A .. while (cond) { function F: ( .. ) } Function F’: {?} {?}

Program in Proposed Model

Object A : Token ..

while (cond) {

df_execute (wr_set, rd_set, &F);

}

df_end ( );

(52)

Parallel Execution Mechanics

Multicore Processor Task-stealing Schedulers Deques Application Interface Runtime
(53)

 Ease of programming (qualitative)

 Performance & overheads

 Setup

• Microbenchmarks & benchmarks

• 3 stock multicore machines

(54)

bzip2

Sequential

.. ..

while (!EOF) {

block = new block_t (InFile, size);

compress (block)

file_write (OpFile, block); }

(55)

for (;;) { pthread_mutex_lock(fifo->mut); while (fifo->empty) { if (allDone == 1) { pthread_mutex_unlock(fifo->mut); return (NULL); } GetSystemTime(&systemtime); SystemTimeToFileTime(&systemtime, (FILETIME *)&filetime); waitTimer.tv_sec = filetime.QuadPart / 10000000; waitTimer.tv_nsec = filetime.QuadPart -((LONGLONG)waitTimer.tv_sec * 10000000) * 10; waitTimer.tv_sec++; pret = pthread_cond_timedwait(fifo->notEmpty, fifo->mut, &waitTimer);

FileData = queueDel(fifo, &inSize, &blockNum); pthread_mutex_unlock(fifo->mut);

pret = pthread_cond_signal(fifo->notFull); outSize = (int) ((inSize*1.01)+600);

pthread_mutex_lock(MemMutex); CompressedData = new char[outSize]; pthread_mutex_unlock(MemMutex) if (CompressedData == NULL) {

while ((currBlock < NumBlocks) || (allDone == 0)) { pthread_mutex_lock(OutMutex); if ((OutputBuffer.size() == 0) || (OutputBuffer[currBlock].bufSize < 1) || (OutputBuffer[currBlock].buf == NULL)) { pthread_mutex_unlock(OutMutex); usleep(50000); continue; } else pthread_mutex_unlock(OutMutex);

ret = write(hOutfile, OutputBuffer[currBlock].buf, OutputBuffer[currBlock].bufSize); CompressedSize += ret; pthread_mutex_lock(OutMutex); pthread_mutex_lock(MemMutex); if (OutputBuffer[currBlock].buf != NULL) { delete [] OutputBuffer[currBlock].buf; NumBufferedBlocks--; } pthread_mutex_unlock(MemMutex); pthread_mutex_unlock(OutMutex); currBlock++;

bzip2: Pthread

(56)
(57)

bzip2: Proposed Model

Sequential

.. ..

while (!EOF) {

block = new block_t (InFile, size);

compress (block)

file_write (OpFile, block); }

..

Proposed Model

..

while (!EOF) {

block = new block_t (InFile, size);

} ..

(58)

bzip2: Proposed Model

Sequential

.. ..

while (!EOF) {

block = new block_t (InFile, size);

compress (block)

file_write (OpFile, block); } .. Proposed Model .. opset.insert (OpFile); while (!EOF) {

block = new block_t (InFile, size);

blset.insert (block);

df_execute (blset, &compress); df_execute (opset, blset,

&file_write);

} ..

(59)

 Machines

• 4-core/8-thread Nehalem Core i7

• 16-core AMD Opteron 8350

• 32-core AMD Opteron 8356

 Input sizes

• Small, Medium, Large

Results

Benchmark barneshut barneshut-LG blackscholes blackscholes-LG bzip2 bzip2-LG dedup histogram reverse_index microbenchmarks
(60)

Performance Evaluation

60 0 2 4 6 8 10 12 14 16 18 8x Nehalem (PT) 8x Nehalem (DF) 32x Barcelona (PT) 32x Barcelona (DF)
(61)
(62)

How? Summary

• Serialize computations to same object

– Enforce dependence

• Do not look for/represent independence

– Falls out as an effect of enforcing dependence

• Updates to given state in same order as in sequential program

– Determinate

– No races

– If sequential correct; parallel execution is correct (same input)

(63)

Environment Going Forward

• Multicore processors everywhere

• Won’t know (dynamic) capabilities of hardware or operating environment when writing code

– Different machines may have different hardware and software resources

– Different programs may be running in a multiprogrammed environment

• How much parallelism should we have?

– Too much for dynamic situation can be problematic!

(64)

Parallelism and Resources

64

void parallel_memcpy(void* dest, void* src, size_t length) {

char* d = (char*) dest; char* s = (char*) src;

size_t chunks = length / NUM_THREADS;

parallel_for ( i=0; i < chunks; i++) { for(j=i*chunks; j<(i+1)*chunks;j++) { dest[j] = src[j]; } } }

(65)

Parallelism and Resources

• Same input data on three different machines

• Performance degrades as more threads added

65 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 2 4 6 8 10 12 14 16 Sp e e d u p wrt se q u e n tial ex e cu tion Num_threads

(66)

Parallelism and Resources

• Point of degradation is different on different machines 66 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 2 4 6 8 10 12 14 16 Sp e e d u p wrt se q u e n tial ex e cu tion Num_threads

(67)

Parallelism and Resources

• Different input data on the same machine (4X4-B)

• Performance degrades as more threads added

67 0 0.5 1 1.5 2 2.5 3 3.5 4 0 2 4 6 8 10 12 14 16 Sp e e d u p wrt se q u e n tial ex e cu tion Num_threads Small Trad Large Trad

(68)

Parallelism and Resources

• Point of degradation is different for different input sizes 68 0 0.5 1 1.5 2 2.5 3 3.5 4 0 2 4 6 8 10 12 14 16 Sp e e d u p wrt se q u e n tial ex e cu tion Num_threads Small Trad Large Trad

(69)

Why performance degradation?

• Contention for resources

– memory bandwidth

– software locks in user program

– lock on kernel data-structure

– size of on-chip caches

– interconnection bandwidth

– several other possibilities

(70)

Desired result

70 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 2 4 6 8 10 12 14 16 Spe ed up wrt seq ue n tial ex ec utio n Num_threads

4X4-B Trad 4X8-B Trad 8X2-N Trad

4X4-B Prometheus 4X8-B Prometheus 8X2-N Prometheus

(71)

How to achieve it ?

• Expose parallelism into dynamic computation environment in a controlled manner

– Establish single point of control introducing computation tasks into environment

– Detect if parallelism might be excessive

• e.g., rate of increase in task execution time • e.g., rate of temperature increase

– Control exposure of parallelism by throttling or increasing the number of simultaneous dynamic worker threads (tasks)

(72)

72

Dedicated Environment

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Sec o n d s wr t sequ en tia l e xec u tio n Time TBB FDT PM PD 0 0.5 1 1.5 2 2.5 Jo u les wrt sequ en tia l e xec u tio n Energy TBB FDT PM PD
(73)

73

Multiprogrammed Environment

0 5000 10000 15000 20000 25000 RE His to gr am Str ea m Hash Joi n RE His to gr am Str ea m Hash Joi n RE His to gr am Str ea m Hash Joi n

Scenario 1 Scenario 2 Scenario 3

Joules

Energy

TBB PD

(74)

74

Multiprogrammed Environment

0 50 100 150 200 250 RE His to gr am Str ea m Ha sh Jo in RE His to gr am Str ea m Ha sh Jo in RE His to gr am Str ea m Ha sh Jo in

Scenario 1 Scenario 2 Scenario 3

Sec ond s Time TBB PD

(75)

75

PD variations

0 10 20 30 40 50 60 70 80 90

RE Histogram Stream Hash Join

Coun t Multiprogrammed Scenario1 Scenario2 Scenario3

(76)

Other Issues to Ponder

• Determinate vs. deterministic vs. non-deterministic

• Notion of a precise exception/interrupt

• Future-proofing software

– Unknown pathologies with locks

– What else is lurking out there, and how might it be managed when it occurs?

• Given parallelism, how much of a limiter is a sequential representation? Why have parallel?

(77)

Summary

• Future general-purpose computing world is going to be parallel everywhere

– But different from traditional parallel

• Will require new thinking about how to execute programs

• Want programmers to think parallel but program sequential

• Future will likely be sequential programs with dynamic dataflow parallel execution

(78)

Questions?

(79)

79

Benchmarks

Program Source Original

language Synchronization Description barnes-hut Lonestar C++ barrier N-body

simulation black-scholes PARSE

C C barrier

financial analysis bzip2 pbzip2 C mutex,

condition variables compression dedup PARSE C C mutex, condition variables enterprise storage histogram Phoenix C barrier image

analysis

(80)

80

Hardware configurations

μ-arch AMD Barcelona Intel Nehalem Processor Phenom 9850 Opteron 8350 Core i7 965 Xeon X5550 Sockets 1 4 1 2 Cores 4 4 4 4 Threads 1 1 2 2 Total contexts 4 16 8 16 Clock (GHz) 2.5 2.0 3.2 2.66 Memory (GB) 4 16 12 24

References

Related documents

ROBBINS, THE UNITED STATES TRUSTEE for Region 7 (&#34;UST&#34;), through the undersigned counsel, objects to the Application to Approve Employment and Retention of Miller

– Any analyzer that has been stored at a temperature range inside the allowed storage range but outside the allowed operating range must be stored at an ambient temperature within

First, the paper discusses the relations between music and trance and the role of music during rituals; second, it explores how ritual musics are involved in representations of

Another justification for permitting the determination of causation is to up- hold the purpose of appraisal. 109 The court in CIGNA explained that if the apprais- ers were

The framework currently supports two different kinds of constraints: Simple UIMA Ruta rules, which express spe- cific expectations concerning the relationship of annotations,

In the year to 31 January 2013, Group revenue fell 8% to £197.3 million as a result of declines in sales volumes in both our retail and wholesale businesses in UK/Europe.. Gross

In one part general bio data of students was noted while on the other part outcome variables such as socioeconomic status, parents’ education and their assistance in students’

7 Robust Hovering and Trajectory Tracking Control of the Quadro- tor Helicopter Using a Novel Disturbance Observer 55 7.1 Position Control Utilizing Acceleration Based