Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

(1)

Performance Analysis of

Thread Mappings with a

Holistic View of the

Hardware Resources

Wei Wang, Tanima Dey,

Jason Mars, Lingjia Tang,

Jack Davidson, Mary Lou Soffa

Department of Computer Science

University of Virginia

(2)

Motivation

y

Chip-multiprocessors offer large number of

cores and ample resources

y

Number of simultaneously executing

applications is increasing

y

Careful resource management is critical

y

Thread mapping is a powerful technique

for resource management

(3)

Challenges for Thread

Mapping

y

Multiple resources are effected

y

Threads demonstrate various run-time

characteristics

(4)

Goal of this Research

y

Analyze why a particular thread mapping

is better than another mapping:

y

What are the resources that cause the

performance differences?

y

What are the thread characteristics that

cause the resource utilization differences?

y

What is the relative importance of various

resources?

(5)

Contributions

y

In-depth performance analyses of various

thread mappings using multi-threaded

applications on real hardware

y

Identify the key hardware resources

y

Determine the impact on key resource utilization

y

Introduce a new metric L2MP to analyze the

performance of the combined memory resources

(6)

Outline

• Motivation

• Challenges

• Contributions

• Overview – resource, metric, mappings

• Analysis – prefetchers, processor cores

• Key findings for thread mapping

• Conclusion

(7)

Overview

y

A comprehensive analyses considering

various factors

y

Application’s performance

y

Application’s characteristics

y

Hardware resources shared by applications

(8)

Resources and Metrics

• Resources

–

Memory Resources:

• L1 I/D, I/D TLB, L2, Prefetchers, Memory interconnect

–

Processor Resources:

• Memory disambiguation units, branch predictors, Processor Core

• Metrics

–

Cache misses, mis-predictions, memory latency

(with hardware performance counters (HPCs))

–

Processor utilization (from OS)

–

Execution cycles and execution time

(9)

Thread Characteristics of

Multi-threaded Applications

y

Single thread characteristics

y

Cache demand

y

Memory bandwidth demand

y

I/O frequency

y

Prefetcher effectiveness

y

Prefetcher excessiveness

y

Multiple thread characteristics

y

Sibling Threads

y

Data and instruction sharing

(10)

Four Thread Mappings

Mapping

Core 0

Core 1

Core 2

Core 3

LLC0

LLC1

OSMap

Any thread

IsoMap

a1, a1

a1,a1

a2, a2

IntMap

a1, a1

a2,a2

a1,a1

a2,a2

SprMap

a1, a2

a1,a2

ISPASS 2012 10

Core 0

Core 1

L1

Cache

TLB

L1

Cache

TLB

L2 Cache

Hardware Prefetchers

Core 2

Core 3

L1

Cache

TLB

L1

Cache

TLB

L2 Cache

Hardware Prefetchers

Off-Chip Mem

Interconnect

App

1 App

2

(11)

Experimental Setup –

Platform & Workloads

• Intel Core 2 Q9550

Processor

• PARSEC benchmark

suite 2.1

–

9 benchmarks

• All possible pairs (36)

using the 9

benchmarks

• 4 worker threads each

benchmark

Core 0

Core 1

L1

TLB

L2 Cache

Hardware

Prefetchers

Memory Controller & Memory

L1

TLB

Core 2

Core 3

L1

TLB

L2 Cache

Hardware

Prefetchers

(12)

Key Resources

y

A key resource is identified

y

Utilization of the resource varies considerably

y

Utilization variation results in difference in

application’s performance

y

Identification technique

y

Direct approach: use HPCs

y

Indirect approach: use application’s

performance in different mappings

(13)

Key Resources

y

More important resources

y

Memory resources

y

L1D-cache

y

L2-cache

y

Hardware prefetchers

y

Memory interconnect

y

Less important resources

y

L1I-cache

y

I/D TLB

y

Memory disambiguation unit

y

Processor resources

y

Branch predictor

(14)

y

Experimental Results: streamcluster (w. blackscholes

)

Analysis –

Hardware Prefetchers

(15)

y

Case 1: Threads that share high amount of

data

y

Sharing the same cache improves performance

Key Findings for

(16)

y

Case 2: Threads that have low or no data sharing

but high prefetcher excessiveness

y

Sharing the same prefetchers improves performance

ISPASS 2012 Wang et al., University of Virginia 16

Key Findings for

(17)

y

Case 3: Threads that have low data sharing

and low prefetcher excessiveness

y

Fewer cache misses and prefetch operations

improves performance

Key Findings for

(18)

Analysis – Processor Cores

y

Processor utilization

(19)

Analysis – Processor Cores

(20)

Key Findings for

Processor Cores

y

Case 1: Sibling threads have frequent

synchronization

(21)

Key Findings for

Processor Cores

y

Case 2: Sibling threads have frequent I/O

(22)

Managing Multiple Resources

Example

y

L2 caches, prefetchers, and memory

bandwidth are closely related resources

y

A single metric to evaluate their aggregated

performance impact

y

L2MP:

L2-cache-misses-memory-latency-product

y

L2MP = L2_cache_misses X Memory_latency

(23)

L2MP

(24)

y

Thread mapping algorithms

y

Consider all the key resources together

y

Improve the utilizations of the resources that

provide the maximum benefit

y

Consider co-running application’s characteristics

ISPASS 2012 Wang et al., University of Virginia 24

(25)

y

For memory-intensive applications

y

streamcluster, canneal, facesim, fluidanimate

y

Maximize the L2MP metric

y

For I/O- or CPU-intensive applications

y

swaptions, blackscholes, vips, x264, bodytrack

y

Maximize processor utilization

(26)

Conclusion

y

Identified six key resources

y

Analyzed how to map threads with particular

characteristics to improve resource utilization

y

Introduced a new metric L2MP for managing

key memory resources

y

Determined relative importance of the key

resources

(27)

Related Work

y

Shared-cache-aware thread mapping

y

Jiang et al. PACT 2008

y

Chandra et al. HPCA 2005

y

Xie et al. CMP-MSI 2008

y

Knauerhase et al. IEEE-Micro 2008

y

Cache-Prefetcher-FSB-aware thread

mapping

(28)

Thank you & Questions?