• No results found

Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

N/A
N/A
Protected

Academic year: 2021

Share "Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Performance Analysis of

Thread Mappings with a

Holistic View of the

Hardware Resources

Wei Wang, Tanima Dey,

Jason Mars, Lingjia Tang,

Jack Davidson, Mary Lou Soffa

Department of Computer Science

University of Virginia

(2)

Motivation

y

Chip-multiprocessors offer large number of

cores and ample resources

y

Number of simultaneously executing

applications is increasing

y

Careful resource management is critical

y

Thread mapping is a powerful technique

for resource management

(3)

Challenges for Thread

Mapping

y

Multiple resources are effected

y

Threads demonstrate various run-time

characteristics

(4)

Goal of this Research

y

Analyze why a particular thread mapping

is better than another mapping:

y

What are the resources that cause the

performance differences?

y

What are the thread characteristics that

cause the resource utilization differences?

y

What is the relative importance of various

resources?

(5)

Contributions

y

In-depth performance analyses of various

thread mappings using multi-threaded

applications on real hardware

y

Identify the key hardware resources

y

Determine the impact on key resource utilization

y

Introduce a new metric L2MP to analyze the

performance of the combined memory resources

(6)

Outline

Motivation

Challenges

Contributions

Overview – resource, metric, mappings

Analysis – prefetchers, processor cores

Key findings for thread mapping

Conclusion

(7)

Overview

y

A comprehensive analyses considering

various factors

y

Application’s performance

y

Application’s characteristics

y

Hardware resources shared by applications

(8)

Resources and Metrics

Resources

Memory Resources:

L1 I/D, I/D TLB, L2, Prefetchers, Memory interconnect

Processor Resources:

Memory disambiguation units, branch predictors, Processor Core

Metrics

Cache misses, mis-predictions, memory latency

(with hardware performance counters (HPCs))

Processor utilization (from OS)

Execution cycles and execution time

(9)

Thread Characteristics of

Multi-threaded Applications

y

Single thread characteristics

y

Cache demand

y

Memory bandwidth demand

y

I/O frequency

y

Prefetcher effectiveness

y

Prefetcher excessiveness

y

Multiple thread characteristics

y

Sibling Threads

y

Data and instruction sharing

(10)

Four Thread Mappings

Mapping

Core 0

Core 1

Core 2

Core 3

LLC0

LLC1

OSMap

Any thread

Any thread

Any thread

Any thread

IsoMap

a1, a1

a1,a1

a2, a2

a2, a2

IntMap

a1, a1

a2,a2

a1,a1

a2,a2

SprMap

a1, a2

a1,a2

a1,a2

a1,a2

ISPASS 2012 10

Core 0

Core 1

L1

Cache

TLB

L1

Cache

TLB

L2 Cache

Hardware Prefetchers

Core 2

Core 3

L1

Cache

TLB

L1

Cache

TLB

L2 Cache

Hardware Prefetchers

Off-Chip Mem

Interconnect

App

1

App

2

(11)

Experimental Setup –

Platform & Workloads

Intel Core 2 Q9550

Processor

PARSEC benchmark

suite 2.1

9 benchmarks

All possible pairs (36)

using the 9

benchmarks

4 worker threads each

benchmark

Core 0

Core 1

L1

TLB

L2 Cache

Hardware

Prefetchers

Memory Controller & Memory

L1

TLB

Core 2

Core 3

L1

L1

TLB

TLB

L2 Cache

Hardware

Prefetchers

(12)

Key Resources

y

A key resource is identified

y

Utilization of the resource varies considerably

y

Utilization variation results in difference in

application’s performance

y

Identification technique

y

Direct approach: use HPCs

y

Indirect approach: use application’s

performance in different mappings

(13)

Key Resources

y

More important resources

y

Memory resources

y

L1D-cache

y

L2-cache

y

Hardware prefetchers

y

Memory interconnect

y

Less important resources

y

L1I-cache

y

I/D TLB

y

Memory disambiguation unit

y

Processor resources

y

Branch predictor

(14)

y

Experimental Results: streamcluster (w. blackscholes

)

Analysis –

Hardware Prefetchers

(15)

y

Case 1: Threads that share high amount of

data

y

Sharing the same cache improves performance

Key Findings for

(16)

y

Case 2: Threads that have low or no data sharing

but high prefetcher excessiveness

y

Sharing the same prefetchers improves performance

ISPASS 2012 Wang et al., University of Virginia 16

Key Findings for

(17)

y

Case 3: Threads that have low data sharing

and low prefetcher excessiveness

y

Fewer cache misses and prefetch operations

improves performance

Key Findings for

(18)

Analysis – Processor Cores

y

Processor utilization

(19)

Analysis – Processor Cores

(20)

Key Findings for

Processor Cores

y

Case 1: Sibling threads have frequent

synchronization

(21)

Key Findings for

Processor Cores

y

Case 2: Sibling threads have frequent I/O

(22)

Managing Multiple Resources

Example

y

L2 caches, prefetchers, and memory

bandwidth are closely related resources

y

A single metric to evaluate their aggregated

performance impact

y

L2MP:

L2-cache-misses-memory-latency-product

y

L2MP = L2_cache_misses X Memory_latency

(23)

L2MP

(24)

y

Thread mapping algorithms

y

Consider all the key resources together

y

Improve the utilizations of the resources that

provide the maximum benefit

y

Consider co-running application’s characteristics

ISPASS 2012 Wang et al., University of Virginia 24

(25)

y

For memory-intensive applications

y

streamcluster, canneal, facesim, fluidanimate

y

Maximize the L2MP metric

y

For I/O- or CPU-intensive applications

y

swaptions, blackscholes, vips, x264, bodytrack

y

Maximize processor utilization

(26)

Conclusion

y

Identified six key resources

y

Analyzed how to map threads with particular

characteristics to improve resource utilization

y

Introduced a new metric L2MP for managing

key memory resources

y

Determined relative importance of the key

resources

(27)

Related Work

y

Shared-cache-aware thread mapping

y

Jiang et al. PACT 2008

y

Chandra et al. HPCA 2005

y

Xie et al. CMP-MSI 2008

y

Knauerhase et al. IEEE-Micro 2008

y

Cache-Prefetcher-FSB-aware thread

mapping

(28)

Thank you & Questions?

References

Related documents

Best Ball: On team bets, only the best score from each team is used to award one point maximum per hole.. Hi-Low: On team bets, the best scores are compared for a point as in

The argument developed in Chapter 4 (on the phenomenology of solo performance) expands logically to ensemble situations in Chapter 5, where communication is considered as a

two methods. The first involved a straightforward application of Zeliner's method for estimating seemingly unrelated equations. The cross equation restrictions implied by the

This paper examines the contributions of the four disciplines - Science, Technology, Engineering and Mathematics - to the field of STEM education, and discusses

Maintenance Free Vandalism Proof (no external parts) Management Software (Networked Alarm, Real Time Occupancy Monitoring) Typical applications for battery locking

INTRODUCTION. FUNCTIONALISM AND FORMALISM IN STATUTORY INTERPRETATION ... Statutory Formalism ... Constitutional Formalism ... Institutional Competence Formalism ...

Further, we will generalize our proposed solution methodology to the multi- agent scenario and propose an extended algorithm based on passive co-ordination us- ing an existing

An established method for quantifying the kinetic stability of recombinant TTR tetramers in bu ffer is subunit exchange, in which tagged TTR homotetramers are added to