Performance Analysis of
Thread Mappings with a
Holistic View of the
Hardware Resources
Wei Wang, Tanima Dey,
Jason Mars, Lingjia Tang,
Jack Davidson, Mary Lou Soffa
Department of Computer Science
University of Virginia
Motivation
y
Chip-multiprocessors offer large number of
cores and ample resources
y
Number of simultaneously executing
applications is increasing
y
Careful resource management is critical
y
Thread mapping is a powerful technique
for resource management
Challenges for Thread
Mapping
y
Multiple resources are effected
y
Threads demonstrate various run-time
characteristics
Goal of this Research
y
Analyze why a particular thread mapping
is better than another mapping:
y
What are the resources that cause the
performance differences?
y
What are the thread characteristics that
cause the resource utilization differences?
y
What is the relative importance of various
resources?
Contributions
y
In-depth performance analyses of various
thread mappings using multi-threaded
applications on real hardware
y
Identify the key hardware resources
y
Determine the impact on key resource utilization
y
Introduce a new metric L2MP to analyze the
performance of the combined memory resources
Outline
•
Motivation
•
Challenges
•
Contributions
•
Overview – resource, metric, mappings
•
Analysis – prefetchers, processor cores
•
Key findings for thread mapping
•
Conclusion
Overview
y
A comprehensive analyses considering
various factors
y
Application’s performance
y
Application’s characteristics
y
Hardware resources shared by applications
Resources and Metrics
•
Resources
–
Memory Resources:
•
L1 I/D, I/D TLB, L2, Prefetchers, Memory interconnect
–
Processor Resources:
•
Memory disambiguation units, branch predictors, Processor Core
•
Metrics
–
Cache misses, mis-predictions, memory latency
(with hardware performance counters (HPCs))
–
Processor utilization (from OS)
–
Execution cycles and execution time
Thread Characteristics of
Multi-threaded Applications
y
Single thread characteristics
y
Cache demand
y
Memory bandwidth demand
y
I/O frequency
y
Prefetcher effectiveness
y
Prefetcher excessiveness
y
Multiple thread characteristics
y
Sibling Threads
y
Data and instruction sharing
Four Thread Mappings
Mapping
Core 0
Core 1
Core 2
Core 3
LLC0
LLC1
OSMap
Any thread
Any thread
Any thread
Any thread
IsoMap
a1, a1
a1,a1
a2, a2
a2, a2
IntMap
a1, a1
a2,a2
a1,a1
a2,a2
SprMap
a1, a2
a1,a2
a1,a2
a1,a2
ISPASS 2012 10
Core 0
Core 1
L1
Cache
TLB
L1
Cache
TLB
L2 Cache
Hardware Prefetchers
Core 2
Core 3
L1
Cache
TLB
L1
Cache
TLB
L2 Cache
Hardware Prefetchers
Off-Chip Mem
Interconnect
App
1
App
2
Experimental Setup –
Platform & Workloads
•
Intel Core 2 Q9550
Processor
•
PARSEC benchmark
suite 2.1
–
9 benchmarks
•
All possible pairs (36)
using the 9
benchmarks
•
4 worker threads each
benchmark
Core 0
Core 1
L1
TLB
L2 Cache
Hardware
Prefetchers
Memory Controller & Memory
L1
TLB
Core 2
Core 3
L1
L1
TLB
TLB
L2 Cache
Hardware
Prefetchers
Key Resources
y
A key resource is identified
y
Utilization of the resource varies considerably
y
Utilization variation results in difference in
application’s performance
y
Identification technique
y
Direct approach: use HPCs
y
Indirect approach: use application’s
performance in different mappings
Key Resources
y
More important resources
y
Memory resources
y
L1D-cache
y
L2-cache
y
Hardware prefetchers
y
Memory interconnect
y
Less important resources
y
L1I-cache
y
I/D TLB
y
Memory disambiguation unit
y
Processor resources
y
Branch predictor
y
Experimental Results: streamcluster (w. blackscholes
)
Analysis –
Hardware Prefetchers
y
Case 1: Threads that share high amount of
data
y
Sharing the same cache improves performance
Key Findings for
y
Case 2: Threads that have low or no data sharing
but high prefetcher excessiveness
y
Sharing the same prefetchers improves performance
ISPASS 2012 Wang et al., University of Virginia 16
Key Findings for
y
Case 3: Threads that have low data sharing
and low prefetcher excessiveness
y
Fewer cache misses and prefetch operations
improves performance
Key Findings for
Analysis – Processor Cores
y
Processor utilization
Analysis – Processor Cores
Key Findings for
Processor Cores
y
Case 1: Sibling threads have frequent
synchronization
Key Findings for
Processor Cores
y
Case 2: Sibling threads have frequent I/O
Managing Multiple Resources
Example
y
L2 caches, prefetchers, and memory
bandwidth are closely related resources
y
A single metric to evaluate their aggregated
performance impact
y
L2MP:
L2-cache-misses-memory-latency-product
y
L2MP = L2_cache_misses X Memory_latency
L2MP
y
Thread mapping algorithms
y
Consider all the key resources together
y
Improve the utilizations of the resources that
provide the maximum benefit
y
Consider co-running application’s characteristics
ISPASS 2012 Wang et al., University of Virginia 24