Infrastructure and Methods for High level Architectural Exploration for Graphics.

(1)

AGARWAL, MONAM. Infrastructure and Methods for High level Architectural Exploration for Graphics. (Under the direction of Dr. Gregory Byrd)

The work done in this thesis is divided into two categories. The first is the development of a

feasible infrastructure for fast and reasonably accurate studies of the complex scenarios of

interactions within an SoC. This infrastructure includes a simulator that has an abstract core,

along with models of the last level cache and cache controller, the interconnection network,

and memory. It is built using the SystemC TLM2.0 framework and is aimed to be abstract

and accurate. It is flexible and easy for experimentation, and helps to capture the functional

and performance characteristics of a system at an architectural level. The model is developed

and correlated against a product level cycle-accurate simulator for various parameters. In this

work, this scalable model is used to specifically mimic and study the behavior of graphics

interacting with the system.

The second focus is the characterization of graphics traffic captured outside of the graphics

core at the interface of the core and the rest of the SoC. This characterization is based on

finding interesting patterns in the traces captured. It is a kind of offline infrastructure

developed, that gives useful insights and when used along with the developed high level

model, aids in the process of finding and experimenting with graphics related optimizations

involving the last level cache.

Hence, this work addresses the development of a feasible high level abstracted model

i.e. a simulator as well as offline processing methods that can be used to study architectural

optimizations with a special focus on the integrated GPU scenario. It is observed that the

(2)

(3)

(4)

by

Monam Agarwal

A Thesis submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the degree of

Master of Science

Computer Engineering

Raleigh, North Carolina

2016

APPROVED BY:

_______________________________ _______________________________

Dr. Gregory Byrd Dr. Huiyang Zhou

Committee Chair

(5)

DEDICATION

(6)

BIOGRAPHY

Monam Agarwal was born in November 1990 in Bijnor, India. She received her Bachelor’s

degree from Delhi College of Engineering, Delhi University in 2012 in the field of

Electronics and Communication engineering.

After completing her under graduation, she held position of Systems Engineer in the

In-process Electrical Checks Department in BrahMos Aerospace Pvt. Ltd, a joint venture of the

DRDO, India from 2012 to 2014. Currently she is a graduate student in the Department of

Computer Engineering at North Carolina State University. Her fields of interest lie in

Computer architecture, Performance modeling and analysis of complex System-on-Chip

(7)

ACKNOWLEDGMENTS

I would begin by expressing my gratitude to my advisor, Dr. Gregory Byrd, Associate

Department Head, Department of Electrical and Computer Engineering, for helping and

advising me, and giving useful suggestions and insights throughout the course of the thesis.

The successful completion of this work would not have been possible without his

encouragement and belief in my capabilities.

I am extremely grateful to the PDG Architecture team at Intel, Bangalore, my mentor Mr.

Saurabh Tiwari, CPU Performance Architect, Intel and the entire team as a lot of work on the

thesis was carried out during the course of my internship with this team. I am thankful to my

mentor for his valuable time and extremely useful technical sessions and advice and technical

help at various points in the project, for ramping me up on very fundamental and useful

concepts and helping me improve my skills throughout during the course of the work.

I am thankful to my family, my parents and my sister for always giving me moral support

and encouragement to be able to accomplish and achieve goals in life and hence also to be

able to complete this work successfully.

I am also grateful to my friends, Mansi, Lakshay, Manish, Sidhant, Sabir, Srijit who have

been my support all through my graduate school and played an important part for me

keeping up with the difficult times I faced during the course of my graduate studies and

(8)

TABLE OF CONTENTS

1 Introduction ... 1

1.1 Integrated vs. Standalone Graphics ... 1

1.2 Simulation of Integrated Graphics ... 4

1.3 Contributions ... 6

2 SystemC TLM 2.0 Simulation Framework ... 8

2.1 Use of SystemC TLM2.0 ... 8

2.2 Timing Model ... 9

2.3 Non-Blocking Interface ... 10

2.4 Transaction Packets ... 10

2.5 Scalability ... 11

2.6 Lumped Delays ... 11

2.7 Tracking, Crediting and Arbitration ... 11

2.7.1 Static Crediting ... 14

2.8 Statistics Collection and Parameters ... 15

3 The High-Level Uncore model Architecture ... 16

3.1 Injector/Graphics Core ... 17

3.1.1 Requests Tracking Table (Core) ... 18

3.1.2 Address Match Logic ... 19

(9)

3.2 The Interconnect: Core Arbiter ... 20

3.2.1 Multi-Socket ... 20

3.2.2 Arbiter Logic ... 21

3.2.3 Signal and Data Queues ... 22

3.2.4 Hash Logic ... 22

3.2.5 Credits and Delays ... 22

3.3 Last Level Cache and Cache Controller ... 23

3.3.1 The Last Level Cache ... 23

3.3.2 Replacement Policy ... 24

3.3.3 Cache Controller ... 25

3.4 Cache Controller Arbiter Interconnect ... 26

3.5 Memory Controller ... 27

3.6 Flows and Protocols Modeled ... 28

3.7 Synthetic Scenario Generation Capability and Synthetic Validation Flows ... 28

3.8 Launching Workload studies, Statistics Rollup, Debug Infrastructure ... 31

4 Golden Reference Model vs. the HL_uncore ... 32

4.1 The Ring Interconnect ... 33

4.2 The Last Level Cache and Cache Controller ... 33

(10)

4.4 Abstraction Choices of Hl_uncore with respect to the Reference Model ... 35

4.4.1 Abstraction Choice 1: Modeling of Ring Interconnect ... 35

4.4.2 Abstraction Choice 2: Simplification of Cache Controller Flows ... 36

4.4.3 Abstraction Choice 3: Lumped Delays ... 36

5 Correlation Studies and Results ... 37

5.1 Workloads ... 37

5.2 SoC Configuration for Experiments ... 38

5.3 Simulation Speed Comparison ... 40

5.4 Code Base Comparison ... 41

5.5 Correlation on Cache Associativity Study ... 41

5.5.1 Description ... 41

5.5.2 Discussion on Trends and Correlation ... 42

5.6 Correlation on Cache Size Study ... 46

6 Studies using HL_uncore Model ... 48

6.1 Memory Latency ... 48

6.2 Buffer Sizing ... 49

6.3 Cache Hits As a function of Time ... 53

7 Trace Characterization ... 56

7.1 Type of Requests Distribution ... 57

(11)

7.1.2 Results & Inferences ... 57

7.2 Data Footprint in Cache ... 58

7.2.1 Methodology ... 58

7.2.2 Results and Inferences ... 59

7.3 Stream Detection ... 60

7.3.1 Methodology ... 60

7.3.2 Results and Inferences ... 61

7.4 Cache Policy Studies based on Trace Characterization Results ... 62

7.4.1 Unique Addresses Accessed ... 62

7.4.2 Reuse Number Profiling ... 63

7.4.3 Cache Replacement Policy Experiment with HL_uncore based on Reuse Number 66 7.4.4 Reuse Distance ... 69

8 Conclusion and Future Work ... 73

(12)

LIST OF TABLES

Table 3.1 Synthetic Scenario Generation Flows ... 29

Table 5.1 Base SoC Configuration Details ... 39

Table 5.2 Code Base Comparison ... 41

Table 5.3 Difference in Hit Rates for Associativity = 8, Baseline LRU Policy ... 45

Table 5.4 Difference in Hit rates for Cache Sizes 1, 2, 4 MB ... 47

Table 7.1 Stream detection results ... 61

(13)

LIST OF FIGURES

Figure 1.1 Integrated Graphics Block Level Diagram ... 2

Figure 1.2 External Graphics Block Diagram... 3

Figure 1.3 CPU GPU Interaction ... 4

Figure 1.4 Overview of the Work ... 7

Figure 2.1 TLM Timing Models and their Implementation Mechanisms [3]... 9

Figure 2.2 Types of Crediting Protocols ... 13

Figure 3.1 Workload Flow for Graphics Simulation ... 16

Figure 3.2 Hl_Uncore High level Diagram ... 17

Figure 3.3 Detailed Architecture of Core Arbiter ... 18

Figure 3.4 Detailed Architecture of Core Arbiter Interconnect ... 20

Figure 3.5 Arbiter between Injector cores and Cache Controllers ... 21

Figure 3.6 Detailed LLC & Cache Controller Architecture ... 23

Figure 3.7 Detailed Cache Controller Arbiter Architecture ... 27

Figure 4.1 The Real Uncore Architecture [7] ... 32

Figure 4.2 The High level Core-LLC Interaction [9] ... 34

Figure 5.1 Simulation Speed Comparison ... 40

Figure 5.2 Trend for Varying Associativity- Baseline LRU ... 43

Figure 5.3 Trend for Varying Associativity- Vulnerable Fill ... 44

Figure 5.4 Trend for Varying Cache Size ... 46

Figure 6.1 Performance Evaluation for varying latencies ... 49

(14)

Figure 6.3 Speed Up/Performance Improvement for RTT Size=64 vs 32 ... 50

Figure 6.4 RTT (Size 64 vs. 32) Occupancy vs. Simulation Cycles for angry birds ... 51

Figure 6.5 RTT (Size 64 vs. 32) Occupancy vs. Simulation Cycles for Sandra ... 52

Figure 6.6 RTT (Size 64 vs. 32) Occupancy vs. Simulation Cycles for boshcluster ... 53

Figure 6.7 Cache Hits as a function of Time, angry birds workload ... 53

Figure 6.8 Cache Hits as a function of Time, race stars workload ... 54

Figure 6.9 Cache Hits as a function of Time, dead trigger2 workload ... 54

Figure 7.1 Pattern of Workload Traces ... 56

Figure 7.2 Opcode Distribution ... 58

Figure 7.3 LLC Data Footprint ... 59

Figure 7.4 Stream Detector Implementation ... 60

Figure 7.5 Percentage of Unique Addresses in Different workload traces ... 62

Figure 7.6 Reuse Number: Sandra ... 63

Figure 7.7 Reuse Number: real racing ... 64

Figure 7.8 Reuse Number: race stars ... 64

Figure 7.9 Reuse Number: asphalt8 ... 64

Figure 7.10 Reuse Number: cut_the_rope2 ... 65

Figure 7.11 Reuse Number: dead trigger ... 65

Figure 7.12 Normalized Hit Rates for various LRU configurations, Cache size = 4 MB ... 67

Figure 7.13 Normalized Hit Rates for various LRU configurations, Cache size = 1 MB ... 68

Figure 7.14 Reuse Distance Implementation ... 69

(15)

Figure 7.16 Reuse Distance: dead trigger2 ... 71

Figure 7.17 Reuse Distance: Sandra ... 71

(16)

1 Introduction

For many years, trends in high performance computing, guided by Moore’s Law, have

progressively led to the development of complex heterogeneous systems. Designs have been

able to accommodate increased complexity and optimizations due to the availability of

increased number of transistors on a single chip. It has led to increased circuit level

parallelism and integration of subsystems on the same die.

The trend of the GPU being integrated with CPU on the same die is the result of this

phenomenon. Today, graphics forms an integral part of many computing systems. The

increasing complexity, resolution and realism of game animation require highly specialized

processing engines to obtain high performance for graphics rendering. The CPU solely

cannot be relied for providing such high end performance. Hence specialized GPU

accelerators form an integral part in many systems today. The CPU-GPU system can have

mainly two configurations. The first one is the Integrated Graphics scenario. Intel and AMD

design such integrated systems, with Intel’s Sandy Bridge and AMD’s APU being two of the

first integrated CPU-GPU designs in the market. The second is the standalone graphics card

technology such as that produced by NVidia.

1.1 Integrated vs. Standalone Graphics

In the case of integrated graphics as shown in Figure 1.1, the graphics accelerator is

integrated within the same die as the CPU. It shares the last level cache, interconnect and the

system agent, consisting of the memory controller and the IO. The Double Data Rate (DDR)

DRAM memory in the case of integrated graphics has broadly three divisions. First is the

temporary or stolen memory that is used exclusively by graphics for storing intermediate

values. Second is the CPU memory that belongs totally to the cores, and forms the largest

(17)

buffer for commands issued from the host (CPU cores) to the device (GPU) and outputs for

the Display.

Figure 1.1 Integrated Graphics Block Level Diagram

In contrast, in the case of standalone Graphics as shown in Figure 1.2, the CPU and the GPU

have separate DDRs. The host has a driver that connects through an IO bus such as PCIe.

The driver running on host programs the system-DMA to transfer data from host memory to

graphics card memory.

In a modern SoC, the cores are integrated with on-die last level caches, in addition to the

high level caches in the individual cores. To further reduce memory latency, the memory

controller is also brought onto the die. Hence the Graphics processor that is present on the die

alongside the CPU cores can take advantage of the on-die LLC for catering to its increased

bandwidth requirements, as GPUs are data intensive processing units.

The efficient utilization of on-die caching of integrated graphics traffic is a field of active

research. There are various existing cache policies that cater to the scenario of the CPU cores

(18)

optimization techniques is still open and being studied. Such studies requires fast and

efficient simulation infrastructure.

(19)

1.2 Simulation of Integrated Graphics

Conventionally, pre-silicon graphics simulation is done using a cycle-accurate simulator. The

game workloads are run on an older generation of silicon, over the underlying DX or

OpenGL API, where the driver underneath has instrumentation to capture the workload

traces[1]. These traces consist of two types of information: the command sequence and the

data describing the surfaces that refer to two-dimensional geometrical objects in context of

graphics, stored in memory. These trace files are fed to the graphics simulator.

Two kinds of scenarios can be simulated. In the first case, it is assumed that while the

graphics engine is active, the CPU cores are not in active mode. In other words, after doing

the relevant physics, the CPU cores are not in execution mode for any of the benchmarks

until the GPU executes a particular workload, as shown in Figure 1.3. The second case is the

mix of workloads relevant to both the CPU cores and the GPU fed at the same time, i.e. both

processing units are in active mode simultaneously. Both scenarios need thorough analysis to

derive a good performance from the SoC.

Figure 1.3 CPU GPU Interaction

Hence a robust simulation methodology and infrastructure needs to be developed even before

analyzing the above system for improvement and optimization. It is imperative that the

architectural design is validated and studied for optimizations and tradeoffs well within time

in the design flow. The duration spent on such studies should be minimal but at the same

time accurate to be able to reach well-informed design decisions to minimize issues in

post-silicon. Some of the issues revolving around Graphics simulation and its analysis are as

(20)

i) Stale workloads: The workload traces extracted are not updated, as they are obtained by running workloads on silicon that is at least a generation older than

the design for which the simulation is being carried out. Due to various software

and hardware architectural changes and driver optimizations

generation-over-generation, the traces collected are hence not very accurate. However, ideally it

should not make a very big difference in terms of performance.

ii) Simulation speed and Flexibility: The simulation models being cycle-accurate and modeled to the last level of detail in terms of microarchitecture are very slow.

It takes days for a particular job to complete and results to appear. Also, very

accurate and detailed models have less flexibility for lot of experimentation based

on architectural parameters for studying optimizations and other architectural

improvements.

iii) Large workloads: The workload traces that are obtained constitute gigabytes of data that is extremely hard to run in the compute farm that is distributed across the

globe, as these workloads need to be copied from one compute farm to another. It

further aggravates the problem of time-consuming simulation.

iv) Characterization of Graphics traffic to SoC: A lot of work has been done and is still being done for studying the 3D Graphics workloads and their

characterization based on their behavior inside the graphics engine. However,

characterization of traffic at the interface of the graphics and the SoC that has

filtered localities and very limited information is a field that needs to be explored

further.

v) SoC level Cache Optimizations for Graphics: As explained above, the optimization of caching for graphics at a SoC level is an area for exploration, in

(21)

1.3 Contributions

The work presented in this document focuses on three of the issues described above:

specifically, ii, iv and v.

First, a high level abstraction model is built using the SystemC TLM 2.0 framework for

simulation of a SoC at an abstract level, mainly used for architecture level analysis. The

model developed is fast, scalable, flexible to changes, easy to work with, and reasonably

accurate for carrying out a variety of function- and performance-based studies and analyses.

The second contribution is the characterization of the graphics traffic captured at the

interface of the SoC and the graphics core to understand the nature of the traffic at the SoC

and get an insight into the kind of caching optimizations that can be experimented with using

these results.

Third, using these two different kinds of infrastructure, studies have been carried out to

explore the avenues of optimizations in the caching policies for possibly improving the

performance of the system.

(22)

WHAT?

-Modeling of Abstract Uncore Model HL_UNCORE

-Graphics Traces Analysis (Using simple scripts)

-Alignment of HL_UNCORE & Reference Model

Simulate Cache Policies in HL_UNCORE

WHY?

-Studying Behavior of Graphics interaction in SoC (Cache hit rates, replacement policies, latencies) Fast Simulation

Framework + findings

Evaluation & Analysis

Tuning & Implementation

(23)

2 SystemC TLM 2.0 Simulation Framework

The Uncore is a term used to describe the section of a SoC that is not part of the processing

core/cores. The High level abstraction model of the uncore developed in this work is

developed in the SystemC Transactional Level Modeling (TLM) 2.0 framework. SystemC is

an open source library built on top of C++ that adds the capability of describing hardware at

various levels of detail. TLM is a high level of abstraction and helps in the modeling of

communication between various functional modules. In this case, TLM2.0 is used to describe

hardware at the SoC component level. Various sources [2],[3],[4],[5] have described the

modules and implementation mechanisms in SystemC TLM2.0 framework. The mechanisms

and constructs used in this work are described in the following sections.

2.1 Use of SystemC TLM2.0

The main reason for selecting SystemC TLM2.0 is its availability as an open source standard.

It is a single, unified, design and verification language that is used to express architectural

attributes and system level features in the form of C++ classes. SystemC enables modeling of

system architecture at a very highly abstracted level that is much faster than the pin- and

cycle-accurate RTL (Register Transfer Level) implementation. TLM 2.0 standard, integrated

with SystemC, gives the advantage of interoperability. TLM framework on top of core

SystemC modules allows easy exchange of these modules/IPs among different companies, or

within the same company among different teams. Apart from the basic communication

protocol, it provides the flexibility to customize the protocol of communication between

(24)

2.2 Timing Model

TLM2.0 provides two types of modeling methodologies with respect to timing: the Loosely

Timed (LT) and the Approximately Timed (AT) models, illustrated in Figure 2.1.

Figure 2.1 TLM Timing Models and their Implementation Mechanisms [3]

The model developed in this work is an approximately timed (AT) model presenting a high

level abstraction of a very detailed product level simulator. It is a cycle-count accurate model

and contains a notion of lumped timing that can be used for first-order performance analysis.

The model makes use of sockets, generic payloads, modified phases and a non-blocking

interface where each process is synchronized with the SystemC scheduler and runs in

(25)

2.3 Non-Blocking Interface

There are two types of processes in SystemC: methods and threads. A thread, once spawned,

executes until it is killed; however, it can be suspended by adding appropriate wait

statements. A method, on the other hand, cannot have wait statement but is sensitive to

particular events in the system, i.e. it executes only on the occurrence of mentioned events in

its sensitivity list.

The blocking transport methodology uses a thread for maintaining communication between

modules. After forwarding a request to another module, it waits for a specified amount of

time (the delay that tries to simulate the latency of issuing a request from one module to

another at a very highly abstracted level) and then returns to execution for proceeding with

the next requests. Such an implementation eliminates the need of buffering and crediting.

However, for modeling more complex and real scenarios, where there is a need to model

buffering of requests to imply a real sense of latency, a method is required. This is the basis

of the non-blocking transport implementation mechanism.

2.4 Transaction Packets

The data, command and acknowledgement content between modules is modeled as packets

that are communicated through ports and exports following a certain protocol. The

information packet is termed as a generic payload in TLM2.0 and can contain information of

the address, command and data, along with some other pre-defined fields. There is flexibility

of adding extensions, which means that additional information pertaining to the data packet

can be appended to the main information packet. This capability is used extensively in the

modeling of our system and enables detailed specification and easy modeling of complex

(26)

2.5 Scalability

Injector is a term used to define an abstracted core model that is used to inject inputs to the

system. The model developed for this thesis can scale from a single injector to multiple

injectors, although the focus is working with a single injector simulating a graphics core. The

model can be scaled to have multiple slices of the last level cache, and also have multiple

channels of memory. The connectivity for such scalability is achieved through the use of the

multi socket feature of TLM 2.0.

The multi-socket is used both ways, with the initiator to allow multiple targets and with the

target to ensure connectivity with many initiators. This feature provides ease and flexibility

in the building of a scalable system.

2.6 Lumped Delays

The model developed is an approximately timed model that uses lumped delays to have a

notion of time and latency. It means that it is a count accurate model and not a

cycle-accurate one, not defining the behavior and state after each and every cycle.

Having a notion of delays also helps in keeping the model deterministic. The functioning is

simple. To model the latency consumed by a transaction, it is pushed into a buffer and is

configured to stay in the buffer for the required latency, after which it is popped out of the

buffer. Such delay blocks are configurable and can be used for modeling different values of

delay as required by the experimentation.

2.7 Tracking, Crediting and Arbitration

Three concepts make an inherent part of the system modeling - the tracking of events

(transactions/requests), crediting flows between modules, and arbitration in the scenario of

(27)

The tracking of each transaction is essential in the system in various modules. The basic

interface protocol of TLM2.0 is customized to maintain system-specific phases.

Many STL containers from the C++ library are used to model the various architectural

components of the system. Vectors, lists and deques are used as per the requirement and

functionality of the particular structure to be modeled. For tracking structures in the injectors,

the cache controllers and the memory controller, vectors are used. The requests are buffered

up in the deque structures for implementing the FIFO functionality.

Simple arbitration is maintained for selecting among multiple injectors for injecting new

requests and for sending read and write requests to memory from the multiple cache

controllers. Arbitration modules with arbitration logic, appropriate delays, and storage

buffers thus form an integral part of the system ensuring scalability and efficient

communication of transactions between modules.

Another very important concept is that of crediting that is essential to maintain accuracy in

tracking, preventing overflow of buffers and loss of requests in the system unaccounted.

Crediting is maintained at the interface between any two modules, both in the forward and

(28)

INITIATOR TARGET

TARGET INITIATOR

REQUEST

REQUEST ACK/NACK

INIT_BUFFER TARG_BUFFER

CREDIT COUNTER Num_entries

(CREDIT--)

(CREDIT++)

Num_entries CREDITING PROTOCOLS

(A)

(B)

Figure 2.2 Types of Crediting Protocols

The protocol shown in Figure 2.2 (A)is a simple request and response acknowledge protocol,

in that the initiator keeps sending a request to the target irrespective of the target buffer size

(it has no knowledge of the target buffer size) and receipt of as many NACKs until it finally

receives an ACK. With the protocol depicted in (B), a new request is sent from the initiator

to the target only if the target has a buffer entry free to accommodate the incoming request.

The tracking of free entries is done through the credit counter in the initiator. It implies that

the initiator has to keep information about the maximum available buffer entries in the target

and keep track of the free entries at a particular instant in the system, decrementing the

counter when a new request is sent and incrementing it when the request is processed further

and the target buffer entry is freed. The request is buffered in the initiator buffer itself until

there are free entries available. It seems that maintaining additional logic for crediting in (B)

(29)

time, since the new requests from initiator are still being sent every cycle and NACKs are

being received every cycle until an ACK is received, it increases the interconnect traffic

significantly, and is responsible for a lot of bandwidth and power consumption. It also

increases simulation time because the number of events to be processed increases. Hence it is

a good choice to have a simulation model with crediting flows even if the actual hardware

uses the ACK-NACK protocol.

In this model, the credit counters are combined with delays in every module to avoid race

conditions on these counters. Static crediting is used in this model as described in the next

section.

2.7.1 Static Crediting

In the static crediting protocol, multiple sources in the initiator share a common buffer in the

destination. This buffer in the destination is divided statically into as many numbers of parts

as the sources and each source is allowed to use only those many entries in the destination

buffer. The downside of using such a system is that it may lead to wastage of destination

buffer space in the case if one of the sources do not use all the entries allocated to it, while

the other source needs more entries than allotted but is unable to use them. In such cases an

alternate mechanism for efficiently sharing the destination buffer among sources dynamically

(30)

2.8 Statistics Collection and Parameters

The model developed can be used for analysis for a wide variety of configurations. Various

knobs are provided for configuring parameters to study the effect of each on the behavior of

the system. For example, some of the knobs used are to vary the size of the LLC, its

associativity, the memory configuration, the number of injectors, or number of cache

controllers.

The model is developed with very detailed counting statistics at every level to account for

quantitative analysis of performance pertaining to cache hit and miss rates, the bandwidth

and latency of requests, and other metrics. Histogram statistics are also supported, which

(31)

3 The High-Level Uncore model Architecture

This section describes the details of the high level uncore architecture model developed in

this work. It is an approximately timed scalable model with a detailed cache, cache

controller, and interconnect. The injector core is a simplistic model that contains abstracted

architectural details and is used to inject traffic captured at the core to uncore interface as

shown in Figure 3.1.

Figure 3.1 Workload Flow for Graphics Simulation

It shows that the transactions that have been captured at the interface of the GPU simulator

and the SoC detailed cycle level simulator are injected through a developed high level

injector to the high level abstraction model HL_uncore developed in this work. The model is

configurable for a multiple number of injectors, cache blocks and memory channels.

Figure 3.2 shows the high level view of the model, consisting of various modules and the

interfaces between them. Various abstraction choices have been made for maintaining

modeling and simulation flexibility and simplicity. New and more enhanced features can be

readily added, and new modules can be introduced to take advantage of the modularity of the

(32)

3.1 Injector/Graphics Core

This injector represents Graphics IP in the system. It is modeled protocol-accurate but

without any micro-architecture details. We do model the queuing of requests in this injector

as it is important to capture the latency and throughput limitations of the real system.

The model of a single core consists of an injector that injects new requests to the last level

cache and memory based on the arbitration policy and availability of credits.

(33)

3.1.1 Requests Tracking Table (Core)

The architecture includes a requests tracking table that is a major structure keeping a log of

active transactions in the system and their current status. Each request injected into the

system finds an entry in this table, along with a master transaction created for the same,

based on the availability of credits in the arbiter connected to the core and the size of the

Requests Tracking table. This table has 32 entries and each entry has a unique ID. Each entry

has various fields, corresponding to a particular request, i.e. its transaction, the opcode,

address and data associated with it, and various flags indicating the current status of the

request.

(34)

3.1.2 Address Match Logic

An Address match logic is implemented in the injector that ensures that two requests to the

same memory address are not active simultaneously in the system. It is implemented as a

mechanism that iterates over the current log of requests in the Requests Tracking table. If a

match is found, the new request is stalled until the previous request to the same address is

completed. This makes sure that two requests to the same address are not outstanding from

the same injector as it is difficult to maintain ordering/coherence between them. Order should

be maintained strictly from the injector point of view potentially in write after write scenarios

etc. Hence the injector must issue second request to the same address after completion of the

first one.

3.1.3 Delays and Credits

Appropriate delay blocks are configured in the core to simulate lumped delays on the data

flow path (data received from memory as the result of a read request, or data sent to memory

as the result of write request) and the commands path. The delays have been configured with

approximate values of cycle delays on these paths to maintain a notion of approximate timing

and determinism in the various flows. Static credit counters for signals and data have been

(35)

3.2 The Interconnect: Core Arbiter

Interconnect between the core and the uncore as shown in Figure 3.4 has been modeled as an

arbiter block and is one of the major abstraction choices selected in the model. It is a

structure that is responsible for the coherent flow of requests between the cores and the cache

controllers

Figure 3.4 Detailed Architecture of Core Arbiter Interconnect

3.2.1 Multi-Socket

The arbiter is modeled to have a multi-socket interface; it is essentially a single connection

port, but has the provision for connecting to multiple input ports and multiple output ports. It

is an excellent feature in the TLM2.0 framework and has been used to model the arbiter to

add flexibility while scaling the model, i.e. extending the model to have multiple cores and

(36)

3.2.2 Arbiter Logic

The arbitration logic is hierarchical. It ensures that in one cycle, there is no clash of requests

from multiple injector cores to one particular cache controller. The arbiter logic for the

forward flow of data or requests is shown in Figure 3.5 and works as follows:

i) In every cycle, each cache controller ID is picked up, and for each particular ID,

the core queues are searched for any ready requests to be sent to that particular

cache controller

Figure 3.5 Arbiter between Injector cores and Cache Controllers

ii) A core counter is maintained that contains the ID of the core injector that is at the

highest priority in the round robin logic for that particular cycle. The head of the

core queue is searched. If it contains a request for that particular cache slice, the

request is forwarded. If it does not contain a request for that cache controller, the

counter is incremented and the next core queue is searched.

iii) Once a request is found for a particular cache controller, the next cache slice ID is

picked up and the core queues are searched in a round robin fashion, starting from

the core pointed by the current value of the core counter, for a ready request for

(37)

iv) The same arbiter logic is used to abstract the interconnect between the injector

and the cache controllers and the cache controllers and the memory

v) Resource contention is avoided by this policy, as only one out of many incoming

requests to the same destination is allowed in one cycle. Multiple requests can

progress to different destinations in a single cycle.

3.2.3 Signal and Data Queues

Separate queues are maintained for data packets (incoming requests) and for signals

(acknowledgement signals) both in the forward and in the backward paths. The separate

queues are used to model parallel paths for the flow of data and control signals in a single

cycle. Within the control signal flow as well, separate paths and storage buffers are

maintained for incoming control signals in the forward path and acknowledgment signals in

the backward path.

3.2.4 Hash Logic

The hashing logic is implemented in the arbiter to calculate the cache controller ID and the

slice ID within each cache. The implementation is based on a simplistic approach using the

incoming request address. Relevant high-entropy bits of the address are used to route the

requests to a particular cache block and slice. This logic is in place to ensure that all the Last

level cache slices are used optimally with no one cache slice is hot-spotted or under-used.

3.2.5 Credits and Delays

Static credits are maintained for all buffers in the arbiter to avoid overflow and resultant loss

of requests/data in the system. Configurable Delay blocks are configured on each path

(38)

3.3 Last Level Cache and Cache Controller

3.3.1 The Last Level Cache

The last level cache is a set-associative cache. Its capacity can be divided into blocks, each of

which has a separate cache controller block. Within each cache block, there is further

division in form of slices. Each slice has sets. Each set consists of the following fields: Tag,

Valid bit, MESI state, Snoop filtering bits corresponding to each request injector (core), and

the LRU counter for victimization and replacement. The cache is configurable for various

sizes and associativity, and it can be divided into the required number of blocks. The division

of the cache capacity into different blocks enables faster lookup and catering to more than

one request per cycle.

(39)

3.3.2 Replacement Policy

The Quad-age LRU policy [11] is implemented as the replacement policy for victim selection

and insertion of new lines into the cache. This policy is based on the fact that a particular

cache line in the cache can have 4 ages i.e. from 0 to 3; with 0 denoting the least recently

used (LRU) line and 3 being the most recently used (MRU) line. According to the age of the

line, as well as the type of request, the policy replaces a victim and installs a new line into the

cache.

3.3.2.1 Victim Selection

When a set is accessed and a line has to be replaced from the set, the first LRU counter with

zero value is searched from the left i.e. way-0, way-1 and so on. This line is selected as the

victim for eviction to make space for the incoming line.

3.3.2.2 Insertion

The insertion age of the cache line is fixed at 2. It ensures that once the line is installed in the

cache, it has the opportunity to be used for a few times and produce few hits before it itself

becomes the victim. For all the request types, i.e. the read request and read ownership

opcodes and the write request and write back request opcodes, the insertion age is fixed at 2.

3.3.2.3 LRU Update Logic

The update of the LRU counter is Quad-age based. A per-opcode age table is maintained that

gives the value of the LRU counter if there is a hit to that particular line. A line becomes less

vulnerable to eviction and saturates at the maximum value of 3 (indicating LRU) if it causes

many hits after its installation. To ensure that a zero is always found while searching from

(40)

present. However if zero is not found, then the minimum age among the ways in that set is

subtracted from all the ages in the set.

3.3.3 Cache Controller

3.3.3.1 Requests Tracking Table (RTT- Cache Controller)

This is the most important structure in the cache controller and has the cycle by cycle

updated status of each active request in the cache. The Requests Tracking table is similar to

the one that is present in the core injector and contains the data, address and status

corresponding to each active transaction. In this model, the RTT has 32 entries for 32 unique

transactions and gets updated after every action performed in the core, LLC or memory

unless that request is completed. It acts as a temporary buffer for dirty victimized cache line

data, unless the data is written back to the memory. Hence it implements the Fill before Spill

refill technique, i.e. the cache line can be filled with new data, instead of waiting for the

victimized cache line data to be written back to the memory first.

For the completion of cache hit requests and memory requests that are ready, there are

independent mechanisms that work every cycle to progress the first available hit and memory

request in the simulation.

3.3.3.2 Delays and Credits

Various lumped delays have been modeled in the data and control flow. A constant delay of

n cycles has been configured for the lookup delay of the cache; this is includes the latency of

indexing to the required set, matching the tags, and deciding a hit or a miss. In case of a miss,

latency of selection of a victim is also included in this delay. The lookup delay is

encountered every time there is a lookup to the cache, whether it’s a new request or a victim

(41)

signals are configured to one clock cycle, in both forward and backward paths. New requests

are accepted by the cache controller based on occupancy of the Requests Tracking table, or

else the request wait in the New Requests Queue that is searched every cycle for a ready

request that can be a valid Request Table entry. The credits indicating the occupancy of the

new requests queue are sent back to the arbiter every cycle.

3.3.3.3 Set-Match Reject Logic

The set match reject logic is implemented to ensure that no more than one request to a

particular set in a particular LLC block is active in the system. This check is implemented

while the requests are still in the new requests queue and are about to make an entry in the

Requests Tracking Table. If two requests work on the same set in the LLC simultaneously, it

can result in incorrect update of the LRU counters for that particular set, leading to the

selection of incorrect cache lines for victimization.

3.4 Cache Controller Arbiter Interconnect

This interconnect enables communication between the Last level cache blocks and the

memory controller, consisting of one or more memory channels. The interconnect consists of

multiple sockets as the input and output ports to ensure flexibility in scaling the architecture

model to increase the number of cache controller blocks or the number of memory channels.

It has a similar arbitration logic as the core arbiter interconnect. It consists of separate queues

for data and request queues and also separate read and write queues for each memory

channel. The logic takes care of the fact that not more than one request can be sent from

(42)

Figure 3.7 Detailed Cache Controller Arbiter Architecture

3.5 Memory Controller

An existing model of the memory controller has been integrated with the currently modeled

high level system of the uncore. The memory controller has configurable memory channels

and a memory address decode logic. The memory channels have a scheduler modeled; based

on the read mode or the write mode, the scheduler handles requests stored in the read and the

write queue. The reads are given priority over writes unless a threshold value of writes occur

in the write queue, after which the mode changes from read to write and write requests are

(43)

3.6 Flows and Protocols Modeled

The current model supports six kinds of request opcode flows:

i) Data Read Request

ii) Data Write Request

iii) Read for Ownership Request

iv) Write Back Request from upper level cache

v) Uncacheable Write Request

The MESI protocol is modeled to support the multi core scenarios. However the snooping

flow is simplistic, modeling only the backward invalidate flow.

Static crediting model has been implemented to take care of buffer occupancy at a particular

instant in time.

3.7 Synthetic Scenario Generation Capability and Synthetic Validation

Flows

The injection methodology used in this work gives flexibility for synthetic traffic generation

and can be used to inject real traffic traces extracted from running application workloads.

Synthetic traffic generation is achieved by providing various kinds of information such as the

beginning addresses of the requests, the type of request (i.e. the opcodes), and what kind of

traffic - streaming, random, etc. needs to be generated. A mixture of requests of various types

can also be specified, with a given percent distribution of each type.

The injection methodology uses the SystemC Verification library utilities to create

randomized requests. Such blocks of information are defined as tasks that are mapped to

nodes. One task can be mapped to one or more nodes. Nodes are in turn mapped to actual

(44)

task mapped to a single hardware component or a number of tasks with complex

dependencies mapped to various nodes.

The dependencies are defined in terms of a graph that specifies the initial nodes and the

triggering nodes and mapping of the tasks to the nodes. Such synthetic traffic generation is

very helpful in initial stages of the model development where specific flows need to be

checked for, studied and analyzed. Apart from the synthetically generated task scenarios,

entire workload traces can also be defined as tasks and mapped to nodes. This gives a great

deal of flexibility in defining the inputs to the model. In this work, this method of synthetic

generation of traces is used for various studies, validation of various data and control flows,

data collection and analysis. Five flows were implemented in the current model. These are

the Data Read Flow, the Data Write Flow, and Read for Ownership and Write Back flows

and the Uncacheable writes. Each flow requires individual validation for its correct

functionality before the model can be injected with real large workload traces. A few of the

generated test cases are shown in Table 3.1 .

Table 3.1 Synthetic Scenario Generation Flows

S.No. Test Case System

Configuration

Test description

1 Data Read Flow

(Validation of Read Hit and Miss Flows)

1 core, 2 cache controllers, 1 KB cache size, 4-way associativity

Injection type-Streaming, Opcode- Data Read,

Task 1- Data Reads, 0x000080

Task 2- Data Reads ,0x000080

Num of Requests: 1000

2 Data Write Flow

(Validation of Write Hit and Miss flows)

1 core, 2 cache controllers, 1 KB cache size, 4-way associativity

Injection type-Streaming, Opcode- Data Write,

(45)

0x000080

Task 2- Data Write ,0x000080

3 Mix of Opcodes (Read & Write)

(Validation of mixed requests and Arbitration logic using multiple cores for injection)

i) 1 core, 2 cache controllers, 1 KB cache size, 4-way associativity

ii) 2 cores, 2 cache controllers, 1 KB cache

size; 4-way

Injection type-Streaming, Opcode- Data Read, Data Write

Task 1- Data Read (50%) Data Write (50%),

Start Address : 0x000080

4 Dirty Victimization Flow 1 core, 2 cache

controllers, 1 KB cache size, 4-way associativity

Injection type-Streaming

Task 1- Data Read Address 0x000080 (100 requests)

Task 2- Write Back Address

0x000080 (100 requests)

Task 3- Data write Address

0x0000c0 (100 request)

(46)

3.8 Launching Workload studies, Statistics Rollup, Debug Infrastructure

The Hl_uncore model supports detailed debug infrastructure constituting of cycle by cycle

state of each module, the buffers in every module and the transaction status tracking tables.

The model has the provision to be debugged using a particular flow, following it to the level

of a single transaction, seeing its state at every cycle until it is active in the system.

The model also has basic support for debugging using an internal tool that uses visualization

techniques to track down every transaction, when a new transaction was created, when it was

killed/released and its entire life duration event by event.

Job files were created using python scripts for injecting graphics workloads in various

configurations and for data collection. The job files were then launched on Net batch for

parallel processing and fast completion of the jobs. This was very helpful as a wide variety of

studies could be done with many workloads and multiple configurations.

The model supports a lot of relevant knobs to configure the system according to the needs of

the studies undertaken. The knob infrastructure adds flexibility to the system; the source code

need not be touched every time there is a change in the configuration of the model. The

knobs can be modified accordingly.

The statistics collected from every run of the model with different configuration is collected

using scripts to collect only relevant statistics for that particular study. This selective rollup is

(47)

4 Golden Reference Model vs. the HL_uncore

The Golden reference model used in this work is a Product Simulator that mimics the real

architecture of the uncore[9]. It includes the Last level cache, cache controller, the

interconnect and the System agent with the IO, memory controller and other units. The

reference model is a very detailed, cycle accurate model of the uncore architecture and

implements all minute as well as major features of the product architecture. The actual

architectural model is shown in Figure 4.1.

(48)

4.1 The Ring Interconnect

The ring-based interconnect is used for communication between the cores, the last level

caches, the graphics core and the system agent (which contains the memory controller and

interfaces to many input and output devices). It is composed of four sub-rings, namely:

i) Request Ring

ii) 32-byte Data Ring

iii) Snoop request Ring

iv) Acknowledge Ring

The four separate rings provide parallel paths for communication. All the different agents

listed above are present on the ring as stops. The ring maintains coherency and order between

the multiple requests to different agents and helps optimize latency in the communication.

4.2 The Last Level Cache and Cache Controller

The last level cache is a high bandwidth cache that is shared among the processor cores and

the graphics. It is an inclusive multi-bank LLC with 64B cache lines and 16-way

associativity. The LLC size is scalable according to the number of processing cores, and both

cores and caches run at the same frequency.

The cache box or the cache controller is the interface block within the LLC slice that

(49)

Figure 4.2 The High level Core-LLC Interaction [9]

4.3 Coherency and Ordering

The ring does not maintain the order and coherency of various requests from different agents.

Ordering is only maintained by the respective sender of the request. On the ring, the

transactions flow in parallel and are out of order. MESI protocol is implemented for

maintaining coherency. Snoop filtering is used in the LLC, i.e., every cache line has Core

(50)

4.4 Abstraction Choices of Hl_uncore with respect to the Reference Model

The model developed in this work is a high level one that does not capture the architectural

details of the real system to the lowest level. Hence it is called an abstracted model. It is a

flexible and scalable model that can be enhanced. More features can be added to it as needed

for performing architecture studies. However, some features of the Golden Reference Model

have been abstracted to facilitate fast but accurate simulation of the scenario being studied:

execution by the graphics core only. The remainder of this section describes those

abstraction choices in detail.

4.4.1 Abstraction Choice 1: Modeling of Ring Interconnect

The ring interconnect in the real architecture is abstracted in the high-level model. It is a

reasonable abstraction choice for a one-core model where only the graphics core is simulated,

which is the focus of this work. Functionally, it is a reasonable choice for multicore as well,

as a simple arbitration scheme using a round robin policy is modeled for a fair distribution of

requests from different sources to different destinations. The subrings in the real architecture

are approximated by the separately maintained buffers for data, request and

acknowledgement signals.

In the scenario of standalone graphics simulation, the stress is on catering to the required

bandwidth because the bandwidth demands are enormous for a graphics processor. The

bandwidth requirements of the graphics core are met by the banking of the LLC into different

blocks and slices. However the interconnect model can be approximated as for the graphics

scenario, the latencies are not very crucial and the latency for individual requests brought in

by the transactions on the ring do not introduce serious problems in the model with respect to

(51)

However in terms of performance, the correlation between the two models, purely based on

the effect of the abstraction of ring, will be closer for a one-core system as compared to a

multi-core system.

4.4.2 Abstraction Choice 2: Simplification of Cache Controller Flows

Another abstraction choice that has been made is the simplification of the request flows that

are supported in the model. The flows are simple though detail-oriented and have enough

description to simulate a real flow with reasonable accuracy. However, the scenarios in a

particular flow that are rare but add a great deal of complexity have been omitted in the

implementation with a negligible impact on the accuracy of the model as explained by the

correlation data presented in the later sections.

4.4.3 Abstraction Choice 3: Lumped Delays

The developed model is not a cycle-accurate one, and does not give the details and status of

each and every modeled hardware component at the end of every cycle. However a

reasonably accurate notion of time and determinism is maintained in the model appropriate to

high level modeling, through the use of lumped delays. For example, in order to model the

lookup delay in the cache, a lumped delay is configured to n cycles to mimic the average case

in the real architecture. It doesn’t update the status of the system after each cycle, but gives

an updated state after the configured delay. Also the values of such lumped delays used in the

model are configurable and can be modified as per a particular requirement of system

(52)

5 Correlation Studies and Results

The section presents the results of the correlation studies performed for the developed

HL_uncore model with respect to the existing product level simulator. The functional

correlation has been done to establish confidence in the developed model so that it can be

used reliably for experimentation and the results can be directly indicative of the actual

trends that can be expected from the detailed simulation models. This correlation studies has

been undertaken based on varying the parameters of the LLC size and associativity.

The developed model has also been compared with the baseline reference based on the

simulation speed and code base to indicate the ease and flexibility of using this model over

the detailed one for analyzing first-level functional and performance trends.

The workloads and the configurations used for analysis are described in the next section.

5.1 Workloads

Fifteen workload traces were selected for different kinds of studies. These traces are captured

at the interface of the core and the uncore while running the game benchmarks on silicon. All

the simulations carried out in this work are trace-driven because it is faster as compared to

the execution based simulation. The traces are without timestamps and consist of the type of

request and the memory addresses only as shown in Figure 7.1. The accuracy could be

affected if the timestamps are not taken into account as the timestamps make the bandwidth

demand more realistic. However addition of timestamps would further increase the size of

these traces. At this level of architectural studies, this abstraction has therefore been found to

(53)

The workloads are-

i) Sandra

ii) Boshcluster

iii) Daimlerhulf

iv) Unigine-heaven

v) Unigine-valley

vi) Angry_birds

vii) Basemarkx

viii) Clashclans

ix) Cut_the_rope2

x) Dead_trigger2

xi) Race_stars

xii) T-rex

xiii) Manhattan

xiv) Real_racing

xv) Asphalt8

5.2 SoC Configuration for Experiments

The SoC configuration used for carrying out correlation studies as well as standalone

(54)

Table 5.1 Base SoC Configuration Details

Parameter Value

Number of Injectors 1

Number of Cache Controllers 2

LLC Size 1,2,4 MB

LLC Associativity 4,8,16 -ways

(55)

5.3 Simulation Speed Comparison

This section presents results of comparison of the simulation speed of the developed model

with respect to the baseline reference model. The Simulation speed up achieved by the

Hl_uncore model is plotted with respect to the baseline model as shown in Figure 5.1. The

maximum speed ups achieved are in the range of 18x for few workload traces like

basemarkx, boshcluster, manhattan etc. The average speedup achieved is around 9x.

Impressive simulation speed-up over the baseline reference simulator enables HL_uncore

simulator to be used for early architecture evaluation studies that require multiple iterations

of simulations to evaluate wide-variety of architecture options.

angry_bi

rds asphalt8

basemar kx

boshclus ter

clashcla ns

daimlerh ulf

manhatt an

race_sta rs

real_raci

ng sandra trex SPEED UP 3.49 8.52 17.87 17.73 7.06 4.52 17.07 9.13 8.97 0.17 8.86

0 2 4 6 8 10 12 14 16 18 20

SPEED UP

(56)

5.4 Code Base Comparison

The code base comparison as shown in Table 5.2 indicates that the Hl_uncore model has

much less code and fewer modules to be handled.

Table 5.2 Code Base Comparison

Parameter Value (Base Model/Hl_uncore)

Lines of Source Code ~6x

Number of Files/Modules ~3x

It has multiple advantages over a detailed product simulator-

It is apt for easy experimentation as compared to the baseline product simulator model

because it has greater flexibility to modify code and implement newer architecture proposals

with less effort.

Another advantage is that less code results in fewer bugs in the simulation model.

5.5 Correlation on Cache Associativity Study

In this section, the cache hit trend is studied by varying associativity, keeping cache size

constant and using two variations of the LRU policy.

5.5.1 Description

One is the baseline policy with the install age or LRU counter value equal to 2 and

(57)

particular cache line. The other policy is that of vulnerable fill in which the cache line install

age is 0, i.e. initially it is vulnerable to victimization, but the age counter increases if hits are

seen to that particular cache line.

5.5.2 Discussion on Trends and Correlation

The trend of associativity for the baseline Quad-age LRU policy (Figure 5.2) for both the

models shows that the cache hit rate increases, and conflict misses decrease with increasing

associativity. However the Vulnerable fill LRU policy (Figure 5.3) shows a reverse trend of

associativity, i.e. the cache hit rates decrease with the increasing associativity. In a vulnerable

fill policy, a cache line is installed in the cache at an age of 0, i.e. most susceptible to getting

victimized. The results from the graph show reversed trends of hit rates with associativity.

This is because in this policy, only a particular way keeps getting victimized while rest of the

ways in a particular set remains unused most of the times. Hence for increasing associativity,

the number of sets that remain unused, it being equivalent to wasted cache capacity and the

(58)

(59)

Figure 5.3 Trend for Varying Associativity- Vulnerable Fill

The plots of normalized cache hit rates with respect to varying associativity for different

LRU policies show matching trends between the developed model and the baseline model for

0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6

manhattan race_stars realracing t_rex sandra daimlerhulf angry_birds basemarkx clashclans cut_the_rope2 dead_trigger

Normalised Hit Rate-Reference Model

0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6 as so c_ 4 as so c_ 8 as so c_ 1 6

manhattan race_stars realracing t_rex sandra daimlerhulf angry_birds basemarkx clashclans cut_the_rope2 dead_trigger

(60)

various workloads. Two contrasting cache replacement policies have been studied as shown

in Figure 5.2, and Figure 5.3. They show the opposite trends of cache hit rates with

increasing associativity.

Table 5.3 shows the hit rate differences between the 2 models for different workloads shows

that the maximum misalignment between the hit rates is for the dead trigger trace at 3.8%

Table 5.3 Difference in Hit Rates for Associativity = 8, Baseline LRU Policy

Workload Hit Rate Difference (Hl_uncore vs Reference Model)

Manhattan -0.5 %

race stars -1.6 %

real racing -1.6 %

t_rex -2.1 %

Sandra 0.0 %

daimlerhulf -0.9 %

angry birds -0.7 %

asphalt8 -0.6 %

basemarkx -0.3 %

clashclans -1.4 %

cut_the_rope2 -3.3 %