• No results found

We have previously identified five performance problems withinData Locality cate- gory, more specificallyPoor Cache Locality,Poor TLB Locality,Unnecessary DRAM

Memory Paging, NUMA memory shared between CPUs and Page faults. In this

section, we are going to apply the observational model to identify the necessary infor- mation to be presented within our performance analysis tool.

Due to the way modern processors are designed, they access memory from various levels of caches in order; this makes each subsequent memory fetch slower as larger and more distant memory banks are accessed. A simple conceptual way to view this is illustrated below and in the Figure 6.1 with the arrow on the left of the diagram, going through various components in the chip.

1. Processor attempts to retrieve the data from L1 cache.

2. If not found previously, the virtual-to-physical address translation is performed along with an attempt to retrieve data from L2 cache.

CHAPTER 6. ANALYSING DATA LOCALITY

4. If not found previously, processor attempts to retrieve data from RAM, with a potential involvement of the interconnect on multi-processor systems.

5. If not found previously, a page fault is raised and the appropriate memory page is loaded into main memory.

Now that we have gone through the components and the process flow involved in the data locality problems, we need to elicit the measurable events and counters required to effectively diagnose the problem. Drawing on the model we have pre- sented in Chapter 5, we need to look at and pick relevant observations, ideally with high agreement levels and discarding highly controversial observations (i.e. observa- tions with a high number of “irrelevant” annotation by experts and observations with contradictory annotations).

1. Cache Locality. All of the observations seem to have general agreement be-

tween experts on whether an observation is a indication or contraindication of this problem, with “high #L2 cache misses” being the most agreed indication and “low cache misses as measured with hardware counters” being the strongest contraindication.

2. TLB Locality. The observations “high #TLB miss to instruction ratio” and “low

#TLB misses”have the most discriminatory potential with all experts agreeing on

them being respectively an indication and a contraindication of the TLB locality problem.

3. Sharing of data between CPUs on NUMA systems. The “high remote mem-

ory access” observation is the only observation that seems to have an inter-rater consensus.

4. Unnecessary DRAM Memory Paging. All of the observations have high levels

of agreement among experts, with the simple count of DRAM page changes hav- ing high discriminatory power, as a high number of DRAM page changes being an indication and a low number of DRAM page changes a contraindication of a problem.

5. Page faults. This particular problem seem to have only a single inter-rater con-

CHAPTER 6. ANALYSING DATA LOCALITY

problem and relatively strong agreement on “very low #page faults” which is considered as a contraindication.

The usage of the model is rather straightforward, as once we have identified the required observations, we need to identify the exact counters, events or metrics we require in order to successfully diagnose each problem. Below we enumerate such raw items, along with its collection mechanism. This is only part of the solution, as the information needs to be further refined in order to be useful for programmers.

1. Memory reads and writes.Modern hardware contains a counter that can allow

us to measure the amount of memory reads and writes for a particular processor, in bytes of memory.

2. L1, L2 and L3 cache misses.Similarly, there are counters in hardware that would

allow us to measure cache misses on a particular level. This event essentially occurs when data has not been found on a particular level and the processor needs to fetch it from a higher (and slower) level.

3. L1, L2 and L3 cache hits. We can also obtain values representing the number

of successful hits in various levels of cache, representing successfully obtained data.

4. TLB misses. Once again, we can retrieve the number of TLB misses from the

hardware performance counters on most of the modern CPUs.

5. Minor Page Faults. Operating system events can be used to get information

about minor and major page faults. A minor page fault represents a memory address that is loaded in memory, but is not mapped to a memory management unit, resulting in the system merely making an entry for the particular address.

6. Major Page Faults.Operating system events that are raised by hardware when a

program attempts to access memory that is mapped in the virtual address space, but not loaded in hardware. This often requires the operating system to swap or load from disk, resulting in a considerable slowdown.

7. Processor Interconnect Traffic. In most modern CPU architectures, hardware

CHAPTER 6. ANALYSING DATA LOCALITY

fic passing through the interconnect bus, such as Intel QuickPath Interconnect (QPI) or AMD HyperTransport (HT).