1
UNIT – II
MEMORY HIERARCHY DESIGN
Basic Memory Hierarchy: We know that the programmers want unlimited amounts of fast memory for their programming needs. A best solution to this one is classification of memory in hierarchical fashion.
The memory hierarchy provides advantages in locality and cost performance in memory technology. The principle of locality describes that the programs do not access all the code or data uniformly and locality occurs in time (temporal) and space (spatial) manner. The locality principle and principle of smaller hardware can work faster leads to the hierarchical memories which have different speeds and sizes.
But we know that fast working memory is expensive, it is more expensive per byte than its previous level. Our goal is to provide a memory which cost per byte is as low as equal to cheapest level of memory and speed almost as equal to the fastest level.
In memory hierarchy levels, the addresses of slower and large memory maps to smaller and faster memory. During the address mapping in hierarchy we should check the addresses for scrutinizing.
The importance of memory hierarchy is increased with the processor performance. The following diagram shows the some levels in memory hierarchy.
We know that, when a data word is not found in cache, it must be fetched from the memory and put it in the cache before doing any processing. Sometimes multiple words, called blocks are also moved into the cache for improving system efficiency. Each block in cache have a tag representing the corresponding physical memory address from where it moved from.
2
The important factor during the blocks moving between physical memory and cache is, “where the blocks are placed in the cache”. The appropriative technique to take a decision is set associative – where a set is a group of blocks in the cache. Here a block is first mapped onto a set, and then the block can be replaced anywhere within the set.
Generally the set is chosen by the following formula.
(Block address) MOD (Number of sets in cache)
If there are ‘n’ blocks in a set, the cache placement is called n-way set associative.
Another issue regarding block replacement in cache is, reading and writing the blocks between physical memory and cache. The caching data is read easy because, the data is identical in cache and in physical memory, but writing caching data is difficult, here we take care of the data in cache should be consistent. For writing caching data, we follow two methods; one is write through method-in this method the data is updated at both cache and main memory. Another one is write back method-in this method first the data is updated in cache later it is updated in main memory. Both strategies use write buffer.
Cache Performance: One measure to represent the advantages of various cache organizations is miss rate. The miss rate is simply the fraction of cache accesses that result in miss- that is, the number of access that miss divided by total number of accesses. The high miss rate leads to design of better caches. There is a three Cs model to classify the cache misses.
Compulsory: At the very first time, the required block to access by the processor is cannot be available in the cache, so the block must be brought into the cache. Compulsory misses are those that occur even if you had an infinite cache.
Capacity: If the cache cannot contain all the blocks needed during execution of a program due to the less capacity of cache the capacity misses occur.
Conflict: Multiple memory locations are mapped to the same cache location leads to conflict miss. If the block placement strategy is not fully associative, conflict messes will occur.
3
A miss rate is a misleading measure for several reasons. Hence we calculate misses per instruction rather than misses per memory reference.
=
The problem with misses per memory access and misses per instruction is that they don’t factor the cost of a miss. So we propose a better method for calculating cache performance is average memory access time.
Average memory access time = Hit time + Miss Rate X Miss Penalty Here the hit time is the time to hit in the cache and Miss Penalty is the time to replace the block from memory. The average memory access time is also an indirect measure of cache performance, but it is a better method than miss rate, it is not substitute for execution time. The following are six basic techniques to reduce the miss rate.
1. Larger block size to reduce the miss rate: The simplest way to reduce the miss rate is increase the block size. Note that larger blocks also reduce compulsory misses, but they also increase the miss penalty.
2. Bigger caches to reduce miss rate: The obvious way to reduce the misses is to increase cache capacity. The drawbacks with this method is, potentially longer hit time of larger cache memory and higher cost and power.
3. Higher associativity to reduce miss rate: Generally increasing associativity reduces conflict misses. Greater associativity is achieved by the cost of increased hit time.
4. Multilevel caches to reduce miss penalty: Generally there are two options to increase the cache hit time-one is to increase the clock rate of the processor and another one is to choose large size of cache, but these two options have their own pros and cons. A simple solution to increase the cache hit time is, adding another level of cache between the original cache and memory. Here the first level cache is small to match a fast clock cycle time of processor and second level cache is large enough to capture many
4
accesses from main memory. Now the average memory access time is rewritten as;
Hit TimeL1 + Miss RateL1 X (Hit timeL2 + Miss rateL2 X Miss penalityL2) 5. Giving priority to read Misses over writes to reduce miss
penalty: We implement this optimization in write buffer, because the write buffer consists the most recent updated value of a location needed on read miss. So we check the contents of the write buffer on a read miss, sending the read before the writes reduces the miss penalty. Many processors give reads priority over writes.
6. Avoiding address translation during indexing of the cache to reduce hit time: Cache must cope with translation of virtual addresses from the processor to physical addresses to memory.
There is a typical relation between caches; translation look aside buffers (TLB) and virtual memory. A common method we use is to optimize cache performance is page offset-this part is identical to both virtual and physical addresses to index the cache. The disadvantage of this method is virtually indexed, physically tagged optimization, size of the page limit and size of the cache.
Advanced optimizations of cache performance:
The average memory access time, defined the three parameters which are influencing the cache performance. These are Hit time, Miss Ratio, and Miss penalty. Today most of the people using the super scalar computers, so we may include the cache bandwidth is also one of the performance factors dealing with the caches. There are totally eleven (11) cache optimization techniques, and these are grouped into the following categories.
1. Techniques for reducing the hit time.
2. Techniques for increasing the cache bandwidth.
3. Techniques for reducing the miss penalty.
4. Techniques for reducing the miss rate.
5. Techniques for reducing miss penalty or miss rate by using the parallelism.
5
Small and Simple caches to Reduce Hit time:
The most time consuming part and complex operation when dealing with caches is, to read the tag memory using the index portion of the address and then compared it with the original address of the memory. But we know that smaller hardware can perform faster, so we try to keep small caches to reduce the hit time. At the same time it is not possible to keep the L2 cache on the processor, even if it is smaller.
One solution to the above problem is, we keep the smaller caches which support the direct mapping method. One of the advantages of direct mapping is, the checking of the tag is overlapped with the transmission of the data. This solution can effectively reduce the hit time. We also try to synchronize fast clock cycles of the processor with the memory cycles of the cache. So the small L1 level of caches is suitable to reduce the hit time. For the caches which have slow memory cycles (low level caches) than the processor, we keep the tags on chip and data off the chip that is in L2 cache. This technique leads to the fast tag checking and improve the cache capacity.
Today modern computers consists huge volumes of on-chip cache and it capacity increased from generation to generation of computers.
But from the last three generations the capacity of L1 cache is same especially on AMD processors. Here our concentration on to avoid the memory access even if the misses happened on L1 cache due to fast clock rate of the processor, the data may available on L2 cache.
One approach to determine the impact of hit time in advance during the building of new cache chip is to use CAD tools. The CACTI programs may also estimate the access time of alternative cache structures on CMOS microprocessor. For a given minimum size of cache, the hit time of the cache may vary depending upon cache size, associativity, and number of read/write ports.
Way Prediction to Reduce Hit Time:
Suppose we follow the direct mapping method to reduce the hit time, these were a possibility to occur conflict misses. One way to reduce the conflict misses and to combine fast hit time of processor during the direct mapping on associative cache to avoid conflict misses is ‘way
6
prediction’. This technique is applicable when we use an associative memory as a cache memory.
In the way prediction method we use an extra bit to the cache to predict the “way” that means the required memory block within the set contains the required data or not. To perform this activity we attach a multiplexer to the cache blocks, due to this by making the single tag comparison in parallel to the reading the cache data. Suppose a miss may occur during the first check then we update the predictor in next clock cycles. The following diagram shows way prediction.
To implement the way prediction method we should add block predictor bits to the each cache block. These bits may represent which of the block to be accesses in future (i.e. on next cache access). If the predictor is working correctly, then cache access latency is equal to hit time. Suppose the predictor is not working correctly, then we try to access other blocks by changing the way of prediction. Some simulations studies revels that prediction accuracy is more than 85% for two-way set associative caches. This is also good method for speculative processors.
7
Pipelined Cache Access to Increase Cache Bandwidth
We know that, the speed of the processor is always more than that of the main memory. As a result unnecessary wait-states are developed when instructions or data are being fetched from the main memory. This causes a slowdown of the performance of the system.
A cache memory is basically developed to increase the efficiency of the system and to maximize the utilization of the entire computational speed of the processor. The performance of the processor is highly influenced by the methods employed to transfer data and instructions to and from the processor. The less the time needed to transfers for the better the processor performance.
The Pipeline Burst Cache is basically a storage area for a processor that is designed to be read from or written to in a pipelined series of four data transfers. As the name suggests 'pipelining', the transfers after the first transfer happen before the first transfer has arrived at the processor. It was developed as an alternative to asynchronous cache and synchronous burst cache.
One of the fundamental factors regarding the cache optimization is, cache bandwidth influence the result of hit time. Especially in pipelined caches the queuing delay appends to the hit time when the cache does not have sufficient read/write ports. By improving the cache bandwidth we may reduce the slow hit rate, but most time may be wasted during wrong predicted branches and more clock cycles are required load and use of the data in pipelined cache.
Non blocking Caches to Increase Cache Bandwidth
We know that, pipelined processors allow the execution of the instruction in out-of-order fashion. In this case the processor need not stop on data cache miss. Using this principle the designers design non blocking caches. A non-blocking cache allows execution of instructions even if the cache misses happened. The non blocking caches allowing the data cache allow to supply hits during the misses. This is known as the
“hit under miss optimization”. It reduces the miss penalty. The advanced technique in non blocking caches to reduce the miss penalty is “hit under multiple misses” or “miss under miss”.
8
The major difficulty during the performance evaluation of non blocking caches is a cache miss does not necessarily stop the processor.
In this case, it is difficult to evaluate the impact of single miss, and hence difficult to calculate the average memory access time.
When using the non blocking caches, the miss penalty is not a sum of misses happened during the cache reference, but it is the non overlapped time that the processor is stopped. The advantages of non blocking caches depending upon miss penalty when multiple misses happened, the memory reference pattern, and how many instructions the processor can execute with a miss outstanding. The following diagram shows operation of non blocking cache.
NOTE: The pipelined processors are capable of hiding (reducing) miss penalty of an L1 caches, but those are not capable to reduce miss penalty of L2 caches.
9
Protection via Virtual Memory:
Today security and privacy are two most difficult challenges for the information technology. To improve the security and privacy of computational systems, the researchers looking for new ways. Here protection of information is not only job of the hardware; real security and privacy involve innovation in computer architecture and systems software. Here we discuss the architecture support for protecting processes from each other via virtual memory.
Virtual memory is an imaginary memory; it gives the illusion of memory arrangement that is not physically there. There are three main functions of virtual memory.
1. Address translation.
2. Process protection.
3. Information sharing.
In multiprogramming environment several programs are running parallel would share the computer. Sharing of resources by the programs demand protection among the processes and a running program may switch to number of states to finish its execution. Changing of different states by the process at any instant is also known as process switch or context switch. In this situation the hardware and operating system jointly allow the processes to share the hardware resources and not interfere with each other.
10
The page based virtual memory, which also includes Translation Look aside Buffers that caches page table entries, is the primary mechanism to protect the processes from each other. The following are some of the activities that the hardware should minimally support for virtual protection.
1. Provide at least two modes of execution of processes. Generally the process is a user process or a system process, these are also known as kernel process or a supervisor process.
2. Provide a portion on the processor; this part may have user/supervisor mode bits, exception enable/disable bits, and memory protection information. This part can use by user process (reading) but not made any modifications or changes (don’t write) in this part.
3. Provide effective mechanism to switch the processes between user mode and supervisor mode. Switching from user mode to supervisor mode is implemented by system call with some special instructions that transfer control to a dedicated location in supervisor code space. At this moment the program counter (PC) is saved from the point of system call and the processor is placed in supervisor mode. The switching from supervisor mode to user mode like a subroutine returns that restores the previous mode.
4. Provide a mechanism to limit memory access to protect memory state of a processor without having to swap the process to disk on context switch.
There are several memory protection schemes, but one of the popular schemes is adding protection restrictions to each page of virtual memory. These protection restrictions are included in each page table entry. The protection restrictions describe whether a user process can read this page, whether a user process can write to this page, and whether code can be executed from this page and a process can neither read nor write a page if it is not in the page table.
In virtual memory systems, page tables are used to keep track of memory pages. A page table resides in main memory and it occupies large volumes. To access information from virtual memory by using the
11
page tables doubles the time of memory access, because at first time the processor will access the physical address and at the second time processor access the data. This phenomenon increases the cost of the memory access.
To reduce the memory access time again we depend on the principle of locality. It represents if the access have locality, then the address translations also have locality. Now we can perform the address translation in a special cache is known as translation look aside buffer (TLB).
A TLB entry is like a cache entry, where tag holds portion of virtual address and the data portion holds a physical page address, protection field, valid bit, use bit and dirty bit. The operating system changes these bits by changing the value in the page table and then invalidating the corresponding TLB entry.
The TLB works as follows. On a virtual memory access, the CPU searches the TLB for the virtual page number of the page that is being accessed, an operation known as TLB lookup. If a TLB entry is found with a matching virtual page number, a TLB hit occurred and the CPU can go ahead and use the page table entry stored in the TLB entry to calculate the target physical address.
12
Protection via Virtual Machine:
A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. The virtual machine typically imitate a physical computing environment, but requests for CPU, memory, hard disk, network and other hardware resources are managed by a virtualization layer which translates these requests to the underlying physical hardware. Virtual machines are created within a virtualization layer, such as a hypervisor or a virtualization platform that runs on top of a client or server operating system. This operating system is known as the host OS. The virtualization layer can be used to create many individual, isolated VM environments. The virtual machines are required for the following reasons;
1. The increasing importance of isolation and security in modern computers.
2. The failures in security and reliability of standard operating systems.
3. The sharing of a single computer among many unrelated users.
4. The dramatic increases in raw speed of processors.
The virtual machine can provide complete system level information at Instruction Set Architecture (ISA) and a VM can run on different ISAs, it provides illusion that the users of VM have an entire computer to themselves, including operating system. A single computer runs multiple VMs and can support a number of different operating systems. In a traditional system, a single OS have all hardware resources, but with in VM, multiple OS share the hardware resources.
13
The software that supports VM is called a virtual machine monitor (VMM). The VMM is the heart of virtual machine technology. Here the underlying hardware platform is known as host and its resources shared among guest VMs. The VMM determines how to map virtual resources to physical resources. The VMM can able to manage both hardware and software.