The L1 and L2 cache hit ratios were compared for the CPU and the GPU. The CPU showed much higher cache hit ratios than the GPU as expected, as the CPU focuses more on the memory hierarchy during execution to increase performance. On the other hand, the GPU relies more on parallel execution to process larger workloads such that the same set of instructions can be executed across multiple threads to increase productivity and performance. Since this paper used Multi2Sim, it was used as a reference point for any AMD SI GPU simulations performed during this the- sis. The cache hit ratios recorded from this thesis were also compared to the results recorded from this paper for reference. There was no research in terms of caching policies or identifying the most recently used blocks in cachememory. This paper completely focuses on improving the GPUmemory performance by proposing two methods: shared L1 vector data cache and clustered work-group scheduling. Both of these methods were executed for different workloads, and the performance improve- ments were recorded.
slot is chosen when the required data has returned from lower memory levels. In either policy, if any of the required resources is not available, a reservation failure occurs and the memory pipeline is stalled. The allocated MSHR is reserved until the data is fetched from the L2 cache/off-chip memory while the miss queue entry is released once the miss request is forwarded to the lower memory hierarchy. Since allocate-on-fill preserves the victim cache line longer in the cache before eviction and reserves fewer resources for an outstanding miss, it tends to enjoy more cache hits and fewer reservation failures, and in turn better performance than allocate-on-miss. Although allocate-on-fill requires extra buffering and flow-control logics to fill data to the cache in-order, the in-order execution model and the write-evict policy make the GPU L1 D-cache friendly to allocate-on-fill as there is no dirty data to write to L2 when a victim cache line is to be evicted at the fill time. Therefore, it is intriguing to investigate how well allocate-on-fill performs for GPGPU applications.
This section describes the summary of the technique we proposed, including a compression algorithm that we apply newly in this paper.
5.1. Data compression schemes in secondary cache In our strategy, data in a secondary cachememory are compressed and the areas vacated by the compression are turned off by controlling gated-Vdd transistors, which leads to effective reduction of leakage energy. We use compres- sion thresholds of 1/4, 1/2 and 3/4. For example, when a block could be compressed into smaller than a fourth, the compressed block data are stored in L2 cache and the other three fourths area is turned off. When a block could not be compressed into smaller than three fourths, the origi- nal block is stored as it is. Although compression and de- compression overheads exist when accessing the secondary cache, they are not significant since a frequency in access- ing the secondary cache is not high.
VII. R ELATED WORK
There is a lot of research done on advanced loop transfor- mations. However, most of this research focuses on e.g. data lifetime reduction, optimizing loop addressing and memory layout for general purpose systems with cache based memory hierarchies. Memoryaccess coalescing has been described in 1994 by Davidson and Jinturkar in . Although their work focusses on coalescing narrow memory references into wide ones on scalar processors, some of the issues of automated coalescing data accesses still apply to GPUs. Also the pos- sibilities and limitations of compile-time coalescing memory accesses are discussed, as well as the options of changing the memoryaccess pattern at run-time, which still leads to a significant speed-up.
Graphic Processing Units (GPUs) often employ shared memory to provide efficient storage for threads within a computational block. This shared memory includes multiple banks to improve performance by enabling concurrent accesses across the memory banks. Conflicts occur when multiple memory accesses attempt to simultaneously access a particular bank, resulting in serialized access and concomitant performance reduction. Identifying and eliminating these memory bank access conflicts becomes critical for achieving high performance on GPUs; however, for common 1D and 2D accesspatterns, understanding the potential bank conflicts can prove difficult. Current GPUs support memory bank accesses with configurable bit-widths; optimizing these bit- widths could result in data layouts with fewer conflicts and better performance.
In this paper we analyse sequences of basic memoryaccess operations which are R for Read and W for Write operations. Our goal is to record a sequence of memoryaccess operations, or memtraces, and analyze this sequence. The majority of modern desktop computers utilize x86 compatible architectures that were introduced in order to implement pipelines and, as a result, increase execution speed. Modern x86 compatible CPUs translate opcodes into a sequence of micro-operations (or uops) responsible for loading and storing data, interacting with arithmetic logical units, branching, and so on, each uop executed on the specific port. Some authors collected information about number and types of micro-operations used by CPUs in order to execute certain opcodes, as in  where such information was collected for Intel architectures ranging from Pentium to the Skylake architecture. For example, in Sandy Bridge architecture port p23 stands for memory read or address calculation, and p4 for memory write.
In this scenario, the memory subsystem of GPUs poorly performs. In this paper, we look into the reasons explaining this behavior, and we find that one of the main sources of performance losses of the memory subsystem is the management of L2 cache misses.
We find that conventional caches designed to address memorypatterns of CPU applica- tions do not properly meet the requirements of GPGPU applications, but they seriously penalize their performance since they can significantly slow down the management of L2 cache requests on long bursts of requests. The previous rationale means that improv- ing the L2 cache management is a key design concern that should be tackled to improve the system performance. This paper proposes a novel L2 cache design aimed at boost- ing the memory level parallelism by adding a Fetch and Replacement Cache (FRC) that provides additional cache lines that help unclog the memory subsystem. The FRC ap- proach uses these extra resources to prioritize the fetch of incoming L2 cache requests and to delay the eviction of the blocks to be replaced. The proposal has been evaluated considering an AMD GPU based architecture, although the results would also apply in almost all current GPU architectures as they implement a similar memory hierarchy.
The Representativeness Validation by Simulation (RVS) method relies on identifying the conflictive combinations (set) of addresses, aC i so that if they are randomly mapped to the same cache set they lead to cache (set) mapping scenarios with high impact on execution time. RVS also estimates (upperbounds) the probability of occurrence of those scenarios and assesses whether the pWCET distribu- tion derived with MBPTA truly upperbounds the impact of those scenarios. The validation is performed in the miss count domain rather than in the execution time domain, and it is applied for all cache memories individually (i.e. instruction and data caches). RVS relies on the assumption that miss counts highly correlate with execution time. This is usually the case since cache misses have been shown to be one of the major contributors to programs’ execution time. Yet we perform a quantitative assessment of this fact for our reference processor architecture (Section IV). RVS includes the following steps:
b Assistant professor,College of engineering,Munnar 685612,India
Cache memories serve as accelerators to improve the performance of modern microprocessors. Caches are vulnerable to soft errors because of technology scaling. So it is important to provide protection mechanisms against soft errors. Tag comparison is critical in cache memories to keep data integrity and high hit ratio. Error correcting codes (ECC) are used to enhance reliability of memory structures. The previous solution for cacheaccess is to decode each cache way to detect and correct errors. In the proposed architecture ECC delay is moved to the non- critical path of the process by directly comparing the retrieved tag with the incoming new information which is encoded as well, thus reducing circuit complexity. For the efficient computation of hamming distance, butterfly weight accumulator is proposed to reduce latency and complexity further. The proposed architecture checks whether the incoming data matches the stored data. The proposed architecture reduces the latency and hardware complexity compared with the most recent implementation.
substitution matrix is not random anymore: multiple substitution scores can be loaded simultaneously when aligning the query with a database character. Further- more, query sequence lookups are not required any- more; only the current position within the query is needed to index into the profile. A query profile is generated once for every query sequence. Each query profile column stores values for 23 characters. The number of columns and hence the memory require- ment for a query profile depends on the length of the query sequence. The GTX 275 GPU used for our implementation has 8KB of texture cache per multi- processor. This means that a query sequence having more than ⌊8 × 1024/23⌋ = 356 characters will result in increased cache misses, as described in . Tests were performed to quantify the texture cache miss rate, which was shown to be very small. For example, aligning an 8000 character query sequence resulted in 0.009% miss rate. Using this query profile method resulted in a 17% performance improvement with Swiss-Prot .
As for the effect of texture cache, at the cost of modification to Map function code, GT never loses to G. Comparing SI to GT, except for MM-M, where input cannot be completely staged, staging input is better or equal to texture in most cases. While WC-M and SM-M have similar results, opposite results are shown in II-M and KM-M. While the texture cache is designed to reduce the global memory bandwidth demand, its latency is still longer than shared memory. Therefore when there are long, complex computation phases with conditional branches and a large variance in input as in II-M, bandwidth is not a problem, and the short latency of shared memory makes SI much better. But in KM-M, featured with fixed-size input and almost equally long computation for every thread, the latency of texture fetch is well hidden and the GT mode wins because hardware cache brings the slightest overhead compared to explicit staging. However SI gradually catches up with GT when the thread block is larger, showing that the overhead of SI is finally hidden with more warps. Besides, MM-M’s GT mode shows superior performance over SI because in GT, row/column vectors can be cached with the hardware-managed replacement policy, while SI can only stage the row/column indices.
7 C ONCLUSIONS AND O UTLOOK
Technological improvements in memory performance are mostly achieved by incrementing the size of the data transfer bursts between main memory and the CPU/GPU. While this feature can greatly improve the performance of algorithms accessing large blocks of sequential data, it is neutral for algorithms request- ing relatively small data blocks spread across distant random locations. In fact we are expecting to see that, in terms of efficiency, pseudo-random memoryaccesspatterns like those shown by straightforward FM-index implementations will steadily lag behind sequential accesspatterns even in upcoming next- generation memory systems. In such a scenario, the performance cost is determined by the total num- ber of blocks accessed and not by the amount of data accessed. Therefore, we must favour algorithmic variations that access similar amounts of data but concentrated on less and bigger data blocks, even at the expense of more computation. This is precisely what our k-step FM-indexing strategy does: it trades reading less blocks for reading bigger blocks.
Now a days computers are designed to operate with different types of memory organized in a memory hierarchy. In such memory hierarchies, as the distance of the memories increases from the processor, as the distance from the processor increases, so that the access time for each memory increases. Closest to the CPU is the CacheMemory. Cachememory is fast but quite small; it is used to store small amounts of data that have been accessed recently and are likely to be accessed again soon in the future. Data is stored here in blocks, each containing a number of words. To keep track of which blocks are currently stored in Cache, and how they relate to the rest of the memory, the Cache Controller stores identifiers for the blocks currently stored in Cache. These include the index, tag, valid and dirty bits, associated with a whole block of data. To access an individual word inside a block of data, a block offset is used as an address into the block itself. Using these identifiers, the Cache Controller can respond to read and write requests issued by the CPU, by reading and writing data to specific blocks, or by fetching or writing out whole blocks to the larger, slower Main Memory. Figure 2 shows a block diagram for a simple memory hierarchy consisting of CPU, Cache (including the Cache controller and the small, fast memory used for data storage), Main Memory Controller and Main Memory proper.
Figure 20.3: Early GPU pipeline.
Figure 20.4: Improved GPU pipeline.
Figure 20.4 shows how the GPU pipeline looks now. Unlike early GPUs which used different hardware for different shaders, the new pipeline uses a unified hardware that combines all shaders under one core and shared memory. In addition, nVidia introduced CUDA (Compute Unified Device Architecture) which added the ability to write general-purpose C code with some restrictions. This meant that a programmer has access to thousands of cores which can be instructed to carry out similar operations in parallel.
To explore the memory performance issues in a limit study, we profile the execution of the NPB programs using the latency-above-threshold profiling mechanism of the Intel Ne- halem microarchitecture. This hardware-based mechanism samples memory instructions with access latencies higher than a predefined threshold and provides detailed informa- tion about the data address used by each sampled instruc- tion. Based on the sampled data addresses it is straight- forward to determine the number of times each page of the program’s address space is accessed by each core of the sys- tem. To account for accesses to all levels of the memory hierarchy we set the latency threshold to 3 cycles (accesses to the first level cache on the Nehalem-based system have a minimum latency of 4 cycles). The profiling technique is portable to many different microarchitectures, because most recent Intel microarchitectures support latency-above- threshold profiling. Additionally, AMD’s processors support Instruction-Based Sampling , a profiling mechanism very similar to that of Intel’s.
Conclusions and Future Work 6
This project has served as an introduction to FPGA-based heterogeneous computing and to the OpenCL framework as a tool to develop applications in such environments.
FPGAs are powerful devices that can enhance significantly the performances of algorithms and applications, but it is necessary a non-negligible quantity of knowledge and time to exploit all their potential. On one hand, it is important to understand the FPGA architecture and how the basic operations performed by the source code are translated into a digital circuit design to maximize resource exploitation. On the other hand, knowing the nature and characteristics of the algorithm is key to determine how the physical resources can be best organized and distributed among the different tasks.
Studies have shown that on-chip caches will consume concerning five hundredth of the entire power in superior microprocessors.
In this paper, we have a tendency to propose a replacement cache technique, cited as early tag access (ETA) cache, to boost the energy potency of L1 information caches. In an exceedingly physical tag and virtual index cache, a vicinity of the physical address is keep within the tag arrays whereas the conversion between the virtual address and therefore the physical address is performed by the TLB. By accessing tag arrays and TLB throughout the LSQ stage, the destination ways that of most memory directions will be determined before accessing the L1 information cache. As a result, just one method within the L1 information cache has to be accessed for these directions, thereby reducing the energy consumption considerably. Note that the physical addresses generated from the TLB at the LSQ stage may be used for future cache accesses.
Fully Associative Mapping: This is a much more flexible mapping method, in which a main memory block can be placed into any cache block position. This indicates that there is no need for a block field. In this case, 14 tag bits are required to identify a memory block when it is resident in the cache. This is indicated in Figure 5.8. The tag bits of an address received from the processor are compared to the tag bits of each block of the cache to see if the desired block is present. This is called the associative- mapping technique. It gives complete freedom in choosing the cache location in which to place the memory block. Thus, the space in the cache can be used more efficiently. A new block that has to be brought into the cache has to replace (eject) an existing block only if the cache is full. In this case, we need an algorithm to select the block to be replaced. The commonly used algorithms are random, FIFO and LRU. Random replacement does a random choice of the block to be removed. FIFO removes the oldest block, without considering the memoryaccesspatterns. So, it is not very effective. On the other hand, the least recently used technique considers the accesspatterns and removes the block that has not been referenced for the longest period.
be to a CPU because the large number of threads on a GPU can hide the latency from a memoryaccess.
The final metric for this configuration was the simulation time. This is the number of cycles needed by the program being simulated to complete execution. This is how to measure the overall performance. Figure 4.3a has the graph of the data for this metric for the general purpose benchmarks. The two benchmarks that are easy to see in the graph look like the opposite of the L1 hit ratio. Instead of a valley, there is a peak with a small L2 cache size and a sizable drop when L2 is removed. This is because the simulation time has an inverse relationship to cache performance. The better the cache performs, the lower the simulation time will be and vise versa. With the very small L2 cache, performance of the L1 cache is impacted greatly and this shows with the simulation times. Figure 4.3b shows the simulation times for the machine learning benchmarks. Many of the benchmarks look like they have a constant simulation time. The scale of the graph is dictated by the Kronecker benchmark because this has a much larger simulation time compared to the rest of them. While the simulation time is a great way to measure the overall performance of an application, it is hard to compare multiple applications because of the wide variances in times. In addition to this, averaging the simulation times will not have the desired effect since the outliers will dominate and the trend will not be valid.
Here is some more detail. The experiments reported here were conducted on a BBN TC2000 (a.k.a. BBN Butterfly) shared-memory multiprocessor consisting of 128 processor/memory nodes. The shared memory consists of the totality of memory modules at all these nodes, with internode access being via a multistage network. Local operating system structure allowed us to run programs on a dedicated-machine basis, i.e., with all other jobs suspended, except for certain interactive jobs which run on a reserved set of four nodes. For each combination of parameters, at least 15 (and as many as 65) runs were conducted, with the timings graphed here being the average values so obtained. All of the DTA experiments reported here used a group size of G = 16, and thus the numbers of processors used were various multiples of 16. Since four of the 128 processors are unavailable, the maximum number of processors we used was 112. The graphs presented here plot program timings t against numbers of processors p. 5