Top PDF Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture

studied and identified for various releases of CPUs [11, 12, 13]. Savaldor Petit et al. [11] investigated the temporal locality of a multi-way set associative cache by recording the percentage of cache hits along the multiple lines for each set of the CPU’s cache. This helped analyze the power consumption and performance of the CPU for the different cache lines using the current caching policies. A new drowsy cache policy was proposed to demonstrate a good balance between performance and power consumption. The experiments were performed using the HotLeakage simulator and the Spec2000 benchmark suite for CPUs. This drowsy cache policy was then compared to two existing caching policies, Most Recently Used On (MRO) and Two Most Recently Used On (TMRO), to compare the performance and power statistics. This research involved going through each Most Recently Used (MRU) block or line in cache and the power values, and hit percentages were recorded for each line. Various CPU benchmarks were run to identify the temporal locality of lines for each set of CPU cache. The first MRU line shows a very high percentage of cache hits between 85% to 95%, depending on the benchmarks being tested. The other 15% to 5% of hits were then distributed among the other MRU lines. The results showed a very high percentage of hits on the Most Recently Used (MRU) block or MRU0 line, and this value was recorded to be 92% on average. This paper serves as a good reference point for the MRU cache access patterns for the CPU. This served as a motivation to find the MRU cache access patterns of the GPU to see if the GPU showed similar results to the CPU results recorded by this paper.
Show more

95 Read more

BAG : Managing GPU as buffer cache in operating systems

BAG : Managing GPU as buffer cache in operating systems

Gdev [10] introduces an open source kernel driver and user-level library to manage the GPU as first-class computing resource, facilitating the sharing of GPU re- sources. GPUfs [17] is a system that exposes POSIX-like file system API to GPU programs. In order to optimize GPU file access, GPUfs also maintain a buffer cache in GPU memory. Although both GPUfs and BAG manages GPU memory as buffer cache, there are many differences between the designs of these two systems because of the distinct goals. GPUstore [18] is a general-purpose framework that is intended to accelerate computational tasks in storage systems. GPUstore provides efficient mechanism for mapping memory pages between kernel and user space, and we hope to integrate it into BAG to further improve performance. The RAID 6 implemen- tation on GPU in GPUstore can also be used in our system for fault tolerance, since only high-end GPUs support ECC for GPU memory. Both GPUstore and BAG operate at the storage layer, but BAG focuses on how to expand memory capacity with GPU’s disaggregated RAM, instead of just the computational capability. Memory expansion: Disaggregated memory [11], [12] has been proposed as an effective approach to scaling the local memory capacity of blade servers. The work in [12] also demonstrated the feasibility of enhancing disaggre- gated memory with content-based page sharing, which would be a good fit for GPUs. Transcendent memory (Tmem) [4] is a new approach to improving the utiliza- tion of physical memory, and a well-designed front-end API of Tmem can be used to implement various memory capacity optimizations, such as remote paging and page compression. We believe that developing a back-end for Tmem using GPUs is an interesting future work. Flash has been identified to be a promising way to expand the buffer cache [1], [6], [16]. All these previous studies in part motivated our work. More discussion about the related work can be found in the supplementary file.
Show more

12 Read more

Dual Access Cache Memory Management Recommendation Model Based on User Reviews

Dual Access Cache Memory Management Recommendation Model Based on User Reviews

This section discusses the related works on the various recommendation models available traditionally. Recently, the RS are complement to conventional query-based services that offered the proactive information discovery. Esparsa et al [6] explored the fragmented noisy snippets that were directly used in recommendations. The validation of whether the Real- Time Web (RTW) services were used as the basis for the recommendation and the performance with the traditional systems. The relationship between the web services and the providers described by the two- dimensional form called the user-item matrix. But, the matrix model cannot reveal that relationship accurately. Cao et al [7] presented the cube model to describe the relationship among the services and providers. The matrices which represent the cube model were consumer-service QoS matrix, binary matrix and consumer-provider matrix. Based on the status of the cube model, the Standard Deviation (SD) and the Inverse Consumer Frequency (ICF) based filtering approaches to assure the effective recommendation. Zhang et al [8] enhanced the recommendation system performance by fusing the virtual ratings derived from the user reviews. They identified the self-supervised sentimental classification models with high-precision and recall under proximity evaluation. The unstructured and semi-structured review patterns in the vast number of reviews made the tracking as difficult task. Daoud et al [9] developed the diverse recommendation methodologies to alleviate the challenges such as overload of online shoppers. They utilized the text mining approach to mine the product opinions, features and their semantic similarity regarding the opinion sources. The reliable extraction of the product adopter
Show more

11 Read more

Adaptive memory-side last-level GPU caching

Adaptive memory-side last-level GPU caching

demand high bandwidth to shared data. Shared LLCs incur a perfor- mance bottleneck for workloads that frequently access data shared by multiple SMs. A shared memory-side LLC consists of multiple slices each caching a specific memory partition, i.e., a specific ad- dress range of the entire memory space is served by a particular memory controller. As a result, a shared cache line appears in a single LLC slice, which leads to a severe performance bottleneck if multiple SMs concurrently access the same shared data. Requests from different SMs queue up in front of the LLC slice, which leads to long queuing delays, up to the point that these queuing delays can no longer be hidden, ultimately deteriorating overall application performance. The underlying performance bottleneck is a lack of LLC bandwidth to shared data. A potential solution to the bandwidth problem may be to replicate shared data across the different LLC slices to increase the LLC bandwidth to the shared data. A private LLC achieves exactly this although it also leads to higher miss rates because of cache line replication. Because of the conflicting phe- nomena (higher LLC bandwidth versus increased miss rate), it is unclear whether a shared or private LLC organization is preferred for sharing-intensive workloads.
Show more

13 Read more

Vertical Memory Optimization for High Performance Energy-efficient GPU

Vertical Memory Optimization for High Performance Energy-efficient GPU

5.5.1.1 Performance We first analyze the performance of different configurations em- ploying various schedulers and/or combinations of schedules and TEMP, only pure GPU workloads are evaluated. The results are normalized to CCWS that does not have any op- timization on DRAM system, as shown in Figure 38. Applying TEMP on top of CCWS introduces 5.7% GM speedup, while replacing CCWS with TBAS can further raise the speedup to 10.3%. We also compare to OWL [19]. OWL targets on cache performance through intelligent warp scheduling. It also tries to improve the BLP of memory accesses by prioritizing different-numbered thread blocks in consecutive SMs. From our evaluation, OWL is 93.6% within the performance of CCWS across the application set. We observe that CCWS has higher cache hit rate compared to OWL, and the BLP improvement of OWL is limited because only a small subset of thread blocks which share pages are considered. As a result, TEMP on top of CCWS is 12.9% better than OWL. Figure 39 shows the local access ratio to each memory bank. Here local access denotes the memory access from the SM associated to the banks while remote access denotes the access from other SMs.
Show more

111 Read more

High Performance Cache Architecture Using Victim Cache

High Performance Cache Architecture Using Victim Cache

Cache memory is used to increase data transfer rate. Generally it is difficult to found data requested by microprocessor in cache memory, as the size of cache memory is very small as compared to RAM and main memory. When any data requested by microprocessor will not found in cache memory this causes cache miss in cache memory. We know that cache memory is on chip memory, if data requested by microprocessor is found in cache memory then it will increase the data transfer rate as data is present nearer to processor. If data requested by microprocessor is not found in cache memory, then data will receive by microprocessor from RAM or main memory. It takes more time for this process and it directly affects the data transfer rate. This means that speed or data transfer rate increase by reducing the cache miss in cache memory. To track the cache miss in cache memory we will design cache controller, which will help us to increase data transfer rate between main memory and processor. In design of a cache, parameters that enable design of various cache architectures include the overall cache size, block size and efficiency, along with the replacement policy employed by the cache controller. The proposed architecture takes advantage of a number of extra victim lines that can be associated with cache sets that experience more conflict misses, for a given program. To reduce the miss delay in caches, the concept of victim cache is proposed. To implement the proposed cache architecture, three modules are considered; a cache controller, and two storage modules for the data store and the TAG store. This is illustrated in Fig. below. The cache controller handles the memory access requests from the processor, and issues control signals to control flow of data within the cache [1].
Show more

9 Read more

A Survey on Encode Compare and Decode Compare Architecture for Tag Matching in Cache Memory using Error Correcting Codes

A Survey on Encode Compare and Decode Compare Architecture for Tag Matching in Cache Memory using Error Correcting Codes

In cache tag matching, currently microprocessor caches are set-associative caches. A set associative cache has a tag directory and a data array. The tag directory stores tag addresses which are used to indicate which part of memory is stored in the data array. When an access is made to the cache, the set-address portion of the entire address is used to index into the tag directory and a set of tags (the number of tags read depends on the associativity) are read.

5 Read more

A Systemc Cache Simulator for a Multiprocessor Shared Memory System

A Systemc Cache Simulator for a Multiprocessor Shared Memory System

The most common solution to the memory wall is to cache data and caching requires locality of access or memory reuse, which may be achieved by compiler optimisations that can help to localise data (Jesshope, 2008). Computing scientists also designed banked memory systems to provide high bandwidth to random memory locations (Hennessey and Patterson, 2007; Jesshope, 2008), but, some access patterns still break the memory (Jesshope, 2008). Processors that tolerate high-latency memory accesses have been designed but this requires concurrency in instruction execution (Hennessey and Patterson, 2007; Jesshope, 2008). Caches are largely transparent to the programmer, but, programmers must be aware of the cache while designing code to ensure regular access patterns (Hennessey and Patterson, 2007; Jesshope, 2008, 2009, 2011). Caching the right data is the most critical aspect of caching to improve maximum system performances. More catch misses end up reducing performance instead of improving and this might end up consuming more memory and at the same time suffering from more cache misses lead to system deadlocks, where the data is not actually getting served from cache but is re-fetched from the original source. The development of a cache simulator requires a deeper understanding of how the memory hierarchy operates (Schintke, Simon, and Reinfield, 2012)
Show more

12 Read more

Exploration of GPU Cache Architectures Targeting Machine Learning Applications

Exploration of GPU Cache Architectures Targeting Machine Learning Applications

Table 3.2 shows a comparison of some of the characteristics between the different gen- erations of NVIDIA architectures. The number of SMs increases with the newer architec- tures after Kepler. Another difference between them all is the number of CUDA cores per SM. Kepler has the most at 192, but newer generations drop. This is because there is less overhead with fewer cores and newer technology still allowed for an increase in perfor- mance. The cache sizes have also varied with the generations. Fermi, Kepler, and Volta all have a combined L1 and shared memory. This allows for some configurability of how much to use per program. Fermi and Kepler have a total size of 64 KB while Volta has twice that at 128 KB. Maxwell and Pascal both have a dedicated 24 KB L1. Maxwell has 96 KB of shared memory while Pascal dropped this to 64 KB. Finally, the L2 cache size is an ever increasing trend. Fermi started with 768 KB, but each generation increased all the way to 6 MB with Volta. One conclusion that can be made from this table is that the cache, L2 in particular, is consuming large amounts of hardware in each new generation. Although this work focuses on the Kepler architecture, the findings and conclusions will be applicable to newer architectures. One characteristic not mentioned in the table is the as- sociativity of the caches. While it is known that the caches use a set-associative placement policy, the number of sets and associativity are not disclosed. The only way to find this out is to benchmark the caches [10].
Show more

93 Read more

Fast, parallel implementation of particle filtering on the GPU architecture

Fast, parallel implementation of particle filtering on the GPU architecture

Measurements were done on a PC with Intel i5-660 (3.33 GHz, 4-MB cache) 4 CPU with 4-GB system mem- ory running Ubuntu Linux 11.04 with kernel version 2.6.38-15 (amd64). We used an NVIDIA GeForce GTX 550 Ti GPU with 1-GB GDDR memory with CUDA toolkit 4.1 with 295.49 driver version. The following nvcc compiler options were used to drive the GPU binary code generation: -arch=sm_20;-use_fast_math. We also made some measurements with -arch=sm_13. The host c code was compiled with gcc 4.5; the compiler flag was -O2. GPU kernel running times were mea- sured with the official profiler provided by the toolkit, and the global times were measured by the OS’s own timer. The kernel time measurements include the par- ticle filtering kernel of a single time step; the global times include all operations during the execution for all states (file I/O, memory allocations, computational operations, etc.).
Show more

16 Read more

Matching Memory Access Patterns and Data Placement for NUMA Systems

Matching Memory Access Patterns and Data Placement for NUMA Systems

Several authors have noticed the problem of uniform data sharing. Thekkath et al. [16] show that clustering threads based on the level of data sharing introduces load imbal- ances and cancels the benefits of data locality. Tam et al. [14] schedule threads with a high degree of data shar- ing onto the same last-level cache. The authors admit that their method works only with programs without uniform data sharing. Verghese et al. describe a OS-level dynamic page migration [18] that migrates thread-private pages but does not consider uniformly shared pages. Therefore, their results show significant amount of remaining remote mem- ory accesses. An approach similar to the work of Vergh- ese et al. is described by Nikolopoulos et al. [12]. Re- mote memory accesses are not eliminated in this approach either. Marathe et al. [10] describe the profile-based mem- ory placement methodology we use, but they do not dis- cuss how uniform data sharing influences the effectiveness of their method. Tikir et al. [17] present a profile-based page placement scheme. Although successful in reducing the percentage of remote memory accesses for many benchmark programs, their method is not able to eliminate a signifi- cant portion of remote memory accesses for some programs, possibly due to uniform data sharing.
Show more

12 Read more

Understanding the ISA impact on GPU Architecture.

Understanding the ISA impact on GPU Architecture.

Figure 2 shows the block diagram of GPU microarchitecture for SIMT Core. Fetch unit, Decode Unit, Scoreboard, SIMT Stack, Issue Unit, Register Read, ALU (Arithmetic and Logic Unit), Memory, SFU (Special function unit), L1 cache, constant cache, texture cache, shared memory are the different blocks of SIMT Core (not all shown in the Figure 2). As mentioned previously, Thread Block Scheduler dispatches “thread block” to SIMT Core and threads are executed on SIMT Core at warp granularity. SIMT Core divides thread block to create multiple warp and all these warp can execute in parallel to each other. For example if the number of threads in thread block are 256 then SIMT Core will create 8 warps considering threads within a single warp is 32. Each thread in thread block/warp is identified via unique thread identifier. Once the threads are allocated and warp creation is complete, SIMT Core will start the execution.  Fetch unit: This unit is responsible for sending the Instruction request to instruction cache. Fetch unit interacts with SIMT Stack to find out the program counter (PC) of the next instruction.
Show more

70 Read more

GPU Memory Architecture Optimization.

GPU Memory Architecture Optimization.

slot is chosen when the required data has returned from lower memory levels. In either policy, if any of the required resources is not available, a reservation failure occurs and the memory pipeline is stalled. The allocated MSHR is reserved until the data is fetched from the L2 cache/off-chip memory while the miss queue entry is released once the miss request is forwarded to the lower memory hierarchy. Since allocate-on-fill preserves the victim cache line longer in the cache before eviction and reserves fewer resources for an outstanding miss, it tends to enjoy more cache hits and fewer reservation failures, and in turn better performance than allocate-on-miss. Although allocate-on-fill requires extra buffering and flow-control logics to fill data to the cache in-order, the in-order execution model and the write-evict policy make the GPU L1 D-cache friendly to allocate-on-fill as there is no dirty data to write to L2 when a victim cache line is to be evicted at the fill time. Therefore, it is intriguing to investigate how well allocate-on-fill performs for GPGPU applications.
Show more

108 Read more

Design of 
		cache memory mapping techniques for low power processor

Design of cache memory mapping techniques for low power processor

Cache systems are on-chip memory elements such that data that is needed can be stored. The miss rate that occurred in cache memory can be found by the controller. When the data that is required by microprocessor is found then the data hit is said to occur in the cache. The common usage of storing data on cache is to achieve faster data reading time but energy consumption is one of the drawbacks of on-chip. As the cache memory moves away from the CPU, the access time and the size of the cache memory storage unit increase as well. Cache memory is an additional and fast memory unit that has to be placed between the processing unit and the physical memory. The mostly used instructions and data, where this information is needed to be accessed again are stored in cache. The physical memory and external disk storage devices can be accessed faster by the internal registers and cache which are located near to CPU. Cache which is accessed faster can be considered to be more power efficient. To address this challenge and also sustainable computing goals are met by several energy efficient techniques that are proposed for the cache architecture. The static and dynamic power consumption will lead to the total power consumption. If so, the processor will reads from or writes immediately to the cache, which is much smaller and faster to read and write the data that is required. Three independent caches are used in the modern desktop and server CPUs, such as an instruction cache to speed up executable fetch from instruction, a data cache is used to speed up data fetch and store, transistion look aside table used to increase the virtual-to-physical address translation. The two levels of memory are used to reduce average access time works in principle, during the course of execution of a program. One obvious advantage of the logical cache is that cache access speed is faster than a physical cache, because this
Show more

6 Read more

DOPA: GPU based protein alignment using database and memory access optimizations

DOPA: GPU based protein alignment using database and memory access optimizations

substitution matrix is not random anymore: multiple substitution scores can be loaded simultaneously when aligning the query with a database character. Further- more, query sequence lookups are not required any- more; only the current position within the query is needed to index into the profile. A query profile is generated once for every query sequence. Each query profile column stores values for 23 characters. The number of columns and hence the memory require- ment for a query profile depends on the length of the query sequence. The GTX 275 GPU used for our implementation has 8KB of texture cache per multi- processor. This means that a query sequence having more than ⌊8 × 1024/23⌋ = 356 characters will result in increased cache misses, as described in [22]. Tests were performed to quantify the texture cache miss rate, which was shown to be very small. For example, aligning an 8000 character query sequence resulted in 0.009% miss rate. Using this query profile method resulted in a 17% performance improvement with Swiss-Prot [24].
Show more

11 Read more

Optimization of GPU Based Main Memory Hash Join

Optimization of GPU Based Main Memory Hash Join

Since the UVA mechanism allows in-kernel access to the CPU main memory, we measured the total time of the table R transfer and building kernel to evaluate the performance of pinned memory and UVA. The results are listed in Figure 6. The results suggest that the pinned memory is about 37% faster than UVA. Our results indicated that it does not enable better performance than pinned memory when executing the building kernel partly due to the random patterns of memory accesses. In fact, Negrut et al. [12] suggest that the UVA is preferable when the memory accesses have a high degree of spatial and temporal coherence.
Show more

6 Read more

An Enhancement of Futures Runtime in Presence of Cache Memory Hierarchy

An Enhancement of Futures Runtime in Presence of Cache Memory Hierarchy

If there are no runnable threads in G, the runtime traverses thread groups searching for a thread with a stealable continuation. The order that the thread groups are traversed in matters, and the next section deals with an optimal traversal strategy in presence of cache memory hierar- chy. Let t be the thread being examined at a particular time instance. The runtime tries to remove the continuation C with the minimal depth from the task queue of t , and, if success- ful, starts a fresh thread t in the thread group G that resumes the stolen continuation. The newly started thread is taken from a thread pool. The stolen continuation C has originally been put into the task queue of t when t was as- signed a future called from C. When the future called from C gets completed, t will try to re-
Show more

6 Read more

Analyzing and Characterizing Space and Time Sharing of the Cache Memory

Analyzing and Characterizing Space and Time Sharing of the Cache Memory

to 50,000 cycles. Much of the variation is the result of the OS service handler that goes through multiple execution paths, where different paths are chosen according to input pa- rameters from the application program, as well as current state of the sys read handler. For example, sys read may take a shorter path if data to be read has already existed on a buffer. Otherwise, it may go through a different execution path to request data transfer from the disk, or trigger page faults if a buffer needs to be allocated, etc. At this point, it may be tempting to further divide up an OS service interval into multiple intervals so that each smaller interval has uniform behavior. However, we note that our OS service interval is already very fine-grain, so going to smaller intervals (hundreds to a few thousands of instructions) would introduce complexities in defining boundaries between intervals and tracking them, as well as inaccuracy in characterizing an interval’s performance because its performance would depend much more on various processor pipeline and cache states, which can only be tracked through time-consuming processor and cache simulation mod- els.
Show more

128 Read more

Effective Use of Cache Memory in Multi-Core Processor

Effective Use of Cache Memory in Multi-Core Processor

ABSTRACT: Generally the CPU has fetch, decode, and execute operation. Most of today’s multi-core processors feature shared caches. In shared caches the caches are splitted into blocks and are used randomly by each core in a multi-core processor. Because of this there is severe delay experienced by each core and the processing speed is also reduced. This happens due to more time taken for read operation of memory and CPU. The problem faced by such architectures is cache contention. So far the time required for fetching is more when compared to the execution process. The fetching time must be reduced. Hence to address this problem, we have implemented a program that allows the usage of cache memory for each core of a processor at same time. In our concept, we execute all jobs of each core by parallel processing. It is observed that the delay, power consumption and the memory usage is also reduced effectively, when cache memory is used parallely used by all cores.
Show more

8 Read more

Hardware support for Local Memory Transactions on GPU Architectures

Hardware support for Local Memory Transactions on GPU Architectures

In the K-Means (KM) benchmark, for each point belonging to a set of N points, we calculate its closest center from a set of K centers. Then, each center is modified based on the geometrical center of its closest points. This is done for a set number of iterations, or until the system converges. In our implementation, each work-item pro- cesses a point. Calculating the closest center is executed in parallel, as there are no dependencies. The transactional region comprises updating the values of the centers, as several work-items will try to modify the same center speculatively. For KM, we considered hav- ing 256 3-dimensional random points (one per work-item within the work-group) and a single iteration. This is a memory bound application that features read-modify-write operations on multiple memory addresses within a transaction. For this benchmark, we ex- ecuted a range of experiments, modifying the number of centers in order to test GPU-LocalTM under scenarios with different conflict probabilities and to change the number of memory locations ac- cessed. The experiment KM2 considers 2 centers, and includes a high probability of conflicts. The number of centers is doubled in each experiment up to KM256. This last scenario presents a small probability of conflict, but the number of addresses shared among the different transactions is high (producing a high false positive rate in the Bloom filters).
Show more

9 Read more

Show all 10000 documents...