• No results found

Clearing the Shadow Tags

III.2 Previous Work

III.5.2 Clearing the Shadow Tags

Initially, we were concerned that not clearing the shadow tag directory between two configurations would pollute subsequent measurements. We examined the effect of copying the contents of the real tag directory to the shadow tag directory after each reconfiguration. We found that copying had little impact on performance (we observed only a 0.1% improvement by using a copying function). In addition, the cost of such a mechanism would be prohibitive both in terms of area and power.

III.6

Conclusion

In this paper we present a novel technique for dynamic parameterization of pre- fetching heuristics. By using this technique we observe an overall improvement in IPC by 24% over the static configuration and a 18% improvement over feedback- directed prefetching. However, the relatively large improvements seen on single benchmarks are equally important. In addition, we observed no regressions against the case of no prefetching, which makes the method robust and only one regression against the case of a static prefetcher.

In terms of cost, the L2 is increased in size by 1.6% for an overall gain of 24%. It might be possible to reduced this to about 0.06% if the methods by Dybdahl et al. and Qureshi et al. can be used with prefetching. The shadow tag directory can be used for other purposes as well, such as optimizing memory level parallelism and enhancing the cache replacement policy, thus amortizing this cost over several performance enhancing techniques.

Bibliography

[1] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general- purpose processor architectures. In MICRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 245–257, 2000. ISBN 1-58113-196-8. doi: http://doi.acm.org/10.1145/360128.360153.

[2] D. Burger and T. M. Austin. Simplescalar toolset 3.0b, 2003. http://www. simplescalar.com.

[3] A. Buyuktosunoglu, S. Schuster, D. Brooks, P. Bose, P. Cook, and D. Albonesi. An adaptive issue queue for reduced power at high performance. In Power- Aware Computer Systems: First International Workshop, PACS 2000, 2000.

[4] T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high- performance processors. Computers, IEEE Transactions on, 44:609–623, May 1995.

[5] A. Dhodapkar and J. Smith. Managing multi-configuration hardware via dy- namic working set analysis. In Proceedings. 29th Annual International Sym- posium on Computer Architecture, pages 233–244, 2002.

[6] A. S. Dhodapkar and J. E. Smith. Comparing program phase detection tech- niques. In Microarchitecture, 2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on, 2002.

[7] H. Dybdahl, P. Stenstrom, and L. Natvig. An LRU-based replacement al- gorithm augmented with frequency of access in shared chip-multiprocessor caches. In MEDEA Workshop, PACT, 2006.

[8] H. Dybdahl, P. Stenstrom, and L. Natvig. A cache-partitioning aware replace- ment policy for chip multiprocessors. In Proceedings of IEEE International Conference on High Performance Computing, 2006.

[9] J. L. Hennesey and D. A. Patterson. Computer Architecture: A Quantitative Approach, 3rd Edition. Morgan Kaufmann Publishers, 2003. ISBN 1-55860- 724-2.

[10] A. KleinOsowski and D. J. Lilja. MinneSPEC: A new spec benchmark work- load for simulation-based computer architecture research. Computer Architec- ture Letters, 1, June 2002.

[11] K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. Micro, IEEE, 25:90–97, Jan. 2005.

[12] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith. AC/DC: An adaptive data cache prefetcher. In Proceedings of the 13th International Conference on Par- allel Architecture and Compilation Techniques, pages 135–145, 2004.

[13] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992.

[14] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA ’06: Proceedings of the 33rd annual international symposium on Computer Architecture, pages 167–178, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2608-X. doi: http://dx.doi.org/ 10.1109/ISCA.2006.5.

[15] T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In Proceedings. 30th Annual International Symposium on Computer Architecture, pages 336– 347, 2003.

[16] P. Shivakumar and N. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report 2, Compaq Western Research Laboratory, August 2001.

[17] A. J. Smith. Cache memories. ACM Comput. Surv., 14(3):473–530, 1982. ISSN 0360-0300. doi: http://doi.acm.org/10.1145/356887.356892.

[18] SPEC. Spec 2000 benchmark suites, 2000. http://www.spec.org.

[19] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. Technical report, University of Texas at Austin, May 2006. TR-HPS-2006-006.

Low-Cost Open-Page

Prefetch Scheduling in Chip

Multiprocessors

Marius Grannæs, Magnus Jahre and Lasse Natvig

In XXVI IEEE International Conference on Computer Design

(ICCD), 2008

Abstract

The pressure on off-chip memory increases significantly as more cores compete for the same resources. A CMP deals with the memory wall by exploiting thread level parallelism (TLP), shifting the focus from reducing overall memory latency to memory throughput. This extends to the memory controller where the 3D structure of modern DRAM is exploited to increase throughput.

Traditionally, prefetching reduces latency by fetching data before it is needed. In this paper we explore how prefetching can be used to increase memory throughput. We present our own low-cost open-page prefetch scheduler that exploits the 3D structure of DRAM when issuing prefetches. We show that because of the complex structure of modern DRAM, prefetches can be made cheaper than ordinary reads, thus making prefetching beneficial even when prefetcher accuracy is low. As a result, prefetching with good coverage is more important than high accuracy. By exploiting this observation our low-cost open page scheme increases performance and QoS. Furthermore, we explore how prefetches should be scheduled in a state of the art memory controller by examining sequential, scheduled region, CZone/Delta Correlation and reference prediction table prefetchers.

IV.1

Introduction

Chip Multiprocessors have been introduced by virtually all makers of high perfor- mance processors. CMPs shifts the focus away from the traditional uniprocessor paradigm, where low latency and instruction-level parallelism (ILP) is important to a paradigm where throughput and thread-level parallelism (TLP) dominates.

This shift is reflected in the memory subsystem as well, where the memory con- trollers have traditionally been used to reduce system latency. However, as more cores are added to a chip, off-chip bandwidth are shared across cores, thus in- creasing the pressure on this resource and lowering locality in the memory access stream. Thus, memory controllers have been designed to optimize for maximum throughput, at the expense of increasing worst-case latency.

This increase in throughput has been made possible by exploiting the 3D structure of modern DRAM [4]. DRAM is organized in several banks. Each bank is organized as a matrix of rows and columns of DRAM cells as shown in figure IV.1. In a normal read operation, a bank and a row is first selected for activation. The charges from this row of capacitors are then amplified by sense-amplifiers in the DRAM module and stored in a large latch. Each such row is commonly referred to as a page. A page is normally about 1KB to 4KB large, whereas a cacheline is typically 64-256B large. Thus, a page will typically hold several consecutive cachelines. The portion of the page that was requested is then transferred over the data-bus. When the page is no longer needed, the memory controller instructs the DRAM module to write the latch contents back into the DRAM cells, preserving the contents of the page. This is referred to as closing the page.

Figure IV.1: The 3D structure of modern DRAM.

In terms of latency, opening and closing a page is expensive, while getting data out of the latches and over the data-bus is comparatively cheap. In addition, there is a minimum allowed time between opening and closing a page (the minimum activate- to-precharge latency). Thus a single read is slow, but reading the next cache block

is relatively cheap as the page is already open and the data is in the latch. This property is exploited by the First Ready, First Come, First Served (FR-FCFS) memory controller proposed by Rixner et al. [16]. This type of memory controller allows accesses that uses an already open page to be scheduled even if the request is not the oldest.

Traditionally, prefetching has been used to decrease latency for a single operation by speculatively bringing data into the cache before it is needed. In this paper we exploit the 3D structure of modern DRAM to demonstrate how prefetching can be used to increase off-chip bandwidth utilization. Because there is a lower cost associated with fetching data that resides in an open page, we prefetch this data, provided that our confidence that the data will be useful is high enough. In addition, we show that prefetching can be effective at relatively low accuracy, due to the low cost of piggybacking prefetches compared to single reads. Finally, we present our low cost open page prefetching scheduling heuristic which exploits this observation.

IV.2

Previous Work

IV.2.1

Prefetching

Previously, Wei-Fen et al. [8] have examined how prefetches can be scheduled in a uniprocessor context with Rambus DRAM. They used a dedicated prefetch queue with a LIFO insertion policy with a scheduled region prefetcher. In addition, Cantin et al. [2] exploited open pages to increase the performance of their stealth prefetcher.

There exists a multitude of different prefetching schemes. The simplest is the sequential prefetcher [18], which simply fetches the next block whenever a block is referenced. However, more complex types exists as well, such as the CZone/Delta Correlation (C/DC) prefetcher proposed by Nesbit et al. [12, 13]. C/DC divides memory into CZones and analyses patterns contained in the reference stream by using a Global History Buffer (GHB) to store recent misses to the cache. Lin et al. [9] introduced scheduled region prefetching (SRP) which issues prefetches to blocks spatially near the addresses of recent demand missed when the memory channel is idle. Other types, such as the Reference Prediction Table Prefetcher (RPT) proposed by Chen and Baer [3] examines the pattern generated by a load instruction with a state machine. Somogyi et al. proposed Spatial Memory Streaming (SMS) [19]. SMS uses code-correlation to predict spatial access patterns.