Configuration and data prefetching - Background and related work

1.4. Background and related work

1.4.3. Configuration and data prefetching

As it has been described along this chapter, one of the main issues in the scheduling problem for reconfigurable architectures lies on the reconfiguration and data transfers. Below, we summarize several works in this field.

⋆ Configuration prefetching for partial reconfiguration

Li and Hauck [LH02] proposed configuration prefetching techniques for reducing the reconfiguration overhead by overlapping the configuration loading with computation. They investigated various techniques includ-ing static configuration prefetchinclud-ing, dynamic configuration prefetchinclud-ing, and hybrid prefetching. Their work is based on the Relocation + De-fragmentation (R+D) FPGA model [CCKH00] to further improve the hardware utilization. The relocation allows the final placement of a configuration within the FPGA to be determined at runtime, while de-fragmentation provides a method to consolidate unused area within a FPGA during runtime without unloading useful configurations.

Static prefetching is a compiler-based approach that inserts prefetch in-structions after performing control flow and data flow analysis based on profile information and data access patterns. Dynamic prefetching de-termines and dispatches prefetches at runtime, using more data access information to make accurate predictions. Hybrid prefetching combines the strong points of both approaches.

Static configuration prefetching starts computing the potential penalties for a set of prefetches at each node of the control flow graph. Then, the algorithm determines the configurations that need to be prefetched at each instruction node based on the penalties calculated in the previous phase. Prefetches are generated under the restriction of the size of the chip. Later, the algorithm trims the redundant prefetches generated in the previous stage. Termination instructions are also inserted.

Dynamic configuration prefetching is based on a Markov model in which the occurrence of a future state depends on the immediately preced-ing state, and only on it. In order to find good candidates to prefetch,

Chapter 1. Introduction

Markov prefetching updates the probability of each transition using the currently available access information. After the execution of a hard-ware task, the dynamic prefetching algorithm sorts the probabilities and selects the first candidate. Then, the algorithm issues prefetch requests for each candidate that is not currently on-chip.

The hybrid configuration prefetching integrates the static prefetching with the dynamic one to avoid mispredictions. The static prefetches are used to correct the wrong predictions determined by the dynamic prefetching, as can occur for the transitions jumping out of a loop.

We also applies a prefetching approach, this includes both configura-tion and data. Configuraconfigura-tion and data prefetching are initially guided by the static scheduler according to the application profiling, and then is adapted during the execution by a very simple runtime monitor.

Qu et al. [QSN06] proposed a configuration model to enable configura-tion parallelism in order to reduce configuraconfigura-tion latency. Their approach consists of dividing the configuration SRAM into sections in such a way that multiple sections can be accessed in parallel by multiple configura-tion controllers. Each configuraconfigura-tion SRAM secconfigura-tion its attached to a pro-grammable logic making a tile. The complete device consists of a number of continuously connected homogeneous tiles. A crossbar connection is used to connect the configuration SRAMs of the tiles to a number of par-allel configuration controllers. The authors defined a prefetch scheduling to load tasks whenever there are tiles and configuration controllers avail-able, instead of tasks become ready. The tasks are modeled as directed acyclic graphs and scheduling is performed at design time. The tasks are scheduled by means of a priority function which determines the urgency of execution of a task, how much benefit a task can get if its configuration immediately starts, and how many additional configurations have to be delayed if configuration of the task cannot stat immediately. This work does not include a runtime scheduling stage. In addition, it is highly specialized on a device model having multiple configuration controllers.

Lee et al. [LYC02] presented a method for runtime reconfiguration

1.4. Background and related work

scheduling in reconfigurable SoC design to hide the reconfiguration la-tency. Their work is based on a formal model of computation, hierar-chical FSM with synchronous data flow (HFSM-SDF). The basic idea consists of knowing the exact order of required configurations during runtime. To obtain the exact order of configurations, the authors ex-ploit the inherent property of HFSM-SDF that the execution order of SDF actors can be determined before the execution of state transition of top FSM. For each transition of top FSM, they compute the exact order of configurations in a ready configuration queue by traversing the hierar-chical FSM. Then, with the queue, a runtime reconfiguration scheduler launches configuration fetches as early as possible during the execution of state transition of top FSM. This work highly depends on the model of computation. However, the authors do not deal with the problem of data supplying for the early fetched configurations. Our work considers the data required by a prefetched task.

⋆ Data prefetching

Data prefetching is a technique that has been proposed for hiding the latency of main memory access. Rather than wait for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance to the actual memory reference. To be effective, data prefetching must be timely, useful, and introduce little overhead. Prefetch strategies are diverse, and no single strategy has yet been proposed that provides optimal performance. All approaches have their own design trade-offs. For a complete survey of data prefetch mechanisms in general-purpose systems, we refer to the work published by Vanderwiel and Lilja [VL00].

Chai et al. [CCE⁺05] proposed a flexible memory subsystem for stream computation which uses stream units to move stream data between mem-ory and processors. The stream units prefetch and align data based on stream descriptors [CBK⁺06], a mechanism that allows programmers to indicate data movement explicitly by describing their memory ac-cess patterns. The memory subsystem builds upon configurable stream

Chapter 1. Introduction

units that move data while computation is performed. The stream units are specialized DMA units that are optimized for stream data transfer.

They rely on a set of stream descriptors, which define the memory access pattern, to prefetch and align data in the order required by the comput-ing platform. Stream units take advantage of available bandwidth by prefetching data before it is needed. Stream computation is very popu-lar in the multimedia domain. However, there is a class of applications which data accesses are not regular and the streaming memory hierarchy could not be enough to provide data to the computing platform in an ef-ficient manner. Our target architecture also includes an on-chip memory for streaming applications. In this work, we have added a new on-chip memory in order to handle non-streaming data access patterns.

Vuleti´c et al. [VPI04] presented an operating system (OS) module that monitors reconfigurable coprocessors, predicts their future memory ac-cesses, and performs memory prefetching accordingly. Their goal is to hide memory to memory communication latency. Their OS-based prefetching is applied to the Virtual Memory Window (VMW), which enables the coprocessors to share the virtual memory address space with user applications. The technique presented in this work is essentially a dynamic software technique that uses a hardware support to detect co-processors memory access patterns. During a VMW-based coprocessor execution, the OS sleeps for a significant amount of time. During idle time, the VMW manager surveys the execution of the coprocessor and anticipate its future requests. During coprocessor operation, the Win-dow Management Unit (WMU) informs the manager about the pages accessed by the coprocessor. Based on this information, the manager predicts future activities of the coprocessor and schedule prefetch-based loads of virtual memory pages. The WMU provides hardware support for the translation of the coprocessor virtual address and for accessing the window memory. The only input to the predictor are miss addresses and access page numbers. The predictor assumes that for each miss a new stream is detected, and it schedules a speculative prefetch for the page following the missing one. This technique can be effectively applied

In document Configuration and data scheduling techniques for executing dynamic applications onto multicontext reconfigurable systems (Page 46-50)