Replication - Facing the challenges - Improving prefetching mechanisms for tiled CMP platforms

6.5 Facing the challenges

6.5.1 Replication

The idea behind this strategy is to replicate global prefetching information in each tile. To explain this technique, we will describe the behavior of the prefetcher following this approach in each of the three phases introduced in Section 6.2. As in Section 6.2, Figure 6.9 shows four tiles that represent a subset of the entire tiled system. In this section, we have also tagged the arrows with a number that represents the behavior in each phase.

1. Analysis phase :When the shared memory is accessed, a new notification message is generated with the information of the memory access (address, core that generated the demand request, if it has been a hit, miss, or useful prefetch, etc.) and broadcast to all the tiles in the system. The prefetcher information is replicated in each tile, which means that all the prefetchers read the request generated by the accessed memory module and update their status, keeping separate information for the different cores. This isolation can be achieved statically (using separate tables) or dynamically (expanding the access index with the core identifier).

2. Request generation phase: At the end of the analysis phase, several requests may be generated. Note that these requests may or may not be addressed to the same tile as the one from which they have been generated. As all the prefetching engines have the same access information, all of them are going to generate the same requests. This allows the requests that are not addressed to the same tile as the one from which they are generated to be filtered before being queued. This is done in order to prevent extra traffic and too many hit prefetches.

3. Evaluation phase: The prefetching profiling information will be gathered as usual in the shared memory piece by marking as a useful prefetch, for example, a prefetched cache line that is actually accessed by a demand request.

When this strategy is applied the challenges are faced in the following way:

• Pattern detection: The patterns are not distributed anymore because all the modules keep track of all the memory accesses in all the tiles. In this way, each prefetcher keeps a replica of the access memory information. For this reason, each replica is able to detect the patterns without any problems. The isolation of the information at a core level, avoids the distortion of the memory patern that may be introduced when several cores or applications are accesing the same cache.

Fig. 6.9 Phases of the Replication technique: (1) analysis, (2) request generation, and (3) evaluation.

• Prefetching queue filtering: The requests are queued to the same tile from which they are generated or they are filtered. For this reason, the same request cannot be located in two different tiles. If there is a replicated request in the same tile, the replicated request will be filtered when queued in the prefetch queue.

• Dynamic profiling: Due to the filtering mechanism performed in the request generation phase, a distributed memory piece will only receive prefetching requests from a single prefetching engine, that is, the one in the same tile. Therefore, all the prefetched blocks in the memory will belong to that prefetching engine and all the profiling information regarding the prefetch activity on that piece of the memory will be correct and related with the prefetcher in that tile.

As has been shown, the replication strategy deals with all the challenges presented in this study. However, each of the techniques proposed in this study has some drawbacks compared to the baseline (which cannot deal with the challenges). Table 6.2 shows the main drawbacks of the techniques proposed for solving the challenges and compares them to the baseline. In the following points, we define these drawbacks and show how the replication is affected by them.

• The total prefetcher size depends on the implementation of each prefetch engine. As it is a highly variable value, we have decided to put a reference value that represents the size of a prefetcher in a tile (M) in the baseline. When there are replications, each replica of the prefetcher must have enough available space to store information about the whole access pattern of all the cores. Therefore, the size of the data structures must be scaled appropriately. The maximum size required for a replica is the size of the prefetcher in a tile from the baseline multiplied by the number of tiles. For this reason, the total size will be the size of a replica multiplied by the number of replicas.

6.5 Facing the challenges 133

If there is a replica in each tile, the total size will be the size of a replica multiplied by the number of tiles. However, the size of a replica can be optimized if the prefetcher uses dynamic tables instead of static ones. For this reason, the size of the prefetcher would be M*K*N instead of M*N*N where K is a variable number between 1 and N. Nevertheless, the performance of the prefetcher will decrease with small values of K. • The messages per access are the maximum number of notifications that a distributed memory module can send to one or several prefetchers per access. In the case of replication, for every cache access, a notification message is sent to every tile in the system (a broadcast message). Note that in the baseline, the communication between the cache module and the prefetcher only takes place inside the same tile. There are techniques that help to reduce the cost of sending these broadcast messages, and studies are being carried out into solutions to make this feasible using, for example a graphene-enabled wireless broadcast [1].

• The prefetcher throughput ratio refers to the maximum number of requests that a prefetch module has to process per cycle. In replication, the notification messages must be processed by the prefetcher and each prefetcher can be the target of a notification message per core and per cycle. For this reason, the contention of the network is not the only problem given that the prefetcher also needs to be available to attend the requests. On the other hand, although the requests that are not addressed to the prefetcher tile are filtered, they still need to be generated and each access memory notification may trigger the generation of prefetching requests. Note that the GHB prefetcher (with the configuration used in this study) can generate up to 16 requests per miss in the distributed memory. This means that the throughput requirements of the prefetching engines will probably need to be quite high in order to achieve a good efficiency. • The NoC prefetcher requests are the maximum number of requests that a prefetcher

module can inject into the network per cycle. However, in the replication, when the prefetch requests are generated, they do not need to be sent into the network (as in the baseline), because the request is always resolved in the same tile.

• The total traffic increment refers to the total amount of traffic that can be injected into the network per cycle by all the tiles in the system, either through the notifications from the distributed memory module or through the requests injected by the prefetcher. When there is a replication, the messages generated per access significantly increase. However, there is a reduction in the number of requests that are generated by the prefetcher. For example, in a 64 tile chip, each access to the distributed memory would

generate 64 new control messages. In the worst of cases, if all the tiles generated a memory access at the same time, there would be 4096 messages generated during the same cycle, which could easily congest the network if not properly managed, whereas in the baseline, in a 64 tile chip, if all the prefetchers generate a request at the same time there would be a peak of 64 messages in the same cycle.

• The extra bits per cache block refers to the extra bits needed to store the information for the profiling in each cache block. In the replication, this is not a problem, because the storage requirements will be the same as the baseline.

Baseline Replicated Centralized Distributed Total prefetcher size M*N M*N*N M*N M*N

Pref throughput 1 N N N→ 1

Extra bits per cache block 1 1 1 log2N

Total traffic increment N (uni) N (br) 2N (uni) N (uni) + N(br) Messages per access 0 1 (br) 1 (uni) 1 (uni) NoC pref requests 1 (uni) 0 1 (uni) 1 (br) Table 6.2 Consumption of resources by the proposed techniques. N: Number of tiles, uni: Uni-cast message , br: Broadcast message, M: Size of a prefetcher in a tile from the baseline.

In document Improving prefetching mechanisms for tiled CMP platforms (Page 151-154)