Appendix A Append - Cache design strategies for efficient adaptive line placement

Appendix

A.1.

SBC additional experiments

A.1.1.

Master-Slave SBC

We have implemented an extended version of the DSBC, called Master-Slave SBC, where destination sets that become highly saturated can be recursively associated without breaking its original association (unless the break condition is fulfilled). As a result, a given set can play the role of source and destination within different associations at the same time. To achieve this, every single set has two entries in the Association Table (see Section 2.4.1), one devoted to each possible role within an association, which indicate its associated sets. Note that when a certain set plays the role of source set it can only displace native lines. Results for the two-level baseline configuration are shown in Table A.1, while results for the three-level configuration are depicted in Table A.2. Simulations were performed under the same conditions as in Chapter 2.

The improvement obtained in comparison with the original DSBC (3.5% and 5.25% average IPC improvement related to the two-level and three-level configurations, respectively) is almost negligible.

Table A.1: IPC improvement of the Master-Slave SBC over the two-level baseline configuration.

bzip2 milc namd gobmk soplex hmmer sjeng libquantum omnetpp astar geomean

4.4% 4.2% 0.1% 0.1% 3.4% 3.6% 0.1% 2.2% 17% 4% 3.8%

Table A.2: IPC improvement of the Master-Slave SBC over the three-level baseline configuration.

bzip2 milc namd gobmk soplex hmmer sjeng libquantum omnetpp astar geomean

4.4% 4.5% 3.5% 4.4% 5.4% 4.4% 4.1% 4% 16% 4.9% 5.5%

A.1.2.

DSBC with Extra Tags

We have experimented with another extended design of the DSBC where each set has two extra tags in the Association Table (AT). These tags identify lines that have been displaced from the set to another one. The value of two tags was chosen according to the average number of displacements observed in the previous experiments (2.15, see Section 2.8.3), so that this number of tags suffices for many associations while not increasing too much the cost of the design.. It also needs a counter, which indicates the number of displaced lines to the destination set. The operation of this extended design is the following:

When a new association is committed, the tag of the line which is displaced to the destination set is replicated in one of the two extra tags of the AT (in the entry corresponding to the source set), the other one remaining invalid. also, the counter of displaced lines is set to 1.

A subsequent displacement would update the other extra tag and increase the counter.

If a new displacement happens and there is no extra tag free, the counter is increased.

Every time a cache set is accessed, all tags, including the two extra in the AT, are checked. If the counter is equal to or greater than 3 or the requested line is found in the extra tags a second search is needed.

A.2 TABSBC additional experiments 155

If a displaced line in a destination set is evicted, the counter of the source set is decreased, and one of its tags is invalidated if it corresponded to the evicted line.

As we can see the purpose of this design extension is to avoid secondary accesses under a miss in the source set of a destination. Tables A.3 and A.4 show the IPC improvement in the two-level and three-level configurations, respectively.

The percentage of improvement is slightly better than that of the Master-Slave approach but still quite low in comparison with the original DSBC.

A.2.

TABSBC additional experiments

A.2.1.

TABSBC using the RRIP replacement policy

The Re-Reference Interval Prediction (RRIP) [28] technique achieves good performance benefits by modifying the traditional cache replacement policy. Victim lines are selected depending on their recent behavior using a 2-bit counter to indicate the degree of reuse of each line. New lines are inserted with a reuse value of 2 or 3 depending on which option is performing better in the cache according to the set dueling mechanism. A line is selected for eviction only if its counter has a value of 3. If no such line exists, the counters for all lines in the current set are increased until one counter reaches that value. When a block is touched, its counter is set to zero (applying the Hit Priority approach, which is the one used in our experiments as it achieved the best results in [28]). Although a technique with thread-aware support has been proposed (TA-DRRIP, which uses set dueling to dynamically determine which option the application should apply in the presence of other applications), its efficiency is reduced when the number of applications sharing a LLC increases since a given core may evict a recently inserted line owned by a different core.

We have extended the original TABSBC design by applying RRIP as the replacement policy instead of the traditional LRU one. This way, the behavior of this version of TABSBC is listed next:

Table A.3: IPC improvement of DSBC with Extra Tags over the two-level baseline configuration.

bzip2 milc namd gobmk soplex hmmer sjeng libquantum omnetpp astar geomean

4.1% 5% 0.2% 0.2% 3.8% 4% 0.5% 2.3% 15.5% 5.2% 4%

Table A.4: IPC improvement of DSBC with Extra Tags over the three-level baseline configuration.

bzip2 milc namd gobmk soplex hmmer sjeng libquantum omnetpp astar geomean

4.5% 4.5% 4% 5% 4.4% 5% 4% 4.5% 14.5% 5.2% 5.6%

New lines are inserted with a degree of reuse equal to 2, if the set is applying the traditional MRU insertion, or 3, if the set is dealing with capacity problems.

Tables A.5 and A.6 show the performance improvement, measured in terms of throughput, and miss rate reduction of this version running 2 cores, respectively. The same study is performed in Tables A.7 and A.8 running 4 cores. Experiments were performed under the same conditions as in Chapter 5.

This design achieves a slight improvement related to TABSBC, which had obtained 4% IPC improvement and 12% miss rate reduction running two cores.

As for the 4-core experiments, the improvement obtained related to TABSBC, 5% performance improvement and 15% miss rate reduction, is a little bit higher than in the two-core ones.

A.2 TABSBC additional experiments 157

Table A.5: Performance improvement of TABSBC with RRIP running two cores.

MW1 MW2 MW3 MW4 MW5 MW6 MW7 MW8 4.7% 7.9% 9.6% 6.4% 1.3% 5.2% 2.8% 4.7% MW9 MW10 MW11 MW12 MW13 MW14 MW15 MW16

1% 6.3% 1.6% 6.2% 1% 1.2% 4% 1.5%

geomean 4.1%

Table A.6: Miss rate reduction of TABSBC with RRIP running two cores.

MW1 MW2 MW3 MW4 MW5 MW6 MW7 MW8 24% 14% 17% 7% 5.1% 5% 31% 24% MW9 MW10 MW11 MW12 MW13 MW14 MW15 MW16

1% 17% 2.6% 24% 5.8% 5.4% 8% 12.3%

geomean 12.4%

Table A.7: Performance improvement of TABSBC with RRIP running four cores.

401+444+ 445+456 401+445+ 456+471 401+433+ 450+462 433+471+ 473+482 401+444+ 458+471 444+458+ 462+471 geomean 1.4% 3% 7.5% 14% 3% 5.4% 5.6%

Table A.8: Miss rate reduction of TABSBC with RRIP running four cores.

401+444+ 445+456 401+445+ 456+471 401+433+ 450+462 433+471+ 473+482 401+444+ 458+471 444+458+ 462+471 geomean 7% 20% 21% 22% 17% 19% 17.5%

In document Cache design strategies for efficient adaptive line placement (Page 195-200)