6.3.1 Coherence protocol
The proposed solution is agnostic of the protocol for cache-coherence. The PTRT and PTAT entries are updated when there is a response to a coherence request for data in the requesting and responding cores respectively. As long as we can monitor the coherence requests and responses issued by an a-core, the scheme is equally applicable to snooping and directory-based coherence. If the m-core is an attached coprocessor, the information for the PTRT and PTAT updates can be sent over a coprocessor interface. If the m-core is a general-purpose core, the update information can either be sent to the m-core through spe- cial messages on a general interconnect, or by having the m-core snoop the a-core requests on a snooping network. The protocol is also agnostic of the choice of cores: in-order or out-of-order, as it only relies on tracking coherence traffic between cores.
1 1 3 2 1 4 // Proc 1 Store A Load B // Proc 2 Store B Load A Program Order
Figure 6.6: Deadlock scenario with the TSO consistency model.
6.3.2 Memory consistency model
Similar to deterministic replay schemes [92], our protocol tracks coherence traffic to deter- mine orderings for accesses to data and replays the same order on the metadata. Hence, it works well with sequential consistency. However, it is known that these schemes can be susceptible to deadlocks under weaker consistency models used in many commercial ar- chitectures (e.g., x86 and SPARC) [92]. For instance, the SPARC Total Store Order (TSO) model allows loads to bypass unrelated stores and get their values from either memory, or a write-buffer. For the code in Figure 6.6, it is possible for both loads to be ordered at memory prior to their preceding stores. Note that instructions still commit in program order, but can be ordered at memory out of order. Thus, from the point of view of the memory model, we have ! →# and " → $, where → denotes a happens-before relation. For deterministic replay systems, this code can cause a deadlock during replay, due to the cycle of dependences [92].
This is because schemes such as RTR that are based on deterministic replay, merely log the coherence actions, and try and replay them in the same order [92]. If the replayer fol- lows the sequentially consistent memory ordering, then it would try and issue $ before !, and # before ". This would cause a deadlock due to a cycle of dependencies. There have been mechanisms proposed to convert these dependencies into artificial write-dependencies to circumvent this problem. The hardware and software support required for this, however, is significant [92].
In our solution, this is not an issue with loads that are ordered before stores and get their values from memory. The Tag Value field in the PTAT provides version management of tag values, allowing for PTAT entries to be processed out of order (as in Section 6.2.4). Thus, the m-core servicing requests can process # and $ even if they are ordered first at memory during replay. The subsequent loads (! and ") get their correct tag values from the source m-core’s PTAT. Thus, a !→# ordering is not imposed on the metadata.
Loads that return values from the a-core’s write buffers pose a more subtle problem. These loads are not observed by the interconnect, and do not have entries in the PTAT. Thus, the previous scheme does not work. Since the a-core commits and orders ! at memory before $, there is already an entry for ! behind $ in the IOT by the time $ is ordered at memory. At this time, while allocating $’s PTRT entry, we add a field with the ID of the youngest instruction in the IOT behind it (note that the IOT is populated when the instruction commits, in program order). This gives a list of loads that have committed behind $, but have been ordered at memory before it. A TSO-compliant m-core can use this to order its metadata memory operations correctly. This argument can be extended to other consistency models that relax the write→read ordering, such as processor consistency on the x86.
6.3.3 Metadata length
Different dynamic analysis scenarios require different metadata lengths. The consistency protocol must be portable and able to accommodate the various lengths used.
Short metadata: The metadata is often much shorter than the actual data. Raksha, for example, associates a 4-bit tag with every 32-bit word of data [24]. Thus, the access to a single 4-byte word of metadata might stem from 8 different 4-byte words of the application. Since we track coherence events to enforce consistency, we enforce orderings at cache block granularity. Accesses to different data cache blocks result in accesses to different
metadata words, and thus short tags do not cause correctness problems for our protocol. On the other hand, short tags can cause a performance problem. Since the metadata that correspond to multiple data cache blocks are packed in a single block, the m-cores can experience higher miss rates than the a-cores due to false sharing. This issue is explored further in Section 6.4.3.
Long metadata: Some analyses require metadata that are longer than the actual data. For instance, the Lockset analysis used by LBA maintains a sorted list of lock addresses for each lock [13]. Thus, each data update corresponds to an update of multiple words of metadata. This creates the following problem: metadata may span multiple cache blocks (or even pages) leading to non-atomic transfers of metadata between m-core caches as the coherence system handles each block separately.
In the analysis architectures proposed thus far, long metadata are always handled in software using short routines with a few instructions [13]. This makes it expensive to handle the atomicity problem for long metadata using software locks. The analysis programmer can potentially avoid using a lock unless the metadata actually spans across multiple cache blocks. Nevertheless, this makes the analysis code architecture-dependent and difficult to write. A better solution is to use Read-Copy-Update (RCU) for metadata. Anytime an analysis routine needs to update long metadata, it creates a copy of the current value and updates the new version. The old metadata is then garbage-collected once its users relinquish hold over it. RCU eliminates the need for software locks in analysis code and the related issues (overhead, deadlocks, etc.). The only change needed in our hardware protocol to work with the RCU approach is the following. Instead of versioning the actual metadata values in the Tag Value field of PTAT entries, we pass a pointer to the active metadata copy. The hardware protocol itself has no other correctness issues.
If RCU is used, garbage collection of the old metadata can be performed by maintain- ing reference counts in software [59]. Reference counts for each version of metadata are incremented when processors enter the analysis routine, and are decremented when they
exit. When no processor is actively using a version of metadata (its reference count reaches zero), it can be garbage collected by software.
6.3.4 Analysis issues
In some cases, the analysis routine performs different operations on the metadata than those performed on the corresponding data. For example, an analysis might maintain a counter in the metadata that gets incremented every time a variable is accessed. This implies that a-core data reads may trigger m-core writes to the corresponding metadata. Our protocol for (data, metadata) consistency, however, relies on coherence activity. Thus, if an a-core read on shared data gets translated into a metadata write, it is not always clear as to which m-core should be able to perform the write first. This could cause consistency issues due to metadata writes being performed out of order. In reality, this is not a major issue because the proposed analyses that convert a-core reads to m-core writes, perform commutative operations on the metadata. Counter increments and lockset updates [13] are commutative operations, and thus the order in which the updates happen does not affect the final value.
To support analyses where data reads lead to non-commutative metadata updates, our protocol must track read accesses to shared data in the PTAT and PTRT structures so that the order can be replayed for metadata operations. Hence, reads to shared data must now be visible on the coherence protocol which is not the case for MESI or MOESI systems (multiple cores can have a copy of the same data in S state and thus, no coherence traffic occurs on reads). A solution would be similar to the scheme by Suh et al. [82], where the authors explain how to implement a MEI coherence scheme on top of MESI or MOESI coherence in order to gain visibility into reads for shared data. Note that the overhead of a MEI protocol would only be paid when such an analysis is actually performed.
Feature Description
Processors 2 to 32 x86 cores, in-order, single issue
Simulator TCC x86 simulator [34] +
Wisconsin GEMS [58] Coherence protocol MESI Directory
Private split L1 64 KB, 4-way set assoc., 3-cycle acc. latency
Shared L2 32 MB, 4-way set assoc., 6-cycle acc. latency
Main Memory 160-cycle acc. latency
Default table sizes 20 (IOT), 10 (PTAT), 10 (PTRT) entries
Table 6.2: Simulation infrastructure and setup.
It is important to note that the evaluation presented in Section 6.4 assumes the worst- case scenario where all instructions (including those in the operating system) must be an- alyzed by the m-core. Developers might however choose to concentrate the analysis on a single application, in which case the hardware structures track only the instructions an- alyzed by the m-core. Similar to the decoupled DIFT architectures [42], system events such as context switches or interrupts do not require any special handling of the hardware structures.