Static system call handler - Behavioral NoC Simulation

Chapter 5 NoC Simulation

5.4 Behavioral NoC Simulation

5.4.2 Static system call handler

The occurrence of a system call disrupts the control flow of an application thread and shifts the control to an OS routine, as discussed in Section 5.3. In a full-system

simulator, system calls are emulated with high-fidelity by having access to a complete system stack. However, the context switching on a system call is very expensive and can significantly slow down the simulation. Moreover, the system response time can be unpredictable, especially in many cases that involve I/O operations9. Alternatively, we can consider a snapshot of those calls that represent the average behavior of the system. In this way, we can eliminate these complexities and still retain an acceptable level of accuracy expected for NoC evaluations. However, similar to the discussion in Section 5.4.1, not all system calls impact the NoC. To this end, we first review the types of system interactions that can occur during the execution of a multi-threaded program:

• Non-blocking system calls: Non-blocking system calls are OS services in which the caller does not need to wait for a kernel notification to resume its operation. We assume the interaction between the application and OS is very fast, and that these system calls return back to user-mode immediately10. Therefore, we neglect the occurrence of these calls in the application’s control flow.

• Private blocking system calls: A blocking call will stop an application’s execution until the request is properly answered by the system. We assume a blocking system call is private when it only interrupts the execution of a single thread. For example, we consider I/O system calls, like reading from a file by a thread, as being private. To account for private blocking system calls, we add the estimated waiting time11 to the RIEC instruction that occurs right after the return from the call.

Many modern applications, such as client-server workloads, require having multiple network com- munications over TCP sockets.

10_{For example, in a non-blocking I/O, a process can submit its I/O request to a kernel buffer and}

immediately resumes its execution. The process will be later notified upon completion of the I/O task.

11_{We assume the waiting time corresponds to the moment that a thread goes into waiting mode until}

• Shared blocking system calls: We consider a blocking system call as shared when two or more threads get affected because of the sharing a resource, which is typically a barrier or lock for a multi-threaded shared-memory application. A shared blocking call is commonly used for synchronization among threads in a multi-threaded program. Since it alters the control flow, we focus on developing techniques to emulate its behavior.

Next, we describe how we statically address the shared blocking system calls.

Implementation of shared blocking calls

We implemented a binary instrumentation tool [50] to record the shared blocking calls of a multi-threaded program. The calls are captured as they occur in a normal execution of the program at a host12. Then, we build a direct acyclic graph (DAG) with the calls as vertices, and the RIECs of threads as edges. That is, we replace the instruction for a shared blocking call with a corresponding vertex in the generated graph. These vertices act as synchronization points. When a thread performs all the memory accesses listed in its RIEC and reaches to this point, it will block until all other threads entering this vertex arrive.

Fig. 5.6 illustrates the proposed method for a parallel program with three threads and three synchronization points. Each thread starts to execute the memory accesses in its reduced instruction trace (as discussed in Section 5.4.1) until it blocks on the vertex V1. If the thread is the last arrival, it will wake up the other blocked threads, and all of

them can resume their normal execution13. A similar procedure occurs for the V2 and

12_{Specifically, we capture the futex wait(), futex wake(), futex cmp requeue() system calls in a linux-}

based system.

Figure 5.6: Injecting synchronization points in reduced instruction traces of application threads.

V3 synchronization points.

To see which of the three system call classes appear more frequently, we did a study on the SPLASH[76] and PARSEC[14] benchmark suites. For CMPs, it is common to consider only the parallel part of a benchmark for evaluations as the region of interest (ROI). Therefore, we present results for both full and partial executions of the benchmarks. Table 5.2 shows the results for three of the benchmarks. As we can see, although all system call types occur in full runs of the benchmarks, the shared blocking system calls are the dominating type for the parallel regions. We can also see that the region of interest (the parallel part) is considerably smaller in instruction count than the full application.

Table 5.2: Application behavior at various instruction blocks (16-core, shared MOESI).

Application Block Inst. count shared system calls

fft full 3884535 59% ROI 917580 100% radix full 6595950 82% ROI 1677496 ∼100% lu ncb full 21905661 78% ROI 3266720 ∼100%

Table 5.3: Comparison of simulation methods.

Simulation method Approach Memory model

Level

Model gen. cost (per combination14₎

Simulation time

full-system Full emulation _Yes Application _{[hours,weeks]} [hours,weeks]

Synthetic Probabilistic _No Network _{[hours,weeks]} [seconds,minutes]

Synthetic (Synfull) Probabilistic _Approx. Network _{[hours,weeks]} [seconds,minutes]

Packet-trace Packet-driven _No Network _{[hours,weeks]} [seconds,minutes]

Net-trace Dependency-aware packet-driven _No Network _{[hours,weeks]} [seconds,minutes]

In document Making the On-Chip World Smaller with Low-Latency On-Chip Networks (Page 152-157)