2.5 Shared Resource Modeling
2.5.2 Simple Bus-Based Example
Time
t0 t1 t2 t3 t4
Figure 2.25: Example illustrating the relationship between execution threads (red) and communication threads (yellow).
These service times can be annotated via consume calls in a similar way to how computation complexity is annotated in the execution stack. Note that it is necessary to explicitly specify a scheduler/arbiter that will service the consume call. Unlike the execution stack, we must be able to distinguish be-tween various classes of communication resources , such as memory banks or busses. In our example, the first consume call within the “Access Memory”
template results in bus usage from t1to t2. The second consume call, repre-senting the actual access to the memory bank, blocks the memory from t2to t3. Finally, the data is returned to the processor via the third consume call, lasting from t3 to t4. It is important to note that there is no contention for either the bus or the memory in this example because we’re illustrating only one read. In the presence of multiple simultaneous reads, multiple commu-nication threads using the “Access Memory” template would be spawned, creating contention for the bus and shared memory.
In the following section we will put these newly introduced model elements to use by incorporating a simple bus into our example.
2.5.2 Simple Bus-Based Example
Let us start with a simple system of two processors, much like one from Figure 2.4, and connect the processors to a single shared memory bank via a bus (Figure 2.26). During the computation of matrix product, each processor will have to access the shared memory via the bus in order to get the matrix data. We assume that program instructions reside locally with each processor.
Resource 1 Resource 2
Shared Memory
Figure 2.26: Bus based interconnect for our architecture.
So, how does this system view fit into the split stack MESH view presented in the previous section? Figure 2.27 shows the amended view of the sys-tem. Note that all the worker threads can access a top layer behavior (i.e.
communication thread) named Bus Access. Therefore, whenever a bus ac-cess is required, this communication thread is instantiated and runs until the access is satisfied. Since a thread that gains exclusive access to the bus gains exclusive access to the memory as well, we will model these two as a single resource (see hardware layer of comm. stack). For the same reason, the bus arbiter will also serve as a de facto memory arbiter, deciding which thread receives access to both the bus and memory. Similarly, the Bus Ac-cess thread’s consume calls will model both the bus and memory delay at once. In a later example, we will split the bus and the memory into separate shared resources with their own arbiters in order to model more complex situations.
Software
Boss Worker 0
...
Worker 3Round Robin Scheduler
Resource 1
Schedulers
Hardware Resources
Resource 2 Bus/
Memory Bus Arbiter
Bus Access
Execution Stack Comm. Stack
Figure 2.27: Modeling a shared bus and memory.
We make our changes to matrix_mul3.c to create matrix_mul12.c that im-plements our bus model. First, we will create an arbiter for the bus/memory resource together with a communication thread that models the behavior of a bus access:
103 // c r e a t e s c h e d u l e r s
104 d e f a u l t s c h e d=m e s h c r e a t e s c h e d u l e r ( ” d e f a u l t s c h e d ” ,
105 m e s h s c h e d u l e r r r ) ;
106
107 b u s a r b i t e r = m e s h c r e a t e s c h e d u l e r ( ” b u s a r b i t e r ” ,
108 m e s h s c h e d u l e r r r ) ;
109
110 // c r e a t e communication threads
111 m e s h c r e a t e c o m m t h r e a d ( ” b u s a c c e s s ” , b u s a r b i t e r , b u s a c c e s s ) ;
In line 107 we create a scheduler named bus_arbiter using the well known mesh_create_scheduler call. Just like creating custom scheduling strate-gies for execution resources, a custom arbitration strategy can be created here as described in Section 2.3.5. For simplicity of this example, we will use the default round-robin behavior.
The mesh_create_comm_thread function behaves very similarly to the func-tion for creafunc-tion of execufunc-tion thread. Just like mesh_create_thread, the mesh create comm
thread() details on pg. 88
mesh_create_comm_thread takes in arguments for thread name, default scheduler, and a function pointer to the thread behavior. What is miss-ing is a void * type optional argument that all execution threads can have at creation. As we will see later, we will be given an opportunity to pass optional arguments when calling communication threads through the read-/write interface.
As seen in the code above, we will create a bus resource along with execu-tion resources. Note that the communicaexecu-tion resource creaexecu-tion is identical to execution resource creation. The simulator does not distinguish between communication or execution resources; all resources have a certain “compu-tation power” that allows them to be consumed and thus generate system timing. Whether the consumes come from execution or communication en-tities does not matter. Note that in line 128 we set bus arbiter to be the bus resource’s scheduler.
The next issue to consider is how to perform the reading of memory from within the worker threads. For this purpose, we will consider only memory accesses of the matrix data, ignoring the memory accesses brought about by instruction loading, thread spawning, etc. Since all the matrix data accesses happen within the worker threads, let us return to the piece of code that actually performs the addition and multiplication operations:
35 //perform matrix m u l t i p l i c a t i o n f o r t h i s row
47
48 // write r e s u l t back to shared memory 49 m esh com m wr ite ( ” b u s a c c e s s ” ,NULL, NULL,
50 BLOCK RESOURCE ) ;
Big change here is the addition of the mesh_comm_read API call that will grab the data values from the A and B matrices in order to perform the cal-culation. The mesh_comm_read call, along with its counterpart mesh_comm mesh comm read()
details on pg. 87 _write, is the main method for execution threads to instantiate and use mesh comm read()
details on pg. 87
communication threads. Remember that in line 111 we’ve created a com-munication thread called bus access. Through the use of mesh_comm_read, the worker thread instantiates the communication thread bus access just as shown by the arrows between threads in Figure 2.27. The mesh_comm_read argument BLOCK RESOURCE means that the worker thread will sub-sequently block the processor it is running on until the mesh_comm_read request is complete (i.e. the bus access completes). There are several other blocking options available which will be discussed later in this tutorial. We will also postpone discussion of other two mesh_comm_read arguments (set to NULL in this example), which allow data and parameters to be passed to the communication threads.
Finally, we need to describe what happens when control is passed from worker threads onto the bus access thread. Obviously, this part of the simulation should model an access to the shared bus and memory, applying the appropriate delay. Let us look at the code for bus access:
13 // b u s a c c e s s comm thread
First thing to notice is that any communication thread has a mesh_comm _thread_data structure passed to it as an input argument. This struc-ture contains several useful pieces of information for the communication thread, several of which we will use in this example. One of the fields within the mesh_comm_data structure is the func field which specifies whether the thread was started as a result of read or a write request. Depending on the type of request, we will consume a different amount of resources (lines 19 through 24), ultimately resulting in different delays for reads and writes. In this example, writes will take twice as long as reads, as seen by the differences in the consume call values. The mesh_consume_str_and_exit is a slightly different type of consume call; it notifies the simulator that it is the last mesh consume str
and exit() details on pg. 75
consume call in the thread and that the thread can be immediately termi-nated. The time of bus access thread termination is very important. After all, the worker thread that started the memory access is blocking until the completion of bus access thread. Additionally, note that every communica-tion thread is responsible to free its own mesh_comm_thread_data structure
(line 26).
The above implementation of bus access thread represents a very simple example of writing communication threads. In several following examples in this tutorial, we will attempt to show other, more advanced features of communication threads. Now that we have set up a bus interconnect model, we will run the matrix_mul12 model and receiving the following output:
MESH Simulation Kernel - compiled on Sep 25 2004 13:20:37 row 0: 0
row 1: 0 row 2: 0 row 3: 0
MESH Final Simulation Time = 78.000000 Resource usage:
resource1: 32.000000, for sched: 0.000000 resource2: 46.000000, for sched: 0.000000 bus: 40.000000, for sched: 0.000000 Thread usage:
worker3: 12.000000, contended 0.000000 boss: 30.000000, contended 0.000000 worker0: 12.000000, contended 0.000000 worker1: 12.000000, contended 0.000000 worker2: 12.000000, contended 0.000000
The output above tells us that the total runtime of the system was 78.0 time units and that the bus resource was utilized a little above half the time. Let us look at the trace graph for more information:
Figure 2.28: System simulation including overhead due to memory accesses.
(matrix mul12.c)
Figure 2.28 features two separate windows showing the execution stack trace
on the top, and the communication stack trace on the bottom. The only re-source in the communication stack is the bus rere-source. Running on it are the read and write varieties of bus access behavior, shown in two different colors.
At time t = 5.0 the worker0 thread attempts to run on resource1. However, before it performs its work and consumes computation from its resource, it must first get the appropriate data through using the mesh_comm_read calls. Therefore, at t = 5.0 and t = 6.0 the bus access thread utilizes the bus to grab the necessary data. Remember that we set the consume value of a bus access read to be 1 and that the default computational power of the bus resource is 1 as well. Therefore, in the absence of contention (as is the case here), it takes two simulation time units for worker0 to get both pieces of data it needs. At t = 7.0 the worker0 executes on resource1. However, because the bus read was of type BLOCK RESOURCE, no other task could happen on resource1 while data was being fetched.
At t = 20.0 we have a case where both worker1 and worker2 (in dark and light green, respectively) attempt to run on separate processors. They simultaneously attempt to access the bus. Remember that we set the bus arbiter scheduler’s behavior to be mesh_scheduler_rr, i.e. round robin.
At t = 20.0, bus arbiter will arbitrarily allow worker2 to access the bus first.
However, at t = 21.0, the round robin algorithm will give access to worker1.
At t = 22.0, worker2 will get a chance to get its second data value. Note that once this is done, worker2 is free to execute on resource2 at t = 23.0.
The worker1 is delayed a cycle by its second bus access at t = 23.0 and gets to execute on resource1 at t = 24.0. This scenario illustrates how contention for a shared resource (a bus and memory in this case) as well as arbitration strategy can be modeled within the MESH simulation.