Dual Core Platform - SIMULATION PERFORMANCE EXPERIMENTS

4. SIMULATION PERFORMANCE EXPERIMENTS

4.5 Dual Core Platform

Direct Memory Interface

Fig 4.5 Backdoor Memory Access

4.5 Dual Core Platform

Based on the task partitioning explained in section 4.3, the application was split into two parts, each part being executed on a separate processor. The platform constructed to simulate such a system is illustrated in Figure 4.6. Each processor has its own program memory which contains the executable in .elf format. In the current framework, the cores communicate with each other via shared memory. This shared memory is used to transfer necessary information among the processors like the image properties consisting parameters as image size, number of components, sampling rate and the necessary blocks from core 1 to core 2 for IDCT computation. To main correctness of the application, the two processors synchronize via polling mechanism in which the semaphore present in shared memory is constantly polled by both processors. Thus processor 1 reads the input image from the memory. Each time processor 1 generates

Wrapper _Memory

OVP MIPS32_24K CPU

an 8x8 block after de-quantization, it places the block into the shared memory and sets the semaphore to high. It then waits for this semaphore to set back to a low value, which is done by Processor 2. Processor 1 writes the block to the shared memory only when the semaphore is low. Processor 2 on the other hand, waits for the high value of semaphore. When semaphore is found high, it reads the block from the shared memory, and reset the semaphore to a low value so that next block could be written by Processor 1. Processor 2 then performs IDCT on the block. When sufficient numbers of blocks are obtained, color-conversion and re-ordering is performed and the reconstructed data is stored back in the output memory.

Fig 4.6 Dual Core Platform

Based on this description the next chapter presents the results of our experimentation.

MIPS32_24K Core 1 (IA)

Bus 1

Program Memory

Input Memory

MIPS32_24K Core 2 (IA)

Bus 2

Output Memory Program

Memory

Bus 3

Shared Memory

Chapter 5 RESULTS AND ANALYSIS

The image chosen for the experimentation is a 640 x 480 colored JPEG image shown in Figure 5.1. Performance was evaluated on the server

with the following specification:

Processor : AMD Opteron Processor 252 CPU MHz : 2592.406

Cache Size : 1024 KB Ram Size : 2 GB

Figure 5.1 Input Image

The server used for our experimentation is a dual core. Results are not taken on dedicated server, and other jobs on server are not restricted. The results of various experiments carried out for single and dual core platforms in OVP and CoWare are described as follows:

5.1 Single Core Platform

During the course of simulation in OVP, the processor executes for IPQ (instruction per quantum) number of instructions and then the quantum advances.

= ∗

Thus the IPQ is affected both by changing the Nominal MIPS rate of the processor and the quantum size in the platform. We have considered them one by one. The Effect of changing Nominal MIPS is shown in Figure 5.2

Varying Nominal MIPS (Quantum size = 1 ms)

Analysis: Increasing the Nominal MIPS rate of the processor in a single core system will allow the processor to execute more number of instructions per quantum before the quantum advances. This reduces the simulated time because the total number of quantums required to execute same number of total instructions decreases. The

elapsed time (wall clock time) remains more or less the same as there are same number of total instructions executed every time. Thus changing the nominal MIPS is more or less like changing the operating frequency of the processor causing it to work faster and faster. For the current experimentation the MIPS are fixed at the value of 100.

Figure 5.2 Time variation with Nominal MIPS for single core system

Varying Quantum Size (Nominal MIPS =100)

The effect of changing the quantum value is illustrated in Figure 5.3. For the case of OVP platforms, the quantum is changed in the platform. The simulation time is always an integral multiple of quantum which requires selection of proper quantum value.

However experimentation results shows that for a fixed Nominal MIPS, the simulation time with varying quantum size remains more or less the same. However too low a quantum value causes greater overhead on the simulator increasing the elapsed time.

The default quantum value in case of the proprietary experimental platform is 0.1ms. In our OVP platforms also we adjusted the quantum value to the same.

Thus for the simulations, we choose:

Processor Nominal MIPS/ Frequency = 100

Quantum size = 0.1 millisecond

0 0.5 1 1.5 2

0 200 400 600 800 1000

Time (sec)

Nominal MIPS

Variation with Nominal MIPS (Quantum = 1 msec)

Simulation Time Elapsed Time

Figure 5.3 Time variation with Quantum size for single core system

Based on these parameters, the simulation statistics for the platforms in 3 scenarios is shown in Table 5.1.

Table 5.1 Speed Comparison for single core system

Analysis: From the above table it is clear that OVP provides a simulation speed improvement by a factor of 3 to 4 over the used virtual prototyping environment. Also when switching from pure OVP environment to the OVP-TLM2.0 environment, there is no drop in simulation speed as long as all the access takes place using DMI hint.

Presented here is a range of simulated MIPS which represents the simulated instruction rate i.e. Number of instructions simulated per wall clock second. This range is taken over a multiple iteration cycle as the simulations are not carried on a dedicated server so elapsed time varies slightly with varying server load.

5.2 Dual Core Platform

0.000000 0.000001 0.00001 0.0001 0.001 0.01 0.1 1

Time (sec)

Quantum (sec)

Variation with quantum

Simulation time Elapsed Time

Currently the synchronization among processors is maintained by polling the shared memory. Experiments were done to find the appropriate quantum value to be used for comparison and the effect of changing the processor Nominal MIPS. Figure 5.4 shows the variation of Nominal MIPS and Figure 5.5 shows the variation of quantum size.

Analysis: This is a dual core application with strict data dependency among cores in which both the processors continuously polls the memory to get the pixel block. A tight synchronization is maintained among them. As the Nominal MIPS increases, it is expected that the elapsed time should not change as the total number of instructions in the application should remain the same. However we find an increasing trend. This is because the number of idle instructions polling the semaphore increases with increasing MIPS rate which causes the elapsed time to increase in a linear fashion. The total number of quantums required to complete the application is still the same so the simulation time does not change. Based on this understanding, it is apt to choose the low Nominal MIPS value.

Figure 5.4 Time variation with Nominal MIPS for dual core system

The timing variation with quantum value shown in figure 5.5 also yields some interesting results. The number of instructions executed per quantum is also controlled by quantum size. The polling overhead can thus be reduced if we limit the number of idle wait instructions executed per quantum. Reducing the quantum value from 0.1millisecond to 10 microseconds reduces both the simulation and elapsed time due to polling overhead reduction.

Figure 5.5 Time variation with Quantum size for dual core system

It is therefore desirable to work with a low quantum value. However too low a quantum value will cause considerable switching overhead on the simulator. More the number of context switches, the greater is the number of times, OVP simulator synchronizes with the SystemC simulator. We get to a quantum of 0.000007 seconds where the simulation time is near the real time, the application will take to execute without a large polling overhead.

Thus for the simulations, we choose:

Processor Nominal MIPS/ Frequency = 100

Quantum size = 7 microseconds The simulation statistics are shown in Table 6.2.

Table 5.2 Speed Comparison for dual core system

From the table 5.2, we observe that though the speed improvements are not as high as for standalone single core systems but still OVP simulations seems faster. Also when working with OVP models at TLM2.0 environment, a slight drop in simulation performance is observed. This drop generally falls in the range of 3-4%. Reduced speed

improvements are because of the synchronization needed between the two processors.

Every quantum, the control switches between the processors which causes little simulation overhead. Higher speed improvements are still possible if we work at a higher quantum value. The overhead of polling in that case can be worked around by implementing a pipe-lined JPEG decoder.

The next chapter describes the set of experimentation carried out for hybrid simulation of OVP with SCML.

Chapter 6 EXPERIMENTATION FOR HYBRID SIMULATION WITH SCML

6.1 Proposed Wrapper for Hybrid Simulation of OVP, SystemC and SCML Open SystemC Modeling Library (SCML) is a modeling methodology provided by CoWare for the creation of highly-reusable SystemC TLM peripherals. SCML helps separate TLM communication, storage, timing, and behavior within the peripheral model, making code more modular and more efficient to develop and test.

The proposed methodology for Hybrid simulation of OVP processor models with Open SCML based models is very much based on the concept of wrappers for TLM models.

The introduction of the TLM2.0 compliant SystemC wrapper for the processor models enables the processor models to access the SCML based memories and peripherals. The openly available SCML modeling technology is not very much TLM2.0 compliant till date.

The original modeling technology supports binding of the SCML memories to either PV port or scml_ post_port. Both of these interfaces are not very generic in terms of request and response structure that could easily support the model interoperability issues, something which TLM2.0 standard is trying to achieve by putting all the transaction request/response parameters in one tlm_generic_payload. Since OVP models are successfully able to interact with SystemC models having TLM2.0 based communication interfaces, an attempt has been made to put SCML memories in a wrapper that is TLM2.0 compliant.

The proposed wrapper to integrate SCML memories is shown in figure 6.1.The SCML memories have been encapsulated in a class derived from sc_module and TLM2.0 target sockets have been added. The memory is bound to the other models in the platform through these sockets. In order to do proper memory read/write operations, blocking transport callbacks are registered on these sockets which directly call a memory access function thereby reading/writing data to/from memory and triggering other callbacks on the memory read/writes. The wrapper constructed around SCML memories uses TLM2.0 transport calls without a direct memory pointer. This disables the processor to bypass the socket in the normal transport calls and the request for read/write process actually travels over the bus before reaching memory. This causes a little simulation overhead. However when using scml for modeling peripherals where there is a little access to the peripheral registers, a transport call to scml memory will not cause much

simulation speed drop as, of the total transactions taking place, only a small fraction actually goes over the bus. The experiments to carry out such integration and the important findings of the experimentation are reported in later sections.

Figure 6.1 Hybrid OVP/SCML simulation

With the development of TLM2.0 compliant SystemC wrappers, it has been thus possible to able to integrate OVP processor models with the SCML based peripherals/memories but the access are still made by actual transport call over the bus.

Backdoor mode access, which provide fast simulation performance are currently not supported. Also during the course of experimentation, some synchronization constraints were brought in front which are discussed further.

6.2 Initial Experimentation

To test the feasibility of such a hybrid simulation, first very basic single core system was constructed executing a simple application. The data and stack memories in this system were replaced by SCML based memories which had a TLM2.0 target socket attached to it over which blocking transport callbacks were registered. However as already, due to the lack of ability to get a memory pointer for SCML memories which could enable direct memory access and provide fast simulations, some memories like program memory whose pointer is passed to the processor to load the application executable were still modeled as simple TLM2.0 memory. With such a system setup, successful simulations were carried out through which possibility for hybrid simulation of OVP and TLM2.0 compliant SCML models was demonstrated.

6.3 Integrating SCML modeled SystemC TLM peripheral

To gain further insight into the possibility for hybrid simulation, we decided to incorporate some SCML modeled peripheral into the system rather than replacing the memories which are pure slave models. The motivation behind this was -

1. Hybrid simulation with non DMI based SCML models without causing much simulation overhead.

2. To look into the synchronization aspects when integrating OVP models with SCML based master models.

Replacing simple memories with SCML based ones, would cause all the memory access read/write transport calls to actually go over the bus and would bring down the simulation speed thereby defeating the purpose of prototyping for high end S/W development use-case. However when working in a non-DMI mode for some of the peripherals in the system, the overhead is minimum, as there are only a limited number of accesses to the peripheral registers. Keeping this in mind, we changed the dual core system built in the second experimentation of Chapter 4 to support interrupt driven inter-processor communication rather than polling of memory. For this purpose, hardware based IPC block is modeled in SCML. The block diagram for IPC is shown in figure 6.2.The IP has number of interrupt and semaphore registers.

Semaphores can be claimed and released through target ports to generate interrupts at the output port of the model which are connected to the interrupt inputs of the

Status 0 Register Enable 0 Register

Status 1 Register Enable 1 Register

Intreq 0

Intreq 1

SEM 0 SEM 1 SEM 2

SEM 4 ……….. SEM 7

When the processor releases a semaphore after claiming it, the bit equal to the semaphore register number (which is released) in the status register gets set. If the corresponding bit in the enable register is set, the interrupt is generated on the interrupt line. The platform incorporating this SCML based model is shown in figure 6.3 The communication between cores is described in figure 6.4. Core 1 after performing VLD, ZZ and de-quantization operations on an 8x8 pixel block claims the semaphore in the Inter Processor Communication (IPC) block. It then places the block in the shared memory and releases the semaphore to generate interrupt for core 2 on interrupt line 1. Core 1 now goes in a wait state waiting for an interrupt from core2. It can write the second block to the shared memory only when the first one is taken by core 2. Core 2, which was initially waiting for an interrupt, on receiving the interrupt from core 1, clears the interrupt line. It further claims the semaphore and gets control over the shared memory. It reads the block from the shared memory, performs IDCT and further operations. Core 2 then releases the semaphore in IPC to generate interrupt for Core 1 on interrupt line 0 (shown in figure 6.4) notifying that core 2 has finished operation on the current block and core 1 can now write the second block to memory. In this way, both the processors work together with a tight synchronization to decode all the blocks and the final reconstructed image is produced.

Figure 6.3 Interrupt driven dual core platform

Bus 3

To make this application work on the platform, exception handlers to service processor interrupts are needed. The exception handling was studied and the application was modified to support handlers in MIPS assembly. This required careful programming of various co-processor registers of MIPS32_24K.

6.4 Important Observations

When working with such a system, some critical issues were addressed:

1. Interrupts are affected in the above system by calling a processor interrupt routine via SystemC signal. SystemC and OVP simulator synchronizes at the end of quantum so all signal writes take effect at the end of quantum, when there is a call to wait().

2. The interrupt generated in the current quantum (Q1) for a processor will thus be sensed by the processor in the next quantum (Q2).

3. On sensing the interrupt, the interrupted processor will save the current processor state, clear the interrupt line, disables interrupt and branch to ISR. If all signal writes takes place at the end of quantum, the call for interrupt clear in Q2 will be effective in Q3.

Figure 6.4 Dual core inter processor communication flow

Clear Interrupt line

4. The interrupt line must be cleared immediately. This is needed because after returning from ISR, the processor re-enables the interrupts. If the interrupt is not clear by that time, the processor will again loop in ISR which is undesirable. The interrupted processor thus keeps on executing the ISR for the entire quantum.

Thus for the case of OVP based simulation, it is necessary to calculate the proper quantum value such that the number of instructions executed per quantum is close to the number of instructions in Interrupt Service Routine and processor does not loop in ISR. Higher quantum value causes delayed interrupt assertion/de-assertion. Working with a small quantum value solves the problem but causes greater overhead on the simulator affecting the performance.

For the case of OVP, the working quantum range was found 100 – 220 ns.

The similar platform when simulated in the proprietary modeling environment does not give a problem as it allowed dynamic change of quantum based on external events by other masters. This is something which TLM2.0 compliant OVP wrappers are currently missing. In the tested proprietary environment, explicit synchronization is maintained between different models which allowed immediate interrupt assertion and de-assertion. Thus a higher quantum value in this case does not bring any problem of processor looping in ISR and gave correct results.

The simulation statistics for the experimentation is reported in Table 6.1

Modeling Environment

Quantum size Simulation time Elapsed time

Simulated

Table 6.1 Simulation statistics for IPC based system

The above table shows that the simulation speed achieved from the hybrid OVP environment is slower than that of the existing modeling solution when SCML based masters demanding synchronization are put into the system. Here, the simulations in OVP have been carried out at a very small quantum value which results in slow speed.

Had OVP been able to provide explicit synchronization between the OVP and SystemC

simulation kernel when desired, by dynamically changing the quantum value, the simulations incorporating SCML based masters with a higher quantum value can be carried out which would have given better performance.

6.5 Proposed Solutions

In order to achieve fast simulation speed in the OVP/SystemC/SCML based environments consisting of a number of master and slave peripherals, it is desirable that the synchronization between the OVP simulator and the SystemC simulation kernel needs to be enhanced. This is necessary to realize the correct functionality of the application. Some proposed solutions to modify the TLM2.0 compliant wrapper for OVP

In document Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML (Page 31-49)