• No results found

Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML

N/A
N/A
Protected

Academic year: 2021

Share "Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

Hybrid Simulation Framework for

Virtual Prototyping

Using OVP, SystemC & SCML

A Feasibility Study

PRIYA AGRAWAL

VLSI DESIGN TOOLS & TECHNOLOGY

INDIAN INSTITUTE OF TECHNOLOGY, DELHI

(2)

Hybrid Simulation Framework for

Virtual

Prototyping

Using OVP, SystemC & SCML

A Feasibility Study

A thesis submitted in partial fulfilment of requirements for the degree of

MASTER OF TECHNOLOGY

in

VLSI DESIGN TOOLS & TECHNOLOGY

by

Priya Agrawal

2007JVL2170

Under the guidance of

Prof. Anshul Kumar

Mr. Desingh Devibalan B (

NXP Semiconductors

)

VLSI DESIGN TOOLS & TECHNOLOGY

Indian Institute of Technology, Delhi

(3)

i

CERTIFICATE

This is to certify that the thesis titled Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC and SCML – A Feasibility Study being submitted by Priya Agrawal to the

Indian Institute of Technology, Delhi for the award of the degree of Master of

Technology in VLSI Design Tools and Technology is a bonafide work carried out by her under our supervision and guidance. The research reports and the results presented in the thesis have not been submitted in parts or in full to any other University or Institute for the award of any degree or diploma.

Dr. Anshul Kumar Desingh Devibalan B

Professor Technical Leader

Department of Computer Science & Engg. CTO & IC Design Cluster Indian Institute of Technology NXP Semiconductors

(4)

ii

ACKNOWLEDGEMENT

I would like to express my heartily thanks to Professor Anshul Kumar, Department for Computer Science and Engineering, IIT Delhi, my academic guide for overall motivation, support and guidance during this project.

I would like to sincerely thank my supervisor Desingh Devibalan B for providing me such a challenging project to work on. His constant guidance and invaluable suggestions throughout the project and his critical approach to problems has led to the successful completion of this project.

Furthermore, I am also thankful to Duncan Graham, Lee Moore and Larry Lapides (Imperas), Raghunandan Balasubramaniam, Chandrashekhar and Mischa Jonker (NXP Semiconductors) for their support and encouragement during this project. It has been very enlightening and enjoyable experience to work with them.

I wish to express my great thanks to my family members who supported me in all the endeavors I had during thesis work.

Finally, I owe many thanks to my colleagues and friends for making my stay in IIT Delhi and NXP Semiconductors, Bangalore memorable.

Priya Agrawal

M. Tech (VLSI design Tools and Technology) IIT Delhi

(5)

iii

ABSTRACT

The increasing software development cost and effort and decreasing turnaround time requirement for Multiprocessor SoC has made the designers strive for fast virtual prototyping solutions capable of simulating the system at speed of several hundreds of MIPS. Several fast prototyping solutions are provided by the ESL designers worldwide.

Open Virtual Platforms enables simulating embedded systems running real application code. This project aims at exploring this new technology and its interoperability with the existing TLM based SystemC platforms. The present work addresses the details of the technology, experiments done with it to check its simulation performance and possibility for hybrid simulation with SCML. Experimentation for simulation speed comparison of OVP with existing proprietary prototyping solutions and hybrid simulation discloses some important observations which are also reported.

(6)

iv

Table of Contents

1. INTRODUCTION……….1 1.1 Overview ... 1 1.2 Motivation ... 1 1.3 Organization ... 2 2. BACKGROUND ... 3 2.1 Need of Prototyping ... 3

2.2 SystemC / Transaction Level Modeling ... 3

2.3 System Simulators ... 4

3. OPEN VIRTUAL PLATFORMS... 6

3.1 Introduction ... 6

3.2 OVP APIs ... 6

3.2.1 Innovative CPU Manager (ICM) ... 6

3.2.2 Virtual Machine Interface (VMI) ... 7

3.2.3 Behavioral Hardware Modeling (BHM) and Peripheral Programming Model (PPM) ... 8

3.3 OVP models... 8

3.4 OVPSim – The OVP Simulator ... 9

3.5 Additional Features of OVP ...10

3.6 WHY is OVP fast? ...11

3.7 Hybrid Simulation support for OVP ... 13

3.8 Approach to TLM2.0 ...14

3.9 Details of OVP inside TLM2.0 ...14

4. SIMULATION PERFORMANCE EXPERIMENTS ... 17

4.1 Application under test - JPEG Decoder ...17

4.2 Application Task Graph Mapping for Dual Core Platform ...18

4.3 Single Core Platform ...20

(7)

v

4.5 Dual Core Platform ...22

5. RESULTS AND ANALYSIS ... 24

5.1 Single Core Platform ...24

5.2 Dual Core Platform ...26

6. EXPERIMENTATION FOR HYBRID SIMULATION WITH SCML ... 30

6.1 Proposed Wrapper for Hybrid Simulation of OVP, SystemC and SCML ...30

6.2 Initial Experimentation………..……….31

6.3 Integrating SCML modeled SystemC TLM peripheral ...32

6.4 Important Observations ...34 6.5 Proposed Solutions ...36 7. CONCLUSIONS ... 37 7.1 Summary ...37 7.2 Future Scope ...38 REFERENCES ... 39

(8)

vi

List of Figures

Figure No. Caption Page No.

3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 5.3 5.4 5.5 6.1 6.2 6.3 6.4 OVP Interfaces Wrapper for Processor Model

Processor Wrapper Implementation JPEG Decoder

Dual Core System Architecture

Partitioning for JPEG Decoder Single Core Platform

Backdoor Memory Access Dual Core Platform

Input Image

Time variation with Nominal MIPs for single core platform Time variation with Quantum size for single core platform Time variation with Nominal MIPs for dual core platform Time variation with Quantum size for dual core platform Hybrid OVP/SCML Simulation

Inter Processor Communication Block Interrupt Driven Dual Core System

Dual core inter processor communication flow 7 15 16 18 19 20 21 22 23 24 25 26 27 28 31 32 33 34

(9)

vii

List of Tables

Table No. Caption Page No.

4.1

5.1

5.2

6.1

System Load distribution for JPEG Decoder

Speed Comparison for Single Core System

Speed Comparison for Dual Core System

Simulation Statistics for IPC Based System

20

26

28

(10)

1

Chapter 1

INTRODUCTION

1.1 Overview

Today’s embedded systems need to verify that the combination of hardware and software matches the expected functionality and performance. The turnaround time requirement of any project design is decreasing every year. In order to design and verify the prototype of large systems, fast simulation requirement is a must. This project aims to investigate the feasibility of adopting a virtual Prototyping technology based on binary translation to improve the simulation speed of Software verification. Imperas, on March 07, 2008 announced the release of a virtual platform and modeling technology to enable simulating embedded systems running real application code. This technology is called OPEN VIRTUAL PLATFORMS

.

In this project, an attempt has been made to explore the technology provided by Imperas.

1.2 Motivation

Virtual platforms (VP) have been used for some years to develop, analyze, optimize and validate system - level hardware architecture. Today’s offerings of prototypes are architected for single core SoC and does not scale to large number of embedded processors, specifically when it comes to simulation speed and debugging usability.

IMPERAS provide multi-processor (MP) virtual prototyping, simulation, and debugging. Building a virtual prototype with Imperas tools simulate efficiently at speeds of 100s and 100s of MIPS on desktop PCs .They are completely Instruction Accurate and model the whole system. OVP and its APIs help foster model interoperability, which is vitally needed now in electronic system level (ESL) design. [1][2][3]

As the virtual platform solutions offered by Imperas seems to be quite promising, we have intended to analyze the feasibility of stitching OVP processor models together with other peripherals modeled in SystemC/Open SCML. The key objective is to create a proof-of-concept platform to demonstrate this hybrid simulation framework for virtual platforms. Also, we have tried to benchmark and compare the simulation performance

(11)

2

of OVP for single/multi-core platforms against the simulation framework provided by one of the industry leading ESL vendor.

This hybrid simulation framework shall lead to new avenues in simulation of complex SoC Platforms built from various ESL/IP vendor supplied IPs in SystemC/SCML (eg. CoWare, ARM), testing the true inter-operability defined by the TLM2.0 standard. Thus it shall reduce the engineering effort in creating high speed multi-core virtual platforms for early software development to meet the tight time-to-market windows.

1.3 Organization

The entire work is organized as follows. Chapter 2 presents the background base needed. Some existing prototyping solutions are also discussed. Chapter 3 highlights the details and components of the Open Virtual Platform technology. The hybrid simulation support provided to integrate OVP models in SystemC environment is also presented. Chapter 4 contains detailed description of platforms constructed in OVP and the proprietary modeling environment with corresponding simulation statistics presented in Chapter 5. The proposed wrapper for hybrid simulation of OVP with SCML, simulation experiments and observations for the same are discussed in Chapter 6. Finally Chapter 7 provides the summary of all the experimentation and suggestions for future exploration.

(12)

3

Chapter 2

BACKGROUND

2.1 Need of Prototyping

With the increasing complexity and integration in SoC, software development costs are rising very high. A simulation environment is necessary to simulate the system under design so that software developers can test the software and hardware developers can investigate design alternatives. Traditionally, techniques like FPGA prototyping and emulation have been proposed for software validation. [4] However, these solutions are available too late (once the RTL is available) and significantly impact the design cycle. With software development determining project success/failures, modularity and fast prototyping have become important aspects of simulation framework. The SystemC and TLM based new approaches of system level modeling helps provide fast prototyping solutions.

2.2 SystemC/ Transaction Level Modeling

SystemC supports modeling of complex hardware systems with different abstraction levels, with modeling of hierarchical components, as it is build over C++. No doubt, the achievable simulation speed depends on the level of model abstraction, which also determines the platform’s accuracy. SystemC has always been intended to support the actual embedded software development, but SystemC has not possessed all the necessary technology components to fully enable it. Within the Transaction-Level Modeling (TLM) working group of OSCI, several different abstraction levels are introduced which enable faster simulations. [5][6]

The transaction mechanism allows a process of an initiator module to call methods exported by a target module, thus allowing communication between TLM modules with very little synchronization codethereby significantly reducing communication overhead in modeling of SoCs. The draft of TLM2.0 standard introduces new transaction abstractions so that platform components can communicate and be interoperable [7]. The use of default tlm_generic_payload transaction type enables this.

(13)

4

The further improvements provided by TLM2.0 which results in faster simulations of models are listed as follows:

1. Direct Memory Interface (DMI): This allows direct backdoor access to memory

and thus allows un-inhibited Instruction set simulator execution as the transport call does not actually goes over the bus avoiding any bus conflicts.

2. Loosely Timed modeling: There is no timing annotation in the model. This has

speed- accuracy tradeoffs.

3. Temporal Decoupling: The models can have their own local clock which

synchronizes with the SystemC global clock only at adequate synchronization points. This allows simulation speed up for multi-core platforms.

Much work has been done in embedded software generation from transaction level description. Some examples of this are discussed in the following section.

2.3 System Simulators

Full system simulation makes it possible to run the exact binary embedded software including the operating system on a totally simulated hardware platform. The simulation environments thus need to support full system simulation, and should use some hardware modeling techniques. Moreover, the simulations should be fast to enable early software development.

The most challenging way to enhance simulation speed is to simulate the processors. Processor simulation is achieved with Instruction Set Simulation (ISS) [8]. Instruction set Simulators can be:

- Interpretive ISS - Static compiled ISS - Dynamic compiles ISS

In the past decade, dynamic translation technology has favored many ISS [9]. The binary target code to be executed is dynamically translated into an executable representation. There are typically two variants of dynamic translation technology:

1. The target code is translated directly into machine code for the simulation host. 2. The target code is translated into an intermediate representation that makes it possible to execute the code with fast speed.

(14)

5

Dynamic translation introduces a compile time phase as part of the overall simulation time. But as the resulting code is re-used, the compilation time is amortized over time.

Based on dynamic translation, some simulators have been designed.

SimSoC demonstrates an integrated simulation framework relying only upon SystemC and transaction-level modeling [10]. The ISS uses dynamic binary translation using the second technology stated above. The speed results are lower than what are achieved using the binary translation to host machine code. Moreover, the solution do not scale well with multi-core platforms as it uses lot of time costly wait() instruction for simulating parallel executing cores. Moreover, if wait () is used after a large number of instruction to avoid simulation overhead, simulations are not faithful enough.

Virtual Machines such as QEMU and GXEMUL [11] also emulate to a large extent the behavior of a particular hardware platform. QEMU is a form of dynamic translation based on technique 1. Though QEMU and GXEMUL include many device models of open-source C code, but these models lack interoperability. Besides, QEMU enables simulation of fixed defined single processor simulators.

Several providers of virtual platform technology have also come up with their own platform-driven electronic system-level (ESL) design solutions which promise high simulation speed and accuracy. Technologies like that of Virtio, Simics from Virutech,

Platform Architect from CoWare, Design Ware from Synopsys are few examples.[12] The

fate is that all of them develop proprietary modeling solutions.

Imperas on the other hand provides Open Virtual Platforms, the infrastructure technology which is open source and free, focused on multi-core platform development and high simulation speed for embedded software development.

In the following chapter we discuss the basic know-how of the OVP technology, its core components, significant features and the extensions that enable it to work in SystemC platforms with TLM2.0.

(15)

6

Chapter 3

OPEN VIRTUAL PLATFORMS

3.1 Introduction

Imperas announced the formation of the Open Virtual Platforms alliance, or OVP, and seeded it with some of their technology serving the market requirements. This includes programming models, verification/debug/analysis tools, and simulation platforms. The interfaces provided by Imperas address the model interoperability problem. The primary entity is that, Imperas have made their technology public.

OVP has three main components [13] –

1. The OVP APIs that enable C models to be written.

2. A collection of open source processor and peripheral models.

3. OVPsim, a simulator that executes these models.

3.2 OVP APIs

To model an embedded system there are several main items to be modeled: Platforms, Processors, Peripherals and environment. The platform connects and configures the behavioral components. The processors fetch and execute object code instructions from the memories, and the peripherals model the components and environment that the operating system and application software interacts with. OVP is thus made of four interfaces.

- Innovative CPU Manager - Virtual Machine Interface

- Behavioral Hardware Modeling Interface - Peripheral Programming Model Interface

The combination of these interfaces makes the complete Platform. The interaction between these interfaces can be shown in figure 3.1

3.2.1 Innovative CPU Manager (ICM)

The ICM is a C API used to create the platform netlist of the design/system for use with OVPsim simulator. It allows instantiation of multiple processors, buses, memories and

(16)

7

peripherals that can further be connected together and application programs executables can be loaded in simulated memories. [14]

Figure 3.1 OVP Interfaces

3.2.2 Virtual Machine Interface (VMI)

VMI is the C based processor interface, allowing the processor model to communicate with the simulation kernel and the other components of the system. VMI is the heart and soul of the high performance execution provided by OVP. Processors in OVP use a code morphing approach which is coupled with a just-in-time (JIT) compiler to map the processor instructions into those provided by the host machine. In between are a set of optimized opcodes into which the processor operations are mapped, and OVPsim provides fast and efficient interpretation or compilation into the native machine capabilities. Some of the capabilities of VMI are listed below [15]:

1. VMI allows a form of virtualization for capabilities such as file I/O. This allows direct execution on the host using the standard libraries provided.

2. Encapsulating existing ISS models within OVPsim, provided that they export some basic features (for example, the existing ISS model should be available as a shared object, provide an API to allow it to be run instruction-by-instruction or for a number of instructions, and provide an API allowing memory to be modeled externally) is possible through VMI.

3. VMI enables modeling of the mode dependent behavior (kernel/user mode) of an instruction. Using the VMI, OVPsim can implement arbitrary multiprocessor systems. 4. The VMI can be used for both RISC and CISC processors. Any instruction format can

(17)

8

5. VMI also allows modeling of L2 caches and other extensions around the processor.

3.2.3 Behavioral Hardware Modeling (BHM) and Peripheral Programming Model (PPM)

They are used to write behavioral models of hardware/software systems which are peripheral to the processors in the platform being developed. Each instance of a peripheral model runs on its own virtual machine with an address space large enough for the model. This processor and its memory are separate from any processors, memories and buses in the platform being simulated; they exist only to execute the code of the peripheral model. This processor is called a Peripheral Simulation Engine or PSE for short [16]. The difference between PPM and BHM is:

BHM – This API gives access to

 Behavioral modeling processes (threads)  Simulated delays

 Events

 Diagnostic control and simulator message stream.

This API can support more general forms of communication and provides the piece that TLM is missing.

PPM – This API gives access to

 Connectivity of peripherals in platforms.  Creation and control of

o Ports and nets o Address spaces

o Windows into memory address space  Create behavior on memory region accesses o Install callbacks

Thus this API understands about buses and networks and is similar in terms of functionality with the OSCI TLM interface proposal.

The BHM/PPM has similar concepts to SystemC, but each instance of each model exists in its own private address space. It is normally pretty easy and simple to wrap existing C functions in a BHM/PPM peripheral model.

3.3 OVP models

OVP provides with some processor models like ARM7, several MIPS processors, Tensilica and the OpenRISC OR1K.

(18)

9

A number of standard embedded devices to allow assembly of a complete platform, including various types of memories, traps, bridges, DMA engines, and UARTs, to name a few are also modeled.

OVP processor models are instruction accurate. In the realm of ISS models, the instruction accurate models are approximately timed in that they claim to, on most occasions, execute each instruction using the correct number of clock cycles and they perform their I/O operations at sort of the right place within the instruction [17]. OVP processor models are however, instruction accurate in purely functional space and not in the behavioral space.

To make this point clear functional models and behavioral models are strictly defined to be different. A functional model does not include timing, although it may include

sequence. A behavioral model includes timing although the level of detail of timing is not defined. Both models can exist at any level of abstraction. Thus the ISS models which

are generally used by several prototyping environments (eg. PV abstracton level of TLM compliant SystemC) are the behavioral models. OVP models are functional models.

Instruction accuracy in terms of OVP means that the registers hold the correct values at the end of each instruction and create the right side effects from executing that instruction. They progress one instruction at a time and do not know anything about multi-execution pipelines, out of order execution or anything of those sorts.

3.4 OVPSim – The OVP Simulator

OVPsim provides infrastructure for describing multicore platforms. The OVPsim simulator can simulate arbitrary multiprocessor shared memory configurations and heterogeneous multiprocessor platforms. OVPsim is a very fast simulator. Performance of OVPsim depends on several factors (for example, the processor variants used in the platform and the exact nature of the application itself), but typically speeds of hundreds of millions of simulated instructions per second can be expected.

The simulation experiments conducted for similar platforms in OVP and one of the proprietary modeling industry standards reiterates the claim of greater speed efficiency of OVP.

Since OVPsim platform models can be compiled as shared objects, they can be encapsulated in any simulation environment that is able to load shared objects. This includes C, C++, and SystemC simulation environments. The commercial Linux based Imperas simulator supports multiprocessor debugging, not provided in Windows based free simulator and provides even higher simulation performance.

(19)

10

3.5 Additional Features of OVP

1. Semi-hosting:

Semihosting is the ability to provide host functionality to the simulated processor or peripheral. The semihosting library has full access to the simulated processors registers, stack and memory space. Far more complex scenarios can be envisaged including for example, using the host network interface, host USB port in order to get connectivity to the outside world, from the simulated platform. The capabilities within the semihosting library interface provided by OVP can be used to model a huge range of system functions.

2. Mapping the processor address region to external memory:

The processor address space can be explicitly specified to contain separate RAMs and ROMs. It is also possible to specify that certain address ranges will be modeled by callback functions in the ICM platform itself, which is useful for modeling memory-mapped devices. Such a capability is exploited in the current work to make OVP processor work with SystemC/TLM based models thereby establishing a hybrid simulation framework.

3. Integration with other environments:

Normally, simulators tend to want to be masters and can call into other models or simulators. This creates a conflict when two simulators need to be bolted together because neither of them really wants to relinquish control. Imperas OVP simulator is built as a slave and thus callable from other environments – such as SystemC. The reverse is however not true. OVPsim cannot call a SystemC model. This is quite natural since the calling of SystemC would bring the entire simulator performance back down to the very thing it is trying to replace. On the other hand, substituting part of the system which is a SystemC based platform with an OVP model may bring about a large performance gain in relative terms. However, Amdahl’s Law tells us that we get diminishing returns dominated by the slowest running piece of the entire system, and thus even one slow SystemC model will make the entire system crawl along at the slow rates. Putting OVP models in SystemC environment therefore requires careful scheduling.

OVP models and subsystems can be encapsulated in SystemC platforms and harnessed using:

- sc_clock(), i.e. at the detailed instruction or clock level - TLM 2.0, i.e. the new OSCI transaction level approach

(20)

11

Both cases of integration have been tested in this work. Since modeling in pure SystemC brings down the simulation speed and is not desirable for Software development use case, we emphasize on integrating OVP models at the transaction level.

3.6 WHY is OVP fast?

The OVP technology from Imperas enables to create faster virtual platforms for software development. This includes several key components to enable fast simulation speed. As a result of the following key technologies incorporated in OVP, virtual platforms are able to run at several 100 MIPS of execution speed. Some of these features are mentioned below:

1. Just in Time Code Morphing:

Conventional processor models written in an HDL or similar modeling language might be implemented by a loop that is activated by a clock signal. On activation of the system clock, the model might fetch the next instruction, decode it, and call specific functions to execute the instruction. Certain optimizations however may be performed to speed up execution.

Although models written in this conventional style can be accurate and straightforward in structure, they are not fast. Processor models designed for the Imperas tool set instead use just-in-time (JIT) code morphing technology. [15] The technique is quite similar to dynamically compiled ISS. This works as follows:

1. As each new processor instruction is encountered during program execution, the instruction is translated (morphed) into equivalent native machine code. The exact translations to be made are specified by the processor modeler using the Imperas Virtual Machine Interface (VMI) API.

2. Contiguous sections of translated processor instructions are gathered into code

blocks, which are held in a dictionary for the processor. Separate dictionaries are

held for supervisor mode code fragments and user mode code fragments.

3. If a processor performs a jump to a simulated address that has already been translated to a code block held in the dictionary, there is no need to perform the translation again: the simulator simply re-executes the existing code block.

Imperas technology handles the generation of native machine code and the efficient management of code blocks and dictionaries to give extremely fast simulation. This is possible because, as simulation proceeds, run time (execution of translated code blocks)

(21)

12

dominates morph time (JIT compilation). High processor models are created by doing as much work as possible at morph time and as little as possible at run time.

It may be possible that not all instructions map closely to the Just-In-Time code morphing opcode set. Those that don’t can be implemented using function calls from Just-In-Time morphed code at run-time.

Such a simulation method is capable of providing speed improvements if the application under test has a portion of code used repeatedly, which in general, all the real time applications do.

2. Program Counter Modeling:

The simulator always knows the address of the current instruction. Instead of maintaining the program counter value each time in the processor model, it is fetched directly from the simulator when required. Thus the processor models do not explicitly model the register values that are infrequently referenced and can be created easily on demand. The same is the case very often for processor status registers. This makes processor models execute at a faster rate.

3. Simulation Performance Options:

ICM_ATTR_RELAXED_SCHED:

The standard multiprocessor scheduling algorithm built-in to the simulator normally simulates each processor for exactly the number of instructions implied by the processor MIPS rate and time slice before moving on to the next processor in that time slice. Using the instance attribute ICM_ATTR_RELAXED_SCHED indicates to the scheduling algorithm that a closely-approximate number of instructions can be used for that instance. This makes simulations much faster. This could be explained in detail as follows:

The exact number of instructions for which the processor needs to execute can be calculated as:

. = ∗ 10 ∗

Consider an example of a single code block containing native code implementing four simulated arithmetic instructions and one simulated jump instruction, so five simulated instructions in total. Suppose that relaxed scheduling isn't enabled and the simulation is reaching the end of a time slice, with just three instructions left to perform in that time

(22)

13

slice, and that the next block to run is the one described above, which actually contains five instructions. In this case, the simulator won't be able to use the code block as it stands, as that would result in execution of too many instructions in this time slice. It therefore has to discard that code block and generate a new one, containing only three instructions, so that the instruction count is exactly correct at the end of the time slice. This incurs significant overhead. In relaxed scheduling mode, the simulator won't execute the code block in this time slice, and won't discard it. This is much more efficient, but it means that not quite enough instructions have been executed in the time slice (e.g. 999997 instead of 1000000). The simulator will attempt to make up the difference in the next time slice (i.e. it will try to execute 1000003 instead of 1000000 instructions next time round) so errors do not build up over time.

ICM_ATTR_APPROX_TIMER:

Processor models often contain countdown timers that expire after a certain number of instructions, causing an exception. Once again, modeling these timers to an exact instruction imposes a significant simulation overhead. If a closely-approximate number will do (as is usually the case, as instruction countdown timers are themselves often approximations of cycle countdown timers) simulation is much faster when the countdown counter expires frequently. Using the instance attribute ICM_ATTR_APPROX_TIMER indicates to the scheduling algorithm that a closely-approximate number of instructions can be used for countdown timer expiry.

Besides the key features mentioned above which enable fast system simulations, OVP also comes with the capability to be integrated with the existing SystemC based platforms. In order to achieve this, a wrapper is needed to be put around the OVP models. The next section describes the methodology which enables hybrid virtual prototyping using OVP models.

3.7 Hybrid Simulation Support for OVP

SoC makes intensive use of various IPs. Components reuse becomes necessity to reduce the design challenge. This requires design methodologies for inter IP communication and implementation. This flattening of the design process can be best managed through platform based design at transaction level. TLM2.0 provides new level of performance and interoperability. With TLM2.0 it is possible to enable models from different vendors to work together in a virtual platform. The OVP provides C++ interface to encapsulate OVP models in the SystemC environment. New developments have been made to make OVP models work in TLM2.0 compliant SystemC platforms. The availability of SystemC TLM2.0 technology to use with OVP CPU models allows the encapsulation of OVP models in existing TLM2.0 compliant SystemC platforms, thereby solving the model

(23)

14

interoperability issue and enabling fast solutions for successful deployment of virtual platforms by hybrid simulation of OVP and SystemC.

3.8 Approach to TLM2.0

In order to integrate already existing OVP models, wrappers are written that is put around the existing code for making it compatible with the OSCI TLM APIs. The conventional APIs in OVP are built in C. To make TLM2.0 compliant SystemC wrapper several new classes are constructed in which the conventional C routines for the models are called. These classes build the wrapper around the binaries of the OVP processor, peripheral, memories and bus models allowing them to be exported to an outer simulation environment other than OVP. Once exported to SystemC environment, these models can then be controlled from the SystemC interfaces.

Of the various abstraction levels provided by TLM2.0, it is the loosely timed modeling that gives a higher performance. It enables processes to run ahead of simulation time (temporal decoupling) and uses a quantum keeper. It is this abstraction level on which wrappers have been built so that the models could be run as fast as possible. Features like Direct Memory Interface are used to provide direct pointer to memory in the target bypassing the sockets in the transport calls enabling a faster simulation needed for software development use case. The processor has the option to invalidate DMI in which the transport calls goes over the bus. The wrappers are supported for TLM2.0 blocking transport interface with timing annotation.

3.9 Details of OVP inside TLM2.0

The wrapper to put OVP processor models in the TLM2.0 environment is a generic wrapper that can further be extended according to the processor under use. The

wrapper allows free-running of each processor for a large number of instructions rather than advancing all processors in lock-step. [18]

The generic wrapper for the processor model is described in the form of a class derived from SC_MODULE. The details of the wrapper are shown in figure 3.2. The implementation methodology can be seen in figure 3.3. To enable encapsulation at TLM level, first very basic C++ wrappers are built that put every instance of a processor, bus etc inside separate classes (Processor/Bus object shown in figure 3.2). These classes access the core OVP functionality of the respective model. The outer SC_MODULE then calls objects of these processor and bus classes. (CPU object shown in figure 3.2). The specific processor for e.g. MIPS, ARM can then be derived from this basic processor

(24)

15

providing a third layer for the wrapper. Based on this hierarchy of wrappers, thus the module of processor shown in the figure 3.3 has objects of the bus instantiated inside it. This allows mapping of the OVP processor address space to a local OVP memory/peripheral (through OVP Bus) as well as an external memory or peripheral with a TLM2.0 target socket. When the processor is connected to an internal OVP memory or a peripheral, the connection is made directly from the OVP bus shown. In order to connect to an external memory/peripheral, a portion of address space of the local OVP bus, directly connected to the OVP processor is bridged to another bus (TLM Bus shown in figure) over which read/write callbacks are registered. Initiator sockets are opened on the processor model. Any access to this TLM bus address space which is mapped to an external memory/peripheral will trigger these read/write callback functions on the TLM bus indirectly connected to the processor. The callback functions then create the appropriate transaction request and forward the transport call with its generic payload over the initiator sockets.

Fig 3.2 Wrapper for Processor Model

This is a generic wrapper put around CPU models and is used in a processor configuration specific layer to create specific processor wrappers like that for ARM, MIPS etc. which is then instantiated into the SystemC platform.

The processor thus, on encountering an instruction that do a load/store to/from memory location on the bus, will call a function in the wrapper code which in turn issues the necessary blocking transactions on the bus.

Wrappers for the peripheral model are also constructed in a similar fashion using the read/write callbacks registered on the bus connected to the peripheral model within an SC_MODULE. The TLM2.0 wrapper also provides a bus decoder with a configurable number of initiator and target sockets which is used to forward the transaction arriving on its target port to the proper initiator port based on the bus address map.

MIPS/ ARM Processor Object (Layer 3) CPU Object (Layer 2)

Processor Object(Layer 1)

Processor Model (OVP)

Bus Object (Layer 1)

Bus Model (OVP)

(25)

16

Fig 3.3 Processor Wrapper Implementation

The SystemC environment thus calls the OVP simulator through this wrapper. Proper synchronization between the two simulators needs to be maintained to achieve correct working of the models in the platform. As the simulation starts, each processor runs from a SystemC thread. The thread executes IPQ instructions on the processor without advancing SystemC time where:

= ∗

The function call asking the processor to simulate for IPQ instructions is from OVP environment through the wrapper. When the allotted instructions have completed, the thread calls SystemC wait() to advance time. The OVP simulator synchronizes with the SystemC simulation kernel every time the quantum is over. Thus each processor executes a number of instructions at a time in a round-robin schedule.

Based on this background, a wrapper is prepared to enable OVP models communicate with Open SCML based models. The details of the wrapper and the experimentations done with that are presented in chapter 6. The following chapter presents the simulation performance experiments done with OVP.

Bus Bridge OVP Processor OVP Bus TLM Bus TLM2 Initiator Socket SC_MODULE

(26)

17

CHAPTER 4

SIMULATION PERFORMANCE EXPERIMENTS

Imperas solutions claims to simulate platforms consisting of one or more processors running real time application, at speed of hundreds of MIPS which is needed for today’s embedded software development environments. In order to validate this claim put by OVP, we have compared the simulation statistics for similar platforms constructed using OVP and some other modeling technology. The proprietary virtual prototyping solutions provided by leading ESL designer are chosen to be compared against Open Virtual Platforms Technology. Similar single and dual core platforms are constructed in different environments and their simulation statistics are compared.

4.1 Application under test - JPEG Decoder

In order to simulate the platforms, there is a need to choose proper application which could be executed on the processor. The choice of application should be such that the workload on the processor is quite high. Baseline JPEG Decoder is chosen as a benchmark application for our current simulation framework.

Joint Photographic Experts Group or in short, JPEG is a widely used image compression technique. It is used in image processing systems such as copiers, scanners and digital cameras. A JPEG decoder is capable of reconstructing image data from a stream of compressed image data. This requires that some transformations be applied to the compressed image data. This results in the reconstruction of the image data. The fact that this coding method forms the basic coding method for all DCT-based JPEG decoders makes it an interesting decoding method. For that reason it was selected to be implemented in this project. JPEG decoder is a streaming multimedia application which has a degree of parallelism and consists of 5-6 tasks.

The JPEG decoding process is graphically depicted in Figure 4.1. Before the operations performed by the decoder are explained, we look at the encoder. The JPEG encoder divides an image in blocks of 8 by 8 pixels. The encoder then has a number of blocks, which when placed in the right order, form the original image. The encoder applies a number of operations on each of these blocks. These operations include a discrete

(27)

18

cosine transform, quantization, zigzag scan and variable length encoding. The result of these operations, and of the encoder, is a compressed image.

Fig 4.1 JPEG Decoder

The decoder reverts the transformations applied by the encoder to the image data. The decoder takes the compressed image data as its input. It then subsequently applies following operations to the compressed image –

1. Variable Length Decoding (VLD) 2. Zigzag scan (ZZ)

3. De-quantization (DQ)

4. Inverse Discrete Cosine Transform (IDCT) 5. Color Conversion

6. Reordering

The decoder then obtains the reconstructed bitmap image. The compressed image data forms a byte stream input for the decoder. This byte stream contains so called markers. A marker is a two-byte combination, which identifies a structural part of the compressed image data. The incoming bit stream is parsed to get header information and image data, based on the markers and various transformations are then applied. Details about different transformations and markers can be found in [19].

4.2 Application Task Graph Mapping for Dual Core Platform

Besides single core, homogeneous multi-core core platforms using MIPS processor are also constructed over which the same JPEG decoder application is tested. The case is limited to dual core systems but could be extended to several cores depending on the workload of the application. In order to execute the same application on two processors, we need to partition the total tasks among two processors in such a way that each processor has almost equal computation and communication load. [20]

JPEG Color Conversion Re-order

VL

ZZ

D

IDC

Compressed Image Data Reconstructed Image

(28)

19

As seen from figure 4.1, the various tasks in the decoder are performed one after the other. Thus the platform will be having processor cores which are active one after the other. Therefore the architecture of the dual core system looks something like in Figure 4.2

Fig 4.2 Dual Core System Architecture

To select proper task partitioning for the application under experimentation, the careful study of the application is done to find the match between JPEG decoder and the dual core platform.

The compressed image data uses connection 1 for our twin-processor system. The compressed image data is connected to the VLD in the JPEG decoder. Therefore, the VLD must be incorporated in the first processor. As seen in Figure 4.1, re-ordering is connected to the output, which is connection number 3 in the system so is mapped to core 2. In order to divide the ZZ, DQ, IDCT and color conversion over the two cores, the data consumption and production rate of the various parts of the system is looked upon. The VLD consumes data from the outside world and produces data in blocks. The zigzag scan, de-quantization and IDCT also consume and produce one block at a time. The color conversion and re-ordering requires one or more (up to 10) blocks before they can run. The color conversion however produces data in a block-by-block basis and sends this to the re-ordering unit which then produces output data. This implies that the communication over connection 2 of our dual core system is always in blocks. Thus every division of the JPEG decoder in two cores requires the same data rate. The

subdivision of the JPEG decoder does not influence the communication load of the system.

Still, for the proper partitioning of decoder among the 2 cores, the computation load on two cores must be more or less same. This enables core 2 to start as soon as core 1 has produced one block. The survey result of the system load for various parts of the JPEG decoder is shown in Table 1. The table 4.1 shows that partitioning just before and after IDCT-function is the easiest to realize.

This choice enables almost 50-50% of load sharing among two cores. It also has the advantage that the Huffman decoding and de-quantization tables required by the VLD and DQ units respectively do not need to be shared by both processors. Based on this

(29)

20

task partitioning, the data flow among the two processors in the system is shown in Figure 4.3.

Part Task in JPEG decoder

Computation Load (% of total load) CORE 1 VLD 35 ZZ 5 DQ 10 CORE 2 IDCT 20 Color Conversion 15 Re-ordering 15

Table 4.1 System Load distribution for JPEG Decoder

Fig 4.3 Partitioning for JPEG Decoder

4.3 Single Core Platform

To study the simulation performance, platforms having the same configuration are constructed for the following three cases -

 Pure OVP simulation framework

 ESL vendor supplied virtual Prototyping framework  Hybrid OVP+ SystemC in OVP simulation framework  Hybrid OVP+ Open SCML in OVP simulation framework

Output: BITMAP image Input: JPEG Image Proc 2 IDCT FBlocks Color Conv. Re-order

Proc 1 Image Properties

VLD

ZZ

(30)

21

For the simplicity of the experimentation, dedicated peripherals are not added to the platform. Also to maintain the fairness of the comparison, it is necessary that the same variant of processor model should be used in all the three cases. We have chosen Instruction Accurate (IA) MIPS32_24Kc processor model for the same. The details of the platform can be seen in Figure 4.4. The program memory shown in the figure is used to store the executable binary of the application to be executed. This binary is in standard executable and linking (.elf) file format. The input and output image memories are used to read the compressed image data and store the final reconstructed image data after decoding respectively. The process of loading and storing of image in memory is automated in the platform. Once the image is loaded, the processor executes the application in which the data is read from the input image memory. The Quantization and Huffman tables needed for various transformation during the decode process are present in the image. As the image is read on byte-by-byte basis from the memory, based on various markers that are found, these tables are read and stored in the local storage with the processor. These are then used to decode the image pixel data and the final reconstructed image data is obtained in a sun-raster image format. This image is stored in an output image memory.

The hybrid single core platform where the OVP processor model is put around TLM2.0 compliant SystemC wrapper is also experimented. In such platforms, OVP processor with a TLM2.0 compliant SystemC wrapper is made to interact with simple TLM2.0 target memories. This is done by connecting a processor model to a SystemC based bus decoder which has TLM2.0 target and initiator sockets. The bus decodes the incoming address and based on the address, forwards the transaction to one of its initiator port which is connected to TLM2.0 target socket of the memory.

Fig 4.4 Single Core Platform MIPS32_24Kc Processor Model (IA) Input image Memory Bus Decoder Program Memory Output image Memory

(31)

22

4.4 Backdoor Mode simulations

In all the cases, simulations have been carried out in backdoor mode. This is a way to access memory/peripheral in which the transaction request does not actually goes over the bus. For the case of pure OVP environment, the processor accesses the memory through a pointer. The proprietary solutions also provide options for simulation in backdoor mode. For the Hybrid simulation case, it is the Direct Memory Interface support provided by TLM2.0 that enables backdoor mode simulation. The DMI provides a means by which an initiator can get direct access to an area of memory owned by a target, thereafter accessing that memory using a direct pointer rather than through the transport interface. This offers a potential increase in simulation speed for memory access between initiator and target models. Figure 4.5 gives a representation of how

DMI works. Once established, DMI is able to bypass the normal path of multiple transport (blocking transport in current framework) calls from initiator through interconnect components to target.

Direct Memory Interface

Fig 4.5 Backdoor Memory Access

4.5 Dual Core Platform

Based on the task partitioning explained in section 4.3, the application was split into two parts, each part being executed on a separate processor. The platform constructed to simulate such a system is illustrated in Figure 4.6. Each processor has its own program memory which contains the executable in .elf format. In the current framework, the cores communicate with each other via shared memory. This shared memory is used to transfer necessary information among the processors like the image properties consisting parameters as image size, number of components, sampling rate and the necessary blocks from core 1 to core 2 for IDCT computation. To main correctness of the application, the two processors synchronize via polling mechanism in which the semaphore present in shared memory is constantly polled by both processors. Thus processor 1 reads the input image from the memory. Each time processor 1 generates

Wrapper Memory

OVP MIPS32_24K CPU

(32)

23

an 8x8 block after de-quantization, it places the block into the shared memory and sets the semaphore to high. It then waits for this semaphore to set back to a low value, which is done by Processor 2. Processor 1 writes the block to the shared memory only when the semaphore is low. Processor 2 on the other hand, waits for the high value of semaphore. When semaphore is found high, it reads the block from the shared memory, and reset the semaphore to a low value so that next block could be written by Processor 1. Processor 2 then performs IDCT on the block. When sufficient numbers of blocks are obtained, color-conversion and re-ordering is performed and the reconstructed data is stored back in the output memory.

Fig 4.6 Dual Core Platform

Based on this description the next chapter presents the results of our experimentation.

MIPS32_24K Core 1 (IA) Bus 1 Program Memory Input Memory MIPS32_24K Core 2 (IA) Bus 2 Output Memory Program Memory Bus 3 Shared Memory

(33)

24

Chapter 5

RESULTS AND ANALYSIS

The image chosen for the experimentation is a 640 x 480 colored JPEG image shown in Figure 5.1. Performance was evaluated on the server

with the following specification:

Processor : AMD Opteron Processor 252 CPU MHz : 2592.406

Cache Size : 1024 KB Ram Size : 2 GB

Figure 5.1 Input Image

The server used for our experimentation is a dual core. Results are not taken on dedicated server, and other jobs on server are not restricted. The results of various experiments carried out for single and dual core platforms in OVP and CoWare are described as follows:

5.1 Single Core Platform

During the course of simulation in OVP, the processor executes for IPQ (instruction per quantum) number of instructions and then the quantum advances.

= ∗

Thus the IPQ is affected both by changing the Nominal MIPS rate of the processor and the quantum size in the platform. We have considered them one by one. The Effect of changing Nominal MIPS is shown in Figure 5.2

Varying Nominal MIPS (Quantum size = 1 ms)

Analysis: Increasing the Nominal MIPS rate of the processor in a single core system will

allow the processor to execute more number of instructions per quantum before the quantum advances. This reduces the simulated time because the total number of quantums required to execute same number of total instructions decreases. The

(34)

25

elapsed time (wall clock time) remains more or less the same as there are same number of total instructions executed every time. Thus changing the nominal MIPS is more or less like changing the operating frequency of the processor causing it to work faster and faster. For the current experimentation the MIPS are fixed at the value of 100.

Figure 5.2 Time variation with Nominal MIPS for single core system

Varying Quantum Size (Nominal MIPS =100)

The effect of changing the quantum value is illustrated in Figure 5.3. For the case of OVP platforms, the quantum is changed in the platform. The simulation time is always an integral multiple of quantum which requires selection of proper quantum value. However experimentation results shows that for a fixed Nominal MIPS, the simulation time with varying quantum size remains more or less the same. However too low a quantum value causes greater overhead on the simulator increasing the elapsed time. The default quantum value in case of the proprietary experimental platform is 0.1ms. In our OVP platforms also we adjusted the quantum value to the same.

Thus for the simulations, we choose:

Processor Nominal MIPS/ Frequency = 100

Quantum size = 0.1 millisecond

0 0.5 1 1.5 2 0 200 400 600 800 1000 T im e ( se c) Nominal MIPS

Variation with Nominal MIPS (Quantum = 1 msec)

(35)

26

Figure 5.3 Time variation with Quantum size for single core system

Based on these parameters, the simulation statistics for the platforms in 3 scenarios is shown in Table 5.1. OVP OVP + TLM2 ESL vendor Simulated instructions/ Cycles 181, 956, 786 instructions 181, 956,786 instructions 206,505,533 cycles Simulation

Time 1.82 sec 1.82 sec 2.064 sec

Simulation Speed 380- 400 MIPS 380-400 MIPS 93 - 101 MIPS Speed Up 3-4 3 -4 1

Table 5.1 Speed Comparison for single core system

Analysis: From the above table it is clear that OVP provides a simulation speed

improvement by a factor of 3 to 4 over the used virtual prototyping environment. Also when switching from pure OVP environment to the OVP-TLM2.0 environment, there is no drop in simulation speed as long as all the access takes place using DMI hint. Presented here is a range of simulated MIPS which represents the simulated instruction rate i.e. Number of instructions simulated per wall clock second. This range is taken over a multiple iteration cycle as the simulations are not carried on a dedicated server so elapsed time varies slightly with varying server load.

5.2 Dual Core Platform

The dual core platform described in Chapter 5 is also experimented in all the three situations. 0 0.5 1 1.5 2 2.5 3 3.5 0.000000 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 Ti m e ( se c) Quantum (sec)

Variation with quantum

(36)

27

Currently the synchronization among processors is maintained by polling the shared memory. Experiments were done to find the appropriate quantum value to be used for comparison and the effect of changing the processor Nominal MIPS. Figure 5.4 shows the variation of Nominal MIPS and Figure 5.5 shows the variation of quantum size.

Analysis: This is a dual core application with strict data dependency among cores in which both the processors continuously polls the memory to get the pixel block. A tight synchronization is maintained among them. As the Nominal MIPS increases, it is expected that the elapsed time should not change as the total number of instructions in the application should remain the same. However we find an increasing trend. This is because the number of idle instructions polling the semaphore increases with increasing MIPS rate which causes the elapsed time to increase in a linear fashion. The total number of quantums required to complete the application is still the same so the simulation time does not change. Based on this understanding, it is apt to choose the low Nominal MIPS value.

Figure 5.4 Time variation with Nominal MIPS for dual core system

The timing variation with quantum value shown in figure 5.5 also yields some interesting results. The number of instructions executed per quantum is also controlled by quantum size. The polling overhead can thus be reduced if we limit the number of idle wait instructions executed per quantum. Reducing the quantum value from 0.1millisecond to 10 microseconds reduces both the simulation and elapsed time due to polling overhead reduction. 0 2 4 6 8 10 12 14 16 18 20 0 200 400 600 800 1000 1200 T im e ( se co n d s) Nominal MIPS

(37)

28

Figure 5.5 Time variation with Quantum size for dual core system

It is therefore desirable to work with a low quantum value. However too low a quantum value will cause considerable switching overhead on the simulator. More the number of context switches, the greater is the number of times, OVP simulator synchronizes with the SystemC simulator. We get to a quantum of 0.000007 seconds where the simulation time is near the real time, the application will take to execute without a large polling overhead.

Thus for the simulations, we choose:

Processor Nominal MIPS/ Frequency = 100

Quantum size = 7 microseconds

The simulation statistics are shown in Table 6.2.

Table 5.2 Speed Comparison for dual core system

From the table 5.2, we observe that though the speed improvements are not as high as for standalone single core systems but still OVP simulations seems faster. Also when working with OVP models at TLM2.0 environment, a slight drop in simulation performance is observed. This drop generally falls in the range of 3-4%. Reduced speed

0 20 40 60 80 100 120

1E-07 1E-06 1E-05 0.0001 0.001 0.01 0.1

Ti m e ( se c)

Quantum Duration (sec)

Simulation Time Elapsed Time

OVP OVP + TLM2 ESL Vendor

Simulated instructions/ Cycles 360,245,629 instructions 360,245,629 instructions 242,248,910 cycles

Simulation Time 1.79 sec 1.84 sec 2.42 sec

Simulation Speed CORE 1 CORE 2 CORE 1 CORE 2 CORE 1 CORE 2

130 - 150 MIPS 130 - 150 MIPS 120- 130 MIPS 120 - 130 MIPS 58 -60 MIPS 58 - 60 MIPS Speed Up 2.2- 2.5 2- 2.3 1

(38)

29

improvements are because of the synchronization needed between the two processors. Every quantum, the control switches between the processors which causes little simulation overhead. Higher speed improvements are still possible if we work at a higher quantum value. The overhead of polling in that case can be worked around by implementing a pipe-lined JPEG decoder.

The next chapter describes the set of experimentation carried out for hybrid simulation of OVP with SCML.

(39)

30

Chapter 6

EXPERIMENTATION FOR HYBRID SIMULATION WITH SCML

6.1 Proposed Wrapper for Hybrid Simulation of OVP, SystemC and SCML

Open SystemC Modeling Library (SCML) is a modeling methodology provided by CoWare for the creation of highly-reusable SystemC TLM peripherals. SCML helps separate TLM communication, storage, timing, and behavior within the peripheral model, making code more modular and more efficient to develop and test.

The proposed methodology for Hybrid simulation of OVP processor models with Open SCML based models is very much based on the concept of wrappers for TLM models. The introduction of the TLM2.0 compliant SystemC wrapper for the processor models enables the processor models to access the SCML based memories and peripherals. The openly available SCML modeling technology is not very much TLM2.0 compliant till date. The original modeling technology supports binding of the SCML memories to either PV port or scml_ post_port. Both of these interfaces are not very generic in terms of request and response structure that could easily support the model interoperability issues, something which TLM2.0 standard is trying to achieve by putting all the transaction request/response parameters in one tlm_generic_payload. Since OVP models are successfully able to interact with SystemC models having TLM2.0 based communication interfaces, an attempt has been made to put SCML memories in a wrapper that is TLM2.0 compliant.

The proposed wrapper to integrate SCML memories is shown in figure 6.1.The SCML memories have been encapsulated in a class derived from sc_module and TLM2.0 target sockets have been added. The memory is bound to the other models in the platform through these sockets. In order to do proper memory read/write operations, blocking transport callbacks are registered on these sockets which directly call a memory access function thereby reading/writing data to/from memory and triggering other callbacks on the memory read/writes. The wrapper constructed around SCML memories uses TLM2.0 transport calls without a direct memory pointer. This disables the processor to bypass the socket in the normal transport calls and the request for read/write process actually travels over the bus before reaching memory. This causes a little simulation overhead. However when using scml for modeling peripherals where there is a little access to the peripheral registers, a transport call to scml memory will not cause much

(40)

31

simulation speed drop as, of the total transactions taking place, only a small fraction actually goes over the bus. The experiments to carry out such integration and the important findings of the experimentation are reported in later sections.

Figure 6.1 Hybrid OVP/SCML simulation

With the development of TLM2.0 compliant SystemC wrappers, it has been thus possible to able to integrate OVP processor models with the SCML based peripherals/memories but the access are still made by actual transport call over the bus. Backdoor mode access, which provide fast simulation performance are currently not supported. Also during the course of experimentation, some synchronization constraints were brought in front which are discussed further.

6.2 Initial Experimentation

To test the feasibility of such a hybrid simulation, first very basic single core system was constructed executing a simple application. The data and stack memories in this system were replaced by SCML based memories which had a TLM2.0 target socket attached to it over which blocking transport callbacks were registered. However as already, due to the lack of ability to get a memory pointer for SCML memories which could enable direct memory access and provide fast simulations, some memories like program memory whose pointer is passed to the processor to load the application executable were still modeled as simple TLM2.0 memory. With such a system setup, successful simulations were carried out through which possibility for hybrid simulation of OVP and TLM2.0 compliant SCML models was demonstrated.

(41)

32

6.3 Integrating SCML modeled SystemC TLM peripheral

To gain further insight into the possibility for hybrid simulation, we decided to incorporate some SCML modeled peripheral into the system rather than replacing the memories which are pure slave models. The motivation behind this was -

1. Hybrid simulation with non DMI based SCML models without causing much simulation overhead.

2. To look into the synchronization aspects when integrating OVP models with SCML based master models.

Replacing simple memories with SCML based ones, would cause all the memory access read/write transport calls to actually go over the bus and would bring down the simulation speed thereby defeating the purpose of prototyping for high end S/W development use-case. However when working in a non-DMI mode for some of the peripherals in the system, the overhead is minimum, as there are only a limited number of accesses to the peripheral registers. Keeping this in mind, we changed the dual core system built in the second experimentation of Chapter 4 to support interrupt driven inter-processor communication rather than polling of memory. For this purpose, hardware based IPC block is modeled in SCML. The block diagram for IPC is shown in figure 6.2.The IP has number of interrupt and semaphore registers.

Semaphores can be claimed and released through target ports to generate interrupts at the output port of the model which are connected to the interrupt inputs of the processor through sc_signal.

Figure 6.2 Inter-Processor Communication Block Target Port 1

Target Port 0

Interrupt 0 Registers

Interrupt 1 Registers

………

Status 0 Register Enable 0 Register

Status 1 Register Enable 1 Register

Intreq 0

Intreq 1

SEM 0 SEM 1 SEM 2

References

Related documents