MULTICORE SOLUTIONS - Design and evaluation of a VLIW processor for real-time systems

Multicore architectures for real-time systems usually imply some of the previous cited approaches interconnected via some real-time network-on-chip. Network-on-chips and multicore memory hierarchy are not the subject of this thesis but there are two relevant approaches: Merasa (UNGERER et al.,2010) and T-CREST (SCHOEBERL et al.,

2015) projects.

3.7.1 The Merasa project

Ungerer et al.(2010) describe the Merasa architecture. The main focus of this project is to develop a multicore processor design for hard real-time embedded systems and analysis tools.

Merasa connects modified versions of CarCore (MISCHE et al.,

2010) processors. The number of cores is configurable and varies from one to eight and they are connected by a system bus to the memory interface and to a central dynamically partitioned cache. This cache has multiple banks and is also connected to the memory interface. Typically each cache bank is allocated to a different core. The interconnection of the system is a bus and it is scheduled by a hybrid priority and time slice (TDM – Time Division Multiplexing) algorithm.

Differently from CarCore, Merasa cores execute only one hard real-time and various soft real-time threads but there is no dynamic thread swapping. Soft real-time thread access to the cache is not fully integrated due to cache interference among hard real-time threads.

3.8. Summary 75

Ungerer et al.(2010) also present a quad core FPGA prototype synthesized in an Altera Stratix II EP2S180F1020C3 device running around 25Mhz.

3.7.2 The T-CREST project

The T-CREST project considers design techniques from the multicore processor level regarding cores, memory, network-on-chip (SCHOE- BERL et al.,2015) and the analysis tools.

The platform is composed by Patmos (SCHOEBERL et al.,

2011) processor cores interconnected by two network-on-chip. One networks-on-chip provides messaging passing between core nodes and the other provides access to the shared external memory. Patmos cores use caches but there is not any kind of hardware support for cache coherence. Processor communication via shared memory is allowed but coherence must be implemented by software. The interconnect network use packet switching and source routing and supports asyn- chronous message passing across point-to-point virtual circuits. It is implemented using DMA-driven block transfers from the local processor scratchpad into remote scratchpad processor nodes. Network virtual circuits are implemented using TDM (time division multiplexing) of the resources.

Instead of using benchmarks to evaluate the system, they use industry test cases based on avionics and railway applications.

3.8 SUMMARY

In this chapter, we surveyed various related projects. In many of them, the focus is on hardware multithreading while others concentrate on the memory hierarchy. Table9summarizes the described projects.

Real-time hardware multithreading allows higher pipeline uti- lization and promotes deterministic thread execution. There are thread- interleaved approaches as the PRET architecture and sophisticated hardware thread scheduling approaches likeKomodoand CarCore. Be- sides some advantages, hardware multithreading is a complex system

and additional hardware elements like the register file must be repli- cated. Schoeberl(2009b) argues that the additional hardware could be employed to chip multiprocessing instead of chip multithreading.

Patmos and JOP use a unique cache system where several spe- cialized caches are proposed. It requires effort on the part of software development, but it can enhance the analyzability of the system. The compiler must explicitly select memory instructions to access different caches and two instructions are required for memory loads.

We can note that all related works have a main objective regarding determinism and differently of them, our work is concentrated in the investigation of a VLIW processor for hard real-time applications. There are many design space investigations and it is not clear how they impact on the WCET performance. Among them, we can highlight: pipeline dependencies resolution, full predication for 4-issue VLIW processor, branch architecture and all of these features are considered in the WCET analysis tool.

3.8. Summary 77

Table 9 – Summary of related project objectives Project Main objective

Komodo Java processor: feasibility of real-time hardware multithreading

JOP Java processor: cache architecture (method, stack, ...)

MCGREP Reconfigurable architecture: application- specific instructions

PRET Precision timed machine: thread-interleaved pipeline

Patmos Dual VLIW: memory architecture (method cache, stack cache, data cache, scratchpad) CarCore

Heterogeneous pipeline: hardware multithreading with different hardware scheduling algorithm

Merasa Multicore: CarCore processors interconnected by bus

T-CREST Multicore: PATMOS processor interconnected by dual network-on-chip

Our work

VLIW processor design space investigation: predication, branch architecture, pipeline dependencies resolution and their WCET analysis

4 THESIS RATIONALE AND DESIGN DECISIONS

The analysis of real-time systems is a challenging task and requires the union of numerous techniques. Typically the problem is divided into two main areas: scheduling analysis and temporal analysis of individual tasks.

Scheduling analysis is responsible for verifying the system feasibility as a scheduling problem: we want to verify if a task set can execute properly respecting the individual deadline of each task.

The formulation of the scheduling problem will always depend on the temporal system parameters. Obtaining these parameters is the responsibility of the task’s individual temporal analysis. Some of these parameters are dependent on what we want to control or sample using a real-time system, like the activation frequency or task period (T). However, a key parameter is the Worst-Case Execution Time (WCET) of each task – (C). Every scheduling analysis depends on the WCET. The focus of this work is on the temporal analysis of individual tasks, more specifically, on researching deterministic hardware techniques to improve the WCET analysis but also increase processor performance.

As noted briefly in the previous chapters, in recent years con- siderable research effort has been done on providing time-predictable architectures for real-time systems. Since general-purpose processors usually focus on average-case performance (SCHOEBERL et al.,2015), analyzing real-time systems and obtaining WCET using those processors became very complex. Real-time applications are becoming in- creasingly more complex requiring more processing power with embedded systems providing more and more “desktop-like” functionality. However, the use of standard processor design choices to improve performance is usually not possible for real-time systems because it greatly jeopardizes determinism proprieties.

In this chapter, we describe the target problem and the thesis rationale. We will present the target systems, their requirements and discuss design decisions for the proposed architecture comparing them with the related researches.

In document Design and evaluation of a VLIW processor for real-time systems (Page 76-82)