4.2 Implementation on a COTS Platform
4.2.1 Platform Description
A brief summary of the architectural features of the MPC5777M MCU is provided in Table 4.5, while a block diagram is reported in Figure 4.13. The SoC platform includes
1
http://erika.tuxfamily.org/drupal/
Table 4.5: Characteristics of Freescale MPC5777M SoC
Chip Name MPC5777M (Matterhorn)
Manufacturer Freescale
Architecture Power-PC, 32-bit
CPU Unit 2x E200-Z710 + 1x E200-Z709 +
1x E200-Z425 (I/O) CPU Frequency Application Cores (300 Mhz)
I/O Core (200 Mhz)
Processing Unit CPUs, DMA, Interrupt Controller, NIC Operational Modes Parallel + Lockstep (on one applicative core) ECC Protection Cache, RAM, Flash Storage
Cache Hierarchy L1 (Private Instructions + Data) + Local Memory
Local Memory (SPMs) Instructions (16 KB) + Data (64 KB) L1 Cache Size Instructions (16 KB) + Data (4 KB)
SPM Size 80 KB
SRAM Size 404 KB
Flash Size 8 MB
Main Peripherals Ethernet, FlexRay, CAN, I2C, SIUL
MEMU MEMU For SRAM, Peripheral RAM and Flash
two application cores, which we use to execute 3-phase real-time tasks, and an I/O core, which we use to run the OS and all device drivers. Each core has a dedicated scratchpad memory, whose size (80 KB) is smaller, but comparable to SRAM main memory (404 KB). The platform includes a separate I/O bus, which allows the designer to route I/O traffic without directly interfering with CPU-originated memory requests. The idea of co-scheduling CPU activity and I/O traffic is not new and specific solutions have been proposed in [41,129]. However, traffic transmitted over the dedicated I/O bus needs to be handled, pre-processed and scheduled before reaching the application cores; the I/O core performs such operations. Just like the application cores, the I/O core features a scratchpad memory that is used to buffer I/O data before they are delivered to applications.
Typically, devices that support high-bandwidth operations are DMA-capable. Instead, slower devices expose memory-mapped input/output buffers that can be read/written using generic platform DMA engines. Without loss of generality, we assume I/O data transfers from/to the I/O core are performed by DMA engines and that data from I/O devices can directly be transferred into the I/O core’s scratchpad memory. In other words, I/O devices are not allowed to initiate asynchronous transfers directly towards main memory.
Figure 4.14: Block diagram of error handling circuitry.
This design choice allows us to perform co-scheduling of CPU and I/O activities to achieve higher system predictability.
Note that no MMU is available on this platform. Hence, there is no support for virtual memory. Instead, we use compiler techniques to generate position independent code that can be loaded in either partition of an application scratchpad, as we discuss in Section4.2.2. In order to test and verify the safety features of the SoC, the chip also implements fake error injection mechanisms that are helpful to verify the reaction to various faults. We use these fault injection mechanisms to evaluate our system.
In the considered MPC5777M platform, there is a memory error management unit (MEMU) that is responsible for collecting and detecting the faults in different memory subsystems such as SRAM, SPM and Flash. In particular, the platform implements Hsaio codes that provide single bit error correction and double bit error detection (SEC-DED). Hsiao Code [73] for correction and detection is popular in modern embedded platforms. The MEMU implements separate tables for reporting correctable and uncorrectable errors. There are separate tables for each kind of memories. These tables contain the address of the fault that caused error, moreover, there is a register inside the MEMU that tells if the fault that occurred is a correctable or uncorrectable fault. On MPC5777M, there is no way for the MEMU to send an interrupt to the CPU in case of a fault. There is a separate FCCU module present on the chip that collects all the errors that are forwarded to it from the MEMU. The FCCU module can be preprogrammed to take certain actions based on a particular error. Moreover, it is also responsible of generating interrupt to the processor to notify it in case any kind of errors that are being reported to it from the
MEMU. Figure4.14 shows how different modules are connected to each other. In order to detect the faults in the SPM, we registered a FCCU interrupt with application cores. This interrupt gets generated and sent to all cores when an error is reported by the MEMU to the FCCU in one of the memory subsystem.
As discussed in Section 3.4, the proposed recovery mechanism are effective as long as no more than one bit error occurs in any of the memory subsystems (SRAM, SPM, Flash) every two periods of any task. It is estimated that the FIT of SRAM memories is about 0.001, i.e. on average one bit upset is observed every 1011 hours of operation [151]. Flash memory, on the other hand, shows low SER susceptibility [146]. Since the period of a real-time task is typically tens or hundreds of milliseconds long, we deem this assumption to be satisfied in the vast majority of embedded systems.
Error correction techniques to recover from faults that affect OS memory require special attention and are not within the scope of this work. A promising approach in this context is system check-pointing, i.e. periodically saving the system state of OS and copying it into a more reliable piece of memory. Further research is required to seamlessly integrate check-points into our SPM management scheme.