A. SOFTCORE DEVELOPMENT
1. XUM Softcore Processor Modifications
We first verified the functionality of the XUM SOC softcore before performing
any modifications. We conducted this on a XUPV5-LX110T development board, the
same used to build the original core [18], to eliminate any design variables. Later,
development was conducted on a Genesys 2 development board featuring a Kintex-7
325T FPGA. We found one minor issue involving the UART instantiated within the
system and communicated this back to the author of the core for verification. We also
noticed that memory accesses were not keeping up with the demand of the processor.
Modifications were made to the core to resolve these two issues and then techniques were
applied to the core to make it internally triple-modular redundant.
18
a.
UART
The UART, which has a dual functionality of being able to either boot-load a new
program into memory or allow a user program to communicate [18], was found to stop
receiving data for user programs after sending its first transmission. We discovered that
during the write back of any character, BootSwEnabled transitioned from low to high,
and following that transition, the receive first in, first out (FIFO) module quickly
emptied. The issue is caused by a race condition. When BootSwEnabled goes high, it
switches to boot-load mode and empties the receive FIFO searching for a specific string
of character and causing the user program miss all of its input data. This is shown in
Figure 8. While we were able to correct the issue with the UART, and used it extensively
in testing, we removed the entire module from the final design since NPSAT-1
communicates over its PC-104 bus; however, the concept of using the communication
interface to control, temporarily, the instruction data bus for reprogramming the operating
system was kept and adapted into our PC104 interface.
Waveform Demonstrating XUM UART Improperly Transitioning
Figure 8.
BootSwEnabled on Write
b.
Adjusting the Memory Access Protocol
The XUM core utilizes a four-way handshake to retrieve instructions and data
from memory. In the original core, memory is instantiated using block RAM with
registered outputs. Examining both the code and waveforms produced when executing a
simple program on the processor, we noticed that even with the memory clocked at twice
the speed of the processor, 66 MHz and 33 MHz, respectively, the memory was unable to
keep pace with the processor and took three central processing unit (CPU) clocks to
retrieve one instruction as shown in Figure 9. The author himself notes that “instruction
19
memory fetches only once per handshake, the minimum theoretical CPI is greatly
increased from 1 to between 3 and 4” [18].
Waveform Capture of XUM Instruction Fetch Utilizing Four-way
Figure 9.
Handshake Protocol
To resolve this issue, we implemented a L1 cache onboard the FPGA and
replaced the handshake protocol with a simple flagging protocol. In the new protocol, we
assume that the processor makes requests to the L1 cache every CPU cycle, either a read
or write. Based on this assumption, we designed our cache to be able to resolve a hit
within the same CPU clock cycle it is received. Because the cache is able to respond
every CPU clock cycle, only a single valid line needs to come from memory to the
processor. If the valid line is invalid, the processor stalls; otherwise, it accepts the data on
the data lines.
c.
Application of the GTMR Architecture
The process of applying the GTMR architecture to XUM consisted of a two-step
approach. We first BTMR the system as a whole. Then, we GTMR the system module by
module. We also structure our code to resemble the BTMR hierarchy, with all voting,
both BTMR and GTMR, taking place at the level where the systems are triplicated as
shown in Figure 10. The benefit of this structuring is that there is little disturbance to the
internal interconnect of the systems beneath the TMR hierarchical level, so any system
viewed below the TMR hierarchical level can be viewed as a single operable system. The
inputs and outputs of this single system can be directly mapped to the top-level ports, and
the system functions. This allows us to test a single system for functionality and reduce
compilation times.
20
Schematic Drawing of the CFTP TMR Level
Figure 10.
21
In applying BTMR, we must first identify all parts of the system that cannot be
triplicated. These are instantiated on the top level. In the case of the original core, this
was only the top-level pins; however, we also include the clock management tile (CMT)
on the top and generate triplicated clock outputs from it. The reason for this is to
minimize clock skew / jitter, which can be induced by chaining together multiple CMTs.
The top module then routes the appropriate signals to the TMR level.
At the TMR level, the entire system was triplicated. Inputs from the top level are
distributed to each of the three systems, and outputs from each system are voted on
before being sent to the top level. This completes the BTMR implementation, and we
verified that the system was operable with three processors executing synchronously.
To progress from our BTMR XUM system to a GTMR system, we had to first
locate every register inside of the XUM softcore. These registers were then modified so
that their outputs were directly routed to the TMR level vice the connecting logic. After
progressing through the GTMR voter, the signal was then routed back to the connecting
logic for which it was originally destined. Furthermore, all registers were modified to
update every clock cycle. In the event the processor stalled or a register was not going to
be actively written to, such as those in the register file, a voted on output is still updated
in each. Since the process of modifying any single processor did not compromise the
system functionally, we were able to apply the GTMR module by module and verify the
system in between updating each module so that any errors produced could more readily
be identified.
This standardized process of applying the BTMR and GTMR architectures was
applied to every submodule later instantiated within CFTP with the exception of the
following: the CMT, the internal configuration access port and frame error correction
code primitives, and the double-data rate type three (DDR3) IP softcore.
In document
Implementation of the configurable fault tolerant system experiment on NPSAT-1
(Page 36-40)