4.7 Practical Considerations
5.2.2 Intercellular Communication
Another function of the cortex is to implement the spike transmission between presynaptic soma units and synapse units. This, again, can be regarded as a communication network. In contrast to the intracel- lular network with a continuous flow of membrane update packet, spikes are sent intermittently and less often. Assuming one membrane potential update for simulation of one millisecond of neuron activity, a maximum spiking frequency of 200Hz shows that spike packets are at least 5 times less frequent than dendritic update packets. Unlike intracellular communication that is unicast (from one PE to the other), intercellular communication needs to be multicast, as one presynaptic neuron can send spikes to may
postsynaptic neurons. Additionally, spike packets does not need to convey any extra information other than their source identity and their timing (which is implied in the existence of each spike packet).
A shared media network such as a shared bus is not an option as the number of nodes and number of spikes that must travel through the network at the same time are quite high and the long and variable latencies and hardware cost of arbitration in shared media networks are not justified here [188, 294].
Using a switched media network requires routing, arbitration and switching functions. Similar to intracellular communication, to follow a bio-plausible approach, evo-devo processes must be responsible for the routing of spike packets. Alternatively, as in [275, 301, 72, 300, 378], the physical and logical network connectivity can be decoupled, which effectively deprives the system from the role that evo-devo processes can play in the resource management and optimisation of the neural microcircuit. Allowing evo-devo system to control the physical routing of the neurites enables it to exploit all the physical and even unwanted properties of the cortex substrate to optimise the functionality of the neural microcircuit. It also removes the burden of routing and arbitration from network nodes that significantly reduces the node latency and hardware cost. Moreover, [188] reported that offline scheduling of switches can yield up to 63% performance increase over online scheduling.
With the routing being performed by the evo-devo processes (discussed in chapter 6), three different main switching techniques are available: packet switching, cut-through switching, and circuit switching. These methods have different throughput, bandwidth, hardware cost and latency trends. Unlike intracel- lular network packets, spikes packets have more time for delivery. Performance of the system is also less sensitive to the latency of the spike packets. The spikes in biological brains can have a delay of up to 20ms depending on their length and other factors. Assuming a simulation resolution of one membrane update for equivalent of 1ms time of biological neuron activity, and an average of 50 clock cycles for each update, spike packets in the cortex model need to be delivered in 50-1000 clock cycles depending on their length. However, the spike packets latencies need to be reliable and accurate enough. For the spikes to have the same resolution of 1ms, their timing accuracy must be ±25 clock cycles.
Packet switching has a high latency compared to the other two as it requires buffering the whole packet to decode the routing information before forwarding it to the next node. While packet-switching works perfectly in some large-scale applications such as [300], where real-time simulation is intended and very short packets are used, it is not a feasible solution for hyper-realtime simulations such as this. Even with 256 neurons, a spike packet needs to be at least 8 bits long and it can pass through a maximum of 125 hops in 1000 clock cycles. With different packets blocking each other and multi-casting of the packets to different destinations adding to the network traffic, this latency will be fluctuating and far from guaranteed.
Considering cut-through switching, as the whole spike packet is needed for decoding the routing information, each packet will consist of only one flit (flow unit [294]). Therefore, in this case cut-through switching will be equal to the packet switching in practice. Packet switching and cut-through switching methods also require buffers, and logic for flow control and arbitration that further add to the latency and hardware cost of each node. The hardware cost and unreliable latency of packet switching and
cut-through switching techniques render them infeasible for this intercellular communication network. Considering circuit switching, apart from the initial latency for establishing a circuit, it has the best latency among switching techniques. With routing and arbitration already carried out by evo-devo pro- cesses, circuit switching turns into configured switching [188]. Configured switching has the minimum hardware cost per node as it does not need any buffers or logic for routing and arbitration. While con- figured switching provides a possible solution, it can wastes the bandwidth of the links by preallocating them to circuits that are very rarely used (for sending a spike packet). Additionally, by dedicating a circuit to a single axon, there will be no need for explicitly sending the pre-synaptic neuron ID, and the spike packet size will be reduced to only one bit. Taking this into account, degrades the channel utilisation of the configured switching even further by a factor of log2(n), where n is the number of
neurons.
Fortunately, it is possible to use time-multiplexing to utilise the bandwidth of the links more effi- ciently. In time-multiplexed switching each switch in the network follows its own predefined schedule on a time-division basis. Based on the length of the repeating schedule (number of contexts, n) this creates n virtual channels on each physical link between two nodes. However, this techniques requires that the switching schedule on every cycle to be locally stored (or somehow be available) in each node. The switching memory hardware cost for each port is of order O(n log2(m))where m is the number of
ports in each node.
Table 5.2: Summary of different design patterns for implementing communication in FPGAs, and their characteristics and trade-offs (adopted from [188]).
Characteristics Configured Switching Time-Multiple xed Switching Pack et Switching Circuit Switching
Communication predictability High Low
Latency Lowest Low Highest Moderate
Switching logic HW cost Low Low High High
Switching memory HW cost Lowest Low Highest Modest
Comm. throughput-physical link bandwidth ratio Highest High Low Lowest Channel utilisation (Application dependent) Depends on app. Low Latency overhead (Message length dependent) Lowest Low Highest Depends
Kapre et. al. have investigated the hardware costs, latency and trade-offs of time-multiplexed switching versus packet switching networks for FPGAs in [188]. Table 5.2 summarises their evalua-
tion of four different design patterns for implementing communication in FPGAs: configured switching, time-multiplexed switching, packet switching, and circuit switching. Clearly, with the lowest hardware cost and latency, and highest throughput, configured switching is the best option when application needs to use a circuit all the time. This is the case for intracellular communication in short dendritic loops. However, if application is using the network sporadically, configured switching is not the best options. Then, if communication predictability is important, time-multiplexed switching will be the best option as circuit and packet switching can not provide that predictability. This is clearly the case for intercellular communication in a hyper-realtime neural microcircuit application. However, in realtime neural applica- tions, such as [300] and [147], packet switching makes much more sense particularly when packets are very small and number of PEs is very large [188]. Circuit switching is only an option when application needs to send very long messages sporadically and the latency overhead compared to the length of the message is negligible.
Utilising time-multiplexed switching for the intercellular communication network not only provides a solution to use the routing resources in FPGA efficiently, but also, as will be explained in section 5.2.2, it extends a 2D interconnection network, that is feasible on an FPGA, to a 3D virtual intercellular network, that is bio-plausible. Using such time-multiplexed communication network can also increase the scalability of the system to multiple FPGAs [235]. If all the physical links are local, as a bio-plausible approach suggests, it will be possible to run a time-multiplexed network at much higher clock frequency than rest of the system and increase its bandwidth even further.
Topology
Before moving to discussion of the reconfiguration and feedback functions of the cortex model, topology of the intercellular and intracellular networks must be discussed, so that the next sections can focus on particular promising topologies. From bio-plausibility point of view, although biological neurons and their projections are embedded in a 3-dimensional substrate, the fractal dimension of the connectivity of the neurons in C. elegans and the human brain are measured at around 4 [19]. This is indicative of much higher dimensional topology in these nervous systems. However, looking at the local interactions underlying these long range connectivity, they are still all local interactions with neighbouring elements in a 3-dimensional space. This 3D space is wrapped around in one dimension and has connections to the outside world at one of its edges, since a brain is modelled as a layered neural tube connected at the root to the rest of the body. This 3D space must provide enough resources and connectivity that supports networks with small-world and free-scale characteristics.
In terms of feasibility, the topology must provide low and reliable communication latency and enough throughput with the minimum hardware cost. To achieve this, it is important to appreciate the PEs vs. interconnection trade-off and find the balance between the amount of hardware resources dedicated to computation versus communication. A modular and structured topology is preferred for its reduced and manageable design and testing complexity. Reliability, robustness and fault tolerance are other feasibility factors related to the topologies of these networks.
(c) Bi−torus IP R IP R IP R IP R IP R (f) Ring (g) Bi−ring (b) Torus (a) Mesh Bi−directional link IP−core Router (4 x 4) Bi−directional link IP−core Router (3 x 3)
(d) Binary tree (e) Fat−tree
Figure 2. The NoC topologies considered: (a) mesh, (b) torus, (c) bidirectional torus, (d) binary tree using 3 ⇥ 3 routers, (e) fat-tree built from identical
4⇥ 4 routers, (f) ring, and (g) bidirectional ring. A filled circle represents a node, comprising a router and an IP-core as shown.
the neighbor processor in 3 cycles. This tight integration of the network and the processor pipeline is the basis for, so-called, software circuits, i.e., applications that resemble ASIC circuits.
B. Routing Schedule Construction
Lu and Jantsch [13] propose a configuration technique for the Nostrum NoC [5] that allows multiple virtual circuits to share buffers of the network. They present a problem formulation that defines a legal allocation of TDM time slots using a backtracking search algorithm. In contrast to our problem, only a single assignment of a given set of virtual circuits is needed that satisfies the required bandwidth and a conflict-free operation of the NoC.
A similar slot allocation problem appears for the Æthereal NoC. The allocation here proceeds in two steps. First, routing paths are determined through the NoC depending on a mapping of an application to the network and the application’s communication requirements [14]. Given these paths, TDM time slots are allocated for each virtual circuit in turn [15]. This technique has been extended to split packets and deliver the individual fragments of the packet over multiple paths in order [16]. This approach provides a single solution satisfying the application-specific communication and bandwidth requirements.
The scheduling problem considered in this work can formally be stated as a dynamic multi-commodity flow prob- lem over time. A seminal work by Ford and Fulkerson introduced time-expanded flow networks to model dynamic flow problems using equivalent static problems [17]. A time- expanded network is a structure containing replications of
the network for several time instants (e.g., clock ticks). Fleischer and Skutella study variants of the NP-hard quickest multi-commodity flow problem [18] and present a polynomial 2-approximation algorithm. Although closely related, these results apply to general multi-commodity flow problems, where fractional solutions are acceptable. In the context of this work, however, integer solutions are required since the physical hardware resources are indivisible.
III. REAL-TIMENETWORK-ON-CHIP
In dependable real-time systems it needs to be guaranteed that all deadlines will be met. This guarantee is performed by schedulability analysis. The input to this schedulability analysis is the worst-case execution time (WCET) of the tasks. To enable WCET analysis, all components of the system (the application software, the processor, the memory subsystem, and the communication network) need to be time-predictable. We aim for a time-predictable NoC that supports WCET analysis.
To enable time-predictable usage of a shared resource the resource arbitration has to be time-predictable. In the case of a NoC, statically scheduled TDM is a time-predictable solution. This static schedule is repeated and the length of the schedule is called the period. Like tasks in real-time systems, also the communication is organized in periods. One optimization point of the design is minimizing the period to minimize the latency of delivering flits and the size of the schedule tables.
Many NoCs are intended to be optimized for a given application or application domain. The NoC structure and/or the routing schedules are then optimized and are then
Figure 5.1: Common NOC topologies along with their router(R) and PE (IP) connections and ports (From [328]).
[294]. Inadequacy of bus (star) topology is already discussed in sections 5.2.1 and 5.2.2. A very straight- forward topology (not shown in the figure) is a fully-connected graph that usually uses a single switch for connecting all the nodes centrally. Since the hardware cost of the switching is of order O(n2), where
nis number of ports (equal to number of PEs here), total switching hardware cost in a fully-connected network is only justifiable for small number of PEs. Moreover such a topology does not represent the locality required as a bio-plausibility factor.
Table 5.3 shows characteristics and hardware-performance trade-offs in different common NOC topologies . Bisection bandwidth is a measure of total performance of the network in terms of throughput. Maximum and average hop counts show the upper bound and typical number of hops that a packet needs to travel, which is directly related to the total latency. These two are the main performance factors of a topology. Switching hardware cost comes from the total number of switches in the network times hardware cost of each switch (number of ports squared). Number of links represent the hardware cost of physical links including the links between PEs and their corresponding switches.
Between these common topologies used in the NOC context, bus and fully-connected topologies can be rejected straightaway for very low bisection bandwidth and very high hardware costs respectively. Among the rest of the topologies, hyper-cube and fat-tree topologies have very good performances for a hardware cost that grows rather rapidly with the number of PEs. They also need many long-range connections, when embedded in a 2D substrate of an FPGA. Long-range wires in FPGAs are scarce and costly, and cause long delays that lead to a lower clock frequency impacting the overall performance of the network. Ring, 2D mesh and 2D torus are the only topologies that can be simply implemented on a 2D silicon chip using only short, local, and high-performance links. Therefore, 2D Mesh and torus topologies are two of the very popular topologies in NOCs. Ring can be seen as a 1-dimensional version of the torus. Higher dimensional versions of the mesh and torus are also conceivable. However, they have the same problem of long-range links when mapped to a 2D FPGA. It is also possible to have a hybrid of ring and mesh or torus by dividing each link into segments and adding more nodes in between.
Table 5.3: Characteristics and hardware-performance trade-offs in different major NOC topologies wheren is number of PEs [294]. Bisection bandwidth represent the total bandwidth of the network in unit of link.
Performance Hardware cost
Topology Bisection Bandwidth Max (a ve.) hop count Switching HW cost Number of links Locality (2D mapping)
Bus (star) O(1) 1 (1) O(n) n No
Ring O(2) n/2 (n/4) O(9n) 2n Yes
2D Mesh O(√n) 2√n − 2 (√n − 1) O(25n − 36√n) 3n − 2√n Yes
2D Torus O(2√n) √n/2 (√n/4) O(25n) 3n Yes
Hyper-cube O(n/2) √n −log2(n) (log2(n)/2) O(n(log2(n) + 1)2) n log2(n)/2 + n No
Fat-tree O(n/2) O(> M esh, < Ring) O(2kn logk/2(n)) O(n logk/2(n)) No
Fully-conn. O(n2) 1 (1) O(n2) n2+ n No
This leads to a heterogeneous networks with two type of switches (5 and 3-port) and slightly increases the PE-interconnection ratio, which can be used to adjust the ratio for best overall hardware cost and performance. It is also possible to do the reverse and increase the number of local links to 6 or 8 as in [275], which practically decreases the PE-interconnection ratio. Schoeberl et. al. have investigated different topologies for time-multiplexed NOCs on FPGAs in [328] and reported that for networks above 16 nodes, only torus and fat trees have enough link capacity to enable a schedule period that is in the same range as the IO capacity of the IP cores. With respect to the local connectivity pattern of the FPGA CLBs, a 2D grid torus with 4-neighbourhood connectivity appears as a simple and efficient option that can be extended to 6 or 8 neighbours since each Virtex-5 CLB has 1-hop (low-latency) connectivity wires to all 8 neighbouring CLBs. Selection of the best neighbourhood connectivity and cell design is a separate subject that needs mush further investigation with comprehensive simulations or analytical study (see [92] for example).
Although a 2D torus appears to be the best feasible topology for intercellular and intracellular communication networks of the cortex model, it does not map perfectly with the 3D substrate needed for bio-plausible neural microcircuits. Fortunately, time-multiplexing a 2D topology can create a virtual third dimension in time axis that allows a better mapping to a bio-plausible 3D substrate. This has been already proposed in previous section for intercellular communication network. However, due to the asynchrony of soma units and timing of their packets in dendritic loops, it is not possible to use time- multiplexing for intracellular communication network and extend the growing substrate of dendrites to three dimensions.
Figure 5.2 from [328] depicts the general circuitry for a 2D mesh or torus time-multiplexed switched network. Each switch is shown as a multiplexer receiving inputs from north (N), south (S), west (W), east (E), and the local PE (L). The scheduled switching data for selecting inputs for each multiplexer come from a Schedule Table (ST) that is addressed sequentially by a time-slot counter that can be local to each node or global. This counter generates the slot numbers from zero up to the length of the schedule period. The main hardware cost overhead in this method is the memory needed for the schedule tables. A m-port switch needs a total of mn log2(m − 1)bits of RAM, where n is the length of the schedule. If
a global time-slot counter is used, log2(n)global signals are also needed to be connected to all switches.
Otherwise each switch or group of switches need a local counter of complexity O(log2(n)).
5.2.3
Reconfiguration
To evaluate the fitness of each individual, evo-devo processes must be able to modify the parameters and connectivity of the neurons and synapses both for setting up the neural microcircuits and for successive modifications during development. This process is called reconfiguration of the cortex, although it may not necessarily entail using the reconfiguration feature of the FPGA.
Regarding the bio-plausibility of the cortex, reconfiguration must allow localised modifications of the parameters and connectivity of the neurons, neurites and synapses. The cortex model must also allow the density and location of the soma and synapse units to be controlled by the evo-devo processes.