listed in reference to the total number of available resources of same type
Resource Type LUTs Registers BRAM DSP
Utilization 273k (63.0%) 199k (23%) 553 (37.6%) 214 (5.9%) Per Functional Unit
One EXTOLL Link 66.8k (15.4%) 57.2k (6.6%) 30.50 (2.1%) 47 (1.3%) EXTOLL MUX 3.8k (0.9%) 2.3k (0.3%) 30 (2%) 0 (0%)
HTL/NTL 24.8k (5.7%) 16.2k (1.9%) 42 (2.9%) 12 (0.3%) CR Logic 87.2k (20.1%) 43.8k (5.1%) 404 (27.5%) 102 (2.8%) HMC Layer 21.6k (5%) 19k (2.2%) 15.5 (1.1%) 2 (0.1%)
to 48 ranks at a time. A further increase of the number of ranks would increase Block RAM usage in the specified device region, and significantly increase routing congestion in this area. Routing congestion also comes in heavily when operating frequencies are increased as the implementation tools start to replicate logic in order to reduce trace lengths and fan-out. The modules that suffered most from routing congestion are the EXTOLL links (fmax = 200 MHz) and the CR logic (fmax = 230 MHz). The
final utilization report can be found in Table 4.5.
4.7
NAM Software
Even the best hardware is useless without software that can actually use it. This section describes the software components that were developed or modified to make use of the NAM. There are three main components in this scope: a NAM-aware network setup and management tool to seamlessly integrate this new device with EXTOLL ASICs. An additional user-level Application Programming Interface (API) that provides access to the NAM and implements the CR features. And finally, a service as central instance to handle and manage NAMs and its allocations system-wide.
Network Attached Memory
4.7.1
EMP Extension
The EMP is a software component that is integrated with the EXTOLL software stack. It is used to initially assign NIC identifiers and to setup routing to and between EXTOLL devices in a network. In fact, EMP must be run anytime a system is powered down or even a single node was replaced. In its original form, EMP does not support NAMs as it expects that every connected device also provides routing tables and is able to route through two links via the EXTOLL Crossbar (XBAR). The NAM is an endpoint for any traffic and does not provide a routing table as routing from one link to the other is not supported. Hence, the network must be properly configured to ensure that only packets that actually target a specific NAM will be sent to it. An additional hardware device type was added to the EMP which can now route to and from NAMs but will not attempt to route through it. Currently only fixed and deterministic routing is supported with exactly one path from one node to another.
4.7.2
The libNAM Library
The libNAM library operates on top of the existing EXTOLL RMA API. The function calls provided by libNAM are very similar to libRMA so that existing user applications can be modified without much effort. Listing 4.1 shows a code example to write and read to and from the NAM. In the initial bring-up phase of the NAM hardware-software interaction many of the features that were required to protect the NAM from false usage were implemented in hardware (e.g. a violation of the 16 Byte granularity or unsupported commands). These protection features were gradually shifted into the software, hence reducing hardware and associated implementation complexity.
Reading and writing is realized with send and receive buffers organized in a ring structure. The EXTOLL/NAM notification mechanism is utilized to handle the buffer space, i.e. to free up locations when data has been transmitted (PUT) or received (GET). The number and sizes of the elements a buffer can hold is configurable and at
the same time the limit for outstanding transactions.
Currently, data is sent and received on only one of four available EXTOLL Virtual Channels. Measurements conducted in Chapter 5 will have to unveil if and how strong this affects performance. A possible implementation that uses all VCs would require libNAM to use dedicated buffers, one per VC to properly handle GET responses that might return out of order.
4.7 NAM Software
int main(int argc, char **argv)
{
nam_allocation_t *my_alloc;
char hello[] = "Hello NAM!";
char transferred[13];
//Allocate NAM for Read/Write
my_alloc = nam_malloc(sizeof(hello));
//PUT and GET data
nam_put_sync(hello, 0, sizeof(hello), my_alloc);
nam_get_sync(transferred, 0, sizeof(transferred), my_alloc);
printf("Transferred from NAM: <%s>\n", transferred);
//Release Allocation
nam_free(my_alloc); return 0;
}
Listing 4.1 libNAM PUT/GET usage example
In subsequent libNAM implementations stages an MPI-based layer was added to allow sharing a NAM allocation between processes. This layer furthermore allows to coordinate checkpoint and restart processes for the NAM CR use case. As there may exist multiple NAMs in a system, libNAM forms sets of participating nodes in a CR process and assigns these sets to one of the NAMs. This assignment process is currently implemented in a pseudo-random fashion that balances the number of nodes among sets.
Unfortunately, assigning nodes to NAMs without additional information about routing comes with obvious drawbacks. Figure 4.23 depicts various possible set assignments for an example network with eight nodes and two NAMs. It can be seen that there exist good mappings with potentially low routing congestion and short distances, but also bad mappings that require more network hops and where only one NAM link will be used. As routes are static the system behavior in response to NAM placement and set configuration is predictable. It is therefore essential to assign sets in consideration of the network topology and routing scheme. This task can either be offloaded to the user, who must provide an appropriate mapping scheme, or to libNAM which could use the information provided by EMP to optimally form sets.
It is also possible that the job scheduler selects a node combination that inevitably leads to a similar condition. Figure 4.24 shows two possible node combinations for a
Network Attached Memory N0 N1 N2 N3 N4 N5 N6 N7 NAM 0 NAM 1
(a) Optimal mapping. The
logically nearest nodes are assigned. Distances are small and all NAM links are utilized
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(b) Good mapping. Larger
distances and higher risk of routing congestion. All NAM links are utilized
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(c) Bad mapping. Increased
number of hops leads to routing congestion. Only one link per NAM due to static routes
Fig. 4.23 NAM-XOR set mapping examples
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(a) Optimal scheduling. Both NAMs will be
accessed through both links
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(b) Suboptimal scheduling. Both NAMs can
only be accessed through one link
Fig. 4.24 Impact of node scheduling on NAM accessibility
routes and the best possible XOR set assignment. The figure points out that the NAM checkpointing performance can be significantly affected by simply scheduling the ’wrong’ nodes. The impact of suboptimal mapping on performance will be evaluated in
Chapter 5.
For CR, libNAM is also responsible to pad data chunks with zeros up to the next 16 Byte boundary which would otherwise violate the NAM access granularity.
The NAM address space of 2GB per NAM can be allocated as a single or multiple contiguous memory regions. Allocations are granted, managed, and released by a dedicated NAM manager.
4.7.3
NAM Manager
Before a user application can access a NAM it must obtain an allocation. These allocations are managed by the NAM manager. It is implemented as a system service 108
4.8 NAM Summary