NAM Software - Accelerating Checkpoint/Restart Application Performance in Large-Scale Systems w

listed in reference to the total number of available resources of same type

Resource Type LUTs Registers BRAM DSP

Utilization 273k (63.0%) 199k (23%) 553 (37.6%) 214 (5.9%) Per Functional Unit

One EXTOLL Link 66.8k (15.4%) 57.2k (6.6%) 30.50 (2.1%) 47 (1.3%) EXTOLL MUX 3.8k (0.9%) 2.3k (0.3%) 30 (2%) 0 (0%)

HTL/NTL 24.8k (5.7%) 16.2k (1.9%) 42 (2.9%) 12 (0.3%) CR Logic 87.2k (20.1%) 43.8k (5.1%) 404 (27.5%) 102 (2.8%) HMC Layer 21.6k (5%) 19k (2.2%) 15.5 (1.1%) 2 (0.1%)

to 48 ranks at a time. A further increase of the number of ranks would increase Block RAM usage in the specified device region, and significantly increase routing congestion in this area. Routing congestion also comes in heavily when operating frequencies are increased as the implementation tools start to replicate logic in order to reduce trace lengths and fan-out. The modules that suffered most from routing congestion are the EXTOLL links (fmax = 200 MHz) and the CR logic (fmax = 230 MHz). The

ﬁnal utilization report can be found in Table 4.5.

4.7 NAM Software

Even the best hardware is useless without software that can actually use it. This section describes the software components that were developed or modiﬁed to make use of the NAM. There are three main components in this scope: a NAM-aware network setup and management tool to seamlessly integrate this new device with EXTOLL ASICs. An additional user-level Application Programming Interface (API) that provides access to the NAM and implements the CR features. And ﬁnally, a service as central instance to handle and manage NAMs and its allocations system-wide.

Network Attached Memory

4.7.1 EMP Extension

The EMP is a software component that is integrated with the EXTOLL software stack. It is used to initially assign NIC identifiers and to setup routing to and between EXTOLL devices in a network. In fact, EMP must be run anytime a system is powered down or even a single node was replaced. In its original form, EMP does not support NAMs as it expects that every connected device also provides routing tables and is able to route through two links via the EXTOLL Crossbar (XBAR). The NAM is an endpoint for any traffic and does not provide a routing table as routing from one link to the other is not supported. Hence, the network must be properly configured to ensure that only packets that actually target a specific NAM will be sent to it. An additional hardware device type was added to the EMP which can now route to and from NAMs but will not attempt to route through it. Currently only fixed and deterministic routing is supported with exactly one path from one node to another.

4.7.2 The libNAM Library

The libNAM library operates on top of the existing EXTOLL RMA API. The function calls provided by libNAM are very similar to libRMA so that existing user applications can be modiﬁed without much eﬀort. Listing 4.1 shows a code example to write and read to and from the NAM. In the initial bring-up phase of the NAM hardware-software interaction many of the features that were required to protect the NAM from false usage were implemented in hardware (e.g. a violation of the 16 Byte granularity or unsupported commands). These protection features were gradually shifted into the software, hence reducing hardware and associated implementation complexity.

Reading and writing is realized with send and receive buffers organized in a ring structure. The EXTOLL/NAM notification mechanism is utilized to handle the buffer space, i.e. to free up locations when data has been transmitted (PUT) or received (GET). The number and sizes of the elements a buffer can hold is configurable and at

the same time the limit for outstanding transactions.

Currently, data is sent and received on only one of four available EXTOLL Virtual Channels. Measurements conducted in Chapter 5 will have to unveil if and how strong this aﬀects performance. A possible implementation that uses all VCs would require libNAM to use dedicated buﬀers, one per VC to properly handle GET responses that might return out of order.

4.7 NAM Software

int main(int argc, char **argv)

{

nam_allocation_t *my_alloc;

char hello[] = "Hello NAM!";

char transferred[13];

//Allocate NAM for Read/Write

my_alloc = nam_malloc(sizeof(hello));

//PUT and GET data

nam_put_sync(hello, 0, sizeof(hello), my_alloc);

nam_get_sync(transferred, 0, sizeof(transferred), my_alloc);

printf("Transferred from NAM: <%s>\n", transferred);

//Release Allocation

nam_free(my_alloc); return 0;

}

Listing 4.1 libNAM PUT/GET usage example

In subsequent libNAM implementations stages an MPI-based layer was added to allow sharing a NAM allocation between processes. This layer furthermore allows to coordinate checkpoint and restart processes for the NAM CR use case. As there may exist multiple NAMs in a system, libNAM forms sets of participating nodes in a CR process and assigns these sets to one of the NAMs. This assignment process is currently implemented in a pseudo-random fashion that balances the number of nodes among sets.

Unfortunately, assigning nodes to NAMs without additional information about routing comes with obvious drawbacks. Figure 4.23 depicts various possible set assignments for an example network with eight nodes and two NAMs. It can be seen that there exist good mappings with potentially low routing congestion and short distances, but also bad mappings that require more network hops and where only one NAM link will be used. As routes are static the system behavior in response to NAM placement and set conﬁguration is predictable. It is therefore essential to assign sets in consideration of the network topology and routing scheme. This task can either be oﬄoaded to the user, who must provide an appropriate mapping scheme, or to libNAM which could use the information provided by EMP to optimally form sets.

It is also possible that the job scheduler selects a node combination that inevitably leads to a similar condition. Figure 4.24 shows two possible node combinations for a

Network Attached Memory N0 N1 N2 N3 N4 N5 N6 N7 NAM 0 NAM 1

(a) Optimal mapping. The

logically nearest nodes are assigned. Distances are small and all NAM links are utilized

N0 N1

N2 N3

N4 N5

N6 N7

NAM 0 NAM 1

(b) Good mapping. Larger

distances and higher risk of routing congestion. All NAM links are utilized

N0 N1

N2 N3

N4 N5

N6 N7

NAM 0 NAM 1

(c) Bad mapping. Increased

number of hops leads to routing congestion. Only one link per NAM due to static routes

Fig. 4.23 NAM-XOR set mapping examples

N0 N1

N2 N3

N4 N5

N6 N7

NAM 0 NAM 1

(a) Optimal scheduling. Both NAMs will be

accessed through both links

N0 N1

N2 N3

N4 N5

N6 N7

NAM 0 NAM 1

(b) Suboptimal scheduling. Both NAMs can

only be accessed through one link

Fig. 4.24 Impact of node scheduling on NAM accessibility

routes and the best possible XOR set assignment. The figure points out that the NAM checkpointing performance can be significantly affected by simply scheduling the ’wrong’ nodes. The impact of suboptimal mapping on performance will be evaluated in

Chapter 5.

For CR, libNAM is also responsible to pad data chunks with zeros up to the next 16 Byte boundary which would otherwise violate the NAM access granularity.

The NAM address space of 2GB per NAM can be allocated as a single or multiple contiguous memory regions. Allocations are granted, managed, and released by a dedicated NAM manager.

4.7.3 NAM Manager

Before a user application can access a NAM it must obtain an allocation. These allocations are managed by the NAM manager. It is implemented as a system service 108

4.8 NAM Summary

In document Accelerating Checkpoint/Restart Application Performance in Large-Scale Systems with Network Attached Memory (Page 119-123)