The mapping of the LNET protocol semantics onto the Extoll technology comprises of two components: the data transmission protocols themselves and a message matching mechanism on both the initiating and the target side. The following sections provide the specification of the data transmission protocols followed by the description of how incoming messages are matched with their corresponding LNET messages.
6.5.1 Data Transmission Protocols
EXLND distinguishes two types of transmission protocols: immediate sends and bulk
data transfers. The transmission protocol is chosen depending on the message type,
but also the size of the payload. 6.5.1.1 Immediate Send
Similar to the eager protocol of EXT-Eth presented in section 5.4.1.2, the immediate send splits the payload of an LNET message in 120 byte-sized fragments and sends the data directly through multiple VELO sends. The immediate send is used for messages of type ACK, which typically have no payload attached, and fit in one VELO message. It can also be used for messages of type PUT or GET for small payload sizes. A user-tunable threshold, similar to EXT-Eth, defines the upper limit for the immediate send path. On the receiving side, the VELO message containing the last fragment of the payload notifies EXLND that all data has been received, re-assembles the payload, and then, passes it to the upper layers.
6 Efficient Lustre Networking Protocol Support
LNET EXLND EXLND LNET
LNetPut() Responder Notification Completer Notification RMA GET Return Return Initiator Target
Figure 6.13: Overview of LNetPut() communication sequence over Extoll.
LNET EXLND EXLND LNET
LNetGet() Completer Notification Requester Notification Return Return Initiator Target
Figure 6.14: Overview of LNetGet() communication sequence over Extoll.
6.5.1.2 Bulk Data Transfer
In general, a send path can be divided in two steps: the setup and registration of the data buffers, and the transmission itself. One of the main design goals of EXLND is to minimize the number of LND-internal messages that need be exchanged in order to perform an LNetPut() or LNetGet(). This is achieved by internally interchanging the RMA operations performed by EXLND. In case of a PUT message, EXLND performs a rendezvous GET in order to read the data from the initiating node and write it directly to the target node. In case of a GET message, EXLND performs a rendezvous PUT, which writes the payload directly to the data sink on the initiating node. In this context, rendezvous means that the RMA operation is advertised through a VELO message from the initiating node.
Rendezvous GET Protocol
As described in section 6.2.2, an LNetPut() triggers the transmission of a PUT message, while the ACK message is optional. The PUT message has a memory descriptor attached, which describes the data to be sent over the network. Typically, a client needs to send a PUT request to the destination node. Once the target has allocated the PUT sink, the target informs the initiator of the availability and the PUT operation can be performed. For Extoll, EXLND maps the PUT source buffer to an NLA. If the payload is provided in form of a page list, EXLND only needs to make sure that the buffer is pinned, and then, can directly utilize the kernel API to map the buffer to RMA-addressable memory. For a buffer of type kvec, EXLND needs to translate the buffer into an array of pinned pages, before it registers the memory.
A PUT message is transmitted through the rendezvous GET protocol. This means that the payload of the LNET message on the initiating node acts as a GET source buffer for the target side. First, the GET source address and LNET header are sent to the target node by encapsulating the information into a VELO message. On the target side, lnet_parse() calls EXLND’s receive function, which in turn initiates the RMA GET operation. The beauty of this approach is that there is no further communication required between the participating peers. Upon completion of the GET operation, the RMA unit generates two notifications, one on the initiating and one on the target node. On the initiating side, an RMA responder notification notifies EXLND that the send operation has succeeded, which results in the finalizing of the LNET message. On the target side, an RMA completer notification informs EXLND of the successful data transmission and the LNET message can be finalized. Figure 6.13 outlines the complete PUT sequence.
6 Efficient Lustre Networking Protocol Support
Rendezvous PUT Protocol
When the send function is called by LNetGet(), EXLND receives a message of type GET from LNET. GET messages have no data attached, but encode the information about the GET sink buffer. EXLND maps this buffer to an NLA, and the rendezvous PUT protocol is used to transmit the data to the target node. This means that when the target parses the incoming header of the GET message, EXLND’s receive function receives a REPLY message, which provides the data source information for the transmission. The source buffer is also mapped to an NLA and the receive function triggers an RMA PUT operation to the initiating node.
Once the transmission has successfully completed, two RMA notifications are written. On the initiating side, a completer notification informs EXLND that the data has been received. On the target node, a requester notification informs EXLND that the PUT operation has been performed. On both sides, the notification is used to find the corresponding LNET message. The NLAs are de-registered and the LNET messages can be finalized.
6.5.2 Message Matching and Descriptor Queues
Upon the completion of a bulk data transmission, LNET has to be notified about the completed transfer by finalizing the corresponding LNET message through lnet_finalize(). As previously described, EXLND utilizes Extoll’s RMA noti- fication mechanism. These notifications can be associated with different EXLND events on the initiating and target nodes, and used to implement a message matching mechanism. In accordance with Figures 6.13 and 6.14, the following EXLND events can be distinguished:
LNetPut Done – Initiator: The payload of the PUT message has been sent and the initiating node can de-register the associated software NLA. Since the data transfer has been performed through an RMA GET from the target node, a
responder notification is generated on the initiating node.
LNetPut Done – Target: The payload of the PUT message has been received and the target node can de-register the associated software NLA. Since the data transfer has been performed through an RMA GET from the target node, a
completer notification is generated.
LNetGet Done – Initiator: The data has been received from the target node and the associated software NLA can be de-registered. Since the data transfer 180
U nm at ch ed L N ET m e ss a ge de sc ri pt o rs P oo l of un u se d LN E T m e ss a ge de sc ript or s ... ...
Sending LNET header via VELO or starting a RMA transfer
Upon reception of matching RMA Notification
...
...
Figure 6.15: The movement of list elements representing unfinished transfers. has been performed through an RMA PUT from the target node, a completer
notification is generated on the initiating node of the GET message.
LNetGet Done – Target: The data has been written to the initiating node of the GET message and the target node can de-register the associated software NLA. Since the data transfer has been performed through an RMA PUT from the target node, a requester notification is generated.
To match incoming RMA notifications with pending LNET messages, EXLND defines three descriptor queues, with each of them corresponding to one of the three notifications types. Each peer (clients and servers) implements these queues.
Recall from section 5.4.2.1 that RMA notifications contain a field with the start address (physical or NLA) of the last read or write performed by the RMA unit. Extoll divides RMA transfers larger than the RMA’s MTU in multiple frames of the size of the RMA MTU. This fragmentation is performed by the requester, which means that the requester knows the original start address of a read or write operation. The other two RMA subunits, responder and completer, only know where the read or write of the last frame has started, which means that the corresponding notification contains the start address of the last frame. This address needs to be calculated on each node perform initiating the bulk data transmission. Depending on the expected EXLND event, this either happens in EXLND’s send or receive function.
Since EXLND knows all possible event types and their expected RMA notifications, EXLND adds the pending LNET message to the corresponding descriptor queue and connects it with the address that is expected to be seen in the RMA notification.
6 Efficient Lustre Networking Protocol Support
1# add to / etc / m o d p r o b e . d / l u s t r e . c o n f
2 o p t i o n s l n e t n e t w o r k s = " ex ( ex0 ) "
Listing 6.4: Lustre configuration for EXLND.
For the address calculation, the equation presented in section 5.4.2.2 can be reused. Afterwards, EXLND initiates the data transfer by sending the header of the LNET message through VELO or performing an RMA operation. When EXLND receives a new RMA notification, the corresponding descriptor queue is searched. When the address can be matched, EXLND has found the corresponding LNET message and the message is finalized it through lnet_finalize().
The descriptor queues are realized as double linked lists. At module startup time, EXLND initializes a pool of empty message descriptors. When a new transfer is started, EXLND’s send and receive functions search for an empty entry, initialize its values, and put it into the correct descriptor queue, depending on the expected RMA notification. After an LNET message has been freed, the corresponding list item is moved back into the pool of free descriptors. This is done to avoid allocating and freeing memory for every send and receive, which would be rather time-consuming compared. Figure 6.15 illustrates the movement of list elements.