Chapter 6: Open MPI v1.0
6.2 The TCP BTL Component
The components of the Binary Transmission Layer (BTL) are the means for Open MPI to abstract from the underlying communication protocol and/or hardware. The interface functions of a BTL component are invoked by components of other frameworks, namely the Binary Management Layer (BML), the Point-to-point Management Layer (PML) or the collectives framework (COLL) for low-level transmission of opaque data. Therefore BTL components need no particular knowledge of the MPI semantics.
6.2.1 The Lifecycle of the TCP BTL Component
During the MPI_INIT() call the MCA initializes the component frameworks. Those in turn will search all available components, that may have been linked statically into the MPI library or be shared libraries residing in one of the configured component directories. The MCA identifies loadable components by a strict file naming scheme indicating the framework it belongs to as well as its name. Next, the descriptor of the component, an object derived from mca_base_component_t, is examined and its mca_open_component() function is called. The open function of the TCP BTL component registers several run-time parameters with the MCA framework, which hands their values to the component, if they were configured. Otherwise default values are used. In the next step the TCP component is initialized. It creates a single module for each of the system's network interfaces, except for those excluded by configuration, and a single listening socket used later to accept connections from remote processes. Afterwards it announces the addresses of the local interfaces and the port bound by the listening socket to the remote processes by means of the module exchange. Upon success the initialization function returns a list of the created modules. Upon failure a component can return a NULL list to indicate that its services are not available, e.g. when necessary hardware is not present in a system.
The Point-to-point Management Layer (PML) maintains a global registry where each component can store information it wants to share with remote components of the same type, e.g. information needed to connect to the process. Between the component initialization and the following selection phases the stored values will be exchanged with all other processes by the out-of-band module exchange.
After the components have been initialized, the MCA will select select one or more of them to be used. For BTL components this happens by adding remote processes. Thereby the mca_btl_base_add_proc_fn_t() of each module is called with a list of remote process identifiers. It is the task of this function to decide whether the current module can reach the remote process. The TCP modules do this by comparing the subnet bits of the local and remote addresses. If no remote address in the same subnet as the local interface is found, the first available remote address is used. Since TCP/IP
supports routing, this might work as well. The function will return a bit mask to the MCA with the bits corresponding to all reachable processes set. The MCA then decides by an internal priority ranking of the BTL components and the discovered reachability of remote processes, which components should be used for data transmission to single remote processes.
When the MPI process shuts down, the close function of each component will be called to enables the components to cleanup and release all allocated resources.
6.2.2 Managing Remote Processes
The TCP BTL component maintains a hashtable of descriptors of all known remote processes, which is shared among all instantiated modules. The network addresses exported by a remote process and obtained through the module exchange are stored in an array together with each process descriptor.
Upon component selection, i.e. when the TCP modules check the reachability of the remote addresses, they create additional endpoint descriptors containing a reference to the remote process and the address by which it is reachable. The endpoints are returned to the upper layers of Open MPI. They will be provided as parameter to data transmission functions to identify the destination.
6.2.3 Data Transmission
The TCP BTL component supports lazy connecting to minimize the amount of file descriptors needed. At startup time only a single socket will be created listening on all available interfaces and connections to remote processes are not established until the first data transmission is requested.
Data transmission is nearly completely handled by TCP endpoints. Upon the first transmission request the necessary network socket is created, associated to the endpoint and the connection to the remote process is initiated. The TCP BTL component uses non- blocking sockets. To be notified of socket events an event object is registered with the event library of Open MPI. It contains the socket's descriptor and pointers to the endpoint's receive and send handler functions to be invoked whenever an associated event occurs.
When the connection is established, the data transmission service of a BTL can be used by several other components. Therefore additionally the target component needs to be addressed and a BTL needs to know how different components can be notified about received data. For that purpose each component that wants to use the basic transmission services registers with the BTL, providing a special tag value and its data reception callback function. The BTL in turn maintains a list of all registered callbacks. When data shall be transmitted, the upper layers of Open MPI additionally provide the endpoint identifying the target process and the tag of the target component to the transmission function of the BTL. This in turn prepends the data with a private header containing the data length and the provided tag before transmission. The receiving BTL inspects the tag and invokes the registered callback to hand the data to the target component.