Provided by: libfabric-dev_2.1.0-1.1_amd64 bug

NAME

       fi_intro - libfabric introduction

OVERVIEW

       This introduction is part of the libfabric’s programmer’s guide.  See fi_guide(7).  This section provides
       insight  into  the  motivation for the libfabric design and underlying networking features that are being
       exposed through the API.

Review of Sockets Communication

       The sockets API is a widely used networking API.  This guide assumes that a reader has a  working  knowl‐
       edge  of  programming to sockets.  It makes reference to socket based communications throughout in an ef‐
       fort to help explain libfabric concepts and how they relate or differ from the socket API.  To be  clear,
       the intent of this guide is not to criticize the socket API, but reference sockets as a starting point in
       order  to  explain  certain network features or limitations.  The following sections provide a high-level
       overview of socket semantics for reference.

   Connected (TCP) Communication
       The most widely used type of socket is SOCK_STREAM.  This sort of socket usually runs over TCP/IP, and as
       a result is often referred to as a `TCP' socket.  TCP sockets are connection-oriented, requiring  an  ex‐
       plicit connection setup before data transfers can occur.  A single TCP socket can only transfer data to a
       single peer socket.  Communicating with multiple peers requires the use of one socket per peer.

       Applications  using  TCP sockets are typically labeled as either a client or server.  Server applications
       listen for connection requests, and accept them when they occur.  Clients, on the  other  hand,  initiate
       connections to the server.  In socket API terms, a server calls listen(), and the client calls connect().
       After  a  connection  has  been established, data transfers between a client and server are similar.  The
       following code segments highlight the general flow for a sample client and server.   Error  handling  and
       some subtleties of the socket API are omitted for brevity.

              /* Example server code flow to initiate listen */
              struct addrinfo *ai, hints;
              int listen_fd;

              memset(&hints, 0, sizeof hints);
              hints.ai_socktype = SOCK_STREAM;
              hints.ai_flags = AI_PASSIVE;
              getaddrinfo(NULL, "7471", &hints, &ai);

              listen_fd = socket(ai->ai_family, SOCK_STREAM, 0);
              bind(listen_fd, ai->ai_addr, ai->ai_addrlen);
              freeaddrinfo(ai);

              fcntl(listen_fd, F_SETFL, O_NONBLOCK);
              listen(listen_fd, 128);

       In  this example, the server will listen for connection requests on port 7471 across all addresses in the
       system.  The call to getaddrinfo() is used to form the local socket address.  The node parameter  is  set
       to  NULL,  which  result  in a wild card IP address being returned.  The port is hard-coded to 7471.  The
       AI_PASSIVE flag signifies that the address will be used by the listening side of  the  connection.   That
       is, the address information should be relative to the local node.

       This example will work with both IPv4 and IPv6.  The getaddrinfo() call abstracts the address format away
       from  the  server, improving its portability.  Using the data returned by getaddrinfo(), the server allo‐
       cates a socket of type SOCK_STREAM, and binds the socket to port 7471.

       In practice, most enterprise-level applications make use of non-blocking sockets.  This is needed  for  a
       single  application thread to manage multiple socket connections.  The fcntl() command sets the listening
       socket to non-blocking mode.  This will affect how the server processes connection  requests  (shown  be‐
       low).   Finally,  the server starts listening for connection requests by calling listen.  Until listen is
       called, connection requests that arrive at the server will be rejected by the operating system.

              /* Example client code flow to start connection */
              struct addrinfo *ai, hints;
              int client_fd;

              memset(&hints, 0, sizeof hints);
              hints.ai_socktype = SOCK_STREAM;
              getaddrinfo("10.31.20.04", "7471", &hints, &ai);

              client_fd = socket(ai->ai_family, SOCK_STREAM, 0);
              fcntl(client_fd, F_SETFL, O_NONBLOCK);

              connect(client_fd, ai->ai_addr, ai->ai_addrlen);
              freeaddrinfo(ai);

       Similar to the server, the client makes use of getaddrinfo().  Since the AI_PASSIVE flag  is  not  speci‐
       fied, the given address is treated as that of the destination.  The client expects to reach the server at
       IP  address  10.31.20.04,  port  7471.  For this example the address is hard-coded into the client.  More
       typically, the address will be given to the client via the command line, through a configuration file, or
       from a service.  Often the port number will be well-known, and the client will find the server  by  name,
       with  DNS  (domain  name service) providing the name to address resolution.  Fortunately, the getaddrinfo
       call can be used to convert host names into IP addresses.

       Whether the client is given the server’s network address directly or a name which must be translated into
       the network address, the mechanism used to provide this information to the client varies widely.  A  sim‐
       ple mechanism that is commonly used is for users to provide the server’s address using a command line op‐
       tion.  The problem of telling applications where its peers are located increases significantly for appli‐
       cations  that  communicate with hundreds to millions of peer processes, often requiring a separate, dedi‐
       cated application to solve.  For a typical client-server socket application, this is not an issue, so  we
       will defer more discussion until later.

       Using the getaddrinfo() results, the client opens a socket, configures it for non-blocking mode, and ini‐
       tiates  the connection request.  At this point, the network stack has sent a request to the server to es‐
       tablish the connection.  Because the socket has been set to non-blocking, the connect  call  will  return
       immediately  and  not wait for the connection to be established.  As a result any attempt to send data at
       this point will likely fail.

              /* Example server code flow to accept a connection */
              struct pollfd fds;
              int server_fd;

              fds.fd = listen_fd;
              fds.events = POLLIN;

              poll(&fds, -1);

              server_fd = accept(listen_fd, NULL, 0);
              fcntl(server_fd, F_SETFL, O_NONBLOCK);

       Applications that use non-blocking sockets use select(), poll(), or an equivalent such as epoll() to  re‐
       ceive notification of when a socket is ready to send or receive data.  In this case, the server wishes to
       know  when  the  listening socket has a connection request to process.  It adds the listening socket to a
       poll set, then waits until a connection request arrives (i.e. POLLIN is true).  The  poll()  call  blocks
       until POLLIN is set on the socket.  POLLIN indicates that the socket has data to accept.  Since this is a
       listening  socket, the data is a connection request.  The server accepts the request by calling accept().
       That returns a new socket to the server, which is ready for data transfers.  Finally, the server sets the
       new socket to non-blocking mode.

              /* Example client code flow to establish a connection */
              struct pollfd fds;
              int err;
              socklen_t len;

              fds.fd = client_fd;
              fds.events = POLLOUT;

              poll(&fds, -1);

              len = sizeof err;
              getsockopt(client_fd, SOL_SOCKET, SO_ERROR, &err, &len);

       The client is notified that its connection request has completed when its connecting socket is `ready  to
       send data' (i.e. POLLOUT is true).  The poll() call blocks until POLLOUT is set on the socket, indicating
       the  connection  attempt is done.  Note that the connection request may have completed with an error, and
       the client still needs to check if the connection attempt was successful.  That is not  conveyed  to  the
       application  by  the poll() call.  The getsockopt() call is used to retrieve the result of the connection
       attempt.  If err in this example is set to 0, then the connection attempt succeeded.  The socket  is  now
       ready to send and receive data.

       After  a  connection  has been established, the process of sending or receiving data is the same for both
       the client and server.  The examples below differ only by name of the socket variable used by the  client
       or server application.

              /* Example of client sending data to server */
              struct pollfd fds;
              size_t offset, size, ret;
              char buf[4096];

              fds.fd = client_fd;
              fds.events = POLLOUT;

              size = sizeof(buf);
              for (offset = 0; offset < size; ) {
                  poll(&fds, -1);

                  ret = send(client_fd, buf + offset, size - offset, 0);
                  offset += ret;
              }

       Network  communication  involves buffering of data at both the sending and receiving sides of the connec‐
       tion.  TCP uses a credit based scheme to manage flow control to ensure that there  is  sufficient  buffer
       space  at the receive side of a connection to accept incoming data.  This flow control is hidden from the
       application by the socket API.  As a result, stream based sockets may not transfer all the data that  the
       application requests to send as part of a single operation.

       In  this  example, the client maintains an offset into the buffer that it wishes to send.  As data is ac‐
       cepted by the network, the offset increases.  The client then waits until the network is ready to  accept
       more data before attempting another transfer.  The poll() operation supports this.  When the client sock‐
       et  is  ready  for data, it sets POLLOUT to true.  This indicates that send will transfer some additional
       amount of data.  The client issues a send() request for the remaining amount of buffer that it wishes  to
       transfer.   If  send()  transfers  less data than requested, the client updates the offset, waits for the
       network to become ready, then tries again.

              /* Example of server receiving data from client */
              struct pollfd fds;
              size_t offset, size, ret;
              char buf[4096];

              fds.fd = server_fd;
              fds.events = POLLIN;

              size = sizeof(buf);
              for (offset = 0; offset < size; ) {
                  poll(&fds, -1);

                  ret = recv(client_fd, buf + offset, size - offset, 0);
                  offset += ret;
              }

       The flow for receiving data is similar to that used to send it.  Because of the streaming nature  of  the
       socket, there is no guarantee that the receiver will obtain all of the available data as part of a single
       call.   The  server  instead must wait until the socket is ready to receive data (POLLIN), before calling
       receive to obtain what data is available.  In this example, the server knows to expect exactly  4  KB  of
       data  from  the client.  More generally, a client and server will exchange communication protocol headers
       at the start of all messages, and the header will include the size of the message.

       It is worth noting that the previous two examples are written so that  they  are  simple  to  understand.
       They are poorly constructed when considering performance.  In both cases, the application always precedes
       a  data transfer call (send or recv) with poll().  The impact is even if the network is ready to transfer
       data or has data queued for receiving, the application will always experience the latency and  processing
       overhead of poll().  A better approach is to call send() or recv() prior to entering the for() loops, and
       only enter the loops if needed.

   Connection-less (UDP) Communication
       As  mentioned,  TCP  sockets  are connection-oriented.  They may be used to communicate between exactly 2
       processes.  For parallel applications that need to communicate with thousands peer processes,  the  over‐
       head  of  managing  this many simultaneous sockets can be significant, to the point where the application
       performance may decrease as more processes are added.

       To support communicating with a large number of peers, or for applications that do not need the  overhead
       of reliable communication, sockets offers another commonly used socket option, SOCK_DGRAM.  Datagrams are
       unreliable,  connectionless  messages.  The most common type of SOCK_DGRAM socket runs over UDP/IP.  As a
       result, datagram sockets are often referred to as UDP sockets.

       UDP sockets use the same socket API as that described above for TCP sockets; however,  the  communication
       behavior  differs.   First,  an application using UDP sockets does not need to connect to a peer prior to
       sending it a message.  The destination address is specified as part of the send operation.  A second  ma‐
       jor  difference  is  that  the  message  is  not guaranteed to arrive at the peer.  Network congestion in
       switches, routers, or the remote NIC can discard the message, and no attempt will be made to  resend  the
       message.   The  sender will not be notified that the message either arrived or was dropped.  Another dif‐
       ference between TCP and UDP sockets is the maximum size of the transfer that  is  allowed.   UDP  sockets
       limit  messages to at most 64k, though in practice, applications use a much smaller size, usually aligned
       to the network MTU size (for example, 1500 bytes).

       Most use of UDP sockets replace the socket send() / recv() calls with sendto() and recvfrom().

              /* Example send to peer at given IP address and UDP port */
              struct addrinfo *ai, hints;

              memset(&hints, 0, sizeof hints);
              hints.ai_socktype = SOCK_DGRAM;
              getaddrinfo("10.31.20.04", "7471", &hints, &ai);

              ret = sendto(client_fd, buf, size, 0, ai->ai_addr, ai->ai_addrlen);

       In the above example, we use getadddrinfo() to convert the given IP address and UDP port  number  into  a
       sockaddr.   That  is  passed  into  the sendto() call in order to specify the destination of the message.
       Note the similarities between this flow and the TCP socket flow.  The recvfrom() call allows  us  to  re‐
       ceive  the address of the sender of a message.  Note that unlike streaming sockets, the entire message is
       accepted by the network on success.  All contents of the buf parameter, specified by the size  parameter,
       have been queued by the network layer.

       Although  not  shown, the application could call poll() or an equivalent prior to calling sendto() to en‐
       sure that the socket is ready to accept new data.   Similarly,  poll()  may  be  used  prior  to  calling
       recvfrom() to check if there is data ready to be read from the socket.

              /* Example receive a message from a peer */
              struct sockaddr_in addr;
              socklen_t addrlen;

              addrlen = sizeof(addr);
              ret = recvfrom(client_fd, buf, size, 0, &addr, &addrlen);

       This  example  will receive any incoming message from any peer.  The address of the peer will be provided
       in the addr parameter.  In this case, we only provide enough space to record and IPv4 address (limited by
       our use of struct sockaddr_in).  Supporting an IPv6 address would simply require passing in a larger  ad‐
       dress buffer (mapped to struct sockaddr_in6 for example).

   Advantages
       The  socket  API  has  two significant advantages.  First, it is available on a wide variety of operating
       systems and platforms, and works over the vast majority of available networking hardware.   It  can  even
       work  for  communication between processes on the same system without any network hardware.  It is easily
       the de-facto networking API.  This by itself makes it appealing to use.

       The second key advantage is that it is relatively easy to program to.  The importance of this should  not
       be  overlooked.   Networking  APIs  that offer access to higher performing features, but are difficult to
       program to correctly or well, often result in lower application performance.  This is not  unlike  coding
       an  application  in a higher-level language such as C or C++, versus assembly.  Although writing directly
       to assembly language offers the promise of being better performing, for the vast majority of  developers,
       their applications will perform better if written in C or C++, and using an optimized compiler.  Applica‐
       tions  should  have  a  clear need for high-performance networking before selecting an alternative API to
       sockets.

   Disadvantages
       When considering the problems with the socket API as it pertains to high-performance networking, we limit
       our discussion to the two most common sockets types: streaming (TCP) and datagram (UDP).

       Most applications require that network data be sent reliably.  This invariably means using a  connection-
       oriented TCP socket.  TCP sockets transfer data as a stream of bytes.  However, many applications operate
       on messages.  The result is that applications often insert headers that are simply used to convert appli‐
       cation  message to / from a byte stream.  These headers consume additional network bandwidth and process‐
       ing.  The streaming nature of the interface also results in the application using loops as shown  in  the
       examples  above to send and receive larger messages.  The complexity of those loops can be significant if
       the application is managing sockets to hundreds or thousands of peers.

       Another issue highlighted by the above examples deals with the asynchronous nature  of  network  traffic.
       When  using  a  reliable transport, it is not enough to place an application’s data onto the network.  If
       the network is busy, it could drop the packet, or the data could become corrupted during a transfer.  The
       data must be kept until it has been acknowledged by the peer, so that it can be resent  if  needed.   The
       socket  API  is  defined such that the application owns the contents of its memory buffers after a socket
       call returns.

       As an example, if we examine the socket send() call, once send() returns the application is free to modi‐
       fy its buffer.  The network implementation has a couple of options.  One option is for the send  call  to
       place  the  data  directly onto the network.  The call must then block before returning to the user until
       the peer acknowledges that it received the data, at which point send() can return.  The  obvious  problem
       with  this  approach is that the application is blocked in the send() call until the network stack at the
       peer can process the data and generate an acknowledgment.  This can be a significant amount of time where
       the application is blocked and unable to process other work, such as responding to  messages  from  other
       clients.  Such an approach is not feasible.

       A  better option is for the send() call to copy the application’s data into an internal buffer.  The data
       transfer is then issued out of that buffer, which allows retrying the operation in  case  of  a  failure.
       The  send() call in this case is not blocked, but all data that passes through the network will result in
       a memory copy to a local buffer, even in the absence of any errors.

       Allowing immediate re-use of a data buffer is a feature of the socket API that keeps it simple  and  easy
       to  program  to.   However, such a feature can potentially have a negative impact on network performance.
       For network or memory limited applications, or even applications concerned about  power  consumption,  an
       alternative API may be attractive.

       A  slightly  more  hidden problem occurs in the socket APIs designed for UDP sockets.  This problem is an
       inefficiency in the implementation as a result of the API design being designed for ease of use.  In  or‐
       der for the application to send data to a peer, it needs to provide the IP address and UDP port number of
       the  peer.  That involves passing a sockaddr structure to the sendto() and recvfrom() calls.  However, IP
       addresses are a higher- level network layer address.  In order to transfer data between systems, low-lev‐
       el link layer addresses are needed, for example Ethernet addresses.  The network layer must  map  IP  ad‐
       dresses  to Ethernet addresses on every send operation.  When scaled to thousands of peers, that overhead
       on every send call can be significant.

       Finally, because the socket API is often considered in conjunction with TCP and UDP protocols, it is  in‐
       tentionally  detached  from the underlying network hardware implementation, including NICs, switches, and
       routers.  Access to available network features is therefore constrained by what the API can support.

       It is worth noting here, that some operating systems support enhanced APIs that may be used  to  interact
       with TCP and UDP sockets.  For example, Linux supports an interface known as io_uring, and Windows has an
       asynchronous  socket  API.  Those APIs can help alleviate some of the problems described above.  However,
       an application will still be restricted by the features that the TCP an UDP protocols provide.

High-Performance Networking

       By analyzing the socket API in the context of high-performance networking, we can start to see some  fea‐
       tures that are desirable for a network API.

   Avoiding Memory Copies
       The socket API implementation usually results in data copies occurring at both the sender and the receiv‐
       er.  This is a trade-off between keeping the interface easy to use, versus providing reliability.  Ideal‐
       ly, all memory copies would be avoided when transferring data over the network.  There are techniques and
       APIs  that  can be used to avoid memory copies, but in practice, the cost of avoiding a copy can often be
       more than the copy itself, in particular for small transfers (measured  in  bytes,  versus  kilobytes  or
       more).

       To  avoid  a  memory copy at the sender, we need to place the application data directly onto the network.
       If we also want to avoid blocking the sending application, we need some way for the network layer to com‐
       municate with the application when the buffer is safe to re-use.  This would allow the original buffer to
       be re-used in case the data needs to be re-transmitted.  This leads us to crafting  a  network  interface
       that behaves asynchronously.  The application will need to issue a request, then receive some sort of no‐
       tification when the request has completed.

       Avoiding a memory copy at the receiver is more challenging.  When data arrives from the network, it needs
       to  land  into an available memory buffer, or it will be dropped, resulting in the sender re-transmitting
       the data.  If we use socket recv() semantics, the only way to avoid a copy at the  receiver  is  for  the
       recv()  to be called before the send().  Recv() would then need to block until the data has arrived.  Not
       only does this block the receiver, it is impractical to use outside of an application with a  simple  re‐
       quest-reply protocol.

       Instead, what is needed is a way for the receiving application to provide one or more buffers to the net‐
       work for received data to land.  The network then needs to notify the application when data is available.
       This sort of mechanism works well if the receiver does not care where in its memory space the data is lo‐
       cated; it only needs to be able to process the incoming message.

       As an alternative, it is possible to reverse this flow, and have the network layer hand its buffer to the
       application.   The  application  would  then be responsible for returning the buffer to the network layer
       when it is done with its processing.  While this approach can avoid memory copies, it suffers from a  few
       drawbacks.  First, the network layer does not know what size of messages to expect, which can lead to in‐
       efficient  memory  use.  Second, many would consider this a more difficult programming model to use.  And
       finally, the network buffers would need to be mapped into the application process’  memory  space,  which
       negatively impacts performance.

       In addition to processing messages, some applications want to receive data and store it in a specific lo‐
       cation  in  memory.  For example, a database may want to merge received data records into an existing ta‐
       ble.  In such cases, even if data arriving from the network goes directly into an  application’s  receive
       buffers,  it  may still need to be copied into its final location.  It would be ideal if the network sup‐
       ported placing data that arrives from the network into a specific memory buffer, with the  buffer  deter‐
       mined based on the contents of the data.

   Network Buffers
       Based  on  the problems described above, we can start to see that avoiding memory copies depends upon the
       ownership of the memory buffers used for network traffic.  With  socket  based  transports,  the  network
       buffers  are  owned and managed by the networking stack.  This is usually handled by the operating system
       kernel.  However, this results in the data `bouncing' between the application  buffers  and  the  network
       buffers.   By putting the application in control of managing the network buffers, we can avoid this over‐
       head.  The cost for doing so is additional complexity in the application.

       Note that even though we want the application to own the network buffers, we would still  like  to  avoid
       the situation where the application implements a complex network protocol.  The trade-off is that the app
       provides  the  data  buffers  to the network stack, but the network stack continues to handle things like
       flow control, reliability, and segmentation and reassembly.

   Resource Management
       We define resource management to mean properly allocating network resources in order to avoid overrunning
       data buffers or queues.  Flow control is a common aspect of resource  management.   Without  proper  flow
       control,  a sender can overrun a slow or busy receiver.  This can result in dropped packets, re-transmis‐
       sions, and increased network congestion.  Significant research and development has gone into implementing
       flow control algorithms.  Because of its complexity, it is not something that  an  application  developer
       should  need to deal with.  That said, there are some applications where flow control simply falls out of
       the network protocol.  For example, a request-reply protocol naturally has flow control built in.

       For our purposes, we expand the definition of resource management beyond flow control.  Flow control typ‐
       ically only deals with available network buffering at a peer.  We also want to be concerned about  having
       available space in outbound data transfer queues.  That is, as we issue commands to the local NIC to send
       data,  that  those  commands can be queued at the NIC.  When we consider reliability, this means tracking
       outstanding requests until they have been acknowledged.  Resource management will need to ensure that  we
       do not overflow that request queue.

       Additionally, supporting asynchronous operations (described in detail below) will introduce potential new
       queues.  Those queues also must not overflow.

   Asynchronous Operations
       Arguably,  the  key  feature  of achieving high-performance is supporting asynchronous operations, or the
       ability to overlap different communication and communication with computation.  The socket  API  supports
       asynchronous transfers with its non-blocking mode.  However, because the API itself operates synchronous‐
       ly, the result is additional data copies.  For an API to be asynchronous, an application needs to be able
       to  submit  work,  then later receive some sort of notification that the work is done.  In order to avoid
       extra memory copies, the application must agree not to modify its data buffers until the  operation  com‐
       pletes.

       There  are two main ways to notify an application that it is safe to re-use its data buffers.  One mecha‐
       nism is for the network layer to invoke some sort of callback or send a signal to  the  application  that
       the  request  is done.  Some asynchronous APIs use this mechanism.  The drawback of this approach is that
       signals interrupt an application’s processing.  This can negatively impact the CPU caches, plus  requires
       interrupt processing.  Additionally, it is often difficult to develop an application that can handle pro‐
       cessing a signal that can occur at anytime.

       An alternative mechanism for supporting asynchronous operations is to write events into some sort of com‐
       pletion queue when an operation completes.  This provides a way to indicate to an application when a data
       transfer  has  completed,  plus  gives the application control over when and how to process completed re‐
       quests.  For example, it can process requests in batches to improve code locality and performance.

   Interrupts and Signals
       Interrupts are a natural extension to supporting asynchronous operations.  However, when dealing with  an
       asynchronous  API,  they  can  negatively impact performance.  Interrupts, even when directed to a kernel
       agent, can interfere with application processing.

       If an application has an asynchronous interface with  completed  operations  written  into  a  completion
       queue,  it  is often sufficient for the application to simply check the queue for events.  As long as the
       application has other work to perform, there is no need for it to block.  This alleviates  the  need  for
       interrupt  generation.   A NIC merely needs to write an entry into the completion queue and update a tail
       pointer to signal that a request is done.

       If we follow this argument, then it can be beneficial to give the application control  over  when  inter‐
       rupts should occur and when to write events to some sort of wait object.  By having the application noti‐
       fy  the  network  layer  that it will wait until a completion occurs, we can better manage the number and
       type of interrupts that are generated.

   Event Queues
       As outlined above, there are performance advantages to having an API that reports completions or provides
       other types of notification using an event queue.  A very simple type of event queue merely  tracks  com‐
       pleted operations.  As data is received or a send completes, an entry is written into the event queue.

   Direct Hardware Access
       When  discussing the network layer, most software implementations refer to kernel modules responsible for
       implementing the necessary transport and network protocols.  However, if we want network latency  to  ap‐
       proach  sub-microsecond  speeds,  then we need to remove as much software between the application and its
       access to the hardware as possible.  One way to do this is for the application to have direct  access  to
       the  network interface controller’s command queues.  Similarly, the NIC requires direct access to the ap‐
       plication’s data buffers and control structures, such as the above mentioned completion queues.

       Note that when we speak about an application having direct access to network hardware, we’re referring to
       the application process.  Naturally, an application developer is highly unlikely to code for  a  specific
       hardware  NIC.   That  work would be left to some sort of network library specifically targeting the NIC.
       The actual network layer, which implements the network transport, could be part of the network library or
       offloaded onto the NIC’s hardware or firmware.

   Kernel Bypass
       Kernel bypass is a feature that allows the application to avoid calling into the kernel for data transfer
       operations.  This is possible when it has direct access to the NIC hardware.  Complete kernel  bypass  is
       impractical because of security concerns and resource management constraints.  However, it is possible to
       avoid kernel calls for what are called `fast-path' operations, such as send or receive.

       For  security  and  stability  reasons, operating system kernels cannot rely on data that comes from user
       space applications.  As a result, even a simple kernel call often requires acquiring and releasing locks,
       coupled with data verification checks.  If we can limit the effects of a poorly written or malicious  ap‐
       plication  to  its own process space, we can avoid the overhead that comes with kernel validation without
       impacting system stability.

   Direct Data Placement
       Direct data placement means avoiding data copies when sending and receiving data, plus  placing  received
       data  into the correct memory buffer where needed.  On a broader scale, it is part of having direct hard‐
       ware access, with the application and NIC communicating directly with shared memory buffers and queues.

       Direct data placement is often thought of by those familiar with RDMA - remote direct memory access.  RD‐
       MA is a technique that allows reading and writing memory that belongs to a peer process that  is  running
       on  a  node across the network.  Advanced RDMA hardware is capable of accessing the target memory buffers
       without involving the execution of the peer process.  RDMA relies on offloading the network transport on‐
       to the NIC in order to avoid interrupting the target process.

       The main advantages of supporting direct data placement is avoiding memory copies and minimizing process‐
       ing overhead.

Designing Interfaces for Performance

       We want to design a network interface that can meet the requirements outlined above.  Moreover,  we  also
       want to take into account the performance of the interface itself.  It is often not obvious how an inter‐
       face  can  adversely  affect performance, versus performance being a result of the underlying implementa‐
       tion.  The following sections describe how interface choices can impact performance.  Of course, when  we
       begin  defining  the  actual APIs that an application will use, we will need to trade off raw performance
       for ease of use where it makes sense.

       When considering performance goals for an API, we need to take into account the  target  application  use
       cases.  For the purposes of this discussion, we want to consider applications that communicate with thou‐
       sands  to  millions of peer processes.  Data transfers will include millions of small messages per second
       per peer, and large transfers that may be up to gigabytes of data.  At such extreme  scales,  even  small
       optimizations are measurable, in terms of both performance and power.  If we have a million peers sending
       a  millions  messages per second, eliminating even a single instruction from the code path quickly multi‐
       plies to saving billions of instructions per second from the overall execution, when viewing  the  opera‐
       tion of the entire application.

       We  once  again  refer to the socket API as part of this discussion in order to illustrate how an API can
       affect performance.

              /* Notable socket function prototypes */
              /* "control" functions */
              int socket(int domain, int type, int protocol);
              int bind(int socket, const struct sockaddr *addr, socklen_t addrlen);
              int listen(int socket, int backlog);
              int accept(int socket, struct sockaddr *addr, socklen_t *addrlen);
              int connect(int socket, const struct sockaddr *addr, socklen_t addrlen);
              int shutdown(int socket, int how);
              int close(int socket);

              /* "fast path" data operations - send only (receive calls not shown) */
              ssize_t send(int socket, const void *buf, size_t len, int flags);
              ssize_t sendto(int socket, const void *buf, size_t len, int flags,
                  const struct sockaddr *dest_addr, socklen_t addrlen);
              ssize_t sendmsg(int socket, const struct msghdr *msg, int flags);
              ssize_t write(int socket, const void *buf, size_t count);
              ssize_t writev(int socket, const struct iovec *iov, int iovcnt);

              /* "indirect" data operations */
              int poll(struct pollfd *fds, nfds_t nfds, int timeout);
              int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

       Examining this list, there are a couple of features to note.  First, there are multiple calls that can be
       used to send data, as well as multiple calls that can be used to wait for a non-blocking socket to become
       ready.  This will be discussed in more detail further on.  Second, the operations have  been  split  into
       different  groups (terminology is ours).  Control operations are those functions that an application sel‐
       dom invokes during execution.  They often only occur as part of initialization.

       Data operations, on the other hand, may be called hundreds to millions of times during  an  application’s
       lifetime.   They  deal directly or indirectly with transferring or receiving data over the network.  Data
       operations can be split into two groups.  Fast path calls interact with the network stack to  immediately
       send or receive data.  In order to achieve high bandwidth and low latency, those operations need to be as
       fast  as  possible.   Non-fast  path operations that still deal with data transfers are those calls, that
       while still frequently called by the application, are not as performance critical.  For example, the  se‐
       lect()  and  poll()  calls are used to block an application thread until a socket becomes ready.  Because
       those calls suspend the thread execution, performance is a lesser concern.  (Performance of those  opera‐
       tions  is  still  of a concern, but the cost of executing the operating system scheduler often swamps any
       but the most substantial performance gains.)

   Call Setup Costs
       The amount of work that an application needs to perform before issuing a data transfer operation can  af‐
       fect  performance,  especially message rates.  Obviously, the more parameters an application must push on
       the stack to call a function increases its instruction count.  However, replacing stack variables with  a
       single data structure does not help to reduce the setup costs.

       Suppose that an application wishes to send a single data buffer of a given size to a peer.  If we examine
       the  socket API, the best fit for such an operation is the write() call.  That call takes only those val‐
       ues which are necessary to perform the data transfer.  The send() call is a close second, and send() is a
       more natural function name for network  communication,  but  send()  requires  one  extra  argument  over
       write().   Other  functions are even worse in terms of setup costs.  The sendmsg() function, for example,
       requires that the application format a data structure, the address of which  is  passed  into  the  call.
       This requires significantly more instructions from the application if done for every data transfer.

       Even though all other send functions can be replaced by sendmsg(), it is useful to have multiple ways for
       the application to issue send requests.  Not only are the other calls easier to read and use (which lower
       software maintenance costs), but they can also improve performance.

   Branches and Loops
       When  designing  an  API,  developers  rarely consider how the API impacts the underlying implementation.
       However, the selection of API parameters can require that the underlying implementation add  branches  or
       use control loops.  Consider the difference between the write() and writev() calls.  The latter passes in
       an array of I/O vectors, which may be processed using a loop such as this:

              /* Sample implementation for processing an array */
              for (i = 0; i < iovcnt; i++) {
                  ...
              }

       In  order  to  process  the iovec array, the natural software construct would be to use a loop to iterate
       over the entries.  Loops result in additional processing.  Typically, a loop requires initializing a loop
       control variable (e.g. i = 0), adds ALU operations (e.g. i++), and a comparison (e.g. i < iovcnt).   This
       overhead is necessary to handle an arbitrary number of iovec entries.  If the common case is that the ap‐
       plication wants to send a single data buffer, write() is a better option.

       In  addition  to  control  loops, an API can result in the implementation needing branches.  Branches can
       change the execution flow of a program, impacting processor  pipe-lining  techniques.   Processor  branch
       prediction  helps  alleviate  this issue.  However, while branch prediction can be correct nearly 100% of
       the time while running a micro-benchmark, such as a network bandwidth or latency test, with more  realis‐
       tic network traffic, the impact can become measurable.

       We  can  easily  see  how an API can introduce branches into the code flow if we examine the send() call.
       Send() takes an extra flags parameter over the write() call.  This allows the application to  modify  the
       behavior  of send().  From the viewpoint of implementing send(), the flags parameter must be checked.  In
       the best case, this adds one additional check (flags are non-zero).  In the worst case, every valid  flag
       may need a separate check, resulting in potentially dozens of checks.

       Overall,  the  sockets API is well designed considering these performance implications.  It provides com‐
       plex calls where they are needed, with simpler functions available that can avoid some  of  the  overhead
       inherent in other calls.

   Command Formatting
       The  ultimate  objective  of invoking a network function is to transfer or receive data from the network.
       In this section, we’re dropping to the very bottom of the software stack to the component responsible for
       directly accessing the hardware.  This is usually referred to as the network driver, and its  implementa‐
       tion is often tied to a specific piece of hardware, or a series of NICs by a single hardware vendor.

       In  order  to  signal  a NIC that it should read a memory buffer and copy that data onto the network, the
       software driver usually needs to write some sort of command to the NIC.  To limit hardware complexity and
       cost, a NIC may only support a couple of command formats.  This differs from the software interfaces that
       we’ve been discussing, where we can have different APIs of varying complexity in order  to  reduce  over‐
       head.   There can be significant costs associated with formatting the command and posting it to the hard‐
       ware.

       With a standard NIC, the command is formatted by a kernel driver.  That driver sits at the bottom of  the
       network  stack servicing requests from multiple applications.  It must typically format each command only
       after a request has passed through the network stack.

       With devices that are directly accessible by a single application, there are opportunities  to  use  pre-
       formatted  command  structures.  The more of the command that can be initialized prior to the application
       submitting a network request, the more streamlined the process, and the better the performance.

       As an example, a NIC needs to have the destination address as part of a send operation.  If  an  applica‐
       tion  is  sending to a single peer, that information can be cached and be part of a pre-formatted network
       header.  This is only possible if the NIC driver knows that  the  destination  will  not  change  between
       sends.   The  closer  that the driver can be to the application, the greater the chance for optimization.
       An optimal approach is for the driver to be part of a library that executes entirely within the  applica‐
       tion process space.

   Memory Footprint
       Memory  footprint concerns are most notable among high-performance computing (HPC) applications that com‐
       municate with thousands of peers.  Excessive memory consumption impacts application scalability, limiting
       the number of peers that can operate in parallel to solve problems.  There is often a  trade-off  between
       minimizing  the  memory  footprint needed for network communication, application performance, and ease of
       use of the network interface.

       As we discussed with the socket API semantics, part of the ease of using sockets comes from  the  network
       layering copying the user’s buffer into an internal buffer belonging to the network stack.  The amount of
       internal  buffering  that’s made available to the application directly correlates with the bandwidth that
       an application can achieve.  In general, larger internal buffering increases network performance, with  a
       cost  of increasing the memory footprint consumed by the application.  This memory footprint exists inde‐
       pendent of the amount of memory allocated directly by the application.  Eliminating network buffering not
       only helps with performance, but also scalability, by reducing the memory footprint needed to support the
       application.

       While network memory buffering increases as an application scales, it can often be configured to a  fixed
       size.   The  amount  of buffering needed is dependent on the number of active communication streams being
       used at any one time.  That number is often significantly lower than the total number of  peers  that  an
       application  may  need to communicate with.  The amount of memory required to address the peers, however,
       usually has a linear relationship with the total number of peers.

       With the socket API, each peer is identified using a struct sockaddr.  If we consider a UDP based  socket
       application using IPv4 addresses, a peer is identified by the following address.

              /* IPv4 socket address - with typedefs removed */
              struct sockaddr_in {
                  uint16_t sin_family; /* AF_INET */
                  uint16_t sin_port;
                  struct {
                      uint32_t sin_addr;
                  } in_addr;
              };

       In  total,  the application requires 8-bytes of addressing for each peer.  If the app communicates with a
       million peers, that explodes to roughly 8 MB of memory space that is consumed just to  maintain  the  ad‐
       dress list.  If IPv6 addressing is needed, then the requirement increases by a factor of 4.

       Luckily,  there  are  some tricks that can be used to help reduce the addressing memory footprint, though
       doing so will introduce more instructions into code path to access the network stack.  For  instance,  we
       can  notice that all addresses in the above example have the same sin_family value (AF_INET).  There’s no
       need to store that for each address.  This potentially shrinks each address from 8 bytes to 6.   (We  may
       be  left  with  unaligned data, but that’s a trade-off to reducing the memory consumption).  Depending on
       how the addresses are assigned, further reduction may be possible.  For example, if the application  uses
       the  same  set of port addresses at each node, then we can eliminate storing the port, and instead calcu‐
       late it from some base value.  This type of trick can be applied to the IP portion of the address if  the
       app is lucky enough to run across sequential IP addresses.

       The  main issue with this sort of address reduction is that it is difficult to achieve.  It requires that
       each application check for and handle address compression, exposing the  application  to  the  addressing
       format used by the networking stack.  It should be kept in mind that TCP/IP and UDP/IP addresses are log‐
       ical  addresses,  not  physical.  When running over Ethernet, the addresses that appear at the link layer
       are MAC addresses, not IP addresses.  The IP to MAC address association is managed by the  network  soft‐
       ware.  We would like to provide addressing that is simple for an application to use, but at the same time
       can provide a minimal memory footprint.

   Communication Resources
       We  need  to  take a brief detour in the discussion in order to delve deeper into the network problem and
       solution space.  Instead of continuing to think of a socket as a single entity, with both  send  and  re‐
       ceive  capabilities,  we  want  to consider its components separately.  A network socket can be viewed as
       three basic constructs: a transport level address, a send or transmit queue, and a  receive  queue.   Be‐
       cause  our  discussion  will begin to pivot away from pure socket semantics, we will refer to our network
       `socket' as an endpoint.

       In order to reduce an application’s memory footprint, we need to consider features that fall  outside  of
       the  socket  API.  So far, much of the discussion has been around sending data to a peer.  We now want to
       focus on the best mechanisms for receiving data.

       With sockets, when an app has data to receive (indicated, for  example,  by  a  POLLIN  event),  we  call
       recv().   The network stack copies the receive data into its buffer and returns.  If we want to avoid the
       data copy on the receive side, we need a way for the application to post its buffers to the network stack
       before data arrives.

       Arguably, a natural way of extending the socket API to support this feature  is  to  have  each  call  to
       recv() simply post the buffer to the network layer.  As data is received, the receive buffers are removed
       in  the order that they were posted.  Data is copied into the posted buffer and returned to the user.  It
       would be noted that the size of the posted receive buffer may be larger (or smaller) than the  amount  of
       data  received.   If the available buffer space is larger, hypothetically, the network layer could wait a
       short amount of time to see if more data arrives.  If nothing more arrives, the  receive  completes  with
       the buffer returned to the application.

       This  raises  an issue regarding how to handle buffering on the receive side.  So far, with sockets we’ve
       mostly considered a streaming protocol.  However, many applications deal with messages which end up being
       layered over the data stream.  If they send an 8 KB message, they want the receiver to receive  an  8  KB
       message.  Message boundaries need to be maintained.

       If  an  application sends and receives a fixed sized message, buffer allocation becomes trivial.  The app
       can post X number of buffers each of an optimal size.  However, if there is a wide mix in message  sizes,
       difficulties  arise.   It  is  not uncommon for an app to have 80% of its messages be a couple hundred of
       bytes or less, but 80% of the total data that it sends to be in large transfers that are, say, a megabyte
       or more.  Pre-posting receive buffers in such a situation is challenging.

       A commonly used technique used to handle this situation is to implement one  application  level  protocol
       for  smaller  messages, and use a separate protocol for transfers that are larger than some given thresh‐
       old.  This would allow an application to post a bunch of smaller messages, say 4  KB,  to  receive  data.
       For transfers that are larger than 4 KB, a different communication protocol is used, possibly over a dif‐
       ferent socket or endpoint.

   Shared Receive Queues
       If  an  application  pre-posts  receive  buffers to a network queue, it needs to balance the size of each
       buffer posted, the number of buffers that are posted to each queue, and the number of queues that are  in
       use.   With a socket like approach, each socket would maintain an independent receive queue where data is
       placed.  If an application is using 1000 endpoints and posts 100 buffers, each 4 KB, that results in  400
       MB  of  memory space being consumed to receive data.  (We can start to realize that by eliminating memory
       copies, one of the trade offs is increased memory consumption.)  While 400 MB seems like a lot of memory,
       there is less than half a megabyte allocated to a single receive queue.  At  today’s  networking  speeds,
       that amount of space can be consumed within milliseconds.  The result is that if only a few endpoints are
       in  use,  the application will experience long delays where flow control will kick in and back the trans‐
       fers off.

       There are a couple of observations that we can make here.  The first is that in  order  to  achieve  high
       scalability,  we  need to move away from a connection-oriented protocol, such as streaming sockets.  Sec‐
       ondly, we need to reduce the number of receive queues that an application uses.

       A shared receive queue is a network queue that can receive data for many  different  endpoints  at  once.
       With  shared  receive  queues,  we no longer associate a receive queue with a specific transport address.
       Instead network data will target a specific endpoint address.  As data arrives, the endpoint will  remove
       an  entry  from the shared receive queue, place the data into the application’s posted buffer, and return
       it to the user.  Shared receive queues can greatly reduce the amount of buffer space needed by an  appli‐
       cations.   In  the previous example, if a shared receive queue were used, the app could post 10 times the
       number of buffers (1000 total), yet still consume 100 times less memory (4 MB total).  This is  far  more
       scalable.   The  drawback  is that the application must now be aware of receive queues and shared receive
       queues, rather than considering the network only at the level of a socket.

   Multi-Receive Buffers
       Shared receive queues greatly improve application scalability; however, it still results in some  ineffi‐
       ciencies  as  defined  so  far.  We’ve only considered the case of posting a series of fixed sized memory
       buffers to the receive queue.  As mentioned, determining the size of each buffer is challenging.   Trans‐
       fers larger than the fixed size require using some other protocol in order to complete.  If transfers are
       typically much smaller than the fixed size, then the extra buffer space goes unused.

       Again referring to our example, if the application posts 1000 buffers, then it can only receive 1000 mes‐
       sages  before the queue is emptied.  At data rates measured in millions of messages per second, this will
       introduce stalls in the data stream.  An obvious solution is to increase the number  of  buffers  posted.
       The problem is dealing with variable sized messages, including some which are only a couple hundred bytes
       in  length.   For example, if the average message size in our case is 256 bytes or less, then even though
       we’ve allocated 4 MB of buffer space, we only make use of 6% of that space.  The rest is wasted in  order
       to handle messages which may only occasionally be up to 4 KB.

       A  second optimization that we can make is to fill up each posted receive buffer as messages arrive.  So,
       instead of a 4 KB buffer being removed from use as soon as a single 256 byte message arrives, it can  in‐
       stead receive up to 16, 256 byte, messages.  We refer to such a feature as `multi-receive' buffers.

       With multi-receive buffers, instead of posting a bunch of smaller buffers, we instead post a single larg‐
       er  buffer, say the entire 4 MB, at once.  As data is received, it is placed into the posted buffer.  Un‐
       like TCP streams, we still maintain message boundaries.  The advantages here are twofold.   Not  only  is
       memory  used  more  efficiently, allowing us to receive more smaller messages at once and larger messages
       overall, but we reduce the number of function calls that the application must make to maintain its supply
       of available receive buffers.

       When combined with shared receive queues, multi-receive buffers help support optimal receive side buffer‐
       ing and processing.  The main drawback to supporting multi-receive buffers are that the application  will
       not  necessarily  know  up  front how many messages may be associated with a single posted memory buffer.
       This is rarely a problem for applications.

   Optimal Hardware Allocation
       As part of scalability considerations, we not only need to consider the processing and  memory  resources
       of  the host system, but also the allocation and use of the NIC hardware.  We’ve referred to network end‐
       points as combination of transport addressing, transmit queues,  and  receive  queues.   The  latter  two
       queues  are  often  implemented as hardware command queues.  Command queues are used to signal the NIC to
       perform some sort of work.  A transmit queue indicates that the NIC should  transfer  data.   A  transmit
       command  often  contains  information  such  as  the address of the buffer to transmit, the length of the
       buffer, and destination addressing data.  The actual format and data contents vary based on the  hardware
       implementation.

       NICs  have  limited  resources.   Only the most scalable, high-performance applications likely need to be
       concerned with utilizing NIC hardware optimally.  However, such applications are an important and specif‐
       ic focus of libfabric.  Managing NIC resources is often handled by a resource manager application,  which
       is responsible for allocating systems to competing applications, among other activities.

       Supporting applications that wish to make optimal use of hardware requires that hardware related abstrac‐
       tions  be  exposed  to the application.  Such abstractions cannot require a specific hardware implementa‐
       tion, and care must be taken to ensure that the resulting API is still usable  by  developers  unfamiliar
       with  dealing with such low level details.  Exposing concepts such as shared receive queues is an example
       of giving an application more control over how hardware resources are used.

   Sharing Command Queues
       By exposing the transmit and receive queues to the application, we open the possibility for the  applica‐
       tion that makes use of multiple endpoints to determine how those queues might be shared.  We talked about
       the benefits of sharing a receive queue among endpoints.  The benefits of sharing transmit queues are not
       as obvious.

       An  application  that  uses  more addressable endpoints than there are transmit queues will need to share
       transmit queues among the endpoints.  By controlling which endpoint uses which transmit queue, the appli‐
       cation can prioritize traffic.  A transmit queue can also be configured to optimize for a  specific  type
       of data transfer, such as large transfers only.

       From  the  perspective  of a software API, sharing transmit or receive queues implies exposing those con‐
       structs to the application, and allowing them to be associated with different endpoint addresses.

   Multiple Queues
       The opposite of a shared command queue are endpoints that have multiple queues.  An application that  can
       take  advantage of multiple transmit or receive queues can increase parallel handling of messages without
       synchronization constraints.  Being able to use multiple command queues through a single endpoint has ad‐
       vantages over using multiple endpoints.  Multiple endpoints require separate addresses,  which  increases
       memory use.  A single endpoint with multiple queues can continue to expose a single address, while taking
       full advantage of available NIC resources.

   Progress Model Considerations
       One  aspect  of the sockets programming interface that developers often don’t consider is the location of
       the protocol implementation.  This is usually managed by the operating system kernel.  The network  stack
       is  responsible  for handling flow control messages, timing out transfers, re-transmitting unacknowledged
       transfers, processing received data, and sending acknowledgments.  This processing requires that the net‐
       work stack consume CPU cycles.  Portions of that processing can be done within the context of the  appli‐
       cation thread, but much must be handled by kernel threads dedicated to network processing.

       By  moving the network processing directly into the application process, we need to be concerned with how
       network communication makes forward progress.  For example, how and when are acknowledgments  sent?   How
       are  timeouts and message re-transmissions handled?  The progress model defines this behavior, and it de‐
       pends on how much of the network processing has been offloaded onto the NIC.

       More generally, progress is the ability of the underlying network implementation to  complete  processing
       of an asynchronous request.  In many cases, the processing of an asynchronous request requires the use of
       the host processor.  For performance reasons, it may be undesirable for the provider to allocate a thread
       for this purpose, which will compete with the application thread(s).  We can avoid thread context switch‐
       es  if  the  application  thread can be used to make forward progress on requests – check for acknowledg‐
       ments, retry timed out operations, etc.  Doing so requires that the application  periodically  call  into
       the network stack.

   Ordering
       Network  ordering  is  a complex subject.  With TCP sockets, data is sent and received in the same order.
       Buffers are re-usable by the application immediately upon returning from a function call.  As  a  result,
       ordering  is  simple  to  understand and use.  UDP sockets complicate things slightly.  With UDP sockets,
       messages may be received out of order from how they were sent.  In practice, this  often  doesn’t  occur,
       particularly, if the application only communicates over a local area network, such as Ethernet.

       With  our evolving network API, there are situations where exposing different order semantics can improve
       performance.  These details will be discussed further below.

   Messages
       UDP sockets allow messages to arrive out of order because each message is routed from the sender  to  the
       receiver independently.  This allows packets to take different network paths, to avoid congestion or take
       advantage  of multiple network links for improved bandwidth.  We would like to take advantage of the same
       features in those cases where the application doesn’t care in which order messages arrive.

       Unlike UDP sockets, however, our definition of message ordering is more subtle.  UDP messages are  small,
       MTU  sized packets.  In our case, messages may be gigabytes in size.  We define message ordering to indi‐
       cate whether the start of each message is processed in order or out of order.  This is  related  to,  but
       separate from the order of how the message payload is received.

       An  example  will  help clarify this distinction.  Suppose that an application has posted two messages to
       its receive queue.  The first receive points to a 4 KB buffer.  The second receive  points  to  a  64  KB
       buffer.   The sender will transmit a 4 KB message followed by a 64 KB message.  If messages are processed
       in order, then the 4 KB send will match with the 4 KB received, and the 64 KB send will match with the 64
       KB receive.  However, if messages can be processed out of order, then the sends  and  receives  can  mis‐
       match, resulting in the 64 KB send being truncated.

       In  this  example,  we’re not concerned with what order the data is received in.  The 64 KB send could be
       broken in 64 1-KB transfers that take different routes to the destination.  So, bytes 2k-3k could be  re‐
       ceived  before  bytes  1k-2k.  Message ordering is not concerned with ordering within a message, only be‐
       tween messages.  With ordered messages, the messages themselves need to be processed in order.

       The more relaxed message ordering can be the more optimizations that the network stack can use to  trans‐
       fer  the  data.  However, the application must be aware of message ordering semantics, and be able to se‐
       lect the desired semantic for its needs.  For the purposes of this section, messages refers to  transport
       level operations, which includes RDMA and similar operations (some of which have not yet been discussed).

   Data
       Data  ordering  refers to the receiving and placement of data both within and between messages.  Data or‐
       dering is most important to messages that can update the same target memory buffer.  For example, imagine
       an application that writes a series of database records directly into a peer memory location.   Data  or‐
       dering,  combined with message ordering, ensures that the data from the second write updates memory after
       the first write completes.  The result is that the memory location will contain the  records  carried  in
       the second write.

       Enforcing data ordering between messages requires that the messages themselves be ordered.  Data ordering
       can also apply within a single message, though this level of ordering is usually less important to appli‐
       cations.   Intra-message data ordering indicates that the data for a single message is received in order.
       Some applications use this feature to `spin' reading the last byte of a receive buffer.   Once  the  byte
       changes,  the  application knows that the operation has completed and all earlier data has been received.
       (Note that while such behavior is interesting for benchmark purposes, using such a feature in this way is
       strongly discouraged.  It is not portable between networks or platforms.)

   Completions
       Completion ordering refers to the sequence that asynchronous operations report their  completion  to  the
       application.  Typically, unreliable data transfer will naturally complete in the order that they are sub‐
       mitted  to a transmit queue.  Each operation is transmitted to the network, with the completion occurring
       immediately after.  For reliable data transfers, an operation cannot complete until it has been  acknowl‐
       edged  by  the peer.  Since ack packets can be lost or possibly take different paths through the network,
       operations can be marked as completed out of order.  Out of order acks is more likely if messages can  be
       processed out of order.

       Asynchronous  interfaces  require that the application track their outstanding requests.  Handling out of
       order completions can increase application complexity, but it does allow for optimizing network  utiliza‐
       tion.

lifabric Architecture

       Libfabric  is  well architected to support the previously discussed features.  For further information on
       the libfabric architecture, see the next programmer’s guide section: fi_arch(7).

AUTHORS

       OpenFabrics.

Libfabric Programmer’s Manual                      2024-12-10                                        fi_intro(7)