Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/lnx: LINKx (lnx) provider #10437

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ src_libfabric_la_SOURCES = \
include/uthash.h \
include/ofi_prov.h \
include/ofi_profile.h \
include/ofi_lnx.h \
include/rdma/providers/fi_log.h \
include/rdma/providers/fi_prov.h \
src/fabric.c \
Expand Down Expand Up @@ -484,6 +485,7 @@ include prov/sm2/Makefile.include
include prov/tcp/Makefile.include
include prov/ucx/Makefile.include
include prov/lpp/Makefile.include
include prov/lnx/Makefile.include
include prov/hook/Makefile.include
include prov/hook/perf/Makefile.include
include prov/hook/trace/Makefile.include
Expand Down
1 change: 1 addition & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -1026,6 +1026,7 @@ FI_PROVIDER_SETUP([hook_debug])
FI_PROVIDER_SETUP([hook_hmem])
FI_PROVIDER_SETUP([dmabuf_peer_mem])
FI_PROVIDER_SETUP([opx])
FI_PROVIDER_SETUP([lnx])
FI_PROVIDER_FINI
dnl Configure the .pc file
FI_PROVIDER_SETUP_PC
Expand Down
1 change: 1 addition & 0 deletions include/ofi.h
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,7 @@ enum ofi_prov_type {
OFI_PROV_UTIL,
OFI_PROV_HOOK,
OFI_PROV_OFFLOAD,
OFI_PROV_LNX,
};

/* Restrict to size of struct fi_provider::context (struct fi_context) */
Expand Down
59 changes: 59 additions & 0 deletions include/ofi_lnx.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
/*
* Copyright (c) 2022 ORNL. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
* General Public License (GPL); Version 2, available from the file
* COPYING in the main directory of this source tree, or the
* BSD license below:
*
* Redistribution and use in source and binary forms, with or
* without modification, are permitted provided that the following
* conditions are met:
*
* - Redistributions of source code must retain the above
* copyright notice, this list of conditions and the following
* disclaimer.
*
* - Redistributions in binary form must reproduce the above
* copyright notice, this list of conditions and the following
* disclaimer in the documentation and/or other materials
* provided with the distribution.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/

#ifndef OFI_LNX_H
#define OFI_LNX_H

/* ofi_create_link()
* prov_list (IN): number of providers to link
* fabric (OUT): lnx fabric which abstracts the bond
* caps (IN): bond capabilities requested
* context (IN): user context to store.
*
* The LNX provider is not inserted directly on the list
* of core providers. In that sense, it's a special provider
* that only gets returned on a call of fi_link(), if that
* function determines that there are multiple providers to link.
*
* ofi_create_link() binds the core provider endpoints and returns
* the LNX fabric which abstracts away these provider endpoints.
*/
int ofi_create_link(struct fi_info *prov_list, struct fid_fabric **fabric,
uint64_t caps, void *context);

/*
* ofi_finish_link()
* Uninitialize and cleanup all the core providers
*/
void ofi_link_fini(void);

#endif /* OFI_LNX_H */
2 changes: 2 additions & 0 deletions include/ofi_mr.h
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,8 @@ int ofi_mr_map_init(const struct fi_provider *in_prov, int mode,
struct ofi_mr_map *map);
void ofi_mr_map_close(struct ofi_mr_map *map);

struct fi_mr_attr *
ofi_dup_mr_attr(const struct fi_mr_attr *attr, uint64_t flags);
int ofi_mr_map_insert(struct ofi_mr_map *map,
const struct fi_mr_attr *attr,
uint64_t *key, void *context,
Expand Down
11 changes: 11 additions & 0 deletions include/ofi_prov.h
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,17 @@ MRAIL_INI ;
# define MRAIL_INIT NULL
#endif

#if (HAVE_LNX) && (HAVE_LNX_DL)
# define LNX_INI FI_EXT_INI
# define LNX_INIT NULL
#elif (HAVE_LNX)
# define LNX_INI INI_SIG(fi_lnx_ini)
# define LNX_INIT fi_lnx_ini()
LNX_INI ;
#else
# define LNX_INIT NULL
#endif

#if (HAVE_PERF) && (HAVE_PERF_DL)
# define HOOK_PERF_INI FI_EXT_INI
# define HOOK_PERF_INIT NULL
Expand Down
15 changes: 14 additions & 1 deletion include/ofi_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -1172,9 +1172,11 @@ void ofi_fabric_remove(struct util_fabric *fabric);
* Utility Providers
*/

#define OFI_NAME_DELIM ';'
#define OFI_NAME_LNX_DELIM ':'
#define OFI_NAME_DELIM ';'
#define OFI_UTIL_PREFIX "ofi_"
#define OFI_OFFLOAD_PREFIX "off_"
#define OFI_LNX "lnx"

static inline int ofi_has_util_prefix(const char *str)
{
Expand All @@ -1186,6 +1188,16 @@ static inline int ofi_has_offload_prefix(const char *str)
return !strncasecmp(str, OFI_OFFLOAD_PREFIX, strlen(OFI_OFFLOAD_PREFIX));
}

static inline int ofi_is_lnx(const char *str)
{
return !strncasecmp(str, OFI_LNX, strlen(OFI_LNX));
}

static inline int ofi_is_linked(const char *str)
{
return (strcasestr(str, OFI_LNX)) ? 1 : 0;
}

int ofi_get_core_info(uint32_t version, const char *node, const char *service,
uint64_t flags, const struct util_prov *util_prov,
const struct fi_info *util_hints,
Expand All @@ -1201,6 +1213,7 @@ int ofi_get_core_info_fabric(const struct fi_provider *prov,
struct fi_info **core_info);


char *ofi_strdup_link_append(const char *head, const char *tail);
char *ofi_strdup_append(const char *head, const char *tail);
// char *ofi_strdup_head(const char *str);
// char *ofi_strdup_tail(const char *str);
Expand Down
3 changes: 3 additions & 0 deletions include/rdma/fabric.h
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,7 @@ enum {
FI_PROTO_SM2,
FI_PROTO_CXI_RNR,
FI_PROTO_LPP,
FI_PROTO_LNX,
};

enum {
Expand Down Expand Up @@ -622,6 +623,8 @@ int fi_fabric2(struct fi_info *info, struct fid_fabric **fabric,
uint64_t flags, void *context);
int fi_fabric(struct fi_fabric_attr *attr, struct fid_fabric **fabric,
void *context);
int fi_link(struct fi_info *prov_list, struct fid_fabric **fabric,
uint64_t caps, void *context);
int fi_open(uint32_t version, const char *name, void *attr, size_t attr_len,
uint64_t flags, struct fid **fid, void *context);

Expand Down
1 change: 1 addition & 0 deletions include/rdma/fi_domain.h
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ struct fi_mr_attr {
size_t auth_key_size;
uint8_t *auth_key;
enum fi_hmem_iface iface;
fi_addr_t addr;
union {
uint64_t reserved;
int cuda;
Expand Down
2 changes: 1 addition & 1 deletion include/rdma/fi_errno.h
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ extern "C" {
//#define FI_EADV EADV /* Advertise error */
//#define FI_ESRMNT ESRMNT /* Srmount error */
//#define FI_ECOMM ECOMM /* Communication error on send */
//#define FI_EPROTO EPROTO /* Protocol error */
#define FI_EPROTO EPROTO /* Protocol error */
//#define FI_EMULTIHOP EMULTIHOP /* Multihop attempted */
//#define FI_EDOTDOT EDOTDOT /* RFS specific error */
//#define FI_EBADMSG EBADMSG /* Not a data message */
Expand Down
1 change: 1 addition & 0 deletions include/rdma/providers/fi_peer.h
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ struct fi_peer_rx_entry {
uint64_t tag;
uint64_t cq_data;
uint64_t flags;
uint64_t ignore;
void *context;
size_t count;
void **desc;
Expand Down
1 change: 0 additions & 1 deletion include/rdma/providers/fi_prov.h
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ struct fi_provider {
void (*cleanup)(void);
};


/*
* Defines a configuration parameter for use with libfabric.
*/
Expand Down
1 change: 1 addition & 0 deletions libfabric.map.in
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ FABRIC_1.7 {
fi_getinfo;
fi_freeinfo;
fi_dupinfo;
fi_link;
} FABRIC_1.6;

FABRIC_1.8 {
Expand Down
156 changes: 156 additions & 0 deletions man/fi_lnx.7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
layout: page
title: fi_lnx(7)
tagline: Libfabric Programmer's Manual
---
{% include JB/setup %}

# NAME

fi_lnx \- The LINKx (LNX) Provider

# OVERVIEW

The LNX provider is designed to link two or more providers, allowing
applications to seamlessly use multiple providers or NICs. This provider uses
the libfabric peer infrastructure to aid in the use of the underlying providers.
This version of the provider currently supports linking the libfabric
shared memory provider for intra-node traffic and another provider for
inter-node traffic. Future releases of the provider will allow linking any
number of providers and provide the users with the ability to influence
the way the providers are utilized for traffic load.

# SUPPORTED FEATURES

This release contains an initial implementation of the LNX provider that
offers the following support:

*Endpoint types*
: The provider supports only endpoint type *FI_EP_RDM*.

*Endpoint capabilities*
: LNX is a passthrough layer on the send path. On the receive path LNX
utilizes the peer infrastructure to create shared receive queues (SRQ).
Receive requests are placed on the SRQ instead of on the core provider
receive queue. When the provider receives a message it queries the SRQ for
a match. If one is found the receive request is completed, otherwise the
message is placed on the LNX shared unexpected queue (SUQ). Further receive
requests query the SUQ for matches.
The first release of the provider only supports tagged and RMA operations.
Other message types will be supported in future releases.

*Modes*
: The provider does not require the use of any mode bits.

*Progress*
: LNX utilizes the peer infrastructure to provide a shared completion
queue. Each linked provider still needs to handle its own progress.
Completion events will however be placed on the shared completion queue,
which is passed to the application for access.

*Address Format*
: LNX wraps the linked providers addresses in one common binary blob.
It does not alter or change the linked providers address format. It wraps
them into a LNX structure which is then flattened and returned to the
application. This is passed between different nodes. The LNX provider
is able to parse the flattened format and operate on the different links.
This assumes that nodes in the same group are all using the same version of
the provider with the exact same links. IE: you can't have one node linking
SHM+CXI while another linking SHM+RXM.

*Message Operations*
: LNX is designed to intercept message operations such as fi_tsenddata
and based on specific criteria forward the operation to the appropriate
provider. For the first release, LNX will only support linking SHM
provider for intra-node traffic and another provider (ex: CXI) for inter
node traffic. LNX send operation looks at the destination and based on
whether the destination is local or remote it will select the provider to
forward the operation to. The receive case has been described earlier.

*Using the Provider*
: In order to use the provider the user needs to set FI_LNX_PROV_LINKS
environment variable to the linked providers in the following format
shm+<prov>. This will allow LNX to report back to the application in the
fi_getinfo() call the different links which can be selected. Since there are
multiple domains per provider LNX reports a permutation of all the
possible links. For example if there are two CXI interfaces on the machine
LNX will report back shm+cxi0 and shm+cxi1. The application can then
select based on its own criteria the link it wishes to use.
The application typically uses the PCI information in the fi_info
structure to select the interface to use. A common selection criteria is
the interface nearest the core the process is bound to. In order to make
this determination, the application requires the PCI information about the
interface. For this reason LNX forwards the PCI information for the
inter-node provider in the link to the application.

# LIMITATIONS AND FUTURE WORK

*Hardware Support*
: LNX doesn't support hardware offload; ex hardware tag matching. This is
an inherit limitation when using the peer infrastructure. Due to the use
of a shared receive queue which linked providers need to query when
a message is received, any hardware offload which requires sending the
receive buffers to the hardware directly will not work with the shared
receive queue. The shared receive queue provides two advantages; 1) reduce
memory usage, 2) coordinate the receive operations. For #2 this is needed
when receiving from FI_ADDR_UNSPEC. In this case both providers which are
part of the link can race to gain access to the receive buffer. It is
a future effort to determine a way to use hardware tag matching and other
hardware offload capability with LNX

*Limited Linking*
: This release of the provider supports linking SHM provider for intra-node
operations and another provider which supports the FI_PEER capability for
inter-node operations. It is a future effort to expand to link any
multiple sets of providers.

*Memory Registration*
: As part of the memory registration operation, varying hardware can perform
hardware specific steps such as memory pinning. Due to the fact that
memory registration APIs do not specify the source or destination
addresses it is not possible for LNX to determine which provider to
forward the memory registration to. LNX, therefore, registers the memory
with all linked providers. This might not be efficient and might have
unforeseen side effects. A better method is needed to support memory
registration.

*Operation Types*
: This release of LNX supports tagged and RMA operations only. Future
releases will expand the support to other operation types.

*Multi-Rail*
: Future design effort is being planned to support utilizing multiple interfaces
for traffic simultaneously. This can be over homogeneous interfaces or over
heterogeneous interfaces.

# RUNTIME PARAMETERS

The *LNX* provider checks for the following environment variables:

*FI_LNX_PROV_LINKS*
: This environment variable is used to specify which providers to link. This
must be set in order for the LNX provider to return a list of fi_info
blocks in the fi_getinfo() call. The format which must be used is:
<prov1>+<prov2>+... As mentioned earlier currently LNX supports linking
only two providers the first of which is SHM followed by one other
provider for inter-node operations

*FI_LNX_DISABLE_SHM*
: By default this environment variable is set to 0. However, the user can
set it to one and then the SHM provider will not be used. This can be
useful for debugging and performance analysis. The SHM provider will
naturally be used for all intra-node operations. Therefore, to test SHM in
isolation with LNX, the processes can be limited to the same node only.

*FI_LNX_SRQ_SUPPORT*
: Shared Receive Queues are integral part of the peer infrastructure, but
they have the limitation of not using hardware offload, such as tag
matching. SRQ is needed to support the FI_ADDR_UNSPEC case. If the application
is sure this will never be the case, then it can turn off SRQ support by
setting this environment variable to 0. It is 1 by default.

# SEE ALSO

[`fabric`(7)](fabric.7.html),
[`fi_provider`(7)](fi_provider.7.html),
[`fi_getinfo`(3)](fi_getinfo.3.html)
Loading