`RDMA_READ` errors without `UCX_MEMTYPE_CACHE=n` #7575

pentschev · 2021-10-22T21:27:53Z

Describe the bug

After whole alloc and memtype_cache changes from #7128 , we should now not need to set UCX_MEMTYPE_CACHE=n anymore. In most cases that works fine, but for some Dask workflows we see RDMA_READ errors such as bellow when we don't set any UCX_MEMTYPE_CACHE.

[1634937588.651064] [dgx13:4074 :0]     ib_mlx5_log.c:170  UCX  ERROR Local protection on mlx5_0:1/IB (synd 0x4 vend 0x33 hw_synd 0/157)
[1634937588.651064] [dgx13:4074 :0]     ib_mlx5_log.c:170  UCX  ERROR RC QP 0x1cd4a wqe[49]: RDMA_READ s-- [rva 0x7f8e37495908 rkey 0xa1909] [va 0x7f06cc06d200 len 100028216 lkey 0x3efdf] [rqpn 0xf8b2 dlid=47 sl=0 port=1 src_path_bits=0]
[1634937588.931040] [dgx13:4085 :0]     ib_mlx5_log.c:170  UCX  ERROR Local protection on mlx5_3:1/IB (synd 0x4 vend 0x33 hw_synd 0/157)
[1634937588.931040] [dgx13:4085 :0]     ib_mlx5_log.c:170  UCX  ERROR RC QP 0xf8bc wqe[32]: RDMA_READ s-- [rva 0x7f999b5e1000 rkey 0x16aa9a] [va 0x7fde77fcbc00 len 99986752 lkey 0xbf4e4] [rqpn 0xf88b dlid=46 sl=0 port=1 src_path_bits=0]

Steps to Reproduce

Simplest way to reproduce is run Dask cuDF benchmark using UCX-Py:
- UCX_MAX_RNDV_RAILS=1 UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda python dask_cuda/benchmarks/local_cudf_merge.py -d 0,1,2,3,4,5,6,7 --runs 5 -c 100_000_000 -p ucx --interface ib0
UCX master 5d34a0f

Setup and versions

DGX-1 with 8 x NVIDIA V100
Linux dgx13 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
MOFED 5.3-1.0.5.0
NVIDIA driver: 460.39
CUDA 11.2
Both nv_peer_mem and gdrcopy modules loaded

The text was updated successfully, but these errors were encountered:

pentschev · 2021-10-22T21:28:47Z

@yosefe @Akshay-Venkatesh both of you have pointed out to me that setting UCX_MEMTYPE_CACHE=n shouldn't be needed anymore. Am I perhaps missing something obvious here?

Akshay-Venkatesh · 2021-10-22T22:15:38Z

@pentschev UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda is not explicitly required. Do lengths 100028216 and 99986752 look right? Do the errors lead to failures?

Also, did you already fix UCP config modify usage? Memtype cache is part of global options now.

@yosefe what does local protection error mean here?

pentschev · 2021-10-24T14:57:23Z

UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda is not explicitly required.

That's right, it's not really necessary but it was when I reverted #7128 for testing, that's why I had it.

Do lengths 100028216 and 99986752 look right? Do the errors lead to failures?

Yes, the lengths are normal. The errors lead to failures, sorry for being unclear on that. Endpoints raise Input/output errors and the benchmark eventually hangs.

Also, did you already fix UCP config modify usage? Memtype cache is part of global options now.

I only removed the environment variable, and that caused errors to appear. However, if I add UCX_MEMTYPE_CACHE=n back it works. Are we supposed to still have it set as a global option?

yosefe · 2021-10-28T10:06:34Z

@pentschev can you please run the failing case with "UCX_LOG_LEVEL=req UCX_MEM_LOG_LEVEL=trace" and upload the output (hopefully it will not get too big)?
seems like some memory hooks were missed

pentschev · 2021-10-28T12:42:12Z

@yosefe please see the attached log. I think it's not so big, I have 3 Dask workers, I couldn't reproduce with 2 only.

local_protection_err.log

UCX_MEMTYPE_CACHE=n is still required even for UCX 1.12 (see openucx/ucx#7575), so we reenable it. Reordering the import is also required to prevent UCS warnings when Cython code is imported.

pentschev · 2021-11-04T15:39:31Z

I confirmed with @yosefe offline that leaving UCX_MEMTYPE_CACHE=n set for the 1.12 release is ok, we only need to reorder the variable setting, which is being addressed in rapidsai/ucx-py#807 . This issue can then be moved to the 1.13 milestone.

UCX_MEMTYPE_CACHE=n is still required even for UCX 1.12 (see openucx/ucx#7575), so we reenable it. Reordering the import is also required to prevent UCS warnings when Cython code is imported.

Akshay-Venkatesh · 2022-01-04T17:18:24Z

@pentschev Assuming that the bug here gets fixed at some point in UCX (and potentially in lower layers as well), what is the plan for ucx-py to use whole allocation registration feature for DGX machines vs T4 GPUs? For DGX, I see that the default value of UCX_CUDA_COPY_MAX_REG_RATIO (0.1 by default) has to be bumped up to between 0.9-1.0 to avoid repeated registrations with RMM sub-allocated memory. For T4, a high ratio of 0.9-1.0 will most likely result in a BAR1 exhaustion. One change in the way that this can be handled in UCX is to override user specified ratio if T4 GPUs are detected when UCX_CUDA_COPY_REG_WHOLE_ALLOC is set to auto. If T4 is detected we'd set max_reg_ratio to 0.01 or some low value internally. If there is a use case where whole alloc registration with high registration ratio has to be used on T4 as well, the user would have to set UCX_CUDA_COPY_REG_WHOLE_ALLOC to on. This way ucx-py can default to setting UCX_CUDA_COPY_MAX_REG_RATIO=0.9/1.0 and leave UCX_CUDA_COPY_REG_WHOLE_ALLOC (auto by default) unchanged and wouldn't have to explicitly detect T4 GPUs. Does that sound reasonable?

cc @yosefe

Edit: Looking at the latest code, additional changes are not needed. Setting UCX_CUDA_COPY_MAX_REG_RATIO=0.9 should have the desired effect of registering all of RMM allocation for DGX and only user exposed region in the case of T4.

static size_t
uct_cuda_base_get_total_device_mem(CUdevice cuda_device)
{
   ...
    if (!total_bytes[cuda_device]) {
        cu_err = cuDeviceTotalMem(&total_bytes[cuda_device], cuda_device);
        if (cu_err != CUDA_SUCCESS) {
            cuGetErrorString(cu_err, &cu_err_str);
            ucs_error("cuDeviceTotalMem error: %s", cu_err_str);
            goto err;
        }   

        cu_err = cuDeviceGetName(dev_name, sizeof(dev_name), cuda_device);
        if (cu_err != CUDA_SUCCESS) {
            cuGetErrorString(cu_err, &cu_err_str);
            ucs_error("cuDeviceGetName error: %s", cu_err_str);
            goto err;
        }

        if (!strncmp(dev_name, "T4", 2)) {
            total_bytes[cuda_device] = 1; /* should ensure that whole alloc
                                             registration is not used for t4 */
        }
    }
    ...
}

static ucs_status_t
uct_cuda_base_query_attributes(uct_cuda_copy_md_t *md, const void *address,
                               size_t length, ucs_memory_info_t *mem_info)
{
    ...
    if (md->config.alloc_whole_reg == UCS_CONFIG_AUTO) {
        total_bytes = uct_cuda_base_get_total_device_mem(cuda_device); /* total_bytes = 1 if auto */
        if (alloc_length > (total_bytes * md->config.max_reg_ratio)) {
            goto out_default_range; /* always taken if T4 and alloc_length > 0 */
        }
    } else {
        ucs_assert(md->config.alloc_whole_reg == UCS_CONFIG_ON);
    }

    mem_info->base_address = (void*)base_address;
    mem_info->alloc_length = alloc_length;
    return UCS_OK;

out_default_range:
    mem_info->base_address = (void*)address;
    mem_info->alloc_length = length;
    return UCS_OK;
}

pentschev · 2022-01-04T19:57:49Z

@Akshay-Venkatesh thanks for the ping. I worked yesterday on new defaults for UCX_CUDA_COPY_MAX_REG_RATIO in rapidsai/ucx-py#824 . In summary what that change does is setting UCX_CUDA_COPY_MAX_REG_RATIO=1.0 if BAR1 >= TotalGPUMemory, otherwise leaves it to UCX default (currently UCX_CUDA_COPY_MAX_REG_RATIO=0.1), but we could in the future set a smarter ratio if that's useful. Do you think the UCX-Py change is unnecessary given the code you pasted above?

Akshay-Venkatesh · 2022-01-04T20:37:44Z

default

@pentschev Thanks for the pointer. rapidsai/ucx-py#824 could be simpler and avoid nvml barinfo calls given the code above but IMO I don't think it hurts assuming that nvml query is onetime and given that some logic in ucx may change in the future.

Also, we had logic similar to rapidsai/ucx-py#824 in the past but we removed that in favor of not having nvml dependency. As of 1.12, we do depend on nvml, so it's probably better to generalize this kind of detection by replacing it with nvml barinfo query instead.

pentschev · 2022-01-04T21:06:24Z

In our case, we may generally assume that a user importing UCX-Py for GPU usage will have pynvml installed, so that's less of a concern for us, and it only runs at import time anyway. Plus, querying the BAR1 size allows us to be future-proof to some extent, as we would prefer not to special-case GPUs by some of its characteristics (e.g., name).

pentschev added the Bug label Oct 22, 2021

pentschev mentioned this issue Oct 27, 2021

UCT/CUDA_IPC: MD_FLAG_INVALIDATE stub implementation #7522

Merged

yosefe added this to the v12.0 milestone Oct 28, 2021

pentschev mentioned this issue Nov 4, 2021

Reorder and reenable UCX_MEMTYPE_CACHE=n for all UCX versions rapidsai/ucx-py#807

Merged

yosefe mentioned this issue Dec 11, 2021

UCP/UCS/UCT: Fix memtype_cache region info after merge #7791

Merged

yosefe modified the milestones: v12.0, v1.12.1 Jan 24, 2022

yosefe closed this as completed in #7791 Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`RDMA_READ` errors without `UCX_MEMTYPE_CACHE=n` #7575

`RDMA_READ` errors without `UCX_MEMTYPE_CACHE=n` #7575

pentschev commented Oct 22, 2021

pentschev commented Oct 22, 2021

Akshay-Venkatesh commented Oct 22, 2021

pentschev commented Oct 24, 2021

yosefe commented Oct 28, 2021

pentschev commented Oct 28, 2021

pentschev commented Nov 4, 2021

Akshay-Venkatesh commented Jan 4, 2022 •

edited

Loading

pentschev commented Jan 4, 2022

Akshay-Venkatesh commented Jan 4, 2022

pentschev commented Jan 4, 2022

RDMA_READ errors without UCX_MEMTYPE_CACHE=n #7575

RDMA_READ errors without UCX_MEMTYPE_CACHE=n #7575

Comments

pentschev commented Oct 22, 2021

Describe the bug

Steps to Reproduce

Setup and versions

pentschev commented Oct 22, 2021

Akshay-Venkatesh commented Oct 22, 2021

pentschev commented Oct 24, 2021

yosefe commented Oct 28, 2021

pentschev commented Oct 28, 2021

pentschev commented Nov 4, 2021

Akshay-Venkatesh commented Jan 4, 2022 • edited Loading

pentschev commented Jan 4, 2022

Akshay-Venkatesh commented Jan 4, 2022

pentschev commented Jan 4, 2022

`RDMA_READ` errors without `UCX_MEMTYPE_CACHE=n` #7575

`RDMA_READ` errors without `UCX_MEMTYPE_CACHE=n` #7575

Akshay-Venkatesh commented Jan 4, 2022 •

edited

Loading