Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA_READ errors without UCX_MEMTYPE_CACHE=n #7575

Closed
pentschev opened this issue Oct 22, 2021 · 10 comments · Fixed by #7791
Closed

RDMA_READ errors without UCX_MEMTYPE_CACHE=n #7575

pentschev opened this issue Oct 22, 2021 · 10 comments · Fixed by #7791
Labels
Milestone

Comments

@pentschev
Copy link
Contributor

Describe the bug

After whole alloc and memtype_cache changes from #7128 , we should now not need to set UCX_MEMTYPE_CACHE=n anymore. In most cases that works fine, but for some Dask workflows we see RDMA_READ errors such as bellow when we don't set any UCX_MEMTYPE_CACHE.

[1634937588.651064] [dgx13:4074 :0]     ib_mlx5_log.c:170  UCX  ERROR Local protection on mlx5_0:1/IB (synd 0x4 vend 0x33 hw_synd 0/157)
[1634937588.651064] [dgx13:4074 :0]     ib_mlx5_log.c:170  UCX  ERROR RC QP 0x1cd4a wqe[49]: RDMA_READ s-- [rva 0x7f8e37495908 rkey 0xa1909] [va 0x7f06cc06d200 len 100028216 lkey 0x3efdf] [rqpn 0xf8b2 dlid=47 sl=0 port=1 src_path_bits=0]
[1634937588.931040] [dgx13:4085 :0]     ib_mlx5_log.c:170  UCX  ERROR Local protection on mlx5_3:1/IB (synd 0x4 vend 0x33 hw_synd 0/157)
[1634937588.931040] [dgx13:4085 :0]     ib_mlx5_log.c:170  UCX  ERROR RC QP 0xf8bc wqe[32]: RDMA_READ s-- [rva 0x7f999b5e1000 rkey 0x16aa9a] [va 0x7fde77fcbc00 len 99986752 lkey 0xbf4e4] [rqpn 0xf88b dlid=46 sl=0 port=1 src_path_bits=0]

Steps to Reproduce

  • Simplest way to reproduce is run Dask cuDF benchmark using UCX-Py:
    • UCX_MAX_RNDV_RAILS=1 UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda python dask_cuda/benchmarks/local_cudf_merge.py -d 0,1,2,3,4,5,6,7 --runs 5 -c 100_000_000 -p ucx --interface ib0
  • UCX master 5d34a0f

Setup and versions

  • DGX-1 with 8 x NVIDIA V100
  • Linux dgx13 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • MOFED 5.3-1.0.5.0
  • NVIDIA driver: 460.39
  • CUDA 11.2
  • Both nv_peer_mem and gdrcopy modules loaded
@pentschev pentschev added the Bug label Oct 22, 2021
@pentschev
Copy link
Contributor Author

@yosefe @Akshay-Venkatesh both of you have pointed out to me that setting UCX_MEMTYPE_CACHE=n shouldn't be needed anymore. Am I perhaps missing something obvious here?

@Akshay-Venkatesh
Copy link
Contributor

@pentschev UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda is not explicitly required. Do lengths 100028216 and 99986752 look right? Do the errors lead to failures?

Also, did you already fix UCP config modify usage? Memtype cache is part of global options now.

@yosefe what does local protection error mean here?

@pentschev
Copy link
Contributor Author

UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda is not explicitly required.

That's right, it's not really necessary but it was when I reverted #7128 for testing, that's why I had it.

Do lengths 100028216 and 99986752 look right? Do the errors lead to failures?

Yes, the lengths are normal. The errors lead to failures, sorry for being unclear on that. Endpoints raise Input/output errors and the benchmark eventually hangs.

Also, did you already fix UCP config modify usage? Memtype cache is part of global options now.

I only removed the environment variable, and that caused errors to appear. However, if I add UCX_MEMTYPE_CACHE=n back it works. Are we supposed to still have it set as a global option?

@yosefe
Copy link
Contributor

yosefe commented Oct 28, 2021

@pentschev can you please run the failing case with "UCX_LOG_LEVEL=req UCX_MEM_LOG_LEVEL=trace" and upload the output (hopefully it will not get too big)?
seems like some memory hooks were missed

@yosefe yosefe added this to the v12.0 milestone Oct 28, 2021
@pentschev
Copy link
Contributor Author

@yosefe please see the attached log. I think it's not so big, I have 3 Dask workers, I couldn't reproduce with 2 only.

local_protection_err.log

pentschev added a commit to pentschev/ucx-py that referenced this issue Nov 4, 2021
UCX_MEMTYPE_CACHE=n is still required even for UCX 1.12 (see
openucx/ucx#7575), so we reenable it.
Reordering the import is also required to prevent UCS warnings when
Cython code is imported.
@pentschev
Copy link
Contributor Author

I confirmed with @yosefe offline that leaving UCX_MEMTYPE_CACHE=n set for the 1.12 release is ok, we only need to reorder the variable setting, which is being addressed in rapidsai/ucx-py#807 . This issue can then be moved to the 1.13 milestone.

pentschev added a commit to rapidsai/ucx-py that referenced this issue Nov 4, 2021
UCX_MEMTYPE_CACHE=n is still required even for UCX 1.12 (see
openucx/ucx#7575), so we reenable it.
Reordering the import is also required to prevent UCS warnings when
Cython code is imported.
@Akshay-Venkatesh
Copy link
Contributor

Akshay-Venkatesh commented Jan 4, 2022

@pentschev Assuming that the bug here gets fixed at some point in UCX (and potentially in lower layers as well), what is the plan for ucx-py to use whole allocation registration feature for DGX machines vs T4 GPUs? For DGX, I see that the default value of UCX_CUDA_COPY_MAX_REG_RATIO (0.1 by default) has to be bumped up to between 0.9-1.0 to avoid repeated registrations with RMM sub-allocated memory. For T4, a high ratio of 0.9-1.0 will most likely result in a BAR1 exhaustion. One change in the way that this can be handled in UCX is to override user specified ratio if T4 GPUs are detected when UCX_CUDA_COPY_REG_WHOLE_ALLOC is set to auto. If T4 is detected we'd set max_reg_ratio to 0.01 or some low value internally. If there is a use case where whole alloc registration with high registration ratio has to be used on T4 as well, the user would have to set UCX_CUDA_COPY_REG_WHOLE_ALLOC to on. This way ucx-py can default to setting UCX_CUDA_COPY_MAX_REG_RATIO=0.9/1.0 and leave UCX_CUDA_COPY_REG_WHOLE_ALLOC (auto by default) unchanged and wouldn't have to explicitly detect T4 GPUs. Does that sound reasonable?

cc @yosefe

Edit: Looking at the latest code, additional changes are not needed. Setting UCX_CUDA_COPY_MAX_REG_RATIO=0.9 should have the desired effect of registering all of RMM allocation for DGX and only user exposed region in the case of T4.

static size_t
uct_cuda_base_get_total_device_mem(CUdevice cuda_device)
{
   ...
    if (!total_bytes[cuda_device]) {
        cu_err = cuDeviceTotalMem(&total_bytes[cuda_device], cuda_device);
        if (cu_err != CUDA_SUCCESS) {
            cuGetErrorString(cu_err, &cu_err_str);
            ucs_error("cuDeviceTotalMem error: %s", cu_err_str);
            goto err;
        }   

        cu_err = cuDeviceGetName(dev_name, sizeof(dev_name), cuda_device);
        if (cu_err != CUDA_SUCCESS) {
            cuGetErrorString(cu_err, &cu_err_str);
            ucs_error("cuDeviceGetName error: %s", cu_err_str);
            goto err;
        }

        if (!strncmp(dev_name, "T4", 2)) {
            total_bytes[cuda_device] = 1; /* should ensure that whole alloc
                                             registration is not used for t4 */
        }
    }
    ...
}

static ucs_status_t
uct_cuda_base_query_attributes(uct_cuda_copy_md_t *md, const void *address,
                               size_t length, ucs_memory_info_t *mem_info)
{
    ...
    if (md->config.alloc_whole_reg == UCS_CONFIG_AUTO) {
        total_bytes = uct_cuda_base_get_total_device_mem(cuda_device); /* total_bytes = 1 if auto */
        if (alloc_length > (total_bytes * md->config.max_reg_ratio)) {
            goto out_default_range; /* always taken if T4 and alloc_length > 0 */
        }
    } else {
        ucs_assert(md->config.alloc_whole_reg == UCS_CONFIG_ON);
    }

    mem_info->base_address = (void*)base_address;
    mem_info->alloc_length = alloc_length;
    return UCS_OK;

out_default_range:
    mem_info->base_address = (void*)address;
    mem_info->alloc_length = length;
    return UCS_OK;
}

@pentschev
Copy link
Contributor Author

@Akshay-Venkatesh thanks for the ping. I worked yesterday on new defaults for UCX_CUDA_COPY_MAX_REG_RATIO in rapidsai/ucx-py#824 . In summary what that change does is setting UCX_CUDA_COPY_MAX_REG_RATIO=1.0 if BAR1 >= TotalGPUMemory, otherwise leaves it to UCX default (currently UCX_CUDA_COPY_MAX_REG_RATIO=0.1), but we could in the future set a smarter ratio if that's useful. Do you think the UCX-Py change is unnecessary given the code you pasted above?

@Akshay-Venkatesh
Copy link
Contributor

default

@pentschev Thanks for the pointer. rapidsai/ucx-py#824 could be simpler and avoid nvml barinfo calls given the code above but IMO I don't think it hurts assuming that nvml query is onetime and given that some logic in ucx may change in the future.

Also, we had logic similar to rapidsai/ucx-py#824 in the past but we removed that in favor of not having nvml dependency. As of 1.12, we do depend on nvml, so it's probably better to generalize this kind of detection by replacing it with nvml barinfo query instead.

@pentschev
Copy link
Contributor Author

In our case, we may generally assume that a user importing UCX-Py for GPU usage will have pynvml installed, so that's less of a concern for us, and it only runs at import time anyway. Plus, querying the BAR1 size allows us to be future-proof to some extent, as we would prefer not to special-case GPUs by some of its characteristics (e.g., name).

@yosefe yosefe modified the milestones: v12.0, v1.12.1 Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants