-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA_READ
errors without UCX_MEMTYPE_CACHE=n
#7575
Comments
@yosefe @Akshay-Venkatesh both of you have pointed out to me that setting |
@pentschev UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda is not explicitly required. Do lengths 100028216 and 99986752 look right? Do the errors lead to failures? Also, did you already fix UCP config modify usage? Memtype cache is part of global options now. @yosefe what does local protection error mean here? |
That's right, it's not really necessary but it was when I reverted #7128 for testing, that's why I had it.
Yes, the lengths are normal. The errors lead to failures, sorry for being unclear on that. Endpoints raise
I only removed the environment variable, and that caused errors to appear. However, if I add |
@pentschev can you please run the failing case with "UCX_LOG_LEVEL=req UCX_MEM_LOG_LEVEL=trace" and upload the output (hopefully it will not get too big)? |
@yosefe please see the attached log. I think it's not so big, I have 3 Dask workers, I couldn't reproduce with 2 only. |
UCX_MEMTYPE_CACHE=n is still required even for UCX 1.12 (see openucx/ucx#7575), so we reenable it. Reordering the import is also required to prevent UCS warnings when Cython code is imported.
I confirmed with @yosefe offline that leaving |
UCX_MEMTYPE_CACHE=n is still required even for UCX 1.12 (see openucx/ucx#7575), so we reenable it. Reordering the import is also required to prevent UCS warnings when Cython code is imported.
@pentschev Assuming that the bug here gets fixed at some point in UCX (and potentially in lower layers as well), what is the plan for ucx-py to use whole allocation registration feature for DGX machines vs T4 GPUs? For DGX, I see that the default value of UCX_CUDA_COPY_MAX_REG_RATIO (0.1 by default) has to be bumped up to between 0.9-1.0 to avoid repeated registrations with RMM sub-allocated memory. For T4, a high ratio of 0.9-1.0 will most likely result in a BAR1 exhaustion. One change in the way that this can be handled in UCX is to override user specified ratio if T4 GPUs are detected when UCX_CUDA_COPY_REG_WHOLE_ALLOC is set to auto. If T4 is detected we'd set max_reg_ratio to 0.01 or some low value internally. If there is a use case where whole alloc registration with high registration ratio has to be used on T4 as well, the user would have to set UCX_CUDA_COPY_REG_WHOLE_ALLOC to on. This way ucx-py can default to setting UCX_CUDA_COPY_MAX_REG_RATIO=0.9/1.0 and leave UCX_CUDA_COPY_REG_WHOLE_ALLOC (auto by default) unchanged and wouldn't have to explicitly detect T4 GPUs. Does that sound reasonable? cc @yosefe Edit: Looking at the latest code, additional changes are not needed. Setting UCX_CUDA_COPY_MAX_REG_RATIO=0.9 should have the desired effect of registering all of RMM allocation for DGX and only user exposed region in the case of T4. static size_t
uct_cuda_base_get_total_device_mem(CUdevice cuda_device)
{
...
if (!total_bytes[cuda_device]) {
cu_err = cuDeviceTotalMem(&total_bytes[cuda_device], cuda_device);
if (cu_err != CUDA_SUCCESS) {
cuGetErrorString(cu_err, &cu_err_str);
ucs_error("cuDeviceTotalMem error: %s", cu_err_str);
goto err;
}
cu_err = cuDeviceGetName(dev_name, sizeof(dev_name), cuda_device);
if (cu_err != CUDA_SUCCESS) {
cuGetErrorString(cu_err, &cu_err_str);
ucs_error("cuDeviceGetName error: %s", cu_err_str);
goto err;
}
if (!strncmp(dev_name, "T4", 2)) {
total_bytes[cuda_device] = 1; /* should ensure that whole alloc
registration is not used for t4 */
}
}
...
}
static ucs_status_t
uct_cuda_base_query_attributes(uct_cuda_copy_md_t *md, const void *address,
size_t length, ucs_memory_info_t *mem_info)
{
...
if (md->config.alloc_whole_reg == UCS_CONFIG_AUTO) {
total_bytes = uct_cuda_base_get_total_device_mem(cuda_device); /* total_bytes = 1 if auto */
if (alloc_length > (total_bytes * md->config.max_reg_ratio)) {
goto out_default_range; /* always taken if T4 and alloc_length > 0 */
}
} else {
ucs_assert(md->config.alloc_whole_reg == UCS_CONFIG_ON);
}
mem_info->base_address = (void*)base_address;
mem_info->alloc_length = alloc_length;
return UCS_OK;
out_default_range:
mem_info->base_address = (void*)address;
mem_info->alloc_length = length;
return UCS_OK;
} |
@Akshay-Venkatesh thanks for the ping. I worked yesterday on new defaults for |
@pentschev Thanks for the pointer. rapidsai/ucx-py#824 could be simpler and avoid nvml barinfo calls given the code above but IMO I don't think it hurts assuming that nvml query is onetime and given that some logic in ucx may change in the future. Also, we had logic similar to rapidsai/ucx-py#824 in the past but we removed that in favor of not having nvml dependency. As of 1.12, we do depend on nvml, so it's probably better to generalize this kind of detection by replacing it with nvml barinfo query instead. |
In our case, we may generally assume that a user importing UCX-Py for GPU usage will have pynvml installed, so that's less of a concern for us, and it only runs at import time anyway. Plus, querying the BAR1 size allows us to be future-proof to some extent, as we would prefer not to special-case GPUs by some of its characteristics (e.g., name). |
Describe the bug
After whole alloc and memtype_cache changes from #7128 , we should now not need to set
UCX_MEMTYPE_CACHE=n
anymore. In most cases that works fine, but for some Dask workflows we see RDMA_READ errors such as bellow when we don't set anyUCX_MEMTYPE_CACHE
.Steps to Reproduce
UCX_MAX_RNDV_RAILS=1 UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda python dask_cuda/benchmarks/local_cudf_merge.py -d 0,1,2,3,4,5,6,7 --runs 5 -c 100_000_000 -p ucx --interface ib0
Setup and versions
Linux dgx13 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
nv_peer_mem
andgdrcopy
modules loadedThe text was updated successfully, but these errors were encountered: