Kernel's NUMA balancing vs. OpenMPI's affinity #11357

vitduck · 2023-01-30T15:22:02Z

Hi,

I'm trying to figure out the interaction between automatic NUMA balancing and OpenMPI's process affinity implementation.
I would like to deduce potential outcome before carrying out actual benchmarks.

As I understand:

The kernel attempt to maximize data locality by periodically profiling memory access pattern of and migrating threads/pages.
By default, Open MPI binds processes either to core (n < 2) or to socket (n >= 2). Threads can also be mapped to various hardware hierarchy such as socket, numa, l1cache, l2cache and l3cache.

Considering Zen 3, where 8 cores in a CCX sharing L3 cache, --map-by l3cache can ensures maximal data locality between threads per the first touch policy. In such a case:

(1) becomes redundant at best and OS noises at worst.
(1) makes sense when disabling (2) with --bind-to none.

Heterogeneous system adds an another layer of complication.

Best performance is archived when respecting cpu-gpu affinity
UCX also has affinity detection to make sure processes stays close to HCA.

For the benchmark, the trend should be as follow:
a. no auto-NUMA, --bind-to none as base line
b. auto-NUMA, --bind-to none
c. no auto-NUMA, OpenMPI's affinity
d. auto-NUMA, OpenMPI's affinity

Depending on workload, (a), (b) should show some performance variation with repeated runs. (c) would be slightly better than (d) due to the absence of profiling overhead from kernel.

I appreciate if you can give some comments and share your insights into this matter.

Thanks.

The text was updated successfully, but these errors were encountered:

rhc54 · 2023-01-30T16:10:20Z

I'd suggest first just finding out if the app moves at all during execution if mpirun binds it somewhere. If it does, then that could be an issue as it is overriding what the user asked it to do and expects to see. It would be disturbing, for example, if UCX is "invisibly" changing binding on the user.

So my suggestion is to write an app that detect its affinity and caches it right after it starts, then does some communication and computation while periodically detecting its affinity and comparing that to its initially one. If something changes, print out the new affinity and when it appeared. You can do this for all four of your cases.

If nothing changes, then you can reduce your benchmark to just cases a and b, plus one case with mpirun setting the affinity. If something is changing, then there is a problem that needs to be addressed by the community.

vitduck · 2023-02-01T07:22:30Z

Thanks for your comments.

There is not much information discussing this feature of the kernel. The only two references I could find advise against using auto NUMA balancing:

Best practices for running tightly coupled HPC applications on Compute Engine

Automatic NUMA balancing by operating system can cause overhead and we therefore don't recommend it for MPI applications.

An HPC application such as QE has different workloads during its course of operation. So I can understand that what the kernel applies for this workload will not be suitable for another.

CUDA Best Practice Guide

Some recent Linux distributions enable automatic NUMA balancing (or “AutoNUMA”) by default. In some instances, operations performed by automatic NUMA balancing may degrade the performance of applications running on NVIDIA GPUs. For optimal performance, users should manually tune the NUMA characteristics of their application.

Here NVIDIA advises tuning the process placement directly with numactl as observed with the HPL/HPCG wrapper scripts provided in their NGC containers:

mpirun --bind-to none ... numactl --cpunodebind=0 xhpl

I gather that there is nothing to prevent the kernel from overruling OpenMPI's affinity setting. Hence, the advices above.
It is not straightforward to me how to monitor affinity for HPC applications. Could you give some pointers ?

For now, I can write a toy model that allocate large data set and monitor the page migration from /proc/vmstat.
The overhead, if exists, should correlate with data size. This is a little bit easier to measure.

rhc54 · 2023-02-01T09:56:11Z

I would advise against generalizing that NVIDIA advice - that cmd line locks all of your processes to the first NUMA node on each machine. I'm not sure why anyone would want to do that on a machine that has more than one NUMA, so I'm guessing that their container is configured somehow to ensure that makes sense.

You usually achieve better results with mpirun --map-by socket --bind-to socket ... as this load balances across the sockets. If your app isn't multi-threaded, then replace --bind-to socket with --bind-to core can be a little better performing.

I gather that there is nothing to prevent the kernel from overruling OpenMPI's affinity setting.

It would be a rather odd kernel that did so. All OMPI does is tell the kernel "schedule this process to execute only on the indicated CPU(s)". The kernel is supposed to honor that request. Even autonuma respects it:

Manual NUMA tuning of applications will override automatic NUMA balancing,
disabling periodic unmapping of memory, NUMA faults, migration, and automatic
NUMA placement of those applications.

Which is why you would want to ensure that you let mpirun bind the procs.

Note that there are times when you do want to let the kernel take over. We discussed this recently on another issue (see #11345). Outside of those circumstances, it is usually better to tell the kernel where the process should run.

It is not straightforward to me how to monitor affinity for HPC applications. Could you give some pointers ?

Here is a simple code snippet you could periodically run:

void print_affinity() {
    cpu_set_t mask;
    long nproc, i;

    if (sched_getaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
        perror("sched_getaffinity");
        assert(false);
    }
    nproc = sysconf(_SC_NPROCESSORS_ONLN);
    printf("sched_getaffinity = ");
    for (i = 0; i < nproc; i++) {
        printf("%d ", CPU_ISSET(i, &mask));
    }
    printf("\n");
}

Instead of printing it out, save the initial "mask" you get and then periodically run the code to check the new returned value against the one you obtained at the beginning of execution. If they match, then you haven't moved. So something like this:

void check_affinity() {
    cpu_set_t mask;

    if (sched_getaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
        perror("sched_getaffinity");
        assert(false);
    }
    if (mask != original_mask) {
        printf("IT MOVED\n");
    }
}

You can print out the original vs current if you like.

Obviously, that is not something you want to do in an actual application - strictly a research tool to see what is happening.

vitduck · 2023-02-03T13:37:38Z

I would advise against generalizing that NVIDIA advice - that cmd line locks all of your processes to the first NUMA node on each machine. I'm not sure why anyone would want to do that on a machine that has more than one NUMA, so I'm guessing that their container is configured somehow to ensure that makes sense.

Pardon me for not being clear regarding the NGC container. The aforementioned command is just an example using 1 GPU.
The provided wrapper script requires affinity settings from user via command line options, such as

--gpu-affinity
--cpu-affinity
--mem-affinity

which translates to CUDA_VISIBLE_DEVICES, --cpunodebind, and --membind, respectively.

Among the benchmarks, performance of HPL is most sensitive to NUMA affinity setting. HGX variant has a very unusual one where 8 GPUs are mapped to 7, 7, 5, 5, 3, 3, 1, 1 domains in AMD EPYC. I believe Open MPI supports this via --cpu-list. But I had error with v4.0.x in the past. It seems to be fixed with v5.0 #6540. I guess NVIDIA has to go with numactl to archive the same effect.

You usually achieve better results with mpirun --map-by socket --bind-to socket ... as this load balances across the sockets. If your app isn't multi-threaded, then replace --bind-to socket with --bind-to core can be a little better performing.

Yes, besides HPL, other HPC applications such as VASP, QE does show some performance gain when binding to socket, in case there are multiple GPUs per socket.

It would be a rather odd kernel that did so. All OMPI does is tell the kernel "schedule this process to execute only on the indicated CPU(s)". The kernel is supposed to honor that request. Even autonuma respects it:

It is an oversight in my part. Thanks for pointing this out. It makes sense for the kernel to respect user's request.

Thanks for sample codes. We will experiment according to your suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

vitduck commented Jan 30, 2023

rhc54 commented Jan 30, 2023

vitduck commented Feb 1, 2023

rhc54 commented Feb 1, 2023

vitduck commented Feb 3, 2023

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

Comments

vitduck commented Jan 30, 2023

rhc54 commented Jan 30, 2023

vitduck commented Feb 1, 2023

rhc54 commented Feb 1, 2023

vitduck commented Feb 3, 2023