-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a custom launcher for systems which have hyper threading enabled #74
Comments
To give a bit more context: On vega, the autodetect CPU info is
As a result, our
This works, but is potentially quite inefficient: HPC tasks often don't benefit from hyperthreading, and in fact, the extra overhead slows them down.
I'm now trying to run manually with fewer tasks to see if this gets better. On
and
On Vega, I manually tried
expecting this would give me the same performance as on Snellius. It did not:
I'm not 100% sure why, I guess it might have to do with the fact that the 64 threads (and a few more, since TF also has some extra threads) are free to roam around both hyperthreads. In theory, the OS could momentarily schedule 2 threads to the same physical core, and they might be in each others way. I've been trying interactively if I could launch the latter processes with a binding that excludes on hyperthread per core, but have not been able to find the correct options. I expected
to select half of the available hyperthreads per socket, but instead, rank 0 reports the following binding (to all 128 hyperthreads on the socket):
The only thing I've found to be working is the somewhat confusing:
+0-63 means "bind to cores 0-63, relative to the core set this process has access to. Since we set
each process has access to the 128 hyperthreads on one socket. That means numactl then binds to the first 64 hyperthreads of that 128 hyperthreads per process. That means each process is effectively bound only to one of each set of two hyperthreading cores. Disappointingly, performance is essentially the same as before:
Unexpected, but it might simply mean there is some other thing that causes the TF performance on Snellius to be better (memory bandwidth?). In any case, maybe TensorFlow is an application that actually benifits a bit from hyperthreading, since it tends to launch a few more threads than you specify as I think we should do a similar analysis as above for GROMACS to get a picture of a pure MPI run. I'm guessing the increased overhead of 256 instead of 128 MPI processes per node is really not great for performance... |
Btw, the relation with a custom launcher is as follows: even if we want to not use hyperthreading cores, we'll still need SLURM to allocate them. That means that for a pure MPI program, we'd want e.g. On thing we overlooked is the scenario of hybrid runs, such as TensorFlow. There, we'd want Ugh, it would be complicated. Maybe, for now, we just accept the inefficiency, and let the tests run one thread per hardware thread, instead of trying to avoid that...? Let's see how bad things are with GROMACS, and if this would be acceptable... |
Now with GROMACS:
Results in:
Interesting, so the run that uses hyperthreads as cores is still faster. Trying the same thing with a larger benchmark case from here
Interesting, the relative difference is even bigger. Maybe we shouldn't worry too much about one-task-per-hwthread being started after all... I propose we don't put effort in this for now: let's see how it goes with other tests. If we see bad performance compared to non-hyperthreading systems, we can reevaluate and perform similar comparisons as I have done here. |
Always good to keep in mind: this is a test suite, not a benchmark suite. We're not interested in achieving the very best performance on every system. The most important part is that the performance is reproducible and shows us when an installation was not done correctly. E.g. if an optimization architecture was incorrectly taken into account, we'd want to know. |
https://github.com/casparvl/test-suite/blob/hyperthreading/eessi/testsuite/hooks.py
The text was updated successfully, but these errors were encountered: