Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a custom launcher for systems which have hyper threading enabled #74

Closed
satishskamath opened this issue Jul 27, 2023 · 4 comments

Comments

@satishskamath
Copy link
Collaborator

satishskamath commented Jul 27, 2023

https://github.com/casparvl/test-suite/blob/hyperthreading/eessi/testsuite/hooks.py

@casparvl
Copy link
Collaborator

To give a bit more context:

On vega, the autodetect CPU info is

  "num_cpus": 256,
  "num_cpus_per_core": 2,
  "num_cpus_per_socket": 128,
  "num_sockets": 2

As a result, our assign_one_task_per_compute_unit(self, COMPUTE_UNIT[CPU_SOCKET]) launches 2 processes, with 128 threads each.

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=128
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p cpu
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
export SLURM_EXPORT_ENV=ALL
export OMPI_MCA_pml=ucx
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=verbose
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 128 --inter-op-parallelism 1

This works, but is potentially quite inefficient: HPC tasks often don't benefit from hyperthreading, and in fact, the extra overhead slows them down.

[ RUN      ] TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 %device_type=cpu /f3217366 @vega:cpu+default
[       OK ] (1/1) TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 %device_type=cpu /f3217366 @vega:cpu+default
P: perf: 44079.05044789011 img/s (r:0, l:None, u:None)

I'm now trying to run manually with fewer tasks to see if this gets better. On Snellius, with the exact same CPU model (AMD Zen2 7H12), but without hyperthreading, we get:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=64
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
module load TensorFlow/2.11.0-foss-2022a
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=verbose
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

and

[ RUN      ] TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /5b5c1cb6 @snellius:thin+default
[       OK ] (1/1) TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /5b5c1cb6 @snellius:thin+default
P: perf: 59192.41262205311 img/s (r:0, l:None, u:None)

On Vega, I manually tried

$ cat rfm_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=128
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p cpu
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
export SLURM_EXPORT_ENV=ALL
export OMPI_MCA_pml=ucx
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=verbose
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

expecting this would give me the same performance as on Snellius. It did not:

Performance: 42531.12996799633 img/s

I'm not 100% sure why, I guess it might have to do with the fact that the 64 threads (and a few more, since TF also has some extra threads) are free to roam around both hyperthreads. In theory, the OS could momentarily schedule 2 threads to the same physical core, and they might be in each others way.

I've been trying interactively if I could launch the latter processes with a binding that excludes on hyperthread per core, but have not been able to find the correct options. I expected --cpu-list to be able to do this, but it doesn't seem to be respected (know issue in OpenMPI v4.x). Similarly, specifying a rankfile and using --use-hwthread-cpus I would have expected

$ cat myrankfile
rank 0=+n0 slot=0:0-63
rank 1=+n0 slot=1:0-63

to select half of the available hyperthreads per socket, but instead, rank 0 reports the following binding (to all 128 hyperthreads on the socket):

[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]

The only thing I've found to be working is the somewhat confusing:

mpirun -np 2 --report-bindings numactl --physcpubind=+0-63 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

+0-63 means "bind to cores 0-63, relative to the core set this process has access to. Since we set

export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64

each process has access to the 128 hyperthreads on one socket. That means numactl then binds to the first 64 hyperthreads of that 128 hyperthreads per process. That means each process is effectively bound only to one of each set of two hyperthreading cores. Disappointingly, performance is essentially the same as before:

Performance: 42239.21797731072 img/s

Unexpected, but it might simply mean there is some other thing that causes the TF performance on Snellius to be better (memory bandwidth?).

In any case, maybe TensorFlow is an application that actually benifits a bit from hyperthreading, since it tends to launch a few more threads than you specify as intra-op-threads (e.g. asking for 64 threads, I get about 10 extra or so). Maybe this causes it to benefit a bit from hyperthreading.

I think we should do a similar analysis as above for GROMACS to get a picture of a pure MPI run. I'm guessing the increased overhead of 256 instead of 128 MPI processes per node is really not great for performance...

@casparvl
Copy link
Collaborator

Btw, the relation with a custom launcher is as follows: even if we want to not use hyperthreading cores, we'll still need SLURM to allocate them. That means that for a pure MPI program, we'd want e.g. --ntasks-per-node=256 on Vega, but then want the parallel launcher to launch only 128 tasks. The idea was we could do this with a custom launcher.

On thing we overlooked is the scenario of hybrid runs, such as TensorFlow. There, we'd want --ntasks-per-node=2 --cpus-per-task=256, but the command line argument that specifies the number of tasks should be divided by 2. Note that in this case we would not want the launcher to cut the specified number of tasks per node by 2, as we really want 2 tasks per node in this case, not 1.

Ugh, it would be complicated. Maybe, for now, we just accept the inefficiency, and let the tests run one thread per hardware thread, instead of trying to avoid that...? Let's see how bad things are with GROMACS, and if this would be acceptable...

@casparvl
Copy link
Collaborator

Now with GROMACS:

[eucasparvl@vglogin0007 rfm_gromacs]$ cat rfm_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=256
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p cpu
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
export SLURM_EXPORT_ENV=ALL
export OMPI_MCA_pml=ucx
module load GROMACS/2020.4-foss-2020a-Python-3.8.2
export OMP_NUM_THREADS=1
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/Crambin/benchmark.tpr
mpirun -np 128 gmx_mpi mdrun -nb cpu -s benchmark.tpr -dlb yes -npme -1 -ntomp 1
[eucasparvl@vglogin0007 rfm_gromacs]$ cat rfm_job2.sh
#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=256
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job2.out
#SBATCH --error=rfm_job2.err
#SBATCH --time=0:30:0
#SBATCH -p cpu
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
export SLURM_EXPORT_ENV=ALL
export OMPI_MCA_pml=ucx
module load GROMACS/2020.4-foss-2020a-Python-3.8.2
export OMP_NUM_THREADS=1
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/Crambin/benchmark.tpr
mpirun -np 256 gmx_mpi mdrun -nb cpu -s benchmark.tpr -dlb yes -npme -1 -ntomp 1

Results in:

[eucasparvl@vglogin0007 rfm_gromacs]$ cat  rfm_job.err | grep Performance
Performance:      323.741        0.074
[eucasparvl@vglogin0007 rfm_gromacs]$ cat  rfm_job2.err | grep Performance
Performance:      356.430        0.067

Interesting, so the run that uses hyperthreads as cores is still faster.

Trying the same thing with a larger benchmark case from here

[eucasparvl@vglogin0007 rfm_gromacs]$ cat  rfm_job.err | grep Performance
Performance:       14.187        1.692
[eucasparvl@vglogin0007 rfm_gromacs]$ cat  rfm_job2.err | grep Performance
Performance:       17.664        1.359

Interesting, the relative difference is even bigger.

Maybe we shouldn't worry too much about one-task-per-hwthread being started after all... I propose we don't put effort in this for now: let's see how it goes with other tests. If we see bad performance compared to non-hyperthreading systems, we can reevaluate and perform similar comparisons as I have done here.

@casparvl
Copy link
Collaborator

Always good to keep in mind: this is a test suite, not a benchmark suite. We're not interested in achieving the very best performance on every system. The most important part is that the performance is reproducible and shows us when an installation was not done correctly. E.g. if an optimization architecture was incorrectly taken into account, we'd want to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants