P vs E cores in Open MPI #11345

ggouaillardet · 2023-01-26T02:50:57Z

I just saw this question in Stack Overflow

https://stackoverflow.com/questions/75240988/openmpi-and-ubuntu-22-04-support-for-using-all-e-and-p-cores-on-12th-gen-intel-c

TL;DR on a system with 8 P cores (2 threads each) and 8 E cores (1 thread each), is there a (ideally user friendly) way to tell Open MPI to only use the P cores?

@bgoglin what kind of support is provided by hwloc with respect to P vs E cores?

The text was updated successfully, but these errors were encountered:

rhc54 · 2023-01-26T02:57:06Z

I remember raising this while at Intel - IIRC, the answer was "nobody should be using these processors for MPI". Not really designed for that purpose. Best we could devise was to use the "pe-list" option to select only the p-cores as the e-cores are pretty useless for this application. It's a workaround, but probably your best answer if you insist on using such processors for HPC.

My guess is that someone is just trying to run code on a laptop for test purposes - in which case, restricting to the p-cores is probably just fine.

ggouaillardet · 2023-01-26T03:37:02Z

I am fine with using only the P cores for Open MPI.

I do not have access to such a processor and I do not know how hwloc presents it to Open MPI. Is it seen as a 8 (P) core systems? or as a 12 (8P + 4E) core systems?

ggouaillardet · 2023-01-26T03:45:48Z

FWIW, I asked the user to run mpirun --display-map -np 1 true to check whether Open MPI sees the E cores.

rhc54 · 2023-01-26T04:35:09Z

I honestly don't know how it is presented. I couldn't get the processor team to have any interest in hwloc support back then. The processor was designed in partnership with Microsoft specifically for Windows (which MS custom optimized for it), and MS had no interest in hwloc support.

I'm guessing hwloc should still be able to read something on it anyway. If they have hwloc on that box, then just have them run lstopo and provide the output from it - that's all we get anyway.

ggouaillardet · 2023-01-26T05:28:36Z

$ mpirun --display-map -n 1 true

 Data for JOB [3978,1] offset 0 Total slots allocated 12

========================   JOB MAP ========================

Data for node: xxxxxxxx Num slots: 12   Max slots: 0    Num procs: 1
   Process OMPI jobid: [3978,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..]
=============================================================

there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this.

I am now clarifying this and I guess I'll then have to wait for @bgoglin insights.

bgoglin · 2023-01-26T06:26:09Z

Hello. hwloc reports different "cpukinds" (a cpuset + some info). We don't tell you explicitly which one is P or E (sometimes there are 3 kinds on ARM already), but kinds are reported in an order that goes from power-efficient cores to power-hungry cores. This is in hwloc/cpukinds.h since hwloc 2.4. You likely want to call hwloc_cpukinds_get_nr(topo, 0) to get the number of kinds, and then call hwloc_cpukinds_get_info(topo, nr-1, cpuset, NULL, NULL, NULL, 0) to get your (pre-allocated) cpuset filled with the list of power-hungry cores. This should work on Windows, Mac and Linux on ARM, Intel AlderLake and M1 although the way we detect heterogeneity is completely different in all these cases.

ggouaillardet · 2023-01-26T06:36:04Z

Thanks @bgoglin, I will experience on a M1 (since this is all I have) to see how I can "hide" the E cores from Open MPI.

ggouaillardet · 2023-01-26T06:52:04Z

@bgoglin just to be clear, does hwloc guarantees the highest cpukind (e.g. hwloc_cpukinds_get_nr(...) - 1) is for the power hungry (e.g. P) cores?

bgoglin · 2023-01-26T07:53:26Z

 * If hwloc fails to rank any kind, for instance because the operating
 * system does not expose efficiencies and core frequencies,
 * all kinds will have an unknown efficiency (\c -1),
 * and they are not indexed/ordered in any specific way.

So when you call get_info(), pass an "int efficiency" in hwloc_cpukinds_get_info(topo, nr-1, cpuset, &efficiency, NULL, NULL, 0) and check whether you get -1 in there.

rhc54 · 2023-01-26T08:32:34Z

there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this.

You cannot trust those dots, @ggouaillardet - the print statement isn't that accurate (it's actually rather dumb, to be honest).

how I can "hide" the E cores from Open MPI.

I already told you - you just have to list the PEs you want to use. It would take a significant change to PRRTE (or ORTE for an older version) to try and do this internally. I doubt it would be worth the effort - like I said, these chips are not intended for HPC, and won't run well in that environment.

only use the performance cpus when true (default is false) requires hwloc >= 2.4 Refs. open-mpi#11345 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

liuzheng-arctic · 2023-01-26T21:38:30Z

Thanks for helping me to post my question here. I didn't intend to do real HPC job on this laptop but want to take advantage of the multiple cores to speed up some data processing (40k+ satellite data files and 200k+ model output files, less than 100M each). The processing is pretty repetitive and is perfect for lazy parallelization. The issue is that OpenMPI does not recognize the cores correctly. So I am not sure how it does the scheduling. OpenMPI complains when I set -np to more than 12. I don't want to have more threads on a single core, especially the e-core.

It would be great if I can use all 16 cores. If not, having some control over which cores to use would be ideal, for example, use p-cores for faster processing and e-cores for thermal concerns.

rhc54 · 2023-01-26T21:58:37Z

Yeah, I kinda figured that was the situation. You have a few simple options:

first, add --use-hwthread-cpus to the mpirun cmd line. This will even the playing field between the processor types.
By default, you'll bind each proc to a single thread, which means you can run up to 24 procs (8 p-cores have 16 threads, plus 8 single-thread e-cores). If you need 2 threads/proc, then tell mpirun to bind 2 cpus/rank: --map-by hwthread:pe=2. This will bind one proc to each p-core, and one proc to each pair of e-cores. Will limit you to 12 procs, though.
if you want to run up to 16 procs, on uneven numbers of threads, then you could try this: mpirun --map-by hwthread:pe=2 -np 8 myapp : --map-by hwthread -np 8 myapp. The first context should use both threads of each p-core, while the second context should use the single thread on each e-core (since the p-cores are all used up). Note that your performance won't be great as the procs will significantly differ in their behavior.
if you want more than 12 procs, you could just tell us not to bind at all: --map-by hwthread:oversubscribe --bind-to none. You lose a touch of performance due to not binding, but then you aren't going after great performance here anyway, and this let's the OS schedule the thread usage. Much simpler and the OS knows how to better use the different cores than anything we can provide.

liuzheng-arctic · 2023-01-27T00:36:37Z

Thanks for the reply, although I am not sure I can follow. What confuses me is that OpenMPI (and/or Ubuntu 22.04) can only see 12 cores (12x2 threads=24) although there are actually 16 cores (8x2+8x1=24 threads). If it gets the total number of cores wrong, it may mess up the scheduling to the cores too (missing four cores).

If OpenMPI can only see 12 cores, I assume mpirun -np 12 myapp should use the 12 cores with one processes per core.

What if I want to use all 16 cores, with one process on each? OpenMPI complains if I use mpirun -n 16 directly because it only sees 12 cores. If I use mpirun -n 16 --map-by hwthread:oversubscribe --bind-to none myapp. Will this be one process per core? I am worried that the OS or OpenMPI will only use the 12 cores it can see and have multiple threads on some of them, p-cores or e-cores.

Another question is why ?

rhc54 · 2023-01-27T00:57:55Z

You are overthinking things 😄

If you simply run mpirun -n 100 --oversubscribe you will launch 100 processes, none of them bound to any particular core. The OS will schedule as many of them at a time as it can fit onto CPUs, cycling time slices across all the procs in some kind of load-balanced manner. It will do this in a way that balances thermal load while providing best possible use of the cpu cycles.

You shouldn't care what hyperthread gets used for any given time slice by whatever process is being executed during that time slice. The OS will manage all that for you. This is what the OS does really well.

Trying to do any better than that is a waste of your energy. It doesn't matter what mpirun "sees" or doesn't "see". It's sole purpose is to start N procs, and then get out of the way and let the OS do its thing. Asking mpirun to try and optimize placement and binding on this kind of processor will only yield worse results.

liuzheng-arctic · 2023-01-27T01:14:57Z

Thanks, @rhc54 . I was worried that the OS is confused too because the Ubuntu 22.04 (5.15.79.1-microsoft-standard-WSL2) also sees only 12 cores (24 threads), although the host Windows 11 recognizes the CPU correctly.
lscpu returns the following:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-12800HX
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
BogoMIPS: 4608.01
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse ss
e2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop
_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline
_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs
ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi
2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni u
mip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabili
ties
Virtualization features:
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
Caches (sum of all):
L1d: 576 KiB (12 instances)
L1i: 384 KiB (12 instances)
L2: 15 MiB (12 instances)
L3: 25 MiB (1 instance)
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; Enhanced IBRS
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected

rhc54 · 2023-01-27T02:42:01Z

Understood. The problem is that we cannot do any better than your OS is doing. No matter what options you pass to mpirun, I'm limited to what the OS thinks is present.

What you are seeing is the difference between Windows (being optimized to work with this architecture) and Ubuntu (which isn't). There is nothing anyone can do about that, I'm afraid - unless someone at Ubuntu wants to optimize the OS for this architecture, which I very much doubt.

Your only other option would be to switch to Microsoft's MPI, which operates under Windows. I don't know their licensing structure and it has been a long time since I heard from that team (so this product might not even still exist) - but if you can get a copy, that would support this chip.

Otherwise, the best you can do is like I said - just run it oversubscribed (with however many procs you think can run effectively - probably an experiment) and let the OS do the best it can.

ggouaillardet · 2023-01-27T04:02:46Z

Are you running native Linux?
Or are you running Linux in a virtual machine (or WSL) ?

if the latter, that could explain why hwloc believes this is a 12 cores x 24 hyperthreads system

Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full

liuzheng-arctic · 2023-01-27T04:54:04Z

@ggouaillardet I am using Ubuntu 22.04 in WSL2. The kernel version is 5.15.79.1-microsoft-standard-WSL2.
I check again using a Ubuntu 22.04 usb boot drive, it indeed sees all the 16 cores. I thought WSL only limit the amount of RAM not the number of cores.

bgoglin · 2023-01-27T06:56:21Z

Last time I saw hwloc running on WSL on Windows, Windows/Linux was reporting correct information in sysfs hence hwloc too. But I never tried on a hybrid machine. What's wrong above is lspcu. Either because Windows/Linux reports something wrong, or because lspcu isn't hybrid-aware yet. It sees 24 threads in the socket, 2 threads in first core, and decides that means 24/2=12 cores. Running lstopo would clarify this. Or at least ̀cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

rhc54 · 2023-01-27T07:05:49Z

I'm not sure I agree with the assertion that lspcu is doing something "wrong". WSL isn't "limiting" the number of cores - it is simply logically grouping the available hyperthreads into two-HT "cores" - i.e., you have 12 "cores", each with 2 HTs. Native Ubuntu is logically grouping them into 8 "cores" each with 2HTs, and 8 "cores" each with 1HT. It all just depends on how the OS intends to manage/schedule the HTs. Neither is "correct" or "wrong" - they are just grouped differently.

If you have hyperthreading enabled (which you kinda have to do with this processor), it really won't matter as the kernel scheduling will be done at the HT level - so how they are grouped is irrelevant. What matters is if and how the kernel is scheduling the p-cores differently from the e-cores.

IIRC, Windows was customized to put compute-heavy process threads on the p-cores, and lighter operations on the e-cores. So as your job continued executing, it would migrate the more intense operations to the p-cores (e.g., your computational threads) and the less intense ones to the e-cores (e.g., threads performing disk IO, progress threads that tend to utilize little time, system daemons).

I'm not sure how Ubuntu is optimized - probably not as well customized, so it may just treat everything as equal and schedule a thread for execution on whatever hyperthread is available. Or it may do something similar to what Windows is doing.

Point being: the processor was designed with the expectation that the OS would migrate process threads to the "proper" HT for the kind of operations it was performing. In this architecture, the worse thing you can do is to try and preempt that migration. Best to just let the OS do its job. You just need to add the "oversubscribe" qualifier to the --map-by directive so that mpirun won't error out if you launch more procs than there are "cores" (or HTs if you pass the --use-hwthread-cpus option).

liuzheng-arctic · 2023-01-27T07:07:36Z

@bgoglin I think you are right that it might not be due to WSL limiting the number of available cores. If WSL limit the number of cores, it shouldn't see 24 threads. But lscpu returns the correct number of cores in Ubuntu 22.04 usb boot drive. So something else is wrong and it affects the number of cores available to OpenMPI under WSL.

ggouaillardet · 2023-01-27T07:15:14Z

The number of threads (24) is correct, so WSL is not limiting anything.
But the topology might be altered (IIRC I saw that with KVM or VirtualBox): it shows 12x2 instead of 8x2+8.

Here is the info I requested on SO:

$ lstopo-no-graphics --version
$ lstopo-no-graphics --cpukinds
$ lstopo-no-graphics --no-io --of xml

rhc54 · 2023-01-27T07:16:53Z

it affects the number of cores available to OpenMPI under WSL

We seem to be spending a lot of time chasing ghosts on this thread, so I'll just crawl back under my rock. There is no limitation being set here. OMPI sees the same number of HTs on each system you have tried. mpirun just needs to be told to consider HTs as independent cpus so it can set oversubscription correctly. You don't want to bind your procs - you need to let the OS manage them for you. That is how the processor was designed to be used.

<me pulling the rock over my head>

liuzheng-arctic · 2023-01-27T07:20:30Z

lstopo_output.log
@bgoglin
$ lstopo-no-graphics --version returns:

lstopo-no-graphics 2.7.0
$ lstopo-no-graphics --cpukinds returns nothing
$ lstopo-no-graphics --no-io --of xml returns is attached in the log file.

liuzheng-arctic · 2023-01-27T07:35:57Z

@rhc54 Thanks a lot for the explanations! I think I am more at ease when I use the OpenMPI on this machine now.

ggouaillardet · 2023-01-27T07:41:42Z

thanks, I confirm hwloc sees a single socket with 12 cores and 2 hyperthreads per core, so I guess WSL does not "pass through" the actual processor topology.

So I am afraid there is no trivial way to use 8P + 8E cores (e.g. ignore the second hyperthread on the P cores).
Bottom line, mpirun --use-hwthread-as-cpus --bind-to none and let the OS (Linux via WSL) schedule the MPI tasks.

liuzheng-arctic · 2023-01-27T07:48:56Z

@ggouaillardet Thanks a ton! This helps a lot.

ggouaillardet mentioned this issue Jan 26, 2023

hwloc: add the use_perf_cpus_only option #11346

Closed

rhc54 mentioned this issue Feb 1, 2023

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

Open

chadbrewbaker mentioned this issue May 19, 2024

Add a built-in benchmark tool rails/rails#50451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P vs E cores in Open MPI #11345

P vs E cores in Open MPI #11345

ggouaillardet commented Jan 26, 2023

rhc54 commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

rhc54 commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

bgoglin commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

bgoglin commented Jan 26, 2023

rhc54 commented Jan 26, 2023

liuzheng-arctic commented Jan 26, 2023

rhc54 commented Jan 26, 2023

liuzheng-arctic commented Jan 27, 2023

rhc54 commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023

rhc54 commented Jan 27, 2023

ggouaillardet commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023 •

edited

Loading

bgoglin commented Jan 27, 2023

rhc54 commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023

ggouaillardet commented Jan 27, 2023

rhc54 commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023 •

edited

Loading

liuzheng-arctic commented Jan 27, 2023

ggouaillardet commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023

P vs E cores in Open MPI #11345

P vs E cores in Open MPI #11345

Comments

ggouaillardet commented Jan 26, 2023

rhc54 commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

rhc54 commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

bgoglin commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

ggouaillardet commented Jan 26, 2023

bgoglin commented Jan 26, 2023

rhc54 commented Jan 26, 2023

liuzheng-arctic commented Jan 26, 2023

rhc54 commented Jan 26, 2023

liuzheng-arctic commented Jan 27, 2023

rhc54 commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023

rhc54 commented Jan 27, 2023

ggouaillardet commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023 • edited Loading

bgoglin commented Jan 27, 2023

rhc54 commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023

ggouaillardet commented Jan 27, 2023

rhc54 commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023 • edited Loading

liuzheng-arctic commented Jan 27, 2023

ggouaillardet commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023

liuzheng-arctic commented Jan 27, 2023 •

edited

Loading

liuzheng-arctic commented Jan 27, 2023 •

edited

Loading