Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple multicore executions overlap #1248

Closed
Nortamo opened this issue Mar 7, 2022 · 3 comments
Closed

Multiple multicore executions overlap #1248

Nortamo opened this issue Mar 7, 2022 · 3 comments
Labels
BLOCKER Critical to be fixed in the next possible release bug Target 3.x

Comments

@Nortamo
Copy link

Nortamo commented Mar 7, 2022

Background information

Simultaneous executions requesting multiple cores overlap.
Wondering if this is expected or a bug.

  • prrte version v2.0.2
  • pmix version v4.1.2

Please describe the system on which you are running

  • Operating system/version:
    RHEL 7.9
  • Computer hardware:
  • ( Bull Sequana XH2000)
    AMD EPYC 7402 (dual socket)
  • Network type:
    InfiniBand HDR200

Details of the problem

When launching multiple executions using prun the CPU binds overlap.

shell$ prun -np 1 --map-by node:pe=4 ./a.out

Would be placed on cores [0-3].
If another process is launched with the same settings (while the first one is still running) it will be placed on cores [1-4]. Instead of some non-overlapping range. Otherwise the affinity mapping works perfectly.

Full:

prun -np 1 --map-by node:pe=4 --display map-devel ./a.out &
prun -np 1 --map-by node:pe=4 --display map-devel ./a.out &

->

process on node1 with cores Cpus_allowed_list: 0-3,128-131
process on node1 with cores Cpus_allowed_list: 1-4,129-132

Is this expected behavior or am I perhaps missing some option or configuration to make the placement non-overlapping ? Based on the --display map-devel output multiple cpus/proc does not seem to affect the number of slots used.

Same behavior is observed when running locally and using slurm (21.08) integration (prrte started within a job allocation and srun starts prte on each node).

@rhc54
Copy link
Contributor

rhc54 commented Mar 7, 2022

Hmmm...that is clearly a bug. Will investigate.

@rhc54
Copy link
Contributor

rhc54 commented Mar 9, 2022

Found the bug - still working on it.

@rhc54 rhc54 added bug BLOCKER Critical to be fixed in the next possible release Target 3.x labels Apr 24, 2022
@rhc54
Copy link
Contributor

rhc54 commented Jul 15, 2022

Sorry for the lengthy delay - retirement has its privileges :-)

This is fixed in #1383

@rhc54 rhc54 closed this as completed Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BLOCKER Critical to be fixed in the next possible release bug Target 3.x
Projects
None yet
Development

No branches or pull requests

2 participants