Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping job info in the dstor #144

Closed
artpol84 opened this issue Sep 8, 2016 · 21 comments
Closed

Keeping job info in the dstor #144

artpol84 opened this issue Sep 8, 2016 · 21 comments

Comments

@artpol84
Copy link
Contributor

artpol84 commented Sep 8, 2016

According to recent investigation:
#129 (comment)
job info is not going to the dstore.

We need to make sure it sits there.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

Any progress on this? As far as LANL is concerned this bug is a blocker on Open MPI 2.1.0.

On knl with 272 ranks per node the wasted space is ~ 272 * nodes * 0xaa0! I can't scale to even 1/8th the machine without an OOM.

@jjhursey
Copy link
Member

I sent @karasevb an email this morning asking for an update.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

@jjhursey Thanks Josh! Hopefully this gets fixed soon. With Open MPI master I currently see a net increase in node memory usage with the dstore enabled. Will test again once the fix is ready.

@karasevb
Copy link
Contributor

@hjelmn @jjhursey I'm working on it. I hope to finish in a couple of days.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

@karasevb Once this is complete it might be worth looking at compressing strings stored in the dstore if they go over a certain length. The pmix.lcpus key on knl looks like this:

0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271

That would compress very nicely even just using libz's deflate function.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

Hmm, I see you just pack the data. We could kill two birds with one stone (storage space AND network usage) by compressing the string in the buffer ops.

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2016

I think a regex might be more appropriate and actually use less space - in this case, the regex generator we already have would have made it as N:0-271, where N is the number of replications.

I'd need to look in ORTE at how that is generated as that value doesn't look right to me. The local cpus should only be the local ranks on this node.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

@rhc54 Yeah, that would work as well :)

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

BTW, lpeers probably needs to be fixed as well:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2016

yeah, no surprise at that

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

Compression would still be helpful for strings that can't be fixed. It was trivial to add to buffer_ops.

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2016

Agreed - my concern is only that we look at launch time as well as footprint, as the two often are tradeoffs. Also, we need to be a little careful about what users expect to be handed, and how it is accessed - e.g., we may need to add a flag to indicate "this data has been compressed" so we uncompress it before handing it back.

@artpol84
Copy link
Contributor Author

I think that compression is an orthogonal solution here let's not mix them. Hopefully we will have this part ready for testing next week.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

@artpol84 Agreed. Just throwing it out there as we need to get the memory footprint down as much as possible.

@jjhursey
Copy link
Member

Maybe we can open another issue to track the compression of values? Then we can continue the conversation/development there.

@hjelmn
Copy link
Contributor

hjelmn commented Oct 28, 2016

@jjhursey Sure. Will open that now.

@jjhursey
Copy link
Member

@karasevb @artpol84 Any update on this issue?

@karasevb
Copy link
Contributor

@jjhursey final preparations to RP. Today will be presented.

@karasevb
Copy link
Contributor

@jjhursey sorry, need to fix some of the problems still, it will take some time.

@kawashima-fj
Copy link
Contributor

I re-evaluated the memory footprint as a follow-up of #129.

(c) Before "keeping job info in the dstor"
Open MPI master open-mpi/ompi@277c319 (26 Aug.) + PMIx 2.0a embedded in OMPI
(Same as (c) of #129)

graph aug2

(d) Aefore "keeping job info in the dstor"
Open MPI master open-mpi/ompi@b2e36f0 (2 Dec.) + PMIx 2.0a embedded in OMPI

graph dec2

The environment and condition of the evaluation is same as #129. The graph shows memory footprint per node (orted + 16 clients + share memory).

Memory footprint of PMIx client processes (between the red line and the blue line in the graph) is largely improved. Thank you for your great work!

@jladd-mlnx
Copy link

@karasevb Well done!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants