Keeping job info in the dstor #144

artpol84 · 2016-09-08T07:26:43Z

According to recent investigation:
#129 (comment)
job info is not going to the dstore.

We need to make sure it sits there.

hjelmn · 2016-10-28T16:25:29Z

Any progress on this? As far as LANL is concerned this bug is a blocker on Open MPI 2.1.0.

On knl with 272 ranks per node the wasted space is ~ 272 * nodes * 0xaa0! I can't scale to even 1/8th the machine without an OOM.

jjhursey · 2016-10-28T16:32:09Z

I sent @karasevb an email this morning asking for an update.

hjelmn · 2016-10-28T16:34:12Z

@jjhursey Thanks Josh! Hopefully this gets fixed soon. With Open MPI master I currently see a net increase in node memory usage with the dstore enabled. Will test again once the fix is ready.

karasevb · 2016-10-28T16:36:54Z

@hjelmn @jjhursey I'm working on it. I hope to finish in a couple of days.

hjelmn · 2016-10-28T16:49:04Z

@karasevb Once this is complete it might be worth looking at compressing strings stored in the dstore if they go over a certain length. The pmix.lcpus key on knl looks like this:

0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271:0-271

That would compress very nicely even just using libz's deflate function.

hjelmn · 2016-10-28T16:51:30Z

Hmm, I see you just pack the data. We could kill two birds with one stone (storage space AND network usage) by compressing the string in the buffer ops.

rhc54 · 2016-10-28T16:56:43Z

I think a regex might be more appropriate and actually use less space - in this case, the regex generator we already have would have made it as N:0-271, where N is the number of replications.

I'd need to look in ORTE at how that is generated as that value doesn't look right to me. The local cpus should only be the local ranks on this node.

hjelmn · 2016-10-28T16:57:29Z

@rhc54 Yeah, that would work as well :)

hjelmn · 2016-10-28T16:57:57Z

BTW, lpeers probably needs to be fixed as well:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271

rhc54 · 2016-10-28T16:58:16Z

yeah, no surprise at that

hjelmn · 2016-10-28T16:58:59Z

Compression would still be helpful for strings that can't be fixed. It was trivial to add to buffer_ops.

rhc54 · 2016-10-28T17:00:51Z

Agreed - my concern is only that we look at launch time as well as footprint, as the two often are tradeoffs. Also, we need to be a little careful about what users expect to be handed, and how it is accessed - e.g., we may need to add a flag to indicate "this data has been compressed" so we uncompress it before handing it back.

artpol84 · 2016-10-28T17:02:58Z

I think that compression is an orthogonal solution here let's not mix them. Hopefully we will have this part ready for testing next week.

hjelmn · 2016-10-28T17:03:54Z

@artpol84 Agreed. Just throwing it out there as we need to get the memory footprint down as much as possible.

jjhursey · 2016-10-28T17:07:39Z

Maybe we can open another issue to track the compression of values? Then we can continue the conversation/development there.

hjelmn · 2016-10-28T17:07:57Z

@jjhursey Sure. Will open that now.

jjhursey · 2016-11-10T15:46:49Z

@karasevb @artpol84 Any update on this issue?

karasevb · 2016-11-10T16:52:17Z

@jjhursey final preparations to RP. Today will be presented.

karasevb · 2016-11-10T21:37:38Z

@jjhursey sorry, need to fix some of the problems still, it will take some time.

kawashima-fj · 2016-12-07T01:59:22Z

I re-evaluated the memory footprint as a follow-up of #129.

(c) Before "keeping job info in the dstor"
Open MPI master open-mpi/ompi@277c319 (26 Aug.) + PMIx 2.0a embedded in OMPI
(Same as (c) of #129)

(d) Aefore "keeping job info in the dstor"
Open MPI master open-mpi/ompi@b2e36f0 (2 Dec.) + PMIx 2.0a embedded in OMPI

The environment and condition of the evaluation is same as #129. The graph shows memory footprint per node (orted + 16 clients + share memory).

Memory footprint of PMIx client processes (between the red line and the blue line in the graph) is largely improved. Thank you for your great work!

jladd-mlnx · 2016-12-07T02:39:15Z

@karasevb Well done!!

artpol84 assigned karasevb Sep 8, 2016

artpol84 added enhancement Target: 2.x Target: 3.x Target: 1.2 labels Oct 14, 2016

hjelmn mentioned this issue Oct 28, 2016

Fix PMIx dstore bug #144 (space consumption) open-mpi/ompi#2316

Closed

jjhursey added this to the v1.2.0 milestone Oct 28, 2016

karasevb mentioned this issue Nov 11, 2016

dstore: keeping job info in the dstor #217

Merged

rhc54 closed this as completed in #217 Nov 29, 2016

This was referenced Dec 7, 2016

dstor: reduce storage space for the key name #232

Merged

Reduce memory footprint #129

Closed

jjhursey mentioned this issue Dec 8, 2016

Compress large PMIx values #204

Closed

artpol84 mentioned this issue Apr 17, 2018

PMIx/dstore performance issues #665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping job info in the dstor #144

Keeping job info in the dstor #144

artpol84 commented Sep 8, 2016

hjelmn commented Oct 28, 2016

jjhursey commented Oct 28, 2016

hjelmn commented Oct 28, 2016

karasevb commented Oct 28, 2016

hjelmn commented Oct 28, 2016 •

edited

Loading

hjelmn commented Oct 28, 2016 •

edited

Loading

rhc54 commented Oct 28, 2016

hjelmn commented Oct 28, 2016

hjelmn commented Oct 28, 2016 •

edited

Loading

rhc54 commented Oct 28, 2016

hjelmn commented Oct 28, 2016 •

edited

Loading

rhc54 commented Oct 28, 2016

artpol84 commented Oct 28, 2016

hjelmn commented Oct 28, 2016

jjhursey commented Oct 28, 2016

hjelmn commented Oct 28, 2016

jjhursey commented Nov 10, 2016

karasevb commented Nov 10, 2016

karasevb commented Nov 10, 2016

kawashima-fj commented Dec 7, 2016

jladd-mlnx commented Dec 7, 2016

Keeping job info in the dstor #144

Keeping job info in the dstor #144

Comments

artpol84 commented Sep 8, 2016

hjelmn commented Oct 28, 2016

jjhursey commented Oct 28, 2016

hjelmn commented Oct 28, 2016

karasevb commented Oct 28, 2016

hjelmn commented Oct 28, 2016 • edited Loading

hjelmn commented Oct 28, 2016 • edited Loading

rhc54 commented Oct 28, 2016

hjelmn commented Oct 28, 2016

hjelmn commented Oct 28, 2016 • edited Loading

rhc54 commented Oct 28, 2016

hjelmn commented Oct 28, 2016 • edited Loading

rhc54 commented Oct 28, 2016

artpol84 commented Oct 28, 2016

hjelmn commented Oct 28, 2016

jjhursey commented Oct 28, 2016

hjelmn commented Oct 28, 2016

jjhursey commented Nov 10, 2016

karasevb commented Nov 10, 2016

karasevb commented Nov 10, 2016

kawashima-fj commented Dec 7, 2016

jladd-mlnx commented Dec 7, 2016

hjelmn commented Oct 28, 2016 •

edited

Loading

hjelmn commented Oct 28, 2016 •

edited

Loading

hjelmn commented Oct 28, 2016 •

edited

Loading

hjelmn commented Oct 28, 2016 •

edited

Loading