Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining asynchronous keys #1586

Closed
rhc54 opened this issue Jan 9, 2020 · 40 comments
Closed

Obtaining asynchronous keys #1586

rhc54 opened this issue Jan 9, 2020 · 40 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Jan 9, 2020

Ref #1427
Ref open-mpi/ompi#6982

A user is trying to use PMIx in a way we hadn't anticipated, but should probably be supported. Here is a summary of the problem:

  • procs do a global exchange of some basic set of key-value pairs

  • some process (call it proc-A) requests a specific key-value pair from target process (proc-B) that is not included in that global exchange.

  • The host daemon (call it host-1) does the correct thing and queries the remote host (host-2) of proc-B for the information

  • host-2 calls PMIx_server_dmodex_request to request info for proc-B. Note, however, that the API doesn't allow specification of the key that is being sought - it only requests ALL data posted by proc-B prior to the request

  • the PMIx server library sees that proc-B has posted data and returns that blob

  • host-2 returns the blob to host-1, which delivers it to its PMIx server library. Note that the blob at this point only includes the same data that was in the global exchange because proc-B hasn't posted the new data yet! The server library notifies the client library in proc-A, which checks for the requested key and returns NOT_FOUND

  • proc-B then posts a new key, which happens to be the one proc-A is looking for - but it is too late.

What the user would like to have happen is for host-2 to wait until proc-B posts the desired key, and then respond to the direct modex request. There are several possible solutions that immediately came to mind. I will post each as a separate comment below so people can "emoji-vote" or directly comment on them separately.

Just as a reminder:

  • Hooray or Rocket: I support this so strongly that I want to be an advocate for it
  • Heart: I think this is an ideal solution
  • Thumbs up: I’d be happy with this solution
  • Confused: I’d rather we not do this, but I can tolerate it
  • Thumbs down: I’d be actively unhappy, and may even consider other technologies instead
  • Eyes: You must be kidding, let's not do this one
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 1: Don't support this behavior

Would require clarification in the Standard, but no further work in the library

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 2: do the full blob exchange. If the PMIx client fails to find the requested piece of info in that response, then have it follow-up with a "one piece of data" request. When host-2 receives that request, it knows to pursue it with a PMIx_Get call instead of PMIx_server_dmodex_request.

Pretty easy to implement, but would result in slow behavior. Still, the Standard doesn't guarantee anything about speed-of-response!

No changes required to the Standard. Would require changes to OpenPMIx and PRRTE.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 3: flag that a proc has added/modified its posted data since the last time it was exchanged in a "fence" operation. If we receive a dmodex request and the flag isn't set, then "hold" the request until some new data has been posted. Note that we won't know how much new data is coming, and so we will have to respond as soon as we get the next PMIx_Commit. Thus, if someone posts their data every time with a PMIx_Put/PMIx_Commit operation, we would respond after the first such posting and may not include the data they wanted.

No changes required to the Standard. Only requires changes to OpenPMIx. Doesn't fully solve the user's problem as it doesn't resolve the issue of multiple async postings by the same target process.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 4: Modify/replace the PMIx_server_dmodex_request API to specify the desired key, with the PMIx server library returning the entire blob once the requested key has been posted.

Requires modification to the Standard, plus changes to OpenPMIx. Still inefficient for the case of multiple async postings by the same target unless those postings are combined into one PMIx_Commit operation.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 5: A modified form of Solution 3. If host-2 receives a dmodex request after performing a fence that includes data exchange, first check for existence of the requested key using PMIx_Get. If found, then return all data posted by the target proc since the last fence, just in case it includes other things the requestor might want. This would resolve the "multi-request" case but requires different behavior on the part of the host.

Might require modification to the Standard, at least as an "advice to RMs" section. Still inefficient for the case of multiple async postings by the same target unless those postings are combined into one PMIx_Commit operation as we won't know more requests are coming, but I'm not sure there is a solution for that problem - might merit an "advice to users" in the Standard.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 6: Require there be a PMIx_Fence operation before data can be retrieved. The Fence doesn't have to include data - it just needs to be there to ensure that all data was posted by the procs.

Requires clarification in the Standard. May not solve this user's specific use-case as they don't necessarily know when procs should participate in a Fence (not everyone is posting async data). Wouldn't require any changes to OpenPMIx or PRRTE.

@jjhursey
Copy link
Member

jjhursey commented Jan 9, 2020

Let's see if I understand the problem. The pseudo code looks like this, right?

peer = (rank + 1)%2

PMIx_Init
PMIx_Get(wildcard) -- job data

PMIx_Put(key-1)
PMIx_Commit()

PMIx_Fence(DATA_COLLECT) // "Fence 1"

PMIx_Put(key-2)
PMIx_Commit()

PMIx_Fence() // "Fence 2" -- if this is not called then get below fails

PMIx_Get(peer, key-2)

PMIx_Finalize()

Is the problem that without "Fence 2" then the PMIx_Get(peer, key-2) may return unknown?

I think that is expected behavior, right? The PMIx_Commit is a (node) local operation, and the PMIx_Fence is a synchronizing collective. So there is a race here between rank 0 and rank 1 that would require either the fence or some other external synchronization to occur to guarantee that the other process has written the data before trying to access it.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Yeah, that is basically correct. The problem here is that the procs don't know if/when they should participate in some global fence - they don't know if someone else has published new info (arbitrary ranks might or might not do so at any time). The result is "expected" by us, but the direct modex operation doesn't explicitly require there be a fence before using it. As I said above, one solution is to make that painfully clear - but that then prohibits this use-case.

I'll post another possible solution to get around the problem that might help as well.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2020

Solution 7: Don't use PMIx_Put/PMIx_Get for this use-case. Use PMIx_Publish/PMIx_Lookup instead as this is designed for non-scalable async data exchange.

Requires no change to the Standard, OpenPMIx, or PRRTE. Might want to add some "advice to user" text to the Standard about this usage. Might also need/want to optimize the data server in PRRTE if this gets exercised at any scale.

@jjhursey
Copy link
Member

jjhursey commented Jan 9, 2020

The process that posted key-2 could throw an event to the other processes to refresh their copy:

put()
commit()
post_event()

Then the remote process could do the get(), triggering a dmodex that would pick up the new value without the need for a fence. Essentially using the event mechanism to notify and loosely synchronize procs.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 10, 2020

@angainor You are welcome to join the discussion!

@angainor
Copy link

@rhc54 @jjhursey Thanks for looking into this! Just to clarify, in this particular (trivial) scenario I want to publish rank address information. The reason I don't want to use the second fence with data collect are the memory considerations. I am working with a communication library for grid-based PDE solvers, and as you know in such applications the rank neighborhood is usually local and small. I guess this is a typical scenario and the reason OpenMPI / PMIx support no-modex setup.

Notice that the source of the problems for me was that I was trying to set my custom keys after calling MPI_Init, which internally called a Fence+collect. This broke my workflow, since I expected the no-modex variant to work for my keys, but because of what MPI_Init did internally it seems I have no way of getting those keys without calling another Fence+collect. Now I see that my original workaround to set the keys before MPI_Init also did not work, since a Fence+collect was called anyway, and PMIx did an allgather on my keys. Hence at least part of the problem is integration of independent software components that use PMIx.

Going back to this particular use case, the data exchange is done in the setup phase, so I am not so much concerned about performance. I will look at the Publish / Lookup solution, but here I could also use solution 6 and call a Fence without a data collect. In general @rhc54 is right in that asynchronously posting new / dynamically changed keys without requiring a Fence would be useful. One reasons for this would be transparent task migration, but here our requirements are not yet defined, so I don't have a clear case. Not sure if it makes any sense from your perspective, but to support dynamically posted / updated keys, would it make sense to specify key 'hints' and mark them as direct-modex (or do not include in data collect, or do not cache and always go to the source to get it). This would be explicit and would insulate separate software components from each other. So MPI_Init would not cause an allgather of my keys.

A final thought: I'm sure I did not understand that part, but you wrote that this is a race between the ranks posting and getting yet unposted data. To me this would mean that it should work if I wait long enough, e.g., call PMIx_Get until I succeed:

PMIx_Get(wildcard) -- job data

PMIx_Put(key-1)
PMIx_Commit()

PMIx_Fence(DATA_COLLECT) // "Fence 1"

PMIx_Put(key-2)
PMIx_Commit()

do {
  rc = PMIx_Get(peer, key-2);
} while(PMIX_SUCCESS != rc)

This doesn't work, but maybe this would be an acceptable solution from your perspective? From the user side, if I am expecting the peer to post a key at some point, and I don't know when that is, then I can simply call PMIx_Get until it succeeds. Or I can call it occasionally, and go on with my life until I manage to obtain the key.

@angainor
Copy link

@rhc54 I have looked at openpmix/prrte#297 and PMIx_Publish / Lookup. This would of course work, and I guess the lookup semantics looks similar to what I wrote about in my PMIx_Get thoughts:

Thus, the caller is responsible for ensuring that data is published prior to
executing a lookup, using PMIX_WAIT to instruct the server to wait for the data to be published, or
for retrying until the requested data is found.

This sounds great. The only pitfall, as I understand from your comment, is that this approach will not be scalable? I guess that is because the lookup is a global operation, which in contrast to PMIx_Get does not take the peer id as argument, so you have to search for the key globally. Is my understanding correct? Do you think this will be a problem for large runs? Is the published data kept centrally in one place somewhere (memory usage considerations)? Also, is the data obtained by lookup cached by the local PMIx server?

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 10, 2020

The only pitfall, as I understand from your comment, is that this approach will not be scalable? I guess that is because the lookup is a global operation, which in contrast to PMIx_Get does not take the peer id as argument, so you have to search for the key globally. Is my understanding correct? Do you think this will be a problem for large runs? Is the published data kept centrally in one place somewhere (memory usage considerations)? Also, is the data obtained by lookup cached by the local PMIx server?

Whether you use lookup or get, the scenario you describe is inherently non-scalable as it involves a direct host-to-host operation for every data request. The fence-based get operation is scalable solely because we collect the information from every process and make it locally available, and thus the individual get operations are purely local. This allows us to use more efficient collective algorithms to move the data.

Lookup can ask for the key to come from a particular source, so the search isn't the scaling issue. The scaling problem is again that the lookup involves a remote operation for every request. Storage of published data is an implementation issue and may differ across systems. At the moment, PRRTE uses a central key-value store (hosted on the prte master process), but we are hoping to look at extending options to things like a distributed hash table so the data is more reliable (in case of failures in prte) and to reduce the memory footprint on the prte master.

One way of managing memory footprint is to add the PMIX_PERSIST_FIRST_READ attribute when publishing the data. This will cause the key-value datastore to delete the data once it has been accessed and avoids yet another remote operation to remove it (as you would otherwise have to specifically "unpublish" every post). Here are the supported persistence attributes in case one of them is more appropriate to your case:

#define PMIX_PERSIST_INDEF          0   // retain until specifically deleted
#define PMIX_PERSIST_FIRST_READ     1   // delete upon first access
#define PMIX_PERSIST_PROC           2   // retain until publishing process terminates
#define PMIX_PERSIST_APP            3   // retain until application terminates
#define PMIX_PERSIST_SESSION        4   // retain until session/allocation terminates

Caching of lookup data is again implementation specific. We don't currently cache it, but it wouldn't be hard to do. Main problem would be ensuring that data obtained via lookup didn't get included in modex requests, so a little more bookkeeping is required. We didn't cache it because there wasn't any clear need to do so. If you are expecting multiple procs on a given node to "lookup" the same piece of information, then we probably need to implement something.

@angainor
Copy link

I hope you don't mind my lengthy posts :) Trying to understand this better, so please do correct me if I'm confusing something.

Whether you use lookup or get, the scenario you describe is inherently non-scalable as it involves a direct host-to-host operation for every data request.

I am also thinking about memory scalability, hence the interest in no-modex. I know this is forward-looking, but let's say I run 128 'ranks' per compute node, each posts an endpoint address of size 300b, and I have 1e5 compute nodes. Then after a Fence+collect, each node would replicate 4GB of address data. This is a lot for most architectures.

As I understand, no-modex is useful to address the memory scalability for applications with a sparse communication pattern, as the entire database is not replicated on each host. With direct modex only the keys that have been accessed are cached on each node. Assuming a local and (mostly) static communication pattern, in practice each rank would execute a constant and small number of direct host-to-host operations with the actual neighbors. Since this is done once and cached, it does not affect the parallel scalability of the application. Plus, the memory footprint is constant instead of scaling with the job size. Plus, if the communication pattern changes (e.g., due to node failure / rank migration / anything), thanks to PMIx my application will seamlessly obtain the new endpoint address - even if this will require a one-time direct host-to-host communication.

Hence, I wanted to use PMIx_Set / Get without a fence, but MPI_Init called it for me :)

At the moment, PRRTE uses a central key-value store (hosted on the prte master process), but we are hoping to look at extending options to things like a distributed hash table so the data is more reliable (in case of failures in prte) and to reduce the memory footprint on the prte master.

So from the memory usage perspective on the prte master this would be equivalent to collecting all data from all hosts/ranks, while the remaining nodes would have a smaller memory footprint. Is that right? I cannot really un-publish the data after first access, because each key is accessed multiple times, once per each ranks neighbor.

Regarding caching, in my use-case I can cache the keys in the application, and I don't require PMIx to do this. In this sense lookup works fine for me.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 10, 2020

Loop @jaidayal into the discussion

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 10, 2020

@angainor Well, of course, the goal of PMIx is to completely eliminate the endpoint exchange at all 😄 . There are plans in progress of ways to handle the data size you point out as that will be a recurring issue, but those will be implementation specific (of course).

Direct modex was designed as an alternative solution to the problem, but it relies on a Fence to ensure that data is available and is scalable only in the case of sparsely-connected apps. What we do to try and alleviate the scaling problem is have each dmodex request return all the data published by the target process (in case you want to get more than just one value), and provide an attribute by which you can request that the first dmodex operation return all data from every peer process on the target node (for the case where procs on a node all talk to their corresponding peers on another node). This at least reduces the number of dmodex operations flowing across the system.

The use-case in your example code, however, did something a little different in that you did the fence after the initial posting of data, but then you asynchronously posted additional data and tried to retrieve it. Our optimization now gets in the way as the host on the target node doesn't have an easy way of retrieving just that one piece of data. As outlined in some of my solutions, we can work around that problem - but it creates a non-optimal code path as we are now requesting and returning just one piece of data every time. If lots of procs execute that code path, there will be a storm of communication across the cluster. In your follow-up message, you indicate that perhaps we have misunderstood your example code - it sounds like the initial wireup fence is fine, and your later "get" requests were really focused on obtaining updates due to proc migration.

We handle migration issues a little differently. When a proc migrates (for whatever reason), there is an event generated that includes the proc's identity, where it went, new endpoint, etc - basically, all the data that is typically provided at job start for that proc is included in the event. Procs that need to know that info can simply register a handler to receive it. No dmodex is required. Might make sense for the PMIx server to also register for it and update its "cached" info for the proc - have to put that on the "to-do" list.

Does that resolve the use-case of interest?

@angainor
Copy link

In your follow-up message, you indicate that perhaps we have misunderstood your example code - it sounds like the initial wireup fence is fine, and your later "get" requests were really focused on obtaining updates due to proc migration.

I'm sorry I brought migration into this discussion - please ignore it. Your initial understanding was correct - I think ;) We've had a previous discussion on async modex, and in #1424 (comment) you wrote that a Fence is essentially an allgather. So I assumed that if I avoid the fence, I avoid the memory footprint problem - because there is no allgather.

My application uses MPI, but it also uses other communication backends, which create their own endpoint address information. I have a setup step very similar to that of OpenMPI, in which all ranks create the endpoints and post their address information for later retrieval by communication neighbors - whoever needs it. Since the communication pattern is sparse, I wanted to avoid the Fence and instead rely on the method you described in #1424 (comment). Are you saying now that I can use the Fence+collect and (now or sometime in the future) this will not cause the memory footprint to explode? If that is the case, then I am happy with that.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 10, 2020

I see - few things then.

  1. Do not use Fence+collect - use Fence without data collection. There is an attribute to stipulate you want a dataless barrier. It is still effectively an "allgather" operation - just with zero data, which is really what a "barrier" operation is anyway. Helps to reduce overhead
  2. If you can share the info on the other communication backends, we would be happy to approach those groups and get PMIx "instant on" support added to them. FWIW, we are working on adding "instant on" directly to libfabric, so if your other backends use libfabric for interacting with the fabric, then they will automatically pick this support up in the not-too-distant future
  3. We are definitely working on the memory footprint issue as it needs to be solved for the exascale machines. I can't say exactly when the support will be released, but hopefully in the next year or so.
  4. Remember, in the case where "instant on" is supported, you don't execute Fence at all - not even dataless. There is no need to "sync" the procs as all data is available at time of first execution.

HTH

@angainor
Copy link

@rhc For now we have a UCX backend implementation, we also plan to implement a libfabric backend.

Can you point me to some doc regarding the "instant on" feature? I would be interested in the details. Just as food for thought: our communication library assumes that the user can use MPI in the same program. This means that we need to create different endpoints than the MPI library, e.g., MPI runs with libfabric and we run with UCX, or we both run with the same backend, but still we must use different endpoints. Also, we could use one endpoint per thread, while MPI (for now) uses one endpoint per rank.

I'm not sure you want to handle all such complexity, but I would be interested to hear your opinion on this.

Thanks!

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 13, 2020

You are correct that we don't get down into the details of which library is using which endpoint. All we do is provide a means by which you can request how many endpoints (from each fabric and/or fabric plane) shall be assigned to each process - what you do with them is up to you. You don't even have to use them all, though keep in mind that most fabrics do have limits on the number of endpoints on a node.

You can find out more about "instant on" in the following:

This is a journal article that describes PMIx in general, including a fairly detailed section on "instant on":
https://www.sciencedirect.com/science/article/pii/S0167819118302424?via%3Dihub

Here is a presentation given as a tutorial on working with PMIx. It goes over the nitty-gritty details of the launch procedure starting on page 44. There are not a lot of words as I tend to speak more to the pictures, but hopefully enough is there to get the idea across:

https://pmix.org/wp-content/uploads/2019/06/PMIxTutorial-June2019-pub.pdf

Don't hesitate to ask questions.

@jjhursey
Copy link
Member

Going back in the thread a bit. I'm curious why this didn't work (I wouldn't call it optimal, but should be functional):

PMIx_Get(wildcard) -- job data

PMIx_Put(key-1)
PMIx_Commit()

PMIx_Fence(DATA_COLLECT) // "Fence 1"

PMIx_Put(key-2)
PMIx_Commit()

do {
  rc = PMIx_Get(peer, key-2);
} while(PMIX_SUCCESS != rc)

@rhc54 Is it because Fence 1 was a data collection fence (full modex), so the get will not trigger the dmodex requests? If we replaced Fence 1 with a direct modex would the get-loop continue to issue dmodex requests to the same target for the same key until it resolved or would it only issue it once?

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 13, 2020

It should work, but would be awfully inefficient as every call to PMIx_Get will generate a dmodex request that will return the entire peer's blob at that moment, which the recipient will search thru and return NOT_FOUND. Not something I think we would want people using.

I'm not sure why it failed here - it is a bug, but it might be in ORTE (i.e., we might be tracking that we already did a dmodex for that peer and not do it again).

@angainor
Copy link

It should work, but would be awfully inefficient as every call to PMIx_Get will generate a dmodex request that will return the entire peer's blob at that moment, which the recipient will search thru and return NOT_FOUND.

I would not mind the inefficiency if that helped me to reduce the memory overhead. It seems to me that in practice the while loop would only execute PMIx_Get once: in a real application some time will pass between the PMIx_Commit and PMIx_Get. So there would be in total two transfers of a particular peer's data: once in the Fence, and once in PMIx_Get. Not so bad?

I'm not sure why it failed here - it is a bug, but it might be in ORTE (i.e., we might be tracking that we already did a dmodex for that peer and not do it again).

It looks like it to me, too.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 13, 2020

I would not mind the inefficiency if that helped me to reduce the memory overhead. It seems to me that in practice the while loop would only execute PMIx_Get once: in a real application some time will pass between the PMIx_Commit and PMIx_Get. So there would be in total two transfers of a particular peer's data: once in the Fence, and once in PMIx_Get. Not so bad?

Only if you plan to go to sleep waiting for the job to start 😄

Think about it - at the scale you are targeting, do you really want every process to execute multiple host-to-host data exchanges with multiple processes just to wire up??? The network routers will be lit up like Christmas trees. Your assumption of only one time thru the loop strikes me as very implementation specific - and overly optimistic.

We should just provide a scalable solution rather than trying to devise non-scalable ways of working around the problem.

@angainor
Copy link

Think about it - at the scale you are targeting, do you really want every process to execute multiple host-to-host data exchanges with multiple processes just to wire up???

I guess not :) I do of course see that "instant on" is a solution, while the drafted loop is a work-around.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 13, 2020

😄 I think the original solution that currently has a bug in it should work - we just need to get that bug fixed. It won't be ideal, but much better than having to loop, especially as "instant on" still isn't available.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 23, 2020

Starting to dig into this further. One note (probably something stated earlier that I had forgotten): the "fence" operation after posting key1 and before requesting key2 must collect data. If the fence does not collect data, then the subsequent dmodex for key2 works fine. So it appears to be that we somewhere flag that a fence has occurred, and then disable dmodex from that point forward.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 1, 2020

@angainor I have fixed this via the following commits:
#1611
#1614
#1609
#1607
openpmix/prrte#322
openpmix/prrte#321
openpmix/prrte#318

Here is the bottom line from those changes. Any time you request data for a key, we will check the client's local cache for it. If not found, we will (unless you include the "optional" attribute) send a request to our local PMIx server library for the key. The server library will first check its cache for it and return the value if found. If not found, the server will (unless you include the "immediate" attribute) request that the library's host retrieve the value from wherever the referenced proc is executing. This usually means that the host will send a message to its counterpart on the node where the proc is executing.

When the remote host receives the message, it first uses PMIx_Get to determine if the desired key is available. If it isn't, then the host will put the request in a "holding" location, periodically rechecking to see if the key has become available. When the key is found, the host will return a complete copy of all data posted by that proc - so hopefully any subsequent requests can be resolved without the remote exchange.

IF you don't specify a timeout (via the PMIX_TIMEOUT) attribute, then it is up to the host environment to decide how long to "hold" the request before returning "not found" to you. Note that in the case of PRRTE, this defaults to a 2min timeout, but host environments are not required to support any default value. They may return "not found" immediately, or they may never return by default. So if you don't want to "hang", you need to ensure that the key will be posted, or else specify PMIX_TIMEOUT and mark it as "required" via the PMIX_INFO_REQUIRED macro. This way, you'll get a "not supported" error if the host environment doesn't support a timeout rather than just hanging if it isn't found.

Note that this also holds true for keys posted by other local procs. If you request such a key, the PMIx server library first checks for its availability. If it isn't available, we will "hold" the request and periodically recheck to see if the key has arrived. In this case, I used a default timeout of only 2sec. The timeout mechanism is having a problem at the moment, so for now it will just return "not found" rather than waiting - but I'm hoping that gets resolved soon.

You can also request that the client's local cache be "refreshed" using the new PMIX_GET_REFRESH_CACHE attribute. If that is given, then we will request that the host request a complete copy of all keys posted by the referenced proc, regardless of whether or not we already have a value for that key. This will still wait for the key to be present on the remote host. Note that it doesn't know if the value has been updated - it will only detect the presence and return the data once found. Timeout rules apply here as well.

Note that "refresh cache" doesn't do anything if the referenced proc is on the same node as the requestor. We assume that the host keeps the local data store current. So if the key is present, then we simply return the value - or else we hold it until the key is found or we timeout.

I illustrate these behaviors using PRRTE and a modified form of your reproducer:
https://github.com/openpmix/prrte/blob/master/test/double-get.c

I believe this will resolve the issues you uncovered. Sorry it took a while - it was a rather tricky problem to resolve. Please give it a try (you'll need to use PMIx and PRRTE master branches) and close this issue if it works for you.

I don't know if/when other environments will upgrade to support these features. OpenMPI will soon be moving to use PRRTE as its embedded RTE (to be included in the upcoming v5.0 release), so you'll be able to do it there. And of course, you can always run PRRTE inside an allocation under any environment.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 2, 2020

FYI: I have fixed the local proc timeout problem mentioned in the above: #1615

@angainor
Copy link

angainor commented Feb 2, 2020

@rhc54 Thanks a lot for your effort! I will try to digest all this information this week and test things.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 2, 2020

@angainor Hmmm...give me a day to look at this - something may have gotten into PRRTE that is causing a problem.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 5, 2020

My bad - forgot to update this. All fixed and ready to go.

@angainor
Copy link

angainor commented Feb 6, 2020

@rhc54 Thanks! I am quite busy this week, but I will test things sometime after the weekend.

@angainor
Copy link

@rhc54 I tested your new double-get test with thew newest pmix/master and prrte/master. I think something might still not be right with the timeout, but I might misunderstand your explanations above. Your test works as expected when I use --refresh - then I get the second key correctly. But with the other flags (--wait and --timeout) the test always fails. I changed the test a little (sleep on both ranks in line 166, see openpmix/prrte#338), but it does not help.

Also, I added another test, which simply calls a single PMIx_Set followed by a PMIx_Get with no fence (standard no-modex). This used to work, but now it also this fails now in all cases, unless I specify --refresh.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 10, 2020

Odd - all four variations work fine for me (this is without your PR):

$ prun -n 2 --map-by node ./a.out 
Execute fence
999686146:1 PMIx_Put on test-key-1
PMIx initialized
Execute fence
999686146:0 PMIx_Put on test-key-1
999686146:1 PMIx_Put on test-key-3
999686146:0 PMIx_Put on test-key-2
999686146:1 PMIx_get test-key-2 returned 256 bytes
1: obtained data "SECOND TIME rank 0"
999686146:0 PMIx_get test-key-3 returned 256 bytes
0: obtained data "SECOND TIME rank 1"
PMIx finalized
$


[mpiuser@rhc-node01 test]$ prun -n 2 --map-by node ./a.out --refresh
PMIx initialized
Execute fence
999686148:0 PMIx_Put on test-key-1
Execute fence
999686148:1 PMIx_Put on test-key-1
999686148:0 PMIx_Put on test-key-2
999686148:1 PMIx_Put on test-key-3
999686148:0 PMIx_get test-key-3 returned 256 bytes
0: obtained data "SECOND TIME rank 1"
PMIx finalized
999686148:1 PMIx_get test-key-2 returned 256 bytes
1: obtained data "SECOND TIME rank 0"
$


[mpiuser@rhc-node01 test]$ prun -n 2 --map-by node ./a.out --wait   
PMIx initialized
Execute fence
Execute fence
999686150:1 PMIx_Put on test-key-1
999686150:0 PMIx_Put on test-key-1
999686150:0 PMIx_Put on test-key-2
999686150:1 PMIx_Put on test-key-3
999686150:0 PMIx_get test-key-3 returned 256 bytes
0: obtained data "SECOND TIME rank 1"
999686150:1 PMIx_get test-key-2 returned 256 bytes
1: obtained data "SECOND TIME rank 0"
PMIx finalized
$


[mpiuser@rhc-node01 test]$ prun -n 2 --map-by node ./a.out --timeout
Execute fence
PMIx initialized
999686152:1 PMIx_Put on test-key-1
Execute fence
999686152:0 PMIx_Put on test-key-1
999686152:0 PMIx_Put on test-key-2
--------------------------------------------------------------------------
A request has timed out and will therefore fail:

  Operation:  DMDX: prted/pmix/pmix_server_fence.c:420

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------
Mon Feb 10 15:56:45 2020 ERROR: double-get.c:69  Client ns 999686152 rank 0: PMIx_Get on rank 1 test-key-3: NOT-FOUND

[mpiuser@rhc-node01 test]$ 

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 10, 2020

Same for your "get-nofence" test - it works fine:

$ prun -n 2 --map-by node ./get-nofence
PMIx initialized
999686154:0 PMIx_Put on test-key-1
999686154:1 PMIx_Put on test-key-1
999686154:0 PMIx_get test-key-1 returned 256 bytes
0: obtained data "FIRST TIME rank 1"
999686154:1 PMIx_get test-key-1 returned 256 bytes
1: obtained data "FIRST TIME rank 0"
PMIx finalized
$

@angainor
Copy link

@rhc54 Sorry, I had env problems and used an old pmix library on the compute nodes. The LD_LIBRARY_PATH environment variable was not exported for the ranks when I used prun. I ran as follows:

$ gcc double-get.c -o pmixtest -L${PMIX_DIR}/lib -lpmix
$ ldd pmixtest
	linux-vdso.so.1 =>  (0x00007fff5f0f6000)
	libpmix.so.0 => /cluster/projects/nn9999k/marcink/software/pmix/master/lib/libpmix.so.0 (0x00002b83c5b3c000)
[...]
$ prun -np 2 ./pmixtest
./pmixtest: error while loading shared libraries: libpmix.so.0: cannot open shared object file: No such file or directory
./pmixtest: error while loading shared libraries: libpmix.so.0: cannot open shared object file: No such file or directory
$ prun -x LD_LIBRARY_PATH -np 2 ./pmixtest
./pmixtest: error while loading shared libraries: libpmix.so.0: cannot open shared object file: No such file or directory
./pmixtest: error while loading shared libraries: libpmix.so.0: cannot open shared object file: No such file or directory

I had to add -rpath to compile line. Now it works!

$ gcc double-get.c -o pmixtest -Wl,-rpath=${PMIX_DIR}/lib/ -L${PMIX_DIR}/lib -lpmix
$ prun -np 2 ./pmixtest
PMIx initialized
3749117988:0 PMIx_Put on test-key-1
Execute fence
3749117988:1 PMIx_Put on test-key-1
Execute fence
3749117988:1 PMIx_Put on test-key-3
3749117988:0 PMIx_Put on test-key-2
3749117988:1 PMIx_get test-key-2 returned 256 bytes
1: obtained data "SECOND TIME rank 0"
3749117988:0 PMIx_get test-key-3 returned 256 bytes
0: obtained data "SECOND TIME rank 1"
PMIx finalized

@angainor
Copy link

@rhc54 yes, it seems all test cases run through now :) Thanks! Do you have any idea what was wrong with the -x LD_LIBRARY_PATH scenario?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 10, 2020

Well, there are a couple of things wrong with that approach. First, -x would only change the environment for the application proc, not PRRTE itself, so that wouldn't fully solve the problem as it involved behavior change on both sides of the execution (PRRTE and the client lib). Second, -x is actually only an OMPI option, not a PRRTE one. Thus, we ignore it unless you tell us the personality is OMPI - i.e., add --personality ompi on the prun cmd line.

I'll close this now that things seem to be resolved. Please reopen if you the problem reappears!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants