Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal/accelerator: Add IPC APIs #10959

Merged
merged 5 commits into from
Nov 16, 2023
Merged

opal/accelerator: Add IPC APIs #10959

merged 5 commits into from
Nov 16, 2023

Conversation

wckzhang
Copy link
Contributor

Signed-off-by: William Zhang wilzhang@amazon.com

@wckzhang
Copy link
Contributor Author

A start for adding generic IPC functionality.

*
*/
typedef int (*opal_accelerator_base_module_get_ipc_handle_fn_t)(
int dev_id, void *dev_ptr, void **handle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a void* instead of a class object like the rest of the interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New revision encapsulates handle and size, therefore we don't need the API to get handle size.

However, I think it's necessary to keep device id and base address - didn't find a simplistic way to hide them, maybe in IPC cache if we decide to implement that.

@wenduwan
Copy link
Contributor

wenduwan commented Nov 3, 2023

@edgargabriel Reminded me of this PR. I shall pick it up soon.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edgargabriel
Copy link
Member

Since @bwbarrett did the initial review of this pr, it might be good to get his review also in

@bwbarrett
Copy link
Member

Since @bwbarrett did the initial review of this pr, it might be good to get his review also in

I'm going to be under water with things to do the next couple of weeks, so don't wait for my review.

@wenduwan wenduwan requested a review from lrbison November 6, 2023 17:37
@wenduwan
Copy link
Contributor

wenduwan commented Nov 6, 2023

@lrbison Since Brian is down, could you give it a review?

Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty short PR. I am not tracking the IPC functions well enough to know if this is sufficient for all use cases. However I don't see anything wrong with it and it shouldn't break existing code so I see no problem in approving it and getting it merged either.

@edgargabriel
Copy link
Member

@wenduwan as I mentioned in the meeting, I think meanwhile that we are missing some functions. Assuming that the goal is to be able to remove the remaining hard-coded calls to cuda and replace them with accelerator framework API function calls, here are the functions that are still missing. Not all of them are IPC related, so not sure whether they would absolutely have to be part of this PR:

IpcOpenEventHandle
IpcGetEventHandle
StreamWaitEvent

EventDestroy

In addition, looking over the current Accelerator API, I noticed that we have an interface to create a stream, but not to destroy a stream ( similarly on how we could create an event, but not destroy one, that function is however already in the list above).

wenduwan and others added 5 commits November 8, 2023 02:36
Introduce the destroy stream operation to properly release resources.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Introduce the synchronize stream operation.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Introduce the operation to make stream wait on an event.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Introduce the destroy_event operation to properly release resources.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Signed-off-by: William Zhang <wilzhang@amazon.com>
Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Contributor

wenduwan commented Nov 8, 2023

@edgargabriel Thanks for the reminder - I took a second look at CUDA and ZE API docs and added a couple more that I think are important. Please let me know if that makes sense to you.

I do not have extensive experience with accelerators so I'm open to suggestions to remove less common APIs.

Also I want to mention that I didn't see a CloseIpcEventHandle API - should I add one?

@edgargabriel
Copy link
Member

edgargabriel commented Nov 8, 2023

Also I want to mention that I didn't see a CloseIpcEventHandle API - should I add one?

@wenduwan good point. As far as I can see, neither hip nor cuda have a closeIpcEventHandle function, and they don't even mention in the documentation what to do with the handle. Does ze have a function for that purpose? I will also check with our hip team what is the recommended action.

Since I just read through the documentation of the ipc functionality, I am also double checking one more aspect, namely the fact that the events for IPC have to be created with certain flags set ([cuda/hip]EventDisableTiming and [cuda/hip]EventInterprocess). I think we might need another API function for that (e.g. create_event_for_ipc() or similar). I will come back to you on that.

@edgargabriel
Copy link
Member

@wenduwan I would suggest to go ahead and merge the PR as is, i.e. without the closeIpcEventHandle. There is no hip or cuda function along those lines, and because of that its also not used anywhere in the code. If it turns out later that we still need one for another accelerator, we can still add the API in, we do have a bit of time before the 5.1 release

@wenduwan
Copy link
Contributor

wenduwan commented Nov 9, 2023

@edgargabriel Cool thanks for double checking.

@hppritcha Since you chimed in on this during the Tuesday call I want to see if you have other comments?

@wenduwan
Copy link
Contributor

@lrbison @hppritcha Can we merge this? I will open backport to 5.0

@edgargabriel
Copy link
Member

@wenduwan why would you backport this to 5.0? I think this is meant for 5.1, its new functionality?

@wenduwan
Copy link
Contributor

why would you backport this to 5.0? I think this is meant for 5.1, its new functionality?

I see. Cool then we can leave it on main.

@hppritcha hppritcha merged commit 554a8eb into open-mpi:main Nov 16, 2023
10 checks passed
wenduwan added a commit to wenduwan/ompi that referenced this pull request Nov 17, 2023
This change implements stub APIs introduced in
open-mpi#10959

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
edgargabriel added a commit that referenced this pull request Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants