Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCXX: C++ implementation #913

Open
pentschev opened this issue Jan 12, 2023 · 21 comments
Open

UCXX: C++ implementation #913

pentschev opened this issue Jan 12, 2023 · 21 comments

Comments

@pentschev
Copy link
Member

For a while I have been working and discussing internally a rewrite of UCX-Py, splitting the hardcore logic into a C++ backend, with a (much) thinner Cython layer, and a Python async layer, including also some new Python optimizations. This new library is currently named UCXX and has now been merged into the UCX-Py repository under a separate branch with a ucxx/ subdirectory.

We have not yet decided on a permanent place for this library to exist, since its intent is to replace the current UCX-Py implementation, some of the options we have are:

  1. Starting a fresh repository -- it was so far being developed like that but on a private repository;
  2. Living here under a ucxx/ subdirectory permanently;
  3. Eventually removing UCX-Py files and moving ucxx/ into the root directory of this repository;
  4. Some other solution that has not been raise so far?

I hope the few instructions available following the link above should suffice to get people started building/testing the new implementation.

At this time we would like to invite people to try it out and review the code. The new implementation is definitely not complete nor bug-free, so any and all help is welcome.

A non-exhaustive, and in no particular order, list of features/issues that make this implementation incomplete is below:

  • Missing RMA operations;
  • Missing AM API;
  • Hiding implementation in some form of pImpl or similar idiom to reduce ABI breaking changes.

I appreciate everyone's help, and pinging some of the usual suspects who may be interested: @quasiben @wence- @madsbk @MattBBaker

@jakirkham
Copy link
Member

Would some/all of this make sense to add to UCX itself?

Noticing that is not called out on this list above (unless I've missed it). So also wondering if there is a reason not to

@pentschev
Copy link
Member Author

Would some/all of this make sense to add to UCX itself?

Yes, this is part of the plan, it has been for at least 2-3 years, but partly my fault that it did not happen yet. Eventually we hope all of UCX-Py/UCXX to move into the upstream UCX repo.

@quasiben
Copy link
Member

cc @kkraus14 @MattBBaker

@jakirkham
Copy link
Member

Not at all. This is impressive work! Not surprised it took some time to get here

Ok so this is envisioned as a staging location before upstreaming everything? That makes sense

I'd lightly suggest keeping things in one repo as you have done (just from an ergonomic perspective), but would defer to the judgement of yourself and others here

@kkraus14
Copy link

kkraus14 commented Jan 12, 2023

Took a very quick initial glance and this looks very exciting. Awesome work @pentschev! :partykirby:

One thing that stuck out to me thus far is that Python enablement has leaked into the C++ library, albeit with a compile-time option to avoid it. That being said, this is going to become annoying when it comes to packaging and deployment.

For example, if I'm a pure C++ library wanting to use this I almost certainly don't want to take on a Python dependency so I'd want a build of UCXX that isn't dependent on the Python pieces. Then say someone wants to use something else in the same environment that is dependent on the Python bits. In theory, we could use typical conda shenanigans of having different package variants with build strings and something like track_features to downweigh variants appropriately to make it work, but that's non-trivial extra burden. Also, we'd need to matrix the CI config to build and test both Python enabled and not enabled builds.

It also looks like there's pieces of the code that change the ABI depending on the Python compile-time flag that would probably need to be reworked. For example:

I wonder if it would be possible to restructure the code to have the Python enablement in C++ be a separate library that depends on the "core" C++ library.

@pentschev
Copy link
Member Author

I'd lightly suggest keeping things in one repo as you have done (just from an ergonomic perspective), but would defer to the judgement of yourself and others here

@jakirkham the reason I'm not really fond of this is the duplication that I had to resort to and the mess with the commit history this has caused. As a personal preference, I would make this a new repo, but I also understand there are reasons why people would prefer to do as you say.

@pentschev
Copy link
Member Author

Took a very quick initial glance and this looks very exciting. Awesome work @pentschev! :partykirby:

One thing that stuck out to me thus far is that Python enablement has leaked into the C++ library, albeit with a compile-time option to avoid it. That being said, this is going to become annoying when it comes to packaging and deployment.

For example, if I'm a pure C++ library wanting to use this I almost certainly don't want to take on a Python dependency so I'd want a build of UCXX that isn't dependent on the Python pieces. Then say someone wants to use something else in the same environment that is dependent on the Python bits. In theory, we could use typical conda shenanigans of having different package variants with build strings and something like track_features to downweigh variants appropriately to make it work, but that's non-trivial extra burden. Also, we'd need to matrix the CI config to build and test both Python enabled and not enabled builds.

It also looks like there's pieces of the code that change the ABI depending on the Python compile-time flag that would probably need to be reworked. For example:

I wonder if it would be possible to restructure the code to have the Python enablement in C++ be a separate library that depends on the "core" C++ library.

@kkraus14 thank you for taking the time to look and comment here. All your comments are valid and sensible to me. The potential for ABI breakage is something I'm not particularly happy with in the current implementation, there are so many things to improve on that front as you noted above.

The way I was thinking of resolving both the Python ABI breakage and the C++ dependency on the C++ was to follow more or less the concept that UCX is applying, dlopen optional dependencies. I'm not currently bound to that idea, so I'm happy to discuss either dlopen or instead make the Python C/C++ code depend on the C++ library as you suggested. I guess packaging issues could make us lean to one solution or the other, admittedly I'm not very well-versed in the packaging problems so I would definitely appreciate comments on why would one solution be better than the other.

@wence-
Copy link
Contributor

wence- commented Jan 13, 2023

The way I was thinking of resolving both the Python ABI breakage and the C++ dependency on the C++ was to follow more or less the concept that UCX is applying, dlopen optional dependencies. I'm not currently bound to that idea, so I'm happy to discuss either dlopen or instead make the Python C/C++ code depend on the C++ library as you suggested. I guess packaging issues could make us lean to one solution or the other, admittedly I'm not very well-versed in the packaging problems so I would definitely appreciate comments on why would one solution be better than the other.

To make the structs ABI compatible with/without python, one option would just be to store all the python-related data in an opaque struct and define the (say) worker struct as:

class Worker : public Component {
    private:
        ...
        void *_data{nullptr};
}

And then when python is enabled you have (syntax probably wrong, etc...):

struct pydata {
   std::mutex _mutex{};
   std::queue<...> pool{};
   std::shared_ptr<...> notifier{};
}

...
pydata = struct pydata {}; // needs to be heap allocated
worker = Worker{..., _data = reinterpret_cast<void*>(pydata)};

That might also generically be useful if third-party callers of the C++ implementation need somewhere to hang their application data.

@MattBBaker
Copy link
Contributor

I overall like what I'm seeing. Though I'm a bit lost on the request lifecycle.

Would it be possible to move the delayed submission parts to a Python helper? I remember the UCF presentation and about how deferring operations for a bit before dropping the GIL was a big win, but it seems to add a lot of mental overhead to the core C++ library.

@pentschev
Copy link
Member Author

Would it be possible to move the delayed submission parts to a Python helper? I remember the UCF presentation and about how deferring operations for a bit before dropping the GIL was a big win, but it seems to add a lot of mental overhead to the core C++ library.

Delayed submission is not an exclusive Python feature, you can use it the same way for C++. Effectively it moves any transfer calls, e.g. ucp_tag_send_nbx, from the application thread to the worker thread (if the worker is setup to use a worker thread) and thus remove any overheads from the application thread. Admittedly, I don't know yet if that can improve C++ performance, but I certainly don't want to make it a Python-only feature, at least not before we are able to determine if that's useful for C++. I can see how that can make things confusing, perhaps what could be done is to write a ucxx::RequestDelayed class that would be a specialization of the ucxx::Request, I will have to think a bit about how to best achieve this to try reducing code duplication for the different transfer types (AM/stream/tag).

@MattBBaker
Copy link
Contributor

Okay, my main concern is that I want to make a model where there is a worker thread that manages outstanding requests (where ucp_tag_send_nbx returns a request pointer) and the application thread calls ucp_tag_send_nbx itself and handles immediate completions then and there so that fast path is handled in the application thread.
Also, what is the general release plan? I'd like to integrate this code in a project

@pentschev
Copy link
Member Author

Okay, my main concern is that I want to make a model where there is a worker thread that manages outstanding requests (where ucp_tag_send_nbx returns a request pointer) and the application thread calls ucp_tag_send_nbx itself and handles immediate completions then and there so that fast path is handled in the application thread.

This is no problem, if you set enableDelayedSubmission=false when creating the worker you should get the behavior you want, which is in fact the default. You are still free to use any of the progress modes (either blocking, polling, wait or thread).

Also, what is the general release plan? I'd like to integrate this code in a project

I guess this is mostly a matter of getting people "sign off" on this. Once we are somewhat confident that major issues have been addressed, we could release. The original intent was to have this ready for RAPIDS 23.02 (release expected for February 9th), but it's a lot of code and I don't know if everyone will have enough time to do that. In any case we could still release as an experimental piece of code I guess, not sure if any packaging would be available (such as conda-forge) by then though. Do you have in mind what is the minimum necessary for this to be useful for you? Do you need packages or just a permanent repo structure you can build from?

cc @quasiben for the release discussion above.

@jakirkham
Copy link
Member

Just a note on packaging, this probably makes more sense as a rapidsai/rapidsai-nightly package for now. Handling in conda-forge may make more sense once upstreamed

@kkraus14
Copy link

Do you have in mind what is the minimum necessary for this to be useful for you? Do you need packages or just a permanent repo structure you can build from?

A conda package would be ideal, but we could always take the code and produce our own internal package for now until it's ready to be published in the rapids channels.

@pentschev
Copy link
Member Author

@kkraus14 @MattBBaker @jakirkham I did some changes to split the shared libraries and dlopen the Python bits in an attempt to resolve the issues you raised. Could you take another look and let me know if going this direction would resolve the issues you raised? The more relevant changes for this can be viewed in https://github.com/rapidsai/ucx-py/compare/b3a3709..66f4b72 , but there were other fixes in the ucxx branch if you care to look at them as well. One comment I foresee is the fact that I used the mangled symbol in dlsym which isn't ideal, but it's a bit time consuming to get the C++ namespaces/complex types properly done for the dlsym and I didn't want to spend the time on it before asking for feedback, whether this is going in the right direction first.

@kkraus14
Copy link

kkraus14 commented Mar 1, 2023

@pentschev I took a quick look and other than the dlopen / dlsym shenanigans in using a mangled C++ name it looked reasonable enough to me.

Only thought that crossed my mind is that if you could use the Worker class as a base class and then create a PythonWorker class. This class could then be in something like the libucxx_python shared library and that way you wouldn't need to check for Python availability or do any special handling from the general libucxx C++ code. I suspect that would then allow eliminating all usage of the #if UCXX_ENABLE_PYTHON macro and the associated runtime checks for if Python is available.

@pentschev
Copy link
Member Author

@kkraus14 this is now done in https://github.com/rapidsai/ucx-py/compare/99e911b..2b563ea . We now have a ucxx::python::Worker class that inherits from ucxx::Worker, allowing us to remove Python dependencies from libucxx.so entirely and keep them only in libucxx_python.so (that links to libucxx.so). With that, we can also avoid the dlsym calls, therefore I believe this should be much closer to a good shipping state. Relevant changes are in https://github.com/rapidsai/ucx-py/compare/99e911b..2b563ea, would you mind taking a look and letting us know if there's anything else that could be a problem for you, and just in general if you see anything that could be a problem for shipping binaries?

@kkraus14
Copy link

kkraus14 commented Mar 7, 2023

Took a quick pass and this looks much cleaner from my perspective. There's still a few dangling things that I think could be cleaned up:

  • A lot of functions take a const bool enablePythonFuture and then eventually map down to a call that checks the conditional and doesn't pass future related info if it's false.
    • Is this future Python specific anymore? If not, I think this could be generalized which could then possibly be used for other future based programming models as opposed to just Python async. If it is, then I think it would be better to restructure to have Python specific overloads in the Python library if possible.
    • Maybe an opportunity to make things a bit cleaner by using something like a std::optional<Future>?
  • No longer need the isPythonAvailable() function (or the include/ucxx/utils/python.h / src/ucxx/utils/python.cpp files in general) since all Python usage is isolated to a separate shared library.
  • I think it would be cleaner to move the Python specific constructors (https://github.dev/rapidsai/ucx-py/blob/ucxx/ucxx/cpp/include/ucxx/constructors.h#L86-L99) to a different header, otherwise you can't ship the C++ library completely separately from the Python one.
  • A nice to have, but definitely not strictly needed, would be to split the Python library cmake into its own subdirectory so it can be built and installed separately from the C++ library. This would require treating the ucxx::ucxx as a dependency that possibly needs to be found, but it would make building and installing the components separately a slightly nicer developer experience.

@pentschev
Copy link
Member Author

In general I agree with your comments @kkraus14 , the plan is that we make UCXX public for 23.04 (this time for real) in experimental status, so I want to prioritize addressing those that are potentially a blocker for distribution and for your usage as well. In that sense, I think most of those comments would be in the nice-to-have category for 23.04, and then potentially addressed in 23.06. I'm responding to the individual items below, but please let me know if any of them is a must-have for 23.04.

Took a quick pass and this looks much cleaner from my perspective. There's still a few dangling things that I think could be cleaned up:

  • A lot of functions take a const bool enablePythonFuture and then eventually map down to a call that checks the conditional and doesn't pass future related info if it's false.

    • Is this future Python specific anymore? If not, I think this could be generalized which could then possibly be used for other future based programming models as opposed to just Python async. If it is, then I think it would be better to restructure to have Python specific overloads in the Python library if possible.

Right now yes, but the plan is to add C++ future support as well (thus I changed it from enablePythonFuture to enableFuture in the ucxx::Worker class, but probably the same should be done elsewhere), in which case the interface would hopefully remain the same for both C++ and Python.

  • Maybe an opportunity to make things a bit cleaner by using something like a std::optional<Future>?

I am not sure this is possible to be done the way things are currently implemented. Passing the enable{Python,}Future boolean roughly means "get a future from the pool, make it available as part of the ucxx::Request and notify it when appropriate for user consumption", rather than passing a user-created future, in which case the std::optional<Future> would indeed make sense. Maybe there's room to improve that a bit, but perhaps not in the user-facing API where we don't want (at least I don't think we do at the moment) the user to be responsible for managing the future's lifetime.

  • No longer need the isPythonAvailable() function (or the include/ucxx/utils/python.h / src/ucxx/utils/python.cpp files in general) since all Python usage is isolated to a separate shared library.

Indeed this is not currently necessary, I decided to leave it for now though because it may still be useful to check from C++ whether the Python library is available. That should anyway not have any Python dependencies and harmless in that sense.

This is definitely an oversight by my part, thanks for bringing it up, I'll address it shortly.

  • A nice to have, but definitely not strictly needed, would be to split the Python library cmake into its own subdirectory so it can be built and installed separately from the C++ library. This would require treating the ucxx::ucxx as a dependency that possibly needs to be found, but it would make building and installing the components separately a slightly nicer developer experience.

Agreed, this would be a nice separation too and we should work on it as well.

@pentschev
Copy link
Member Author

We have made the decision to have the UCXX codebase in a new repo which is publicly available now: https://github.com/rapidsai/ucxx .

@kkraus14 I think from the C++/Python separation ask we have resolved all issues you've raised, including moving the Python dependency as a CMake component with its own CMakeLists.txt, please feel free to check it out and let us know what you think.

Also feel free to file issues/submit PRs in the new repo!

@jakirkham
Copy link
Member

Should we close this issue in favor of new focused issues/PRs on the new repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants