Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a host-pinned memory resource that can be used as upstream for pool_memory_resource. #1392

Merged
merged 50 commits into from
Jan 18, 2024

Conversation

harrism
Copy link
Member

@harrism harrism commented Nov 28, 2023

Description

Depends on #1417

Adds a new host_pinned_memory_resource that implements the new cuda::mr::memory_resource and cuda::mr::async_memory_resource concepts which makes it usable as an upstream MR for rmm::mr::device_memory_resource.

Also tests a pool made with this new MR as the upstream.

Note that the tests explicitly set the initial and maximum pool sizes as using the defaults does not currently work. See #1388 .

Closes #618

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@harrism harrism added feature request New feature or request non-breaking Non-breaking change cpp Pertains to C++ code labels Nov 28, 2023
@harrism harrism self-assigned this Nov 28, 2023
@harrism harrism requested a review from a team as a code owner November 28, 2023 03:41
@harrism harrism requested a review from miscco November 28, 2023 03:51
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor suggestions

I am a bit dismayed about the amount of documentation boilerplate.

Maybe we could work with defaulted arguments rather than redefining the function each time?

include/rmm/mr/host_pinned_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/host_pinned_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/host_pinned_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/host_pinned_memory_resource.hpp Outdated Show resolved Hide resolved
@harrism
Copy link
Member Author

harrism commented Dec 6, 2023

Maybe we could work with defaulted arguments rather than redefining the function each time?

I am able to consolidate the non-async functions into a single allocate and deallocate (eliminating two functions). But for the async versions, we have existing calls to allocate_async(bytes, stream) that will fail if we just have allocate_async(bytes, alignment=default_alignment, stream=default_stream. So I can't consolidate these yet.

Also, suppose we did consolidate these. What should we use for default_alignment? A device memory resource should use the default CUDA memory alignment (256 bytes). But a host memory resource should probably use alignof(std::max_align_t). And a user who wants to not care about alignment will have to decide what to pass for alignment because they need it in order to use an explicit stream.

Actually, this default alignment problem applies to the non-async allocate/deallocate functions too. Because when we convert something like pool_memory_resource to the cuda::memory_resource concept what should it use if we need to provide default alignments? The underlying memory for a pool could be device or host memory. I think the pool needs to be able to figure out what its default alignment should be...

Thoughts?

 - Consolidate allocate/deallocate functions using default alignment argument.
 - Add missing includes.
@miscco
Copy link
Contributor

miscco commented Dec 6, 2023

Actually, this default alignment problem applies to the non-async allocate/deallocate functions too.

We could define a free helper function that return 256 on device and alignof(std::max_align_t) on host. I am not sure whether this is something that we really want

@wence-
Copy link
Contributor

wence- commented Dec 6, 2023

Actually, this default alignment problem applies to the non-async allocate/deallocate functions too. Because when we convert something like pool_memory_resource to the cuda::memory_resource concept what should it use if we need to provide default alignments? The underlying memory for a pool could be device or host memory. I think the pool needs to be able to figure out what its default alignment should be...

What is a minimum safe value (given we don't know the type of the pointer we're allocating for)? On host, I believe the answer is alignof(std::max_align_t), on device I think it's alignof(double4) == 16 bytes (as per https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types-alignment-requirements-in-device-code). This is smaller than the 256 byte alignment guaranteed by allocations from the cudamalloc family, so I am not sure if that is pertinent.

Should memory resources that perform concrete allocations advertise their default alignment as a property and then wrapping resources can query that?

@miscco
Copy link
Contributor

miscco commented Dec 6, 2023

Should memory resources that perform concrete allocations advertise their default alignment as a property and then wrapping resources can query that?

We can make this a property. The one potential design issue that our memory_resource has is that it pushes those properties into APIs. We cannot querry it for a property that we did not specify in the template arguments

@harrism
Copy link
Member Author

harrism commented Dec 6, 2023

We can make this a property. The one potential design issue that our memory_resource has is that it pushes those properties into APIs. We cannot querry it for a property that we did not specify in the template arguments

Can you explain a bit more? Maybe an example so I can understand what you mean?

@miscco
Copy link
Contributor

miscco commented Dec 7, 2023

Can you explain a bit more? Maybe an example so I can understand what you mean?

The issue is that properties are awesome if you have them around. But resource_ref is type erased, so we essentially drop all properties that are not stored in the resource_ref in the bin.

Its the difference between:

template<class T>
    requires resource_with<cuda::mr::device_accessible>
void* special_allocate(T& memory_resource, size_t size)

void* special_allocate(cuda::mr::resource_ref<cuda::mr::device_accessible>& memory_resource, size_t size)

The latter is a streamlined implementation that reduces binary size considerably and generally simplifies the interfaces a lot. However, there is no T there anymore. In the former function I can querry T with cuda::mr::has_property<T, some_property> and get a result. In the latter it is impossible because resource_ref itself does not hold that property unless we tell it so explicitly.

Its definitely a wart

@harrism
Copy link
Member Author

harrism commented Dec 9, 2023

Also the former requires C++20, right?

@miscco
Copy link
Contributor

miscco commented Dec 11, 2023

Also the former requires C++20, right?

Yes, but you can also just write

template<class T, cuda::std::enable_if_t<resource_with<cuda::mr::device_accessible>, int> = 0>
void* special_allocate(T& memory_resource, size_t size)

Copy link

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, the alignment parameters not being used obviously means we will need to align on our own. Our pinned code looks to be using std::max_align_t right now, just FYI.

@harrism
Copy link
Member Author

harrism commented Jan 16, 2024

@abellina you are right we should probably fix this to actually align. I think cudaHostAlloc leaves alignment up to the caller.

@harrism
Copy link
Member Author

harrism commented Jan 17, 2024

@abellina you are right we should probably fix this to actually align. I think cudaHostAlloc leaves alignment up to the caller.

Done.

@harrism
Copy link
Member Author

harrism commented Jan 17, 2024

/ok to test

@harrism
Copy link
Member Author

harrism commented Jan 18, 2024

/ok to test

@wence- wence- removed the request for review from a team January 18, 2024 09:27
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor documentation nits, otherwise looks great!

*
* @return Whether the input a power of two with non-negative exponent
* @return True if the input a power of two with non-negative exponent, false otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return True if the input a power of two with non-negative exponent, false otherwise.
* @return True if the input is a power of two with non-negative exponent, false otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-nit (no need to act on it) non-negative integer exponent (all integers can be expressed as powers of two if we admit real exponents).

include/rmm/detail/aligned.hpp Show resolved Hide resolved
@harrism
Copy link
Member Author

harrism commented Jan 18, 2024

/ok to test

@abellina
Copy link

For me this still looks good. I was able to replace our pinned pool with a pool_memory_resource<pinned_host_memory_resource> and use the PTDS stream for all allocations. Perf overall is unchanged from our java-base one, except for one of the NDS queries (83) where this shows a small improvement. I would like to confirm why that is, but my guess is that it has to do with fragmentation in the old pool.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions, then I'm happy to approve.

Comment on lines 170 to 171
* @briefreturn{true if the specified resource is the same type as this resource, otherwise
* false.}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring implies it's possible to compare with another type of resource and get false, but the implementation doesn't allow that. Do we need to update the implementation or the docstrings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I had that thought. Is there a blanket "false" implementation in the base class somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is how comparison works with cuda::mr. Basically if you try to compare with another type of resource, compilation will fail. Note that refactoring to cuda::mr will necessitate changing the semantics RMM currently (mostly) has for MR equality comparison. #1402

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that pinned_host_memory_resource is NOT a device_memory_resource. It simply implements the cuda::mr::memory_resource and cuda::mr::async_memory_resource concepts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also note there is no base class)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the docstring so it doesn't say that false can be returned. Note that we should probably followup with more explicit tests of this MR and future MRs like it. Right now, though, our test machinery for MRs assumes they are all device_memory_resource, so while I can pass a pool_memory_resource<pinned_host_memory_resource> to all the MR tests, I can't pass just pinned_host_memory_resource currently. (It does get tested as the upstream in the former case though, including its operator==).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. If there's no base class, I've just lost track of how the class hierarchy works. I don't have any further comments here but I'll need to refresh myself on how things are supposed to work someday.

tests/mr/device/mr_test.hpp Outdated Show resolved Hide resolved
@harrism
Copy link
Member Author

harrism commented Jan 18, 2024

/ok to test

@bdice
Copy link
Contributor

bdice commented Jan 18, 2024

@harrism asked me to merge once I approve, so I'll do that.

@bdice
Copy link
Contributor

bdice commented Jan 18, 2024

/merge

@rapids-bot rapids-bot bot merged commit 12f8de3 into rapidsai:branch-24.02 Jan 18, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp Pertains to C++ code feature request New feature or request non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Consider creating a pinned host memory resource that can leverage all of the allocation strategies
6 participants