-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEWS: News update for v1.12.0 rc1 #7741
Conversation
@Akshay-Venkatesh, can you please add CUDA related features/bug fixes added in ucx 1.12? (if something is missed) |
NEWS
Outdated
#### Core | ||
* Added initial support for Go language bindings | ||
* Added memory invalidation on error detection | ||
* Added threshold for ep connection matching in UCP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ep-> endpoint
NEWS
Outdated
* Added memory invalidation on error detection | ||
* Added threshold for ep connection matching in UCP | ||
* Added new objects to VFS (md, component, log_level, etc) | ||
* Added config variable to specify what loadable modules are needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config -> configuration
NEWS
Outdated
#### UCP | ||
* Added API for querying UCP library attributes | ||
* Added new sockaddr private data format | ||
* Enabled rendezvous and tag sync for all cases with error handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all cases -> all protocols ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, previously RNDV and sync were disabled if user used connect to worker address with error handling. Now this restriction is removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabled rendezvous and tag sync protocols when error handling is enabled on the endpoint
NEWS
Outdated
* Added usage of mpool set for unexpected eager message to reduce memory consumption | ||
* Added client_id to ucp_worker_create() and ucp_conn_request_query() APIs | ||
* Added support for modifying UCT and UCS configs by ucp_config_modify() API | ||
* Added address versioning to correctly preserve wire compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the big one. Probably should be moved to the top. Also, we should explicitly specify from what version we are wire protocol backward compatible.
NEWS
Outdated
* Added address versioning to correctly preserve wire compatibility | ||
* Optimized unpacked rkeys memory consumption | ||
* Added request flag to influence latency vs. bandwidth protocol | ||
* Added ucp_worker_address_query() API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please group all API changes together ?
NEWS
Outdated
#### CUDA | ||
* Added option to set cuda_copy bandwidth | ||
* Added profiling of CUDA runtime function calls | ||
* Added stub for memory invalidation support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to list "stub" functions - seems like does not contribute to anything (you have few assurances here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is important change, because it allows these transports to be selected for RNDV protocol.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rephrase this: Added stub for memory invalidation support inorder to enable CUDA transport selection for rendezvous protocols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt Added stub for memory invalidation support to enable CUDA-IPC transport selection for rendezvous protocols in case of error handling enabled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would make it a bit shorter Added stub for memory invalidation support to enable CUDA-IPC transport selection for rendezvous protocols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove this altogether, it's part of memory invalidation err flows fix
NEWS
Outdated
* Added process placement option for ucx_info | ||
* Extended parameters correctness check in ucx_perftest | ||
#### CI | ||
* Replaced gtest 1.7 with gtest 1.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated gtest 1.7 to 1.10
NEWS
Outdated
|
||
### Bugfixes | ||
#### Core | ||
* Fixed simultaneous ep close with ucp_hello_world |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ep -> endpoint
NEWS
Outdated
* Suppressed EHOSTUNREACH error in TCP sockcm | ||
* Restricted connecting loop-back to other devices in TCP | ||
#### RDMA CORE (IB, ROCE, etc.) | ||
* Added pkey_index initialization when creating RC QP with DEVX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Added/Fixed
NEWS
Outdated
* Fixes in UCP, UCT, UCS, FAQ and README documentation | ||
#### Tests | ||
* Fixed memory leak in io_demo | ||
* More fixes in io_demo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove...sounds duplicated. Maybe replace the both lines with Multiple fixes
@shamisp, fixed |
@brminich I'm copying this from Yossi's talk this week. These were the main additions:
There were some recent bug-fixes but they're part of master and not v1.12.x |
NEWS
Outdated
#### CUDA | ||
* Added option to set cuda_copy bandwidth | ||
* Added profiling of CUDA runtime function calls | ||
* Added stub for memory invalidation support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove this altogether, it's part of memory invalidation err flows fix
@petro-rudenko can you pls check didn't miss any news for Java |
JUCX:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems few comments missed from prev review
@shamisp can you pls take a look? |
NEWS
Outdated
#### UCP | ||
* Added API for querying UCP library attributes | ||
* Added address versioning to correctly preserve wire compatibility since v1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since -> starting from the version v1.11.0
NEWS
Outdated
### Features: | ||
#### Core | ||
* Added beta-level support for Go language bindings | ||
* Added new objects to VFS (md, component, log_level, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etc -> etc.
NEWS
Outdated
* Added support for user-defined alignment in Active Messages | ||
* Added support for offload tag sync in new protocols | ||
* Updated ucp_atomic_post() to use NBX flow | ||
##### API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move APIs section before UCP. These are the most visible updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is UCP API changes, it has nested level (extra #
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you mean move it right after ####UCP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to the top, but removed #####API
line, because:
- We do not have such section for other parts (UCT, UCS), so it is consistent with other parts - that API changes are just on the top
- Otherwise would need to introduce some other caption which would separate API and plain features
* Improved accuracy of the topology distance estimation | ||
* Added thread-safe put to ptr_map | ||
* Added prints of leaked callbacks from the callback queue | ||
* Added new ptr_array API for bulk allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move API changes to the top of UCS section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
NEWS
Outdated
* Added ucs_ffs32() | ||
* Removed a diagnostic message when fuse thread is stopped | ||
* Added ucs_vsnprintf_safe() which always adds '\0' | ||
* Added API for a per-process aggregate-sum statistics report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move API changes to the top of UCS section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
NEWS
Outdated
* Added support for setting worker id and querying it from the connection request | ||
* Added support to bind on a free port in UcpListener | ||
#### Packaging | ||
* Added cmake support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate on this one, we still use auto tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afair #7096
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added cmake config files for better integration with external cmake based projects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
bot:pipe:retest |
* Added selection of CUDA-IPC capabilities based on NVLINK topology | ||
(to prefer writes vs reads for specific platforms using NVML) | ||
* Added option to set cuda_copy bandwidth | ||
* Added profiling of CUDA runtime function calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the following in CUDA section to reflect #7772
"Added option to limit GPUDirectRDMA size in rendezvous protocol"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@tonycurtis, can you please take a look? |
NEWS
Outdated
* Added API for querying UCP library attributes | ||
* Added client_id to ucp_worker_create() and ucp_conn_request_query() APIs | ||
* Added ucp_worker_address_query() API | ||
* Updated ucp_ep_query() API with getting local and remote addresses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for getting?
NEWS
Outdated
* Added client_id to ucp_worker_create() and ucp_conn_request_query() APIs | ||
* Added ucp_worker_address_query() API | ||
* Updated ucp_ep_query() API with getting local and remote addresses | ||
* Added address versioning to correctly preserve wire compatibility starting from the version v1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "the", and the "v" in front of the number seems redundant.
NEWS
Outdated
* Added memory limit support to memtrack | ||
#### CUDA | ||
* Added global memtype cache to allow UCT transports to query memory attributes | ||
* Auto-register cuda whole allocations to avoid repeated registration costs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda -> CUDA
NEWS
Outdated
* Added capability to select CUDA stream based on source and destination memory type | ||
(required for device memory based pipelining) | ||
* Added selection of CUDA-IPC capabilities based on NVLINK topology | ||
(to prefer writes vs reads for specific platforms using NVML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vs. (period on end)
NEWS
Outdated
### Bugfixes: | ||
* Fixes in Cuda memory hooks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuda -> CUDA
NEWS
Outdated
@@ -68,7 +238,7 @@ | |||
#### RDMA CORE (IB, ROCE, etc.) | |||
* Added report of QP info in case of completion with error | |||
* Refactored of FC send operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, not added by this issue, but looks like a missing word here
Probably I was wrong. Would be good to merge it to not block rc1. |
You and @tonycurtis have to decide. In the past we have tried to push NEWs on the fist RC but sometimes those were delayed. |
On Dec 9, 2021, at 5:31 PM, Pavel Shamis (Pasha) ***@***.***> wrote:
@shamisp <https://github.com/shamisp>, @yosefe <https://github.com/yosefe> let's keep it open to be able to update if something else gets into v1.12
makes sense. You also want to update Authors files (in separate PR)
Probably I was wrong. Would be good to merge it to not block rc1. @tonycurtis <https://github.com/tonycurtis>, your comments applied. If you do not have any new comments I'd squash the changes
You and @tonycurtis <https://github.com/tonycurtis> have to decide. In the past we have tried to push NEWs on the fist RC but sometimes those were delayed.
I could convince myself either way. Keep a running update of NEWS; or accumulate a separate PR until the release.
I’d go for the latter, if pushed. I think people understand an RC is not a release, so the meta-documentation can wait for a final update.
Tony
|
Imo, better have actual NEWs for rc release, because users may/will want to try it and check its content |
@brminich agree. If you happen to have ready to go before RC I don't see a good reason not to include. I have seen few some linux distress using our RC. |
@yosefe, @tonycurtis, so I'm going to merge it unless you have some objections |
On Dec 10, 2021, at 9:20 AM, Mikhail Brinskiy ***@***.***> wrote:
@yosefe <https://github.com/yosefe>, @tonycurtis <https://github.com/tonycurtis>, so I'm going to merge it unless you have some objections
No, please go ahead.
Tony
|
What
Why ?
Release preparation
How ?
Comparing v1.11.x and v1.12.x branches