-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ib/mlx5 BlueFlame issue for MT applications #2464
Comments
@jladd-mlnx, FYI |
ha! No - I was basing on Mar 23rd's snapshot: Looks like it was almost simultaneously. I will check this one, |
@xinzhao3 Can you try the recent UCX master? |
@artpol84 Sure. Doing now. |
Fixed through #2338 |
@yosefe @xinzhao3 @jladd-mlnx
|
@artpol84 that looks great! Thanks! |
@yosefe
Me and @xinzhao3 were testing UCX with our in-house multi-threaded benchmark. Multiple options was considered:
We observed the issue of lost packets for the case 2.
The issue was observed only for small packets and according to the ibdump sender was sending corrupted packets while send buffers wasn't corrupted. Here is the message size info:
Problem:
While doing our measurements we faced the issue with mode 2 (One context and multiple workers). We observed hands for some message sizes:
Eventually the following debug helped to pinpoint the issue:
According to PRM if threads are using the same BF register there should be some sort of locking.
Manual disabling of BF in favor of dbreg helps, but also sounds dangerous.
The text was updated successfully, but these errors were encountered: