Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ib/mlx5 BlueFlame issue for MT applications #2464

Closed
artpol84 opened this issue Mar 29, 2018 · 8 comments
Closed

ib/mlx5 BlueFlame issue for MT applications #2464

artpol84 opened this issue Mar 29, 2018 · 8 comments

Comments

@artpol84
Copy link
Contributor

artpol84 commented Mar 29, 2018

@yosefe
Me and @xinzhao3 were testing UCX with our in-house multi-threaded benchmark. Multiple options was considered:

  1. Shared worker: all threads are using the same worker
  2. Shared context: each thread is using it's own worker, but all workers belong to the same context
  3. Separate contexts.

We observed the issue of lost packets for the case 2.

The issue was observed only for small packets and according to the ibdump sender was sending corrupted packets while send buffers wasn't corrupted. Here is the message size info:
Problem:
While doing our measurements we faced the issue with mode 2 (One context and multiple workers). We observed hands for some message sizes:

  • 1B-34B - OK (1 WQEBB in BlueFlame)
  • 35B-226B - FAILURES (2+ WQEBB in BlueFlame)
  • 227B+ - OK

Eventually the following debug helped to pinpoint the issue:

tid=17,	db_ptr=0x7fe2b2d1f800
tid=6,	db_ptr=0x7fe2b2d1f800
tid=18,	db_ptr=0x7fe2b2d1fa00
tid=7,	db_ptr=0x7fe2b2d1fa00
tid=15,	db_ptr=0x7fe2b2d20800
tid=4,	db_ptr=0x7fe2b2d20800
tid=26,	db_ptr=0x7fe2b2d20900
tid=16,	db_ptr=0x7fe2b2d20a00
tid=5,	db_ptr=0x7fe2b2d20a00
tid=27,	db_ptr=0x7fe2b2d20b00
tid=13,	db_ptr=0x7fe2b2d21800
tid=2,	db_ptr=0x7fe2b2d21800
tid=24,	db_ptr=0x7fe2b2d21800
tid=14,	db_ptr=0x7fe2b2d21a00
tid=25,	db_ptr=0x7fe2b2d21a00
tid=3,	db_ptr=0x7fe2b2d21a00
tid=25,	db_ptr=0x7fe2b2d21b00
tid=0,	db_ptr=0x7fe2b2d22800
tid=11,	db_ptr=0x7fe2b2d22800
tid=22,	db_ptr=0x7fe2b2d22800
tid=0,	db_ptr=0x7fe2b2d22900
tid=11,	db_ptr=0x7fe2b2d22900
tid=1,	db_ptr=0x7fe2b2d22a00
tid=12,	db_ptr=0x7fe2b2d22a00
tid=23,	db_ptr=0x7fe2b2d22a00
tid=1,	db_ptr=0x7fe2b2d22b00
tid=20,	db_ptr=0x7fe2b2d33800
tid=9,	db_ptr=0x7fe2b2d33800
tid=10,	db_ptr=0x7fe2b2d33a00
tid=21,	db_ptr=0x7fe2b2d33a00
tid=19,	db_ptr=0x7fe2b2d34a00
tid=8,	db_ptr=0x7fe2b2d34a00

According to PRM if threads are using the same BF register there should be some sort of locking.

Manual disabling of BF in favor of dbreg helps, but also sounds dangerous.

@artpol84
Copy link
Contributor Author

@jladd-mlnx, FYI

@yosefe
Copy link
Contributor

yosefe commented Mar 29, 2018

@artpol84 did your ucx version include #2338?

@artpol84
Copy link
Contributor Author

ha! No - I was basing on Mar 23rd's snapshot:
https://github.com/artpol84/ucx/commits/ucx_mt_base

Looks like it was almost simultaneously. I will check this one,

@artpol84
Copy link
Contributor Author

@xinzhao3 Can you try the recent UCX master?

@xinzhao3
Copy link
Contributor

@artpol84 Sure. Doing now.

@artpol84
Copy link
Contributor Author

Fixed through #2338

@artpol84
Copy link
Contributor Author

@yosefe @xinzhao3 @jladd-mlnx
I double-checked and now doorbell pointers are different for each QP. All looks good now.

tid db_ptr
26 0x7f5e3a01d800
27 0x7f5e3a01da00
24 0x7f5e3a5ba800
25 0x7f5e3a5baa00
22 0x7f5e3ab57800
23 0x7f5e3ab57a00
20 0x7f5e3ed94800
21 0x7f5e3ed94a00
18 0x7f5e3f1a6900
19 0x7f5e3f1a6a00
16 0x7f5e3f1a9900
17 0x7f5e3f1a9b00
14 0x7f5e475e2800
15 0x7f5e475e2b00
12 0x7f5e475e5800
13 0x7f5e475e5b00
10 0x7f5e4c003900
11 0x7f5e4c003a00
8 0x7f5e4ebc3900
9 0x7f5e4ebc3b00
6 0x7f5e5476c900
7 0x7f5e5476cb00
4 0x7f5e5476f800
5 0x7f5e5476fb00
2 0x7f5e54772900
3 0x7f5e54772b00
0 0x7f5e54829900
1 0x7f5e54829b00

@xinzhao3
Copy link
Contributor

@artpol84 that looks great! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants