Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-sequence (OOS) messages #2067

Closed
bosilca opened this issue Sep 8, 2016 · 31 comments
Closed

Out-of-sequence (OOS) messages #2067

bosilca opened this issue Sep 8, 2016 · 31 comments

Comments

@bosilca
Copy link
Member

bosilca commented Sep 8, 2016

Short version Out-of-sequence messages exists in a single link scenario

Cause Intermediary buffering at different layers in the software stack allows the delivery of message out-of-sequence

Target Most of the BTLs, with a particular emphasis on vader and IB.

Long version This issue was raised during the discussion about the performance degradation seem between 1.8 and what will eventually become 3.x. While we identified the builtin atomics as being the main culprit, it turns out that enabling multi-threading raised a set of additional issues, not necessarily visible outside this particular usage.

Having multiple threads injecting messages into the PML in the context of a single communicator, lead to a significant number of out-of-sequence messages. The reason is that the per peer sequence number is taken very early in the software stack (optimization that makes sense for single threaded scenarios). Thus, between the moment when a thread acquires the sequence number and the moment when it's message is pushed into the network, there are many opportunities for another thread to bypass and reach the network first. From the receiver perspective this is seen as an out-of-sequence message, and it will be kept in linear structures and copied multiple time before it becomes in-sequence and can be delivered to the matching logic. There are multiple ways to mitigate this, but this discussion is outside the scope of this particular issue.

More worrisome is the fact that we observe out-of-sequence messages, using a single link and supposedly ordered BTLs, and this even when each thread is using a unique communicator. Logically, in this case no out-of-sequence message should be seen. At this point we assume that the immediate send optimization without a proper implementation in the BTLs is allowing message to bypass other messages waiting in the PML/BTL queues.

@jsquyres
Copy link
Member

jsquyres commented Sep 8, 2016

I marked this as a blocker bug -- just so that we don't miss it when talking about v2.1.0. I'm not 100% sure that it's actually a blocker, but it does seem like an important performance issue with THREAD_MULTIPLE scenarios containing lots of sending threads. We can figure out whether this is a blocker for v2.1.0 over time.

@jsquyres
Copy link
Member

jsquyres commented Sep 8, 2016

One scenario that @bosilca cited to me earlier today is:

  1. App calls MPI_ISEND
  2. Get MPI sequence number X
  3. openib BTL tries to do an inline send, and fails (e.g., out of resources). So the send goes onto a queue to send later
  4. Control returns to the app
  5. App calls MPI_ISEND again
  6. Get MPI sequence number (X+1)
  7. openib BTL tries to do an inline send -- without checking the "I failed to inline send so I queued it up" queue -- and succeeds
  8. Later, the "I failed to inline send..." queue is actually progressed and message X is sent

In this scenario, message X is received after message (X+1).

@bosilca Are similar scenarios happening in other BTLs?

@bosilca
Copy link
Member Author

bosilca commented Sep 9, 2016

Yes, @thananon has confirmed that a small number of OOS messages exists with vader.

Going back to your example, this is indeed how things unfold in the single threaded scenario. For the multi-threaded case it doesn't have to be MPI_Isend. The OOS only kicks in for the matching fragments, but if you push small messages from multiple threads we have this issue for every send below the eager limit, and for up to the number of sending threads for larger messages.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Sep 9, 2016

@bosilca @thananon I would like to summarize my understanding of the issue. Please correct me where needed.

  1. OOS messages are believed to be the root cause of the observed degradation in message rates between 1.8.8 and master/2.x.
  2. OOS does not happen in single threaded scenario in the 1.8 series but can, and does, occur, even in a single threaded scenario, in master/2.x.
  3. This is the result of various thread safety/thread multiple optimizations added to 2.x.

Can this affect other PMLs? @yosefe @gpaulsen please be advised.

@bosilca
Copy link
Member Author

bosilca commented Sep 13, 2016

@jladd-mlnx no, no and no. I clearly stated in my original post that this has nothing to do with the performance degradation between 1.8 and 2.x, for which we have already identified the root cause. Let me try to be even more clear on the technical details.

In a single threaded case, in an injection rate scenario, some of the messages are delivered out of order. This small number of OOS messages, have little impact on the performance (and let me stress this again: in single threaded case). That being said, in a normal run, there is absolutely no reason to have OOS messages, so this might be an indicator of some subtle issue in our communication layers. We identified this issue on master, but we have no plan to look further in the other releases.

Multi-threaded scenarios are something new, and we cannot run any test to completion in 1.8. So, this part of the discussion is moot. However, we have highlighted the fact that in a multi-threaded scenario, the number of OOS message is extremely high, and they are certainly responsible for a significant part of the measured performance degradation. At this point, it is difficult to quantify this impact, but we are working on it. Again, our efforts are entirely focused on master, and we have no plan to pursue this topic outside this scope.

@thananon
Copy link
Member

thananon commented Sep 19, 2016

Below is the data I tested on vader. It has huge impact on the injection rate.

Results from Artem's benchmark

Vader on arc00 with windows size = 256 and message size = 64 bytes.
Each thread will post 256 requests ping-ping pong-pong with thread id as tag for 100 iterations. The numbers shown are aggregated from both processes. The result is obtained by @davideberius 's version of PAPI integrated in OMPI.

2 Threads

Communicator Message rate Unexpected OOS
Same 700437.32 39 321
Separated 821592.27 13 48

4 Threads

Communicator Message rate Unexpected OOS
Same 627965.81 329 8173
Separated 1087428.83 529 130

8 Threads

Communicator Message rate Unexpected OOS
Same 260344.8 1377 301402
Separated 1060360 1314 267

@thananon
Copy link
Member

thananon commented Sep 26, 2016

Additional information after some test in single threaded case. Posting 500 non blocking messages.

BTL OOS
SM 0
TCP 0
Vader 63
OpenIB 39

@hppritcha
Copy link
Member

Is this really a blocker for 2.1.0? Is anyone working on a fix? It would need to come soon or else we'll need to reset the milestone.

@thananon
Copy link
Member

thananon commented Oct 3, 2016

If we want 2.1.0 to be multithreaded-efficient, we might need it as you can see, the performance is affected a lot. I will keep on reporting and let you guys decide on the milestone.

Right now the data is pointing towards individual BTL. I'm investigating further. The solution might be as easy as disable inline send. Will report back.

@hjelmn
Copy link
Member

hjelmn commented Oct 3, 2016

@thananon I expect some OOS messages from vader when switching protocols. Might be worth setting the fast box limit to 1 and measuring after the transition. Should be 0 and if not I need to fix something.

@hjelmn
Copy link
Member

hjelmn commented Oct 3, 2016

Hmm, i do see a case I am not properly checking which can lead to OOS messages under heavy load. Working on a patch for you to test.

@hjelmn
Copy link
Member

hjelmn commented Oct 3, 2016

@thananon See if this makes a difference for vader:

diff --git a/opal/mca/btl/vader/btl_vader_module.c b/opal/mca/btl/vader/btl_vader_module.c
index f54b407..f2d8f66 100644
--- a/opal/mca/btl/vader/btl_vader_module.c
+++ b/opal/mca/btl/vader/btl_vader_module.c
@@ -497,7 +497,8 @@ static struct mca_btl_base_descriptor_t *vader_prepare_src (struct mca_btl_base_
 #endif

             /* inline send */
-            if (OPAL_LIKELY(MCA_BTL_DES_FLAGS_BTL_OWNERSHIP & flags)) {
+            if (OPAL_LIKELY((MCA_BTL_DES_FLAGS_BTL_OWNERSHIP & flags) &&
+                            (MCA_BTL_NO_ORDER == order || 0 == opal_list_get_size (&endpoint->pending_frags)))) {
                 /* try to reserve a fast box for this transfer only if the
                  * fragment does not belong to the caller */
                 fbox = mca_btl_vader_reserve_fbox (endpoint, total_size);

@hjelmn
Copy link
Member

hjelmn commented Oct 4, 2016

I tested my patch on a Mac and it improved the 1-byte message injection rate by ~ 50%. There is a problem with larger messages (1k-16k). I am working on a fix for that. Hoping to have the fix ready some time tomorrow. It doesn't address the multi-threaded OOS problem but it will help somewhat.

@hppritcha
Copy link
Member

removing blocker label for this per discussion at 10/4/16 devel con call

@thananon
Copy link
Member

thananon commented Oct 4, 2016

Turning off the inline send by these MCA parameters doesn't solve reduce the number of OOS messages on both vader and IB (also both THREAD_SINGLE AND THREAD_MULTIPLE)

-mca btl_vader_max_inline_send 0
-mca btl_openib_max_inline_send 0
-mca btl_openib_max_inline_data 0

As per @hjelmn, I will wait until you finish your patch and try again. I ran @artpol84 benchmark with the message size of 64 for all test.

@hjelmn
Copy link
Member

hjelmn commented Oct 4, 2016

I have the "problem" fixed. The vader inline send is meaningless without xpmem so I wouldn't expect it to have an effect. I should really condition it on xpmem support.

For IB it is likely due to coalescing. Set btl_openib_use_message_coalescing to 0.

@hjelmn
Copy link
Member

hjelmn commented Oct 4, 2016

I will open a PR tomorrow for the vader fix.

@thananon
Copy link
Member

@hjelmn With btl_openib_use_message_coalescing 0 still shows some OOS messages but the number is lower. I haven't run the test enough to be able to statistically say that. This is just eyeballing observation.

@hjelmn
Copy link
Member

hjelmn commented Oct 10, 2016

Ok, the remainder might be due to the transition to eager rdma. Try setting btl_openib_use_eager_rdma to 0.

@thananon
Copy link
Member

thananon commented Oct 10, 2016

With -mca btl_openib_use_eager_rdma 0 alone, the OOS message is now 0.

So we found the issue for IB. As you might expect, the performance drops if we set that flag but at least we know where to look now. :)

@larrystevenwise
Copy link

Thanks @hjelmn and @thananon, I'll have a look at the eager logic in openib...

@larrystevenwise
Copy link

Hey @thananon, how can I reproduce the single threaded OOS issue using openib? Specifically, is there some statistic or counter that shows the OOS count? Or do you have a patch? Thanks in advance!

@larrystevenwise
Copy link

Looking at mca_btl_openib_endpoint_credit_acquire(), I see if parameter queue_frag is true, then a fragment is queued if there aren't resources available, even though OPAL_ERR_OUT_OF_RESOURCE is returned. And I see that mca_btl_openib_endpoint_post_send() calls mca_btl_openib_endpoint_credit_acquire() with queue_frag == true.

Perhaps this is where things get out of order?

@larrystevenwise
Copy link

larrystevenwise commented Oct 11, 2016

mca_btl_openib_endpoint_send_eager_rdma() calls mca_btl_openib_endpoint_send() which can queue the frag. Then the frag is processed for resend in progress_no_credits_pending_frags() called by btl_openib_handle_incoming() and btl_openib_handle_incoming_completion().

So maybe if mca_btl_openib_endpoint_send_eager_rdma() is called and queues, and then mca_btl_openib_endpoint_send_eager_rdma() is called again, but before progress_no_credits_pending_frags() is called, and further does not queue because resources are now available, we get an OOS message sent...

@bosilca
Copy link
Member Author

bosilca commented Oct 11, 2016

@larrystevenwise why is not mca_btl_openib_endpoint_send_eager_rdma checking if there are pending messages and send them prior to sending the current fragment ? The issue is that OOS affects the injection rate, especially on the receiver side where we must match the messages in FIFO order.

@thananon
Copy link
Member

@larrystevenwise I used internal version of PAPI (Another project here) to obtain the number. It is not in the release state and a little bit too messy to share.

The code where we add the counter is pml_ob1_recvfrag.cline 745 with the tag wrong_seq. You can add one line there.

@larrystevenwise
Copy link

Thank you @thananon!

@gpaulsen
Copy link
Member

gpaulsen commented Dec 2, 2016

Are there directions on how to try running with the integrated PAPI? How integrated is it? I'm interested in learning more.

@bosilca
Copy link
Member Author

bosilca commented Dec 2, 2016

We are using PAPI capabilities that are not in any release branch that I know of. It will be difficult to share all the code necessary for this with you. But we might be able to expose the counters different, directly in the PML module. I'll get back to you.

@gpaulsen
Copy link
Member

gpaulsen commented Dec 2, 2016

Thanks.

@jsquyres
Copy link
Member

Per Jan 2017 F2F discussion: the only single-threaded issue that still needs to remain open is #2161 (OOS issues in openib).

In the multi-threaded case, fixing the performance will be... challenging. And will remain challenging. 😄 The MPI-3.0 spec is (very) likely to include an info key that allows relaxing ordering of matching, which is the user-level workaround for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants