Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Evict inactive peers from SyncingEngine #13829

Merged
merged 7 commits into from
Apr 21, 2023

Conversation

altonen
Copy link
Contributor

@altonen altonen commented Apr 5, 2023

If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing SyncingEngine to free up a slot for some other peer that hopefully is more active.

This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send.

This is a change we probably want to test somewhere before releasing, in case it has adverse effects not observable in my n = 1 dataset.

From logs:

DEBUG tokio-runtime-worker sync: evict peer 12D3KooWCYdHoP38Jymy3UtVM2mVYcgLNAuMWMNzwsB16kEYQsTA since it has been idling for too long    
DEBUG tokio-runtime-worker sync: evict peer 12D3KooWHsvEicXjWWraktbZ4MQBizuyADQtuEGr3NbDvtm5rFA5 since it has been idling for too long    
DEBUG tokio-runtime-worker sync: 12D3KooWCYdHoP38Jymy3UtVM2mVYcgLNAuMWMNzwsB16kEYQsTA disconnected    
DEBUG tokio-runtime-worker sync: 12D3KooWHsvEicXjWWraktbZ4MQBizuyADQtuEGr3NbDvtm5rFA5 disconnected    

Best block and libp2p peers after 30 minutes of running:

best_block
peers

If both halves of the block announce notification stream have been
inactive for 2 minutes, report the peer and disconnect it, allowing
`SyncingEngine` to free up a slot for some other peer that hopefully
is more active.

This needs to be done because the node may falsely believe it has open
connections to peers because the inbound substream can be closed without
any notification and closed outbound substream is noticed only when node
attempts to write to it which may not happen if the node has nothing to
send.
@altonen altonen added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit T0-node This PR/Issue is related to the topic “node”. labels Apr 5, 2023
@altonen altonen requested a review from a team April 5, 2023 14:14
client/network/sync/src/engine.rs Outdated Show resolved Hide resolved
client/network/sync/src/engine.rs Outdated Show resolved Hide resolved
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change the logic a little bit. When the chain is stalled, we should not start evicting all peers. So, we should check that if there are not any block announcements in INACTIVITY_EVICT_THRESHOLD we do not evict all peers. I mean it could happen that we are only connected to peers that disconnected us. So, we should only evict some of the peers and if the new peers start sending us block announcements, we can also start evicting the old inactive peers.

@altonen
Copy link
Contributor Author

altonen commented Apr 6, 2023

We need to change the logic a little bit. When the chain is stalled, we should not start evicting all peers. So, we should check that if there are not any block announcements in INACTIVITY_EVICT_THRESHOLD we do not evict all peers. I mean it could happen that we are only connected to peers that disconnected us. So, we should only evict some of the peers and if the new peers start sending us block announcements, we can also start evicting the old inactive peers.

I'm not sure if I agree. I'd argue that if we are observing a stalled chain, that is because of the peers we're connected to right now and their inactivity is the direct cause of a stalled chain for us. In this case what is best course of action is to evict all of the idle ones. What happens when SyncingEngine does that is Peerset will notice that the number of actual connections (both inbound/outbound) have fallen below the desired limit and it will start establishing/accepting new connections. So what happens is not that we see the best block not progressing, we kick the inactive peers, we see (0 peers) and best block keeps being stuck but that SyncingEngine get a set of fresh connections which some of them might actually be able to provide us with new block announcements. If even after this process peers are constantly getting swapped out because they're rejecting us, then it doesn't really matter anymore because it means that all nodes reachable to this node are full and no matter what is done, the chain will look as though it's stalled. Keeping some of them as backup doesn't help. The only thing I can think of that helps with that is to increase connectivity with, e.g., hole punching. You could also argue that the chain is kept up to date with some delay since during connection we'd resync from this node using the block advertised in BlockAnnouncesHandshake but that's just an extra feature.

Also notice that the process described above is happening constantly during syncing so idle peers are being swapped out for new potential peers that may be active, meaning this process would ensure that when we reach the tip, we would have nodes that have given us new block announcements within the last 2 minutes.

Furthermore, always evicting all idle peers irrespective of whether the chain is stalled or not, without having special case of preserving observed idle peers has another benefit: since we're constantly monitoring the connections and remove idling peers which are then put back into NotConnected state in Peerset, later on Peerset will try attempt to reconnect to this node again, at which point the peer might actually have a slot for us and is able to give block announcements. In other words, since the peer state is dynamic in the entire network, while a peer might have rejected us now as it's full, when we later on reconnect to it, it might actually have a slot for us since its peer state is updating independently from us and if we instead try to keep it as a back-up peer during a stalled chain, we're not able to reconnect and claim this potentially free slot for us.

All that said, there is one downside to this change that I can think of: we're already having issues with too many full nodes on the network. After this change, each node will be "scanning" the network for free full slots so if previously the node was happy keeping a node in its peer table while it was actually rejected, now the node will evict that node and look for a node that has a free full slot. The good thing is that the chain should be kept up to date, albeit with a delay, because of new blocks discovered in the handshake message but the downside is that if there previously was issues with full node slots being occupied, this change will ensure they will occupied.

@altonen
Copy link
Contributor Author

altonen commented Apr 10, 2023

Btw today I managed to get the sync stalling issue reproduce by running a validator long enough. At some point it starts idling and both best and finalized block are stuck. The reason is because it's not receiving any block announcements as the inbound substream is closed and there's no way to detect that right now. I think it also links to time of the day because when I've done these tests later in the evening/night, everything works fine, probably because there are fewer people running nodes.

Applying the inactivity fix and decreasing the threshold to 30 seconds works better than 2 minutes. It kicks out all 40 peers few times but then it finds some peers that are able to provide blocks.

Interestingly, it looks like providing --in-peers-light 0 fixes the sync issue for both nodes with and without the eviction patch applied, and the node is kept up to date for hours. I will keep testing to see if it produces consistent results or if it was just luck. If turns out to be a working fix, these sync issues are then probably related to paritytech/polkadot-sdk#512 and fixing how handshakes work could possibly be the fix for these issues people have been reporting.

client/network/sync/src/engine.rs Outdated Show resolved Hide resolved
@bkchr
Copy link
Member

bkchr commented Apr 13, 2023

You could also argue that the chain is kept up to date with some delay since during connection we'd resync from this node using the block advertised in BlockAnnouncesHandshake but that's just an extra feature.

Yeah that is a good point! That should be enough to even work when the chain stalls. It may takes a little bit more time to keep everyone syncing again, but that should be neglect-able.

@vstakhov
Copy link
Contributor

vstakhov commented Apr 19, 2023

After burning this PR on Versi we have found that it really helped to reduce peers count spikes and the inbound/outbound libp2p error rates:

Screenshot 2023-04-19 at 12 26 14

Screenshot 2023-04-19 at 12 26 14

@altonen
Copy link
Contributor Author

altonen commented Apr 19, 2023

Thanks for running the burn-in. I'll do one more test in my local environment and I think we can merge this after that.

@altonen
Copy link
Contributor Author

altonen commented Apr 21, 2023

bot rebase

@paritytech-processbot
Copy link

Rebased

@altonen
Copy link
Contributor Author

altonen commented Apr 21, 2023

bot merge

@paritytech-processbot paritytech-processbot bot merged commit e44b43e into master Apr 21, 2023
@paritytech-processbot paritytech-processbot bot deleted the sync-evict-idle-peers branch April 21, 2023 08:22
jasl added a commit to Phala-Network/khala-parachain that referenced this pull request Apr 24, 2023
jasl added a commit to Phala-Network/khala-parachain that referenced this pull request Apr 24, 2023
nathanwhit pushed a commit to nathanwhit/substrate that referenced this pull request Jul 19, 2023
* Evict inactive peers from `SyncingEngine`

If both halves of the block announce notification stream have been
inactive for 2 minutes, report the peer and disconnect it, allowing
`SyncingEngine` to free up a slot for some other peer that hopefully
is more active.

This needs to be done because the node may falsely believe it has open
connections to peers because the inbound substream can be closed without
any notification and closed outbound substream is noticed only when node
attempts to write to it which may not happen if the node has nothing to
send.

* zzz

* wip

* Evict peers only when timeout expires

* Use `debug!()`

---------

Co-authored-by: parity-processbot <>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants