Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate grandpa resource usage #830

Open
eskimor opened this issue May 5, 2022 · 8 comments
Open

Investigate grandpa resource usage #830

eskimor opened this issue May 5, 2022 · 8 comments

Comments

@eskimor
Copy link
Member

eskimor commented May 5, 2022

As can be seen here (data from Kusama), from all the tasks grandpa is apart from node and networking the most CPU intensive:

grandpa-usage-kusama

Talking to @andresilva , we figured that we might be able to significantly reduce this by accepting a slightly slower finality, by simply doubling the gossip_duration configuration parameter. That's a very easy change and the smaller nodes on Kusama used to have quite high CPU usage, so it seemed worth trying out. We doubled that value from 1000 to 2000 ms and deployed that to all Versi nodes:

https://github.com/paritytech/polkadot/pull/5448/files
https://gitlab.parity.io/parity/infrastructure/parity-testnet/-/merge_requests/313/diffs

This was tried on May 4th at around 7:30 PM UTC+2, another upgrade (relaxing escalation of approval distribution) got applied at around May 5th 0:00 UTC+2. On May 5th at around 10:00 UTC+2 master got deployed again.

What we found, poll times of grandpa dropped by around 37%:

grandpa-polling-times

Interestingly also a few subsystems had reduced poll times with that change:

poll-time-collation-generation

gossip-support-poll-duration

default-poll-duration

Parachain block times did not really seem to be affected (or rather fluctuate too much to see a clear pattern):

parachain-blocktimes

As expected finality got a bit slower:

finality-lag

Number of active heads stayed quite stable:

active-heads

Network traffic wise, there was no overall concrete pattern visible.

Conclusion and path forward

By doubling the gossip_duration indeed some load can be reduced, we were not able to find any negative effects apart of a bit slower finality.

If 1 block slower finality seems acceptable, this is an easy way to reduce load a bit. It is likely that the savings will be larger on larger networks. Versi was around 560 nodes at the time of the test. So trying this out on a larger test network, might be a worthwhile experiment.

@andresilva
Copy link
Contributor

It's a bit hard to predict how much impact it will have in the node resources when this change gets deployed on a different network, but since the worst thing that can happen is slightly increased finality times I think it would be worthwhile to test this on Kusama. Since it has 1000 validators it might have a slightly bigger impact than what was observed in Versi.

@rphmeier
Copy link
Contributor

rphmeier commented May 5, 2022

Out of curiosity, what is the heaviest part of GRANDPA? Can we optimize GRANDPA lower-level to achieve similar results without increasing the finality time?

@andresilva
Copy link
Contributor

andresilva commented May 5, 2022

I think it boils down to the amount of data we need to spread every round, which with the current gossip_duration should be about 4 seconds. We only gossip to 4 random peers which is usually enough to spread the votes throughout the network without a lot of churn.

One thing that we could do in the future is changing GRANDPA to only start a new round when there's something new to vote on (maybe with a timeout to start a new round anyway even if there's nothing new). This would make it so that GRANDPA is "synced" with approval voting, avoiding rounds were we finalize the same thing.

#551 is also useful since the API we're currently using doesn't handle backpressure.

@eskimor
Copy link
Member Author

eskimor commented Oct 30, 2022

Looking at this again, it seems that CPU load of grandpa indeed improved. I can see CPU load of grandpa at around 10%-15% now, e.g. here:
Screenshot from 2022-10-30 13-09-14

Anyhow, improving performance on Kusama and not on Polkadot is longer term not a good idea. I will remove the change for now, if we decide we want this we should enable it on both networks later on.

@burdges
Copy link

burdges commented Oct 30, 2022

What about grandpa's gossip topology? I think four random peers does not sound wonderful, either for latency, deduplicaiton, etc. What if we used the grid from approvals?

@h4x3rotab
Copy link

The finality speed affects downstream parachains. Phala runs external offchain workers that follows the finality of parachain blocks, which follows the upstream relay chain. It has a built-in GRANDPA light client. It verifies the relay chain finality, and then checks if the parachain header matches the header in the relay chain. It always has the latency of relay chain finality + parachain finality. If the relay chain has one more block delay, our offchain worker will move slower. If the original delay is around 2 blocks, then 1 block is 50% slower, which is non-trivial.

It's nice to reduce the burden of running a relay chain node, but not ideal with the cost of increasing the downstream application latency. It would be better if we can optimize Grandpa itself, or only make adjustment on non-validator full nodes (not sure if it's relevant to this issue though).

@burdges
Copy link

burdges commented Feb 15, 2023

An 8% reduction in CPU usage sounds worth doing. It'll likely mean more parachain cores and thus more parachains.

Did we see much finality lag on Versi @eskimor ?

Do we know how much grandpa time is signature verification v signature creation (minimal) v networking v business logic? We've partial work towards Rabin-Williams signatures, which theoretically give 10x faster verification with only mild key rotation, but also 256 bytes vs 64 bytes for Ed25519. We could explore single layer XMSS signatures, just blazingly fast, but require aggressive key rotation and 2kb of bandwidth. We would not otoh benefit from these if we've networking limitations, which afaik should not be the case, but libp2p maybe slow.

@eskimor
Copy link
Member Author

eskimor commented Feb 15, 2023

Thanks @h4x3rotab ! This is really useful feedback! @burdges yes, but we have some bigger fish to fry for now :-)

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Update substrate to polkadot-v0.9.28

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* Fix PruningParams fields

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* Some trivials

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* Ignore tests calling EthereumRuntimeRPCApi_current_transaction_statuses

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* ignore tests

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* Fix fmt

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

Signed-off-by: koushiro <koushiro.cqx@gmail.com>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
)

* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
bkchr pushed a commit that referenced this issue Apr 10, 2024
* transaction_payment_without_multiplier -> transaction_payment

* tests

* fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@andresilva @burdges @h4x3rotab @eskimor @rphmeier and others