You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi guys, we have a production private QBFT PoA network running Besu 24.3.3 that halted this week. The network architecture is:
4 RPC Bonsai Full nodes
1 Archive Forest Full node
4 Validator Bonsai Full nodes
Other 4 RPC Bonsai Full nodes, that act as “standby validators”.
The nodes are distributed in different AZs and regions. We have each validator in one AZ, two of them in one region and the other two in another. It’s the same for “standby validators” and RPC node.
This week, the network just halted. The validators, which were in INFO log level, stopped producing blocks with no error message, and even stopped logging anything.
To quick recover from this, we manually restarted the 4 validators, bringing the network online again. Validator logs are attached.
The network was running for +- 4 months without any issues.
Something curious is that some RPC nodes came back, some not. We restarted 2 of the main RPC nodes, which applications connect, and they started to sync blocks, but the other 2 recovered themselves, without intervention.
But there’s still “standby validators” that we didn’t restart and are halted. We changed the log level of them to debug and trace but didn’t have much luck with the information. Both logs are attached as well.
One last important information is that near the time the network froze, we’re deploying an application contract to the network. Smart contract is solidity 0.8.24 version, with hardhat.
This contract was deployed before in one of our development/QA networks without any issues, and after the crash of the network and the restart, we’re able to deploy it to the network, but it’s the major event that happened near the crash time besides the “business as usual” load in other contracts. The applications were running as usual among this deploy.
Our guess here is maybe the deploy triggered a bug, or it could be a bug alone in the 24.3.3. We already planning to upgrade to 24.9.1, but we would like to share this case with the community because this could be a bug, and we’re afraid it could be something related to Bonsai and QBFT.
Thank you
Steps to Reproduce
We’re unable to reproduce the issue in our test/QA environments. We have a QA network running for about the same time with no issues.
Versions
Software Version: Besu 24.3.3
Docker image: official hyperledger/besu:24.3.3, arm image
Hi @jframe, we have DEBUG enabled, I ran against a still failed node (validator 7, which is still frozen, generating no logs but with rpc port open) and one active validator, validator-1. Validator 7 didn't return anything, but validator 1 returned lots of info. Both follows attached. validator-7-debug-getbadblocks.txt validator-1-debug-getbadblocks.txt
Description
Hi guys, we have a production private QBFT PoA network running Besu 24.3.3 that halted this week. The network architecture is:
The nodes are distributed in different AZs and regions. We have each validator in one AZ, two of them in one region and the other two in another. It’s the same for “standby validators” and RPC node.
This week, the network just halted. The validators, which were in INFO log level, stopped producing blocks with no error message, and even stopped logging anything.
To quick recover from this, we manually restarted the 4 validators, bringing the network online again. Validator logs are attached.
The network was running for +- 4 months without any issues.
Something curious is that some RPC nodes came back, some not. We restarted 2 of the main RPC nodes, which applications connect, and they started to sync blocks, but the other 2 recovered themselves, without intervention.
But there’s still “standby validators” that we didn’t restart and are halted. We changed the log level of them to debug and trace but didn’t have much luck with the information. Both logs are attached as well.
One last important information is that near the time the network froze, we’re deploying an application contract to the network. Smart contract is solidity 0.8.24 version, with hardhat.
This contract was deployed before in one of our development/QA networks without any issues, and after the crash of the network and the restart, we’re able to deploy it to the network, but it’s the major event that happened near the crash time besides the “business as usual” load in other contracts. The applications were running as usual among this deploy.
Our guess here is maybe the deploy triggered a bug, or it could be a bug alone in the 24.3.3. We already planning to upgrade to 24.9.1, but we would like to share this case with the community because this could be a bug, and we’re afraid it could be something related to Bonsai and QBFT.
Thank you
Steps to Reproduce
We’re unable to reproduce the issue in our test/QA environments. We have a QA network running for about the same time with no issues.
Versions
Smart contract information
Additional information
Requests: CPU: 500m, memory: 3Gi
Limits: CPU: 4 core, memory 8Gi
besu-validator-7-0_besu-validator-7.log
The text was updated successfully, but these errors were encountered: