Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228

Closed
gitmachtl opened this issue Jul 24, 2022 · 81 comments · Fixed by IntersectMBO/cardano-ledger#2936
Labels
bug Something isn't working comp: ledger era: babbage tag: 1.35.2 type: blocker Major issue, the tagged version cannot be released user type: external Created by an external user Vasil

Comments

@gitmachtl
Copy link
Contributor

gitmachtl commented Jul 24, 2022

Strange Issue

So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!

This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.

I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.

Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks ⚠️

This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! 🙏

image

image

Here is a nightmare one, v7 built ontop of other v7 node (green path):
image

image

image

image

image

image

@gitmachtl gitmachtl added the bug Something isn't working label Jul 24, 2022
@gitmachtl gitmachtl changed the title [BUG] - Urgent - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? [BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? Jul 24, 2022
@papacarp
Copy link

height battles are on the rise. I'll work on analyzing chained height battles also but wanted to drop this data here as part of it.

image

@njd42
Copy link

njd42 commented Jul 24, 2022

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.

Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

@njd42
Copy link

njd42 commented Jul 24, 2022

@gitmachtl - can I ask, are all these 'V7' systems deployed using the same scripts? Just seeing if there could be a common configuration issue that underlies this:
Note
Screenshot 2022-07-24 at 18 45 43
Only has one reporting node.

@gitmachtl
Copy link
Contributor Author

gitmachtl commented Jul 24, 2022

@njd42 as far as i can tell - and i have spoken to a bunch of those SPOs - they are all running completely different systems. different deploy scripts, different locations, different kind of hardware (vm, dedicated server, ...).

the number of nodes reporting are only the ones that are actively reporting them to pooltool, but the height battles can be confirmed by others too.

@rene84
Copy link

rene84 commented Jul 24, 2022

For what it's worth: we are running our own compiled node in Docker (own Dockerfile) on AWS ECS and migrated to 1.35.1 (too) early.

Haven't done the amount of investigation that @gitmachtl has so I can only share sparse findings now, but what I could see is that our BP would mint a block omitting the preceding block (thus getting dropped by the next leader) even though there was sometimes 10+ slots in between. Our relays never received that preceding block (checked logs) so naturally the BPs also didn't. Will keep an eye on this issue and report when I've done more checks

@renesecur
Copy link

Example: block 7530726

Block 7530725 arrived at our BP:

{
    "app": [],
    "at": "2022-07-22T12:25:00.17Z",
    "data": {
        "chainLengthDelta": 1,
        "kind": "TraceAddBlockEvent.AddedToCurrentChain",
        "newtip": "4e4e385bae46bc9edb06c5203a70ccf64dabc64724aad4f4256f1f99cbaaf80d@66926409"
    },
    "env": "1.35.1:c7545",
    "host": "ip-172-2",
    "loc": null,
    "msg": "",
    "ns": [
        "cardano.node.ChainDB"
    ],
    "pid": "82",
    "sev": "Notice",
    "thread": "12135"
}

Then our block minted and reached our relays (three each on 1.35.1, on with p2p enabled). Then 18 slots later the block from 4f3410f074e7363091a1cc1c21b128ae423d1cce897bd19478e534bb arrived with hash 3150a0d4502f1d0995c4feae3fcb93dd505671cf8dc84bbf0b761e2ee64d70dc and our relays switched fork.

image

The number of incoming connections on our relay is healthy and our block propagation seems otherwise fine.

@andreabravetti
Copy link

@rene84 what if you have two relay, one in 1.34.x and one in 1.35.x? does your bp receive that preceding block?

@njd42
Copy link

njd42 commented Jul 25, 2022

I think that the connectivity partition is a much more likely root cause than an issue with chain selection given the evidence at hand. However, such a partition is highly unlikely to occur through pure random chance, which suggests some sort of trigger.

If you scan through your logs looking for, say ErrorPolicySuspendPeer messages? What we are looking for is a rash of those occurring at around the same time causing the 1.34.x and 1.35.x nodes to drop connections with each other.

@coot
Copy link
Contributor

coot commented Jul 25, 2022

Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?

@nemo83
Copy link

nemo83 commented Jul 25, 2022

Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226.

In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.

It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?

@gitmachtl
Copy link
Contributor Author

Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?

As far as i know the SPOs were running 1.35.1 nodes in non p2p mode on mainnet. It is really a strange thing that 1.35.* nodes keep on sticking with the block of other 1.35.* nodes and not picking up a resolved 1.34.* block. 1.35.2 was not used i guess, its just too fresh out of the box.

@gitmachtl
Copy link
Contributor Author

Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226.

In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.

It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?

Could be, but the same relays/BPs later on are producing normal blocks.

@renesecur
Copy link

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

image

image

image

In UTC

@gitmachtl
Copy link
Contributor Author

I am not sure if we're mixing up two different issues here. I would love to hear @coot 's oppinion on the height battle issues on the mainnet, its really outstanding and not just a glitch.

@coot
Copy link
Contributor

coot commented Jul 25, 2022

We haven't seen such long heights battles before, it certainly deserves attention.

@gitmachtl
Copy link
Contributor Author

The strange thing is, that with those double height battles, the 1.35.* only builds on the lost block of the 1.35.* before. Normally it should pick up the block from the previous winner of course. And the fact that it happend than again is really bad. Creating mini forks all the time. I was on a full 1.35.1 setup at this time on mainnet, relays and BPs. I reverted back to a full 1.34.1 setup and had no issues like that anymore since than. I know that this is hard to simulate, because on the Testnet we now are in babbage-era and all nodes are 1.35.*. On the Mainnet we have the mixed situation. But SPOs are now aware, and it could lead to the case that SPOs are only updating in the very last moment to stay out of such bugs. Thats not a good solution to go into this HF i would say.

@coot
Copy link
Contributor

coot commented Jul 25, 2022

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

P2P nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous and Error as here (we ought to change it to something better, Error is terrible for grepping).

@rene84
Copy link

rene84 commented Jul 25, 2022

What is consistent with what I see in the logs is that some blocks just never make it even after 10 or more seconds. Both incoming and outgoing I've seen that. Seems more likely due to a propagation issue in the network eg 1.34.x and 1.35.x not talking rather then there being a bug in the fork selection mechanism in the node itself

@coot
Copy link
Contributor

coot commented Jul 25, 2022

Can I ask for logs from a block producer which was connected to a 1.35.x node from the time when it lost a slot battle. It would help us validate our hypothesis. If one doesn't want to share them publicly here, you can send them to my iohk email address (marcin.szamotulski@iohk.io) or share them through discord.

@rene84
Copy link

rene84 commented Jul 25, 2022

Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1

@coot
Copy link
Contributor

coot commented Jul 25, 2022

Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1

I am not sure if this will help, but please send. It's more interesting to see 1.34.1 BP connected to 1.35.1 relay.

@gitmachtl
Copy link
Contributor Author

here is another example:

image

@gitmachtl
Copy link
Contributor Author

gitmachtl commented Jul 25, 2022

and another one, in this example i know for sure that PILOT is running 1.35.0 on all mainnet nodes in normal mode (not p2p).
same goes for NETA1, 1.35.0 on all mainnet nodes (not p2p).

image

@renesecur
Copy link

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

P2P nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous and Error as here (we ought to change it to something better, Error is terrible for grepping).

I searched for TraceDemoteAsynchronous but did not get any results. I'm pretty confident we haven't enabled the correct trace for that log message. I do see regular PeerSelectionCounters and the number of hotPeers changing from time to time but not Error log message on the entire day of the 22nd of July (UTC)

@coot
Copy link
Contributor

coot commented Jul 26, 2022

@renesecur you need to enable TracePeerSelection and TracePeerSelectionActions.

@coot
Copy link
Contributor

coot commented Jul 26, 2022

We would like to get logs from a 1.34.1 relay connected to a 1.35.x BP node which was part of any of the slot battles.

@coot
Copy link
Contributor

coot commented Jul 26, 2022

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

@JaredCorduan
Copy link
Contributor

thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.

@reqlez
Copy link

reqlez commented Jul 26, 2022

thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.

I have a day job as well... sorry... i wish i could go 100% Cardano ;-) This is clearly a "just in case" vs. an "emergency" downgrade, and I been running it for a month now, but I will get to it eventually.

@nemo83
Copy link

nemo83 commented Jul 26, 2022

What about shelley? I know I'm stating the obvious, but 1.35.x is showing problems in shelley.

The bug that I've found does not effect Shelley. What makes you think Shelley has problems? I'm not aware of a network still running Shelley...

lool sorry, Alonzo! I meant Alonzo :disappear:

@karknu
Copy link
Contributor

karknu commented Jul 26, 2022

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.
Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.

I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.

That should be easy to test by configuring a 1.34 relay to only have a single 1.35 node as its upstream peer.

JaredCorduan pushed a commit to IntersectMBO/cardano-ledger that referenced this issue Jul 27, 2022
Additionally, a new golden test for the alonzo fee calculation has been
added, using the block from:
IntersectMBO/cardano-node#4228 (comment)
JaredCorduan pushed a commit to IntersectMBO/cardano-ledger that referenced this issue Jul 27, 2022
Additionally, a new golden test for the alonzo fee calculation has been
added, using the block from:
IntersectMBO/cardano-node#4228 (comment)
JaredCorduan pushed a commit to IntersectMBO/cardano-ledger that referenced this issue Jul 28, 2022
Additionally, a new golden test for the alonzo fee calculation has been
added, using the block from:
IntersectMBO/cardano-node#4228 (comment)
JaredCorduan pushed a commit to IntersectMBO/cardano-ledger that referenced this issue Jul 28, 2022
Additionally, a new golden test for the alonzo fee calculation has been
added, using the block from:
IntersectMBO/cardano-node#4228 (comment)
@JaredCorduan
Copy link
Contributor

I did write "resolves #2936 in the ledger PR notes that I think fixes the bug, but I did not realize that those magic words would close issues cross-repo. I think it's best to wait for more testing before this node issues is closed. sorry.

@JaredCorduan JaredCorduan reopened this Jul 28, 2022
JaredCorduan pushed a commit that referenced this issue Jul 28, 2022
Jimbo4350 pushed a commit that referenced this issue Jul 29, 2022
This resolves node #4228

update plutus to the tip of release/1.0.0

This delay SECP256k1 builtins

Updated changelogs and cabal files with 1.35.3

Added changelogs and cabal files updates to version 1.35.3
@JaredCorduan
Copy link
Contributor

So what we have is a soft fork where the "conservative nodes" (ie 1.34) are winning the longest chain.

I want to clarify that what I said here ☝️ is incorrect. This was actually an accidental hard fork. Blocks produced by nodes with the bug would potentially not be valid according to a 1.34 node. If the bug had instead been that a higher fee was demanded, then it would have been a soft fork, since 1.34 nodes would still validate all the blocks from nodes with the bug.

Jimbo4350 pushed a commit that referenced this issue Aug 3, 2022
A number of fixes have been added to 1.35.3, have added them to the changelogs.

update ledger to the tip of release/1.0.0

This resolves node #4228

update plutus to the tip of release/1.0.0

This delay SECP256k1 builtins

Updated changelogs and cabal files with 1.35.3

Added changelogs and cabal files updates to version 1.35.3

Bump block header protocol version

Alonzo will now broadcast 7.2 (even though we are actually moving to
7.0, this is to distinguish from other versions of the node that are
broadcasting major version 7). Babbage will now broadcast 7.0, the
version that we are actually moving to.
Jimbo4350 pushed a commit that referenced this issue Aug 3, 2022
A number of fixes have been added to 1.35.3, have added them to the changelogs.

update ledger to the tip of release/1.0.0

This resolves node #4228

update plutus to the tip of release/1.0.0

This delay SECP256k1 builtins

Updated changelogs and cabal files with 1.35.3

Added changelogs and cabal files updates to version 1.35.3

Bump block header protocol version

Alonzo will now broadcast 7.2 (even though we are actually moving to
7.0, this is to distinguish from other versions of the node that are
broadcasting major version 7). Babbage will now broadcast 7.0, the
version that we are actually moving to.
@dorin100 dorin100 added type: blocker Major issue, the tagged version cannot be released user type: external Created by an external user comp: ledger era: babbage labels Oct 21, 2022
@dorin100
Copy link

@gitmachtl is it ok to close this issue now?

@gitmachtl
Copy link
Contributor Author

Yes, was resolved with 1.35.3. Thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp: ledger era: babbage tag: 1.35.2 type: blocker Major issue, the tagged version cannot be released user type: external Created by an external user Vasil
Projects
None yet
Development

Successfully merging a pull request may close this issue.