[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228

gitmachtl · 2022-07-24T13:24:07Z

Strange Issue

So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!

This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.

I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.

Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks ⚠️

This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! 🙏

Here is a nightmare one, v7 built ontop of other v7 node (green path):

papacarp · 2022-07-24T16:51:29Z

height battles are on the rise. I'll work on analyzing chained height battles also but wanted to drop this data here as part of it.

njd42 · 2022-07-24T17:17:52Z

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.

Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

njd42 · 2022-07-24T17:40:34Z

@gitmachtl - can I ask, are all these 'V7' systems deployed using the same scripts? Just seeing if there could be a common configuration issue that underlies this:
Note

Only has one reporting node.

gitmachtl · 2022-07-24T18:38:17Z

@njd42 as far as i can tell - and i have spoken to a bunch of those SPOs - they are all running completely different systems. different deploy scripts, different locations, different kind of hardware (vm, dedicated server, ...).

the number of nodes reporting are only the ones that are actively reporting them to pooltool, but the height battles can be confirmed by others too.

rene84 · 2022-07-24T18:46:59Z

For what it's worth: we are running our own compiled node in Docker (own Dockerfile) on AWS ECS and migrated to 1.35.1 (too) early.

Haven't done the amount of investigation that @gitmachtl has so I can only share sparse findings now, but what I could see is that our BP would mint a block omitting the preceding block (thus getting dropped by the next leader) even though there was sometimes 10+ slots in between. Our relays never received that preceding block (checked logs) so naturally the BPs also didn't. Will keep an eye on this issue and report when I've done more checks

renesecur · 2022-07-24T19:26:32Z

Example: block 7530726

Block 7530725 arrived at our BP:

{
    "app": [],
    "at": "2022-07-22T12:25:00.17Z",
    "data": {
        "chainLengthDelta": 1,
        "kind": "TraceAddBlockEvent.AddedToCurrentChain",
        "newtip": "4e4e385bae46bc9edb06c5203a70ccf64dabc64724aad4f4256f1f99cbaaf80d@66926409"
    },
    "env": "1.35.1:c7545",
    "host": "ip-172-2",
    "loc": null,
    "msg": "",
    "ns": [
        "cardano.node.ChainDB"
    ],
    "pid": "82",
    "sev": "Notice",
    "thread": "12135"
}

Then our block minted and reached our relays (three each on 1.35.1, on with p2p enabled). Then 18 slots later the block from 4f3410f074e7363091a1cc1c21b128ae423d1cce897bd19478e534bb arrived with hash 3150a0d4502f1d0995c4feae3fcb93dd505671cf8dc84bbf0b761e2ee64d70dc and our relays switched fork.

The number of incoming connections on our relay is healthy and our block propagation seems otherwise fine.

andreabravetti · 2022-07-24T22:05:11Z

@rene84 what if you have two relay, one in 1.34.x and one in 1.35.x? does your bp receive that preceding block?

njd42 · 2022-07-25T06:17:44Z

I think that the connectivity partition is a much more likely root cause than an issue with chain selection given the evidence at hand. However, such a partition is highly unlikely to occur through pure random chance, which suggests some sort of trigger.

If you scan through your logs looking for, say ErrorPolicySuspendPeer messages? What we are looking for is a rash of those occurring at around the same time causing the 1.34.x and 1.35.x nodes to drop connections with each other.

coot · 2022-07-25T07:21:13Z

Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?

nemo83 · 2022-07-25T07:26:06Z

Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226.

In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.

It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?

gitmachtl · 2022-07-25T07:34:39Z

Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?

As far as i know the SPOs were running 1.35.1 nodes in non p2p mode on mainnet. It is really a strange thing that 1.35.* nodes keep on sticking with the block of other 1.35.* nodes and not picking up a resolved 1.34.* block. 1.35.2 was not used i guess, its just too fresh out of the box.

gitmachtl · 2022-07-25T07:36:11Z

Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226.

In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.

It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?

Could be, but the same relays/BPs later on are producing normal blocks.

renesecur · 2022-07-25T07:43:13Z

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

In UTC

gitmachtl · 2022-07-25T07:58:04Z

I am not sure if we're mixing up two different issues here. I would love to hear @coot 's oppinion on the height battle issues on the mainnet, its really outstanding and not just a glitch.

coot · 2022-07-25T08:03:13Z

We haven't seen such long heights battles before, it certainly deserves attention.

gitmachtl · 2022-07-25T08:10:02Z

The strange thing is, that with those double height battles, the 1.35.* only builds on the lost block of the 1.35.* before. Normally it should pick up the block from the previous winner of course. And the fact that it happend than again is really bad. Creating mini forks all the time. I was on a full 1.35.1 setup at this time on mainnet, relays and BPs. I reverted back to a full 1.34.1 setup and had no issues like that anymore since than. I know that this is hard to simulate, because on the Testnet we now are in babbage-era and all nodes are 1.35.*. On the Mainnet we have the mixed situation. But SPOs are now aware, and it could lead to the case that SPOs are only updating in the very last moment to stay out of such bugs. Thats not a good solution to go into this HF i would say.

coot · 2022-07-25T08:12:34Z

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

P2P nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous and Error as here (we ought to change it to something better, Error is terrible for grepping).

rene84 · 2022-07-25T08:36:49Z

What is consistent with what I see in the logs is that some blocks just never make it even after 10 or more seconds. Both incoming and outgoing I've seen that. Seems more likely due to a propagation issue in the network eg 1.34.x and 1.35.x not talking rather then there being a bug in the fork selection mechanism in the node itself

coot · 2022-07-25T10:59:54Z

Can I ask for logs from a block producer which was connected to a 1.35.x node from the time when it lost a slot battle. It would help us validate our hypothesis. If one doesn't want to share them publicly here, you can send them to my iohk email address (marcin.szamotulski@iohk.io) or share them through discord.

rene84 · 2022-07-25T11:03:38Z

Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1

coot · 2022-07-25T12:01:10Z

Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1

I am not sure if this will help, but please send. It's more interesting to see 1.34.1 BP connected to 1.35.1 relay.

gitmachtl · 2022-07-25T14:28:19Z

here is another example:

gitmachtl · 2022-07-25T14:35:20Z

and another one, in this example i know for sure that PILOT is running 1.35.0 on all mainnet nodes in normal mode (not p2p).
same goes for NETA1, 1.35.0 on all mainnet nodes (not p2p).

renesecur · 2022-07-25T15:59:56Z

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

P2P nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous and Error as here (we ought to change it to something better, Error is terrible for grepping).

I searched for TraceDemoteAsynchronous but did not get any results. I'm pretty confident we haven't enabled the correct trace for that log message. I do see regular PeerSelectionCounters and the number of hotPeers changing from time to time but not Error log message on the entire day of the 22nd of July (UTC)

coot · 2022-07-26T08:00:25Z

@renesecur you need to enable TracePeerSelection and TracePeerSelectionActions.

coot · 2022-07-26T08:10:59Z

We would like to get logs from a 1.34.1 relay connected to a 1.35.x BP node which was part of any of the slot battles.

coot · 2022-07-26T09:45:33Z

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

JaredCorduan · 2022-07-26T17:52:26Z

thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.

reqlez · 2022-07-26T18:05:14Z

thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.

I have a day job as well... sorry... i wish i could go 100% Cardano ;-) This is clearly a "just in case" vs. an "emergency" downgrade, and I been running it for a month now, but I will get to it eventually.

nemo83 · 2022-07-26T18:34:22Z

What about shelley? I know I'm stating the obvious, but 1.35.x is showing problems in shelley.

The bug that I've found does not effect Shelley. What makes you think Shelley has problems? I'm not aware of a network still running Shelley...

lool sorry, Alonzo! I meant Alonzo :disappear:

karknu · 2022-07-26T19:34:52Z

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.
Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.

I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.

That should be easy to test by configuring a 1.34 relay to only have a single 1.35 node as its upstream peer.

Additionally, a new golden test for the alonzo fee calculation has been added, using the block from: IntersectMBO/cardano-node#4228 (comment)

JaredCorduan · 2022-07-28T19:16:39Z

I did write "resolves #2936 in the ledger PR notes that I think fixes the bug, but I did not realize that those magic words would close issues cross-repo. I think it's best to wait for more testing before this node issues is closed. sorry.

This resolves node #4228

This resolves node #4228 update plutus to the tip of release/1.0.0 This delay SECP256k1 builtins Updated changelogs and cabal files with 1.35.3 Added changelogs and cabal files updates to version 1.35.3

JaredCorduan · 2022-08-02T20:59:09Z

So what we have is a soft fork where the "conservative nodes" (ie 1.34) are winning the longest chain.

I want to clarify that what I said here ☝️ is incorrect. This was actually an accidental hard fork. Blocks produced by nodes with the bug would potentially not be valid according to a 1.34 node. If the bug had instead been that a higher fee was demanded, then it would have been a soft fork, since 1.34 nodes would still validate all the blocks from nodes with the bug.

A number of fixes have been added to 1.35.3, have added them to the changelogs. update ledger to the tip of release/1.0.0 This resolves node #4228 update plutus to the tip of release/1.0.0 This delay SECP256k1 builtins Updated changelogs and cabal files with 1.35.3 Added changelogs and cabal files updates to version 1.35.3 Bump block header protocol version Alonzo will now broadcast 7.2 (even though we are actually moving to 7.0, this is to distinguish from other versions of the node that are broadcasting major version 7). Babbage will now broadcast 7.0, the version that we are actually moving to.

dorin100 · 2022-10-21T12:12:05Z

@gitmachtl is it ok to close this issue now?

gitmachtl · 2022-10-21T12:56:18Z

Yes, was resolved with 1.35.3. Thx.

gitmachtl added the bug Something isn't working label Jul 24, 2022

gitmachtl mentioned this issue Jul 24, 2022

Add utxoCostPerByte protocol parameter #4141

Merged

gitmachtl changed the title ~~[BUG] - Urgent - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker?~~ [BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? Jul 24, 2022

CarlosLopezDeLara added the Vasil label Jul 27, 2022

JaredCorduan mentioned this issue Jul 27, 2022

the alonzo UTxO rule to use alonzo minfee function IntersectMBO/cardano-ledger#2936

Merged

JaredCorduan mentioned this issue Jul 28, 2022

[BUG] - Mainnet Node stops syncing altogether after encountering an invalid block #4236

Closed

JaredCorduan closed this as completed in IntersectMBO/cardano-ledger#2936 Jul 28, 2022

JaredCorduan reopened this Jul 28, 2022

JaredCorduan pushed a commit that referenced this issue Jul 28, 2022

update ledger to the tip of release/1.0.0

b685a22

This resolves node #4228

JaredCorduan mentioned this issue Jul 28, 2022

update ledger to the tip of release/1.0.0 #4242

Merged

jvieiro mentioned this issue Aug 1, 2022

chore: cardano-db-sync@13.0.4 & cardano-node@1.35.3 cardano-foundation/cardano-rosetta#492

Merged

dorin100 added type: blocker Major issue, the tagged version cannot be released user type: external Created by an external user comp: ledger era: babbage labels Oct 21, 2022

gitmachtl closed this as completed Oct 21, 2022

dorin100 added the tag: 1.35.2 label Oct 24, 2022

gitmachtl mentioned this issue Oct 29, 2022

Leader VRF value no longer settling ties IntersectMBO/ouroboros-network#4051

Open

[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228

[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228

Comments

gitmachtl commented Jul 24, 2022 • edited Loading

Strange Issue

Here is a nightmare one, v7 built ontop of other v7 node (green path):

papacarp commented Jul 24, 2022

njd42 commented Jul 24, 2022

njd42 commented Jul 24, 2022 • edited Loading

gitmachtl commented Jul 24, 2022 • edited Loading

rene84 commented Jul 24, 2022

renesecur commented Jul 24, 2022

andreabravetti commented Jul 24, 2022

njd42 commented Jul 25, 2022

coot commented Jul 25, 2022

nemo83 commented Jul 25, 2022

gitmachtl commented Jul 25, 2022

gitmachtl commented Jul 25, 2022

renesecur commented Jul 25, 2022

gitmachtl commented Jul 25, 2022

coot commented Jul 25, 2022

gitmachtl commented Jul 25, 2022

coot commented Jul 25, 2022

rene84 commented Jul 25, 2022

coot commented Jul 25, 2022

rene84 commented Jul 25, 2022

coot commented Jul 25, 2022

gitmachtl commented Jul 25, 2022

gitmachtl commented Jul 25, 2022 • edited Loading

renesecur commented Jul 25, 2022

coot commented Jul 26, 2022

coot commented Jul 26, 2022

coot commented Jul 26, 2022

JaredCorduan commented Jul 26, 2022

reqlez commented Jul 26, 2022 • edited Loading

nemo83 commented Jul 26, 2022

karknu commented Jul 26, 2022

JaredCorduan commented Jul 28, 2022

JaredCorduan commented Aug 2, 2022

dorin100 commented Oct 21, 2022

gitmachtl commented Oct 21, 2022

gitmachtl commented Jul 24, 2022 •

edited

Loading

njd42 commented Jul 24, 2022 •

edited

Loading

gitmachtl commented Jul 24, 2022 •

edited

Loading

gitmachtl commented Jul 25, 2022 •

edited

Loading

reqlez commented Jul 26, 2022 •

edited

Loading