Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus should favor expected slot height to ward off delay attack #2913

Open
Straightpool opened this issue Jan 30, 2021 · 43 comments
Open

Comments

@Straightpool
Copy link

Straightpool commented Jan 30, 2021

Internal/External
External

Summary
When a pool produces or propagates a block late so the block collides with the block of the next slot leader, only the vrf value is evaluated to determine the winning block, which is on its own the correct strategy deciding randomly between competitive slots. Due to the current logic in the case of delayed blocks it does happen that the block of the next slot leader which was properly propagated and produced on-time is lost due to the misconfiguration of the prior slot leader. This can be seen as a form of attack from the viewpoint of the on-time pool.

Similarily, a later slot leader could produce his block multiple seconds earlier and collide with the previous block, if his vrf value was lower he could attack the previous block leader as his early block would make it on chain, the on-time block of the prior slot leader would be lost. We do not see this type of attack yet, as this would be a conscious effort, right now this attack is most likely without malice just out of misconfiguration.

Steps to reproduce
Steps to reproduce the behavior:

  1. Wait on a situation where there are two slots with only a few seconds "x" apart
  2. Delay production of first block by "x" seconds on first slot leader
  3. Produce second block on second slot leader on-time
  4. Wait until the block of the first slot leader has the lower vrf value
  5. Observe that the block of the first slot leader makes it on chain, the block of the second slot leader is lost (had both blocks be on-time, both blocks would have made it on chain)

Expected behavior
The consensus protocol should evaluate the slot of the blocks and favor the block group which is expected in the current time frame.
With expected I refer to the exact block slot height. The algorithm can calculate precisely which slot# a block at this exact moment in time should have.
If there is more than one block in that group of "on-time" blocks only then the lower vrf should decide the winner. The block of the pool which produced the block on-time and propagated the block swiflty should not be attackable by a prior slot leader who delays his blocks accidently or on purpose or by a following block leader who produces his block multiple seconds earlier by modifying the system time on purpose as we have seen on the ITN as a tactic to win competitve slots.

System info (please complete the following information):

  • OS: Ubunto
  • Version 20.04 LTS
  • Node version: cardano-node 1.25.1 - linux-x86_64 - ghc-8.10
    git rev 9a7331cce5e8bc0ea9c6bfa1c28773f4c5a7000f

Screenshots and attachments
2021-01-30 17 57 49
See epoch 244: https://pooltool.io/pool/000006d97fd0415d2dafdbb8b782717a3d3ff32f865792b8df7ddd00/orphans

This is the propagation delay of the slot leader before my block:
2021-01-30 17 59 21
See propagation delays of the pool before my block here:
https://pooltool.io/pool/59d12b7a426724961607014aacea1e584f3ebc1196948f42a10893bc/blocks

This is the hash of the winning late block which made it on chain:
ca40eed5fd46f76fbf64e17a98808f098363a83dfe8c100046947505baa1e406

My block made it into the orphan list on pooltool, hash:
97abb258f15995688bdacdc75a054883b22471451026f409a967028ec7b30316

This is a log excerpt from my block producer, the block which should have been the parent for my block arrived full 4 seconds late:

{"at":"2021-01-28T07:16:47.00Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"97abb258f15995688bdacdc75a054883b22471451026f409a967028ec7b30316@20251916"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"}
{"at":"2021-01-28T07:16:48.04Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"ca40eed5fd46f76fbf64e17a98808f098363a83dfe8c100046947505baa1e406@20251913"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"}

This is the 2nd time I have observed this, last time was on December 21st, same pattern different slot leader:

Block producer log.

{"at":"2020-12-20T03:07:09.01Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"78f0c4a29a9c2b9a628584066f05ba3285f6b7eaac3bc270e353f52a0fa94a8c@16867338"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobat","sev":"Notice","thread":"49"}
{"at":"2020-12-20T03:07:10.64Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"2c237fded6c534200814d991deccc3c99f0a1bae01e603e743d6d5926e8a4519@16867333"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"}

78f0c4a29a9c2b9a628584066f05ba3285f6b7eaac3bc270e353f52a0fa94a8c was my block which was orphaned

2c237fded6c534200814d991deccc3c99f0a1bae01e603e743d6d5926e8a4519 was the hash of the block before mine (5 slots before) arriving 6 seconds late.

Mike downloaded the json of one of the blocks of the pool before mine and noticed a delay of about 10 seconds back then:

{"height": 5100112, "slot": 16870897, "theoretical": 1608437188000, "tiptiming": [10547, 10416, 10440, 10509, 10350, 10099, 10432, 10428, 10333, 10378, 10427, 10548, 10219, 10111, 10362, 10293, 10350, 10281, 10296, 10410, 10461, 10419, 10484, 10343, 10350, 10485, 10347, 10330, 10530, 10592, 10327, 10290, 10373, 10332, 10192, 10288, 10390, 10375, 10392, 10301, 10369, 10457, 10350, 10439, 10354, 10493, 10323, 10503, 10407, 10337, 10343, 10398, 10442, 10359, 10367, 10325, 10334, 10305, 10499, 10369, 10346, 10231, 10369, 10311, 10317, 10420, 10505, 10303, 10240, 10310, 10560, 10350, 10360, 11098, 10410, 10310, 10310, 10280, 10320, 10563, 10370, 10330, 10280, 10120, 10400, 10310, 10350, 10310, 10340, 10490, 10460, 10380, 10540, 10410, 10340, -1608437188000, 10330, 10290, 10340, 10370, 10420, 10310, 10260, 10320, 10380, 10440, 10380, 10370, 10350, 10420, 10270, 10517, 10560, 10360, 10110, 10410, 10380, 10300, 10420, 10440, 10390, 10640, 10580, 10580, 10550, 10280, 10740, 10400, 10580, 10380, 10380, 10420, 10380, 10400, 10320, 10370, 10360, 10450, 10300, 10500, 10340, 10410, 10320, 10300, 10550, 10360, 10410, 10320, 10350, 10400, 10350, 10240, 10630, 10370, 10457, 10350, 10330, 10340, 10530, 10280, 10320, 10737, 10310, 10300, 11560, 10479, 10360, 10290, 10430, 10380, 10280, 10360, 10330, 10410, 10310, 10380, 10320, 10320, 11710, 10320, 10310, 10340, 25580, 10450, 10400, 10320, 10440, 11766, 10390, 10310, 12846, 10320, 10320, 12740, 12500, 12952, 13053, 18000, 20610, 20610, 24800], "histogram": "[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

@yihuang
Copy link

yihuang commented Jan 31, 2021

If we prefer the later slot leader, isn't the later slot leader can attack by ignore the previous block?

I thought node only switch to longer chain? does it also choose same length fork randomly?

@Straightpool
Copy link
Author

Straightpool commented Jan 31, 2021

If we prefer the later slot leader, isn't the later slot leader can attack by ignore the previous block?

I thought node only switch to longer chain? does it also choose same length fork randomly?

I modified the ticket a bit in an attempt to make this more clear.

A pool can never invalidate the prior leader block by delaying his own block as the parent hash will not match, so the late block will not be accepted on chain. A later slot leader could however produce his block multiple seconds earlier and collide with the previous block, if his vrf value was lower he could attack the previous block leader as his early block would make it on chain, the on-time block of the prior slot leader would be lost.

If there are multiple chains with the same length then the winning chain is chosen randomly by evaluating vrf value. This is the status quo, which is sufficient for most cases but not in this attack scenerio where first the precicely expected slot# should be taken into account for a set of viable blocks and only then the vrf value, if multiple blocks remain (competitive blocks on same slot height) should be used to decide on the winning block.

@yihuang
Copy link

yihuang commented Jan 31, 2021

A pool can never invalidate the prior leader block by delaying his own block as the parent hash will not match, so the late block will not be accepted on chain.

Isn't the pool able to set the parent hash to the one before last block, ignore the last block, treat it as the slot is empty?

@Straightpool
Copy link
Author

Isn't the pool able to set the parent hash to the one before last block, ignore the last block, treat it as the slot is empty?

A hacked block producer could do that, but as we are in a decentralized network, a single malicious actor will not be able to propagate his block.

What you describe is a whole other issue in any case. My issue is about timing of the block only.

@yihuang
Copy link

yihuang commented Jan 31, 2021

A hacked block producer could do that, but as we are in a decentralized network, a single malicious actor will not be able to propagate his block.

As long as the next slot leader pick the malicious actor's block, there's high probability that it'll make it into the final chain.
Although I think the reward system is designed in a way that there's no incentive for one to do that.

I think the underlying issue is the same, chain selection rules regarding same length forks.

@dmitrystas
Copy link

dmitrystas commented Jan 31, 2021

Absolutely the same situation, the block 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e at slot 20551061 and the block 05edad11b536a25efdc312ae546c93457015fc2750cf88271932067d87cfeb92 at slot 20551063.

07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e was propagated almost 3 seconds later after creation, even later then 05edad11b536a25efdc312ae546c93457015fc2750cf88271932067d87cfeb92 was created in its slot, but 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e won the slot(?) battle coz its vrf value was lower

[vmi48486:cardano.node.ChainDB:Notice:266] [2021-01-31 18:22:21.41 UTC] Chain extended, new tip: 979e01bc772f03a4f3a87c8599b2b0b32f290d28b6493a6701b52fe7b44fcc35 at slot 20551050
[vmi48486:cardano.node.ChainDB:Notice:266] [2021-01-31 18:22:34.03 UTC] Chain extended, new tip: 05edad11b536a25efdc312ae546c93457015fc2750cf88271932067d87cfeb92 at slot 20551063
[vmi48486:cardano.node.ChainDB:Info:266] [2021-01-31 18:22:34.59 UTC] Block fits onto some fork: 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e at slot 20551061
[vmi48486:cardano.node.ChainDB:Info:266] [2021-01-31 18:22:34.59 UTC] Valid candidate 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e at slot 20551061
[vmi48486:cardano.node.ChainDB:Notice:266] [2021-01-31 18:22:34.59 UTC] Switched to a fork, new tip: 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e at slot 20551061
[vmi48486:cardano.node.ChainDB:Notice:266] [2021-01-31 18:22:56.14 UTC] Chain extended, new tip: f2f7c9fcf417e07dfb0269c69714727bb199062f45d95481172e03dec97f19eb at slot 20551085

@Straightpool
Copy link
Author

[..] block 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e at slot 20551061 and the block 05edad11b536a25efdc312ae546c93457015fc2750cf88271932067d87cfeb92 at slot 20551063. [...] but 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e won the slot(?) battle coz its vrf value was lower

Yes, same issue. As it is not the same slot, this is an "unnecessary" "block height battle" by definition. Thanks for chiming in Dimitry!

@yihuang
Copy link

yihuang commented Feb 1, 2021

https://github.com/input-output-hk/ouroboros-network/pull/2195/files#diff-3220464c354f6bdc1f4c510bef701cc9d1d6a3f0bb0e1b188c9d01b8ae456716R224
This is the PR for related logic.
I guess adding a rule can fix this issue:

- By the slot number of the chain tip, with higher values preferred;

@Straightpool
Copy link
Author

Straightpool commented Feb 1, 2021

By the slot number of the chain tip, with higher values preferred

Good find, yes this is the place to add the fix most likely.

Adding the simple logic you quoted would fix the original issue due to misconfiguration by the prior slot leader.

However, this simple fix could also be dangerous, as it opens up 2 potential loopholes:

  1. A malicious actor could set his clock backward a couple of seconds. As his slot number would be higher than the previous slot leader, his block would win the block height. This type of attack does not make a lot of sense, so I see higher risk in the second scenario:

  2. A malicious actor could hack the node code to increase slot number in the metadata by a value of say 3 seconds which would guarantee winning competitive slots

The safer fix would be something like:

- By the delta of current time to expected slot number time of the chain tip block, with lower values preferred;

@yihuang
Copy link

yihuang commented Feb 1, 2021

A malicious actor could set his clock backward a couple of seconds. As his slot number would be higher than the previous slot leader, his block would win the block height. This type of attack does not make a lot of sense, so I see higher risk in the second scenario:

If the malicious actor want to skip the previous block, it can simply set block hash to the one before that block?
Setting clock backward might risk generating blocks in the future, which might get rejected I guess, not sure about the detail though.

@Straightpool
Copy link
Author

If the malicious actor want to skip the previous block, it can simply set block hash to the one before that block?

From my understanding this attack will not work as the nodes which have received the prior block will reject a new block with the wrong parent hash.

Setting clock backward might risk generating blocks in the future, which might get rejected I guess, not sure about the detail though.

Would not be rejected now from my understanding, as slot number is not evaluated, so the protocol has no idea the block is from the future.

@yihuang
Copy link

yihuang commented Feb 1, 2021

If the malicious actor want to skip the previous block, it can simply set block hash to the one before that block?

From my understanding this attack will not work as the nodes which have received the prior block will reject a new block with the wrong parent hash.

The next slot leader who received both blocks will have two branches with same length at hand, similar situation as your original case. And with the change we just proposed, it'll select the malicious one which has bigger slot number.

@Straightpool
Copy link
Author

If the malicious actor want to skip the previous block, it can simply set block hash to the one before that block?

From my understanding this attack will not work as the nodes which have received the prior block will reject a new block with the wrong parent hash.

The next slot leader who received both blocks will have two branches with same length at hand, similar situation as your original case. And with the change we just proposed, it'll select the malicious one which has bigger slot number.

No, I still believe your case is not the same case. The next slot leader will reject the malicious block as it is not built upon the latest block the next slot leader already added to its chain tip. The valid block is built upon the correct parent hash and is thus accepted. The malicious block is invalid.

@pabstma
Copy link

pabstma commented Mar 27, 2021

Hi,

I too have already lost several blocks due to the behavior described above, so I would like to push this issue again. SPOs with reasonably run servers are penalized for the poor propagation times of other SPOs. It is noticeable that this is often the case with the same pools and often you can see with these pools that they only run one relay for several producers (which explains the bad propagation of the blocks).

Some excerpts of the logs to reconstruct the problem:

{"at":"2021-03-12T11:13:39.00Z","env":"1.25.1:9a733","ns":["cardano.node.Forge"],"data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":23981328}},"app":[],"msg":"","pid":"20005","loc":null,"host":"v2202011","sev":"Info","thread":"72"}
{"at":"2021-03-12T11:13:39.00Z","env":"1.25.1:9a733","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"4db7f940bca8c4ae822c3b641792d8bf95105304bf3166e92b72faa2eed7c6fa@23981328"},"app":[],"msg":"","pid":"20005","loc":null,"host":"v2202011","sev":"Notice","thread":"62"}

Our node successfully produced the block for slot 23981328 at the shown time and I can also confirm that it was propagated to the network successfully and in a timely manner (e.g. by looking at the orphans tab in pooltool.io for epoch 253 and observe the corresponding block).

{"at":"2021-03-12T11:13:39.37Z","env":"1.25.1:9a733","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.TrySwitchToAFork","block":{"hash":"b3afa68cfaea6bbb33748cfffd0b6e1ebb7856809241a4f017081c0bd8decc06","kind":"Point","slot":23981308}},"app":[],"msg":"","pid":"20005","loc":null,"host":"v2202011","sev":"Info","thread":"62"}
{"at":"2021-03-12T11:13:39.37Z","env":"1.25.1:9a733","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddBlockValidation.ValidCandidate","block":"b3afa68cfaea6bbb33748cfffd0b6e1ebb7856809241a4f017081c0bd8decc06@23981308"},"app":[],"msg":"","pid":"20005","loc":null,"host":"v2202011","sev":"Info","thread":"62"}
{"at":"2021-03-12T11:13:39.37Z","env":"1.25.1:9a733","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"b3afa68cfaea6bbb33748cfffd0b6e1ebb7856809241a4f017081c0bd8decc06@23981308"},"app":[],"msg":"","pid":"20005","loc":null,"host":"v2202011","sev":"Notice","thread":"62"}

Shortly after (as can be seen by the timestamps) the node received a block for slot 23981308 and then decided to switch to that chain as this block was for a slot before our block. There were about 20 seconds difference between those two blocks, but the propagation time of the block before ours was so slow that it reached us after our block was already produced, thus invalidating ours.

I think that this behavior is not in the sense of a fair and performant network and therefore a solution for this should be found.

@Straightpool
Copy link
Author

Straightpool commented Mar 28, 2021

I would love to see this issue at least get acknowledged. Today I lost my forth block to this issue since I started tracking this since December 21st 2020.

This time, my ghosted block and the delayed prior slot leader block were just one slot apart (edge case I know, same underlying issue):
image

Prop delays on the prior slot leader slightly worse, obviously enough to trigger the issue when there is one mere second difference.
image

@disassembler
Copy link
Contributor

Hi, this is currently in the hands of IOG researchers. As has been stated with a number of issues (such as pledge curve), any major changes are not made without due diligence of research team analyzing solutions, developers implementing them, and QA/security audits signing off an issue is resolved. Do not expect this to be fixed anytime soon, but it is acknowledged as an issue and being looked into by IOG.

@Straightpool
Copy link
Author

As I know a number of fellow SPOs are following this issue, I am adding the following info as I was not clear on this myself:

"the research pipeline is tracked very differently from the development pipeline. Researchers work on grants, proposals, etc... not GH/JIRA issues. Once it's in development sprint, [...] the issue [is] updated."

So this will take a while. Cheers to IOG for tackling this issue! Looking greatly forward to the time the fix for this painful issue is part of a development sprint!

@Straightpool
Copy link
Author

This epoch I got a new record for this delay attack: 21 full seconds!

My blockhash a099f0c653e443c14c8a2abd77d4e6a886975d3469988d75f0bc00266fa72eb4 at slot 264065 vs.
blockhash 59d12b7a426724961607014aacea1e584f3ebc1196948f42a10893bc at slot 264044
full 21 seconds earlier!

There was no prop delay recording for the slow pool in this specific case:
image

But when you look at the last 3 block propagation pools of the slow pool you see a clear trend, the last block did not even record propagations as it was off the charts:

image

This is the log on my block producer:

My block is produced on time (log time is UTC, pooltool screenshots use CEST which is UTC+2)

{"thread":"64","loc":null,"data":{"newtip":"a099f0c653e443c14c8a2abd77d4e6a886975d3469988d75f0bc00266fa72eb4@26356865","kind":"TraceAddBlockEvent.AddedToCurrentC hain"},"sev":"Notice","env":"1.26.1:62f38","msg":"","app":[],"host":"ub20cn","pid":"422995","ns":["cardano.node.ChainDB"],"at":"2021-04-08T23:05:56.03Z"}
{"thread":"74","loc":null,"data":{"val":{"kind":"TraceAdoptedBlock","slot":26356865,"blockSize":4284,"blockHash":"a099f0c653e443c14c8a2abd77d4e6a886975d3469988d7 5f0bc00266fa72eb4"},"credentials":"Cardano"},"sev":"Info","env":"1.26.1:62f38","msg":"","app":[],"host":"ub20cn","pid":"422995","ns":["cardano.node.Forge"],"at": "2021-04-08T23:05:56.03Z"}

Block 57a951dd269a3d290fee0b3237c04a3cc897779b266c728292606584a076b1b8 comes 21 seconds late and has a lower VRF apparently as the BP switches forks, dropping its own block:

{"thread":"11354","loc":null,"data":{"kind":"ChainSyncClientEvent.TraceDownloadedHeader","block":{"kind":"BlockPoint","headerHash":"57a951dd269a3d290fee0b3237c04a3cc897779b266c728292606584a076b1b8","slot":26356844}},sev":"Info","env":"1.26.1:62f38","msg":"","app":[],"host":"ub20cn","pid":"422995","ns":["cardano.node.ChainSyncClient"],"at":"2021-04-08T23:05:57.12Z"}
{"thread":"64","loc":null,"data":{"kind":"TraceAddBlockEvent.TrySwitchToAFork","block":{"kind":"Point","hash":"57a951dd269a3d290fee0b3237c04a3cc897779b266c728292606584a076b1b8","slot":26356844}},"sev":"Info","env":"126.1:62f38","msg":"","app":[],"host":"ub20cn","pid":"422995","ns":["cardano.node.ChainDB"],"at":"2021-04-08T23:05:57.17Z"}
{"thread":"64","loc":null,"data":{"kind":"TraceAddBlockEvent.AddBlockValidation.ValidCandidate","block":"57a951dd269a3d290fee0b3237c04a3cc897779b266c728292606584a076b1b8@26356844"},"sev":"Info","env":"1.26.1:62f38","sg":"","app":[],"host":"ub20cn","pid":"422995","ns":["cardano.node.ChainDB"],"at":"2021-04-08T23:05:57.18Z"}
{"thread":"64","loc":null,"data":{"newtip":"57a951dd269a3d290fee0b3237c04a3cc897779b266c728292606584a076b1b8@26356844","kind":"TraceAddBlockEvent.SwitchedToAFork"},"sev":"Notice","env":"1.26.1:62f38","msg":"","app":[,"host":"ub20cn","pid":"422995","ns":["cardano.node.ChainDB"],"at":"2021-04-08T23:05:57.18Z"}

I also lost another block this epoch 258 with blocks just one second apart, here my block was supposed to arrive earlier but as we learned within one second all bets are off. My ghosted block was in slot 308464 with hash f3ea0861e2fe81afaf446bed15a71e292fe621e0c3203170f3215573d120366d losing against blockhash 5866c12b53db68e6e704ae4b25634f95a2cbc114436fb4a41c6572290cf64f58 in slot 308465 one second later.

@Straightpool
Copy link
Author

IOG researchers apparently started to work on this issue as the latest node release 1.26.1 has the following point in the release notes: "Add a tracer for the delay between when a block should have been forged and when we're ready to adopt it. (#2995)"

@dcoutts
Copy link
Contributor

dcoutts commented Apr 9, 2021

IOG researchers apparently started to work on this issue as the latest node release 1.26.1 has the following point in the release notes: "Add a tracer for the delay between when a block should have been forged and when we're ready to adopt it. (#2995)"

That's unrelated. That's just to help us with system level benchmarks.

I have however discussed this idea with the research team (some months ago) and from an initial casual review they think it sounds fine. The essential reason it's (probably) fine is that either choice is fine from the point of view of security. It is just a matter of incentives and preferable behaviour. My own view is that it is desirable to use VRF-based deterministic resolution only for blocks within Delta slots of each other, and otherwise to pick based on first arrival. Our current Praos parameter choices puts Delta at 5 slots, so 5 seconds.

Note that this of course requires a hard fork, and the Alegra release is the top priority, so if we go forward with this, it will have to wait for a later hard fork (and proper research scrutiny).

As for the question of whether there is an attack or an advantage by forging blocks early, remember that nodes do not adopt, and thus do not propagate, blocks from the future.

@dcoutts
Copy link
Contributor

dcoutts commented Apr 9, 2021

For IOHK folks who have access to the JIRA ticket, it lives here: https://jira.iohk.io/browse/CAD-2545

From the ticket:

What change is proposed?

The suggestion is that for two chains of equal length, we compare the VRF value of the last block and prefer the lower one – but only for blocks within Delta slots. For blocks older than Delta we simply compare on length and so we can now end up with the case of chains that are equally preferable, in which case we stick with our current chain.

The current ordering rule is the lexicographic combination of the following, in order:

  1. compare chain length
  2. when (same slot number) then (compare if we produced the block ourselves) else equal
  3. when (same block issuer) then (compare operational certificate issue number) else equal
  4. compare (descending) leader VRF value

The suggestion is to change the last one:
4. when (both blocks' slots are within Delta of now) then (compare (descending) leader VRF value) else equal

So that should mean we resolve deterministically on the VRF only when both alternatives are seen within Delta. When one or both are seen outside of Delta then we do not resolve on VRF, and so we can consider chains equally good, in which case we pick the one we adopted first (since we don't switch on equally preferred chains).

Comparing on VRF is not something we ever needed to do for security. So the intuition for why this would be ok is that this would just limit the circumstances in which we use the (unnecessary) VRF comparison.

The purpose of the VRF comparison was to reduce the incentive for all block producers to be very close to each other in network distance (time). The ITN had a very strong incentive for this and we observed that many SPOs moved their servers to be near each other. This works against the goal of geographic decentralisation.

@dmitrystas
Copy link

Today I lost a block in a height battle with 40 seconds time difference.

Logs from my relay (the other SPOs I asked have similar)

[vmi55625:cardano.node.ChainDB:Notice:100] [2021-10-06 04:15:10.17 UTC] Chain extended, new tip: 098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23 at slot 41927419
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:10.17 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerReadBlocked.AddBlock"),("slot",Number 4.1927419e7),("block",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("blockNo",Number 6334732.0)]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:10.17 UTC] fromList [("point",Object (fromList [("headerHash",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("kind",String "BlockPoint"),("slot",Number 4.1927419e7)])),("kind",String "ChainSyncServerEvent.TraceChainSyncRollForward")]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:167823] [2021-10-06 04:15:10.17 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerReadBlocked.AddBlock"),("slot",Number 4.1927419e7),("block",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("blockNo",Number 6334732.0)]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:167823] [2021-10-06 04:15:10.17 UTC] fromList [("point",Object (fromList [("headerHash",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("kind",String "BlockPoint"),("slot",Number 4.1927419e7)])),("kind",String "ChainSyncServerEvent.TraceChainSyncRollForward")]
[vmi55625:cardano.node.ChainDB:Info:100] [2021-10-06 04:15:11.49 UTC] Block fits onto some fork: 78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6 at slot 41927380
[vmi55625:cardano.node.ChainDB:Info:100] [2021-10-06 04:15:11.50 UTC] Valid candidate 78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6 at slot 41927380
[vmi55625:cardano.node.ChainDB:Notice:100] [2021-10-06 04:15:11.50 UTC] Switched to a fork, new tip: 78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6 at slot 41927380
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:167823] [2021-10-06 04:15:11.50 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerReadBlocked.RollBack"),("slot",Number 4.192738e7),("block",String "78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6"),("blockNo",Number 6334732.0)]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:11.51 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerReadBlocked.RollBack"),("slot",Number 4.192738e7),("block",String "78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6"),("blockNo",Number 6334732.0)]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:11.52 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerRead.AddBlock"),("slot",Number 4.192738e7),("block",String "78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6"),("blockNo",Number 6334732.0)]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:11.52 UTC] fromList [("point",Object (fromList [("headerHash",String "78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6"),("kind",String "BlockPoint"),("slot",Number 4.192738e7)])),("kind",String "ChainSyncServerEvent.TraceChainSyncRollForward")]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:167823] [2021-10-06 04:15:11.55 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerRead.AddBlock"),("slot",Number 4.192738e7),("block",String "78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6"),("blockNo",Number 6334732.0)]
[vmi55625:cardano.node.ChainSyncHeaderServer:Info:167823] [2021-10-06 04:15:11.55 UTC] fromList [("point",Object (fromList [("headerHash",String "78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6"),("kind",String "BlockPoint"),("slot",Number 4.192738e7)])),("kind",String "ChainSyncServerEvent.TraceChainSyncRollForward")]

098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23 (my) was created and propagated at the right time, 78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6 was created 2021-10-06 04:14:31 UTC, but due to some reasons (propagation delay?) it got to other nodes 40 seconds later and after by block, 2021-10-06 04:15:11.49 UTC

@rphair
Copy link

rphair commented Oct 6, 2021

[vmi55625:cardano.node.ChainDB:Notice:100] [2021-10-06 04:15:10.17 UTC] Chain extended, new tip: 098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23 at slot 41927419

[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:10.17 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerReadBlocked.AddBlock"),("slot",Number 4.1927419e7),("block",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("blockNo",Number 6334732.0)]

[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:10.17 UTC] fromList [("point",Object (fromList [("headerHash",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("kind",String "BlockPoint"),("slot",Number 4.1927419e7)])),("kind",String "ChainSyncServerEvent.TraceChainSyncRollForward")]

The slot numbers appearing in scientific notation — especially with their text not matching the slot numbers in the "new tip" messages — defeats some of the easier methods of tracing these problems (like searching the logfiles with a text editor) or automating log file checks. Does anyone at IOG think anyone will ever look at this issue, after 9 months without a response (related to the OP in a diagnostic sense)? IntersectMBO/cardano-node#2272

@hodlonaut
Copy link

[vmi55625:cardano.node.ChainDB:Notice:100] [2021-10-06 04:15:10.17 UTC] Chain extended, new tip: 098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23 at slot 41927419

[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:10.17 UTC] fromList [("kind",String "ChainSyncServerEvent.TraceChainSyncServerReadBlocked.AddBlock"),("slot",Number 4.1927419e7),("block",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("blockNo",Number 6334732.0)]

[vmi55625:cardano.node.ChainSyncHeaderServer:Info:168223] [2021-10-06 04:15:10.17 UTC] fromList [("point",Object (fromList [("headerHash",String "098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23"),("kind",String "BlockPoint"),("slot",Number 4.1927419e7)])),("kind",String "ChainSyncServerEvent.TraceChainSyncRollForward")]

The slot numbers appearing in scientific notation — especially with their text not matching the slot numbers in the "new tip" messages — defeats some of the easier methods of tracing these problems (like searching the logfiles with a text editor) or automating log file checks. Does anyone at IOG think anyone will ever look at this issue, after 9 months without a response (related to the OP in a diagnostic sense)? input-output-hk/cardano-node#2272

Please stay on topic in this issue thread...

@hodlonaut
Copy link

For IOHK folks who have access to the JIRA ticket, it lives here: https://jira.iohk.io/browse/CAD-2545

From the ticket:

What change is proposed?

The suggestion is that for two chains of equal length, we compare the VRF value of the last block and prefer the lower one – but only for blocks within Delta slots. For blocks older than Delta we simply compare on length and so we can now end up with the case of chains that are equally preferable, in which case we stick with our current chain.

The current ordering rule is the lexicographic combination of the following, in order:

  1. compare chain length
  2. when (same slot number) then (compare if we produced the block ourselves) else equal
  3. when (same block issuer) then (compare operational certificate issue number) else equal
  4. compare (descending) leader VRF value

The suggestion is to change the last one: 4. when (both blocks' slots are within Delta of now) then (compare (descending) leader VRF value) else equal

So that should mean we resolve deterministically on the VRF only when both alternatives are seen within Delta. When one or both are seen outside of Delta then we do not resolve on VRF, and so we can consider chains equally good, in which case we pick the one we adopted first (since we don't switch on equally preferred chains).

Comparing on VRF is not something we ever needed to do for security. So the intuition for why this would be ok is that this would just limit the circumstances in which we use the (unnecessary) VRF comparison.

The purpose of the VRF comparison was to reduce the incentive for all block producers to be very close to each other in network distance (time). The ITN had a very strong incentive for this and we observed that many SPOs moved their servers to be near each other. This works against the goal of geographic decentralisation.

Hi Duncan

Perhaps I misunderstood something, but..

For the scenario where pool A makes a block at 00:00:00 that takes 19.5s to propagate to most of the network, and pool B makes a block at 00:00:19 that takes 0.6s to propagate to most of the network, and pool C has a leader slot at 00:00:25...

Block from pool A would reach pool C first, local chain would get extended, block from pool B would reach C next (its chain length would be the same as that of pool A's block). According to your proposal since pool A block's slot would be outside of delta, and pool B block's slot would be inside delta you're suggesting current chain would be preferred - wouldn't this result in pool B block being discarded?

Would appreciate a clarification.

@74d4
Copy link

74d4 commented Jan 9, 2022

I too have lost rewards due to this problem.

There is an incentive for a malicious stake pool operator with enough pledge to do the following:

  1. Split up his pledge across many block producers.
  2. When one of his block producers is due to mint, temporarily firewall it so no relays can pull the new block.
  3. Monitor block producation on rest of cardano network and wait until another block is produced by someone else.
  4. Immediately on seeing this next block, reconnect network so malicious block can be pulled by relays.
  5. Because his malicious block producer has relatively smaller pledge, his blocks are more likely to win on VRF comparison.
  6. Causing invalidation of blocks by other pools

Benefit is achieved because this malicious group will receive proportionally more rewards through causing invalidation of many blocks produced by larger pools. However many blocks his total group produces he can cause the same number of blocks by good operators to be invalidated.

The proposed change above is of some benefit. But it will just limit the window for this malicious behaviour to 5 seconds.

@74d4
Copy link

74d4 commented Feb 8, 2022

VRF comparison in the decision tree does benefit smaller pools and this is a good thing for increasing decentralisation even if its effect is small.

@brouwerQ
Copy link
Contributor

brouwerQ commented Mar 7, 2022

Benefit is achieved because this malicious group will receive proportionally more rewards through causing invalidation of many blocks produced by larger pools. However many blocks his total group produces he can cause the same number of blocks by good operators to be invalidated.

Why would someone intentionally cause forks and try to win them but still has a chance to lose them instead of just basing their block on the previous one? 🤔 How will they get more rewards from this?

@74d4
Copy link

74d4 commented Mar 28, 2022

How will they get more rewards from this?

By causing other pool's blocks to become "ghosted" this results in less total valid blocks adopted on chain. Therefore the blocks created by the malicious group result in a higher proportion of the rewards then they should have.

@74d4
Copy link

74d4 commented Mar 28, 2022

Sorry was trying to figure out how to edit the post. See my previous post for the complete answer (now edited).

@brouwerQ
Copy link
Contributor

brouwerQ commented Mar 28, 2022

How will they get more rewards from this?

By causing other pool's blocks to become "ghosted" this results in less total valid blocks adopted on chain. Therefore the blocks created by the malicious group result in a higher proportion of the rewards then they should have.

That's not how it works. If less blocks are adopted, less rewards are distributed. If e.g. 20,520 blocks were adopted for a certain epoch, only 95% of the rewards coming from the reserve for that epoch are distributed (20,520/21,600). The rest of the rewards stay in the reserve. Your statement does hold true for tx fees though, but those are only a tiny fraction of the rewards. It also holds true if more than 21,600 blocks are made in an epoch, but that hasn't happened for a very long time.

See https://hydra.iohk.io/build/13099669/download/1/delegation_design_spec.pdf section 5.4.

@74d4
Copy link

74d4 commented Mar 29, 2022

If less blocks are adopted, less rewards are distributed.

Ahhh. I didn't know that. Thanks for the link.

Still, everything is relative. If a multi-farm of small pools was to act maliciously as I described, then it can make many larger pools earn less rewards and thereby its own rewards are comparatively better. The malicious group will then be relatively higher on the pool yield metrics leader board.

@74d4
Copy link

74d4 commented Apr 2, 2022

I guess the model change to using "Input Endorsers" will obsolete this problem?

@brouwerQ
Copy link
Contributor

@dcoutts Is this change still being considered? Is had been more than two years now...

@dnadales
Copy link
Member

@dcoutts Is this change still being considered? Is had been more than two years now...

We'll work on integrating this PR into Consensus and we want to roll it out at the next hard-fork boundary.

@brouwerQ
Copy link
Contributor

brouwerQ commented Jul 2, 2024

If I understand the linked PRs above correctly, this behavior will be used once we enter the Conway era?

@amesgen
Copy link
Member

amesgen commented Jul 12, 2024

If I understand the linked PRs above correctly, this behavior will be used once we enter the Conway era?

Yes, exactly. Concretely, starting with Conway, the chain order is the lexicographic combination of the following, in order:

  1. compare chain length, preferring larger values
  2. when (same slot and same issuer) compare operational certificate issue number, preferring larger values
  3. when (slots differ by at most 5) compare VRF, preferring smaller values

(emphasis is on what is new in Conway)

The practical effect here is that if a block B should have been able to extend a block A (which should be the case when their slots differ by more than 5), but B actually has the same block number as A, B can no longer win/lose against A via the VRF tiebreaker. Rather, the block that arrived first will stay selected.

On the other hand, slot battles/height battles due to nearby elections are still resolved by the VRF tiebreaker as before, which is the most common case.

Note that one effect of this change is that we no longer have the property that after a period of silence in the network (due to the leader schedule), all honest nodes are guaranteed to select the same chain. This is expected, and not a problem for the Praos security guarantees; further blocks will cause the network to converge on a chain.1

Footnotes

  1. If you are interested: the IOG researchers call chain selection rules that have the pre-Conway property consistent, they analyze both consistent and non-consistent rules in this paper.

@brouwerQ
Copy link
Contributor

3. when (slots differ by at most 5) compare VRF, preferring smaller values

Why isn't the slot in which the block arrived compared to its own slot number instead and that difference compared to 5 seconds?

@amesgen
Copy link
Member

amesgen commented Jul 12, 2024

  1. when (slots differ by at most 5) compare VRF, preferring smaller values

Why isn't the slot in which the block arrived compared to its own slot number instead and that difference compared to 5 seconds?

Can you clarify on how exactly two blocks of equal height should be compared here, ie when should the VRF tiebreaker (not) be applied? Maybe sth like: only when both blocks arrive at most 5 seconds after their slot?

@brouwerQ
Copy link
Contributor

Maybe sth like: only when both blocks arrive at most 5 seconds after their slot?

Yes, I thought the solution would be something like this.

@amesgen
Copy link
Member

amesgen commented Jul 16, 2024

Note

TL;DR: Both restricting the VRF tiebreaker based on slot distance and based on whether the blocks arrived on time improve upon the status quo in certain (but not necessarily the same) situations. Future work might result in using a combined tiebreaker that has the benefits of both variants.


Let's consider both variants of the VRF tiebreaker with concrete scenarios where they do (not) improve upon the status quo. Here, "improve" refers to the avoiding scenarios where in a height battle, misconfigured/underresourced pools might still win even though they clearly misbehaved (for example by not extending a block they should have extended), causing a well-operated pool to lose a block.

Restricting the VRF tiebreaker based on slot distance

  1. when (slots differ by at most 5) compare VRF, preferring smaller values

The underlying idea here is that only blocks in nearby slots should have to use the VRF tiebreaker, as they plausibly might be competitors instead of one extending the other.

This is the tiebreaker described in #2913 (comment) and #2913 (comment), and will be used starting with Conway.

A concrete example1 where this would have helped is the following battle2:

Scenario where restricting the VRF tiebreaker based on slot distance would have helped

Here, FIALA somehow didn't extend CAG's block even though there were 57 slots in-between, but still won the battle due to the VRF tiebreaker.

Starting with the new tiebreaker in Conway, the VRF wouldn't have applied here, causing nodes in the network to not switch to FIALA's block, resulting in P2P to forge on top of CAG's block, causing it to win, and FIALA's block would have been orphaned.

Note that both blocks were delivered on time (FIALA's block was a bit slow with 3s, but still within margin).

Restricting the VRF tiebreaker based on whether the blocks arrived on time

For the purpose of this section, suppose that a block arrives "on time" if it arrived at a node within 5s after the onset of its slot. The idea mentioned above (#2913 (comment)) would be to use this as follows for a modified VRF tiebreaker, in order to further disincentivize pools that propagate their blocks late:

  1. when (both blocks arrived on time) compare VRF, preferring smaller values

A concrete example1 where this would have helped is the following battle2:

Scenario where restricting the VRF tiebreaker based on arrival times would have helped

Here, UTA won the battle due to its better (lower) VRF, even though its block had a very long propagation time, and therefore arrived at pools in the network after BANDA's block. If we had use the tiebreaker just above, UTA's block wouldn't have had the benefit of the VRF tiebreaker, causing BANDA to instead win the battle.

Note that the other tiebreaker that restrict the VRF comparison based on slot distance wouldn't have helped here, as the slots differ by only 2.

Also note that evaluating this tiebreaker is now inherently a per-node property (in contrast to restricting the VRF comparison based on slot distance); this might complicate reasoning/thinking about the dynamics of the chain order, but it isn't a hard blocker necessarily.

Combining both tiebreakers

A natural idea would be to combine both variants above:

  1. when (slots differ by at most 5 AND both blocks arrived on time) compare VRF, preferring smaller values

This would have helped in both scenarios above.

We might consider implementing sth like this in the future (doesn't necessarily have to happen at a hard fork boundary), but we don't have any immediate plans.

Footnotes

  1. Thanks to @gufmar for providing these examples. 2

  2. There is no intention to single out or blame any particular pool here; this is a purely illustrative example. 2

@brouwerQ
Copy link
Contributor

brouwerQ commented Jul 16, 2024

Yes, I see indeed the difference here and that being on time alone isn't sufficient.
After reading the suggestion in #2913 (comment) again, it seems that it was already a combination of both solutions!

  1. when (both blocks' slots are within Delta of now) then (compare (descending) leader VRF value) else equal

This requirement expects them to be on time and also if they differ by more than 5s, they can't be both in the range [now - 5s, now]. So it would be a solution to both problems which was apparently envisioned from the beginning.

So this issue should remain open imho, because the change that will happen is only a partial solution to the initial problem cited in this issue...

@brouwerQ
Copy link
Contributor

Hmmm, now I also see that this case is even more subtle because if e.g. the pool had a prop time of 4s, this would still be tolerated by the rule above and a height battle would also occur with blocks within 1, 2, 3 or 4 slots. But I still think that bad propagation of more than 5s should be punished here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🚫 Help needed
Status: No status
Development

No branches or pull requests