-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus should favor expected slot height to ward off delay attack #2913
Comments
If we prefer the later slot leader, isn't the later slot leader can attack by ignore the previous block? I thought node only switch to longer chain? does it also choose same length fork randomly? |
I modified the ticket a bit in an attempt to make this more clear. A pool can never invalidate the prior leader block by delaying his own block as the parent hash will not match, so the late block will not be accepted on chain. A later slot leader could however produce his block multiple seconds earlier and collide with the previous block, if his vrf value was lower he could attack the previous block leader as his early block would make it on chain, the on-time block of the prior slot leader would be lost. If there are multiple chains with the same length then the winning chain is chosen randomly by evaluating vrf value. This is the status quo, which is sufficient for most cases but not in this attack scenerio where first the precicely expected slot# should be taken into account for a set of viable blocks and only then the vrf value, if multiple blocks remain (competitive blocks on same slot height) should be used to decide on the winning block. |
Isn't the pool able to set the parent hash to the one before last block, ignore the last block, treat it as the slot is empty? |
A hacked block producer could do that, but as we are in a decentralized network, a single malicious actor will not be able to propagate his block. What you describe is a whole other issue in any case. My issue is about timing of the block only. |
As long as the next slot leader pick the malicious actor's block, there's high probability that it'll make it into the final chain. I think the underlying issue is the same, chain selection rules regarding same length forks. |
Absolutely the same situation, the block 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e at slot 20551061 and the block 05edad11b536a25efdc312ae546c93457015fc2750cf88271932067d87cfeb92 at slot 20551063. 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e was propagated almost 3 seconds later after creation, even later then 05edad11b536a25efdc312ae546c93457015fc2750cf88271932067d87cfeb92 was created in its slot, but 07fca3a9479c79cd7e434561218b17b9ecd990e62dca444b25afd962656a0e2e won the slot(?) battle coz its vrf value was lower
|
Yes, same issue. As it is not the same slot, this is an "unnecessary" "block height battle" by definition. Thanks for chiming in Dimitry! |
https://github.com/input-output-hk/ouroboros-network/pull/2195/files#diff-3220464c354f6bdc1f4c510bef701cc9d1d6a3f0bb0e1b188c9d01b8ae456716R224
|
Good find, yes this is the place to add the fix most likely. Adding the simple logic you quoted would fix the original issue due to misconfiguration by the prior slot leader. However, this simple fix could also be dangerous, as it opens up 2 potential loopholes:
The safer fix would be something like:
|
If the malicious actor want to skip the previous block, it can simply set block hash to the one before that block? |
From my understanding this attack will not work as the nodes which have received the prior block will reject a new block with the wrong parent hash.
Would not be rejected now from my understanding, as slot number is not evaluated, so the protocol has no idea the block is from the future. |
The next slot leader who received both blocks will have two branches with same length at hand, similar situation as your original case. And with the change we just proposed, it'll select the malicious one which has bigger slot number. |
No, I still believe your case is not the same case. The next slot leader will reject the malicious block as it is not built upon the latest block the next slot leader already added to its chain tip. The valid block is built upon the correct parent hash and is thus accepted. The malicious block is invalid. |
Hi, I too have already lost several blocks due to the behavior described above, so I would like to push this issue again. SPOs with reasonably run servers are penalized for the poor propagation times of other SPOs. It is noticeable that this is often the case with the same pools and often you can see with these pools that they only run one relay for several producers (which explains the bad propagation of the blocks). Some excerpts of the logs to reconstruct the problem:
Our node successfully produced the block for slot 23981328 at the shown time and I can also confirm that it was propagated to the network successfully and in a timely manner (e.g. by looking at the orphans tab in pooltool.io for epoch 253 and observe the corresponding block).
Shortly after (as can be seen by the timestamps) the node received a block for slot 23981308 and then decided to switch to that chain as this block was for a slot before our block. There were about 20 seconds difference between those two blocks, but the propagation time of the block before ours was so slow that it reached us after our block was already produced, thus invalidating ours. I think that this behavior is not in the sense of a fair and performant network and therefore a solution for this should be found. |
Hi, this is currently in the hands of IOG researchers. As has been stated with a number of issues (such as pledge curve), any major changes are not made without due diligence of research team analyzing solutions, developers implementing them, and QA/security audits signing off an issue is resolved. Do not expect this to be fixed anytime soon, but it is acknowledged as an issue and being looked into by IOG. |
As I know a number of fellow SPOs are following this issue, I am adding the following info as I was not clear on this myself: "the research pipeline is tracked very differently from the development pipeline. Researchers work on grants, proposals, etc... not GH/JIRA issues. Once it's in development sprint, [...] the issue [is] updated." So this will take a while. Cheers to IOG for tackling this issue! Looking greatly forward to the time the fix for this painful issue is part of a development sprint! |
IOG researchers apparently started to work on this issue as the latest node release 1.26.1 has the following point in the release notes: "Add a tracer for the delay between when a block should have been forged and when we're ready to adopt it. (#2995)" |
That's unrelated. That's just to help us with system level benchmarks. I have however discussed this idea with the research team (some months ago) and from an initial casual review they think it sounds fine. The essential reason it's (probably) fine is that either choice is fine from the point of view of security. It is just a matter of incentives and preferable behaviour. My own view is that it is desirable to use VRF-based deterministic resolution only for blocks within Delta slots of each other, and otherwise to pick based on first arrival. Our current Praos parameter choices puts Delta at 5 slots, so 5 seconds. Note that this of course requires a hard fork, and the Alegra release is the top priority, so if we go forward with this, it will have to wait for a later hard fork (and proper research scrutiny). As for the question of whether there is an attack or an advantage by forging blocks early, remember that nodes do not adopt, and thus do not propagate, blocks from the future. |
For IOHK folks who have access to the JIRA ticket, it lives here: https://jira.iohk.io/browse/CAD-2545 From the ticket: What change is proposed?The suggestion is that for two chains of equal length, we compare the VRF value of the last block and prefer the lower one – but only for blocks within Delta slots. For blocks older than Delta we simply compare on length and so we can now end up with the case of chains that are equally preferable, in which case we stick with our current chain. The current ordering rule is the lexicographic combination of the following, in order:
The suggestion is to change the last one: So that should mean we resolve deterministically on the VRF only when both alternatives are seen within Delta. When one or both are seen outside of Delta then we do not resolve on VRF, and so we can consider chains equally good, in which case we pick the one we adopted first (since we don't switch on equally preferred chains). Comparing on VRF is not something we ever needed to do for security. So the intuition for why this would be ok is that this would just limit the circumstances in which we use the (unnecessary) VRF comparison. The purpose of the VRF comparison was to reduce the incentive for all block producers to be very close to each other in network distance (time). The ITN had a very strong incentive for this and we observed that many SPOs moved their servers to be near each other. This works against the goal of geographic decentralisation. |
Today I lost a block in a height battle with 40 seconds time difference. Logs from my relay (the other SPOs I asked have similar)
098926060426ebcdc1c3b1f2363ea8ea8a4116eb763be8c2e1737d64320a1c23 (my) was created and propagated at the right time, 78a8f0094fac3e3f366acdd4071f44b11c6901a2c43fcaf5085dfa1ec53766d6 was created 2021-10-06 04:14:31 UTC, but due to some reasons (propagation delay?) it got to other nodes 40 seconds later and after by block, 2021-10-06 04:15:11.49 UTC |
The slot numbers appearing in scientific notation — especially with their text not matching the slot numbers in the "new tip" messages — defeats some of the easier methods of tracing these problems (like searching the logfiles with a text editor) or automating log file checks. Does anyone at IOG think anyone will ever look at this issue, after 9 months without a response (related to the OP in a diagnostic sense)? IntersectMBO/cardano-node#2272 |
Please stay on topic in this issue thread... |
Hi Duncan Perhaps I misunderstood something, but.. For the scenario where pool A makes a block at 00:00:00 that takes 19.5s to propagate to most of the network, and pool B makes a block at 00:00:19 that takes 0.6s to propagate to most of the network, and pool C has a leader slot at 00:00:25... Block from pool A would reach pool C first, local chain would get extended, block from pool B would reach C next (its chain length would be the same as that of pool A's block). According to your proposal since pool A block's slot would be outside of delta, and pool B block's slot would be inside delta you're suggesting current chain would be preferred - wouldn't this result in pool B block being discarded? Would appreciate a clarification. |
I too have lost rewards due to this problem. There is an incentive for a malicious stake pool operator with enough pledge to do the following:
Benefit is achieved because this malicious group will receive proportionally more rewards through causing invalidation of many blocks produced by larger pools. However many blocks his total group produces he can cause the same number of blocks by good operators to be invalidated. The proposed change above is of some benefit. But it will just limit the window for this malicious behaviour to 5 seconds. |
VRF comparison in the decision tree does benefit smaller pools and this is a good thing for increasing decentralisation even if its effect is small. |
Why would someone intentionally cause forks and try to win them but still has a chance to lose them instead of just basing their block on the previous one? 🤔 How will they get more rewards from this? |
By causing other pool's blocks to become "ghosted" this results in less total valid blocks adopted on chain. Therefore the blocks created by the malicious group result in a higher proportion of the rewards then they should have. |
Sorry was trying to figure out how to edit the post. See my previous post for the complete answer (now edited). |
That's not how it works. If less blocks are adopted, less rewards are distributed. If e.g. 20,520 blocks were adopted for a certain epoch, only 95% of the rewards coming from the reserve for that epoch are distributed (20,520/21,600). The rest of the rewards stay in the reserve. Your statement does hold true for tx fees though, but those are only a tiny fraction of the rewards. It also holds true if more than 21,600 blocks are made in an epoch, but that hasn't happened for a very long time. See https://hydra.iohk.io/build/13099669/download/1/delegation_design_spec.pdf section 5.4. |
Ahhh. I didn't know that. Thanks for the link. Still, everything is relative. If a multi-farm of small pools was to act maliciously as I described, then it can make many larger pools earn less rewards and thereby its own rewards are comparatively better. The malicious group will then be relatively higher on the pool yield metrics leader board. |
I guess the model change to using "Input Endorsers" will obsolete this problem? |
@dcoutts Is this change still being considered? Is had been more than two years now... |
If I understand the linked PRs above correctly, this behavior will be used once we enter the Conway era? |
Yes, exactly. Concretely, starting with Conway, the chain order is the lexicographic combination of the following, in order:
(emphasis is on what is new in Conway) The practical effect here is that if a block On the other hand, slot battles/height battles due to nearby elections are still resolved by the VRF tiebreaker as before, which is the most common case. Note that one effect of this change is that we no longer have the property that after a period of silence in the network (due to the leader schedule), all honest nodes are guaranteed to select the same chain. This is expected, and not a problem for the Praos security guarantees; further blocks will cause the network to converge on a chain.1 Footnotes
|
Why isn't the slot in which the block arrived compared to its own slot number instead and that difference compared to 5 seconds? |
Can you clarify on how exactly two blocks of equal height should be compared here, ie when should the VRF tiebreaker (not) be applied? Maybe sth like: only when both blocks arrive at most 5 seconds after their slot? |
Yes, I thought the solution would be something like this. |
Note TL;DR: Both restricting the VRF tiebreaker based on slot distance and based on whether the blocks arrived on time improve upon the status quo in certain (but not necessarily the same) situations. Future work might result in using a combined tiebreaker that has the benefits of both variants. Let's consider both variants of the VRF tiebreaker with concrete scenarios where they do (not) improve upon the status quo. Here, "improve" refers to the avoiding scenarios where in a height battle, misconfigured/underresourced pools might still win even though they clearly misbehaved (for example by not extending a block they should have extended), causing a well-operated pool to lose a block. Restricting the VRF tiebreaker based on slot distance
The underlying idea here is that only blocks in nearby slots should have to use the VRF tiebreaker, as they plausibly might be competitors instead of one extending the other. This is the tiebreaker described in #2913 (comment) and #2913 (comment), and will be used starting with Conway. A concrete example1 where this would have helped is the following battle2: Here, FIALA somehow didn't extend CAG's block even though there were 57 slots in-between, but still won the battle due to the VRF tiebreaker. Starting with the new tiebreaker in Conway, the VRF wouldn't have applied here, causing nodes in the network to not switch to FIALA's block, resulting in P2P to forge on top of CAG's block, causing it to win, and FIALA's block would have been orphaned. Note that both blocks were delivered on time (FIALA's block was a bit slow with 3s, but still within margin). Restricting the VRF tiebreaker based on whether the blocks arrived on timeFor the purpose of this section, suppose that a block arrives "on time" if it arrived at a node within 5s after the onset of its slot. The idea mentioned above (#2913 (comment)) would be to use this as follows for a modified VRF tiebreaker, in order to further disincentivize pools that propagate their blocks late:
A concrete example1 where this would have helped is the following battle2: Here, UTA won the battle due to its better (lower) VRF, even though its block had a very long propagation time, and therefore arrived at pools in the network after BANDA's block. If we had use the tiebreaker just above, UTA's block wouldn't have had the benefit of the VRF tiebreaker, causing BANDA to instead win the battle. Note that the other tiebreaker that restrict the VRF comparison based on slot distance wouldn't have helped here, as the slots differ by only 2. Also note that evaluating this tiebreaker is now inherently a per-node property (in contrast to restricting the VRF comparison based on slot distance); this might complicate reasoning/thinking about the dynamics of the chain order, but it isn't a hard blocker necessarily. Combining both tiebreakersA natural idea would be to combine both variants above:
This would have helped in both scenarios above. We might consider implementing sth like this in the future (doesn't necessarily have to happen at a hard fork boundary), but we don't have any immediate plans. Footnotes |
Yes, I see indeed the difference here and that being on time alone isn't sufficient.
This requirement expects them to be on time and also if they differ by more than 5s, they can't be both in the range [now - 5s, now]. So it would be a solution to both problems which was apparently envisioned from the beginning. So this issue should remain open imho, because the change that will happen is only a partial solution to the initial problem cited in this issue... |
Hmmm, now I also see that this case is even more subtle because if e.g. the pool had a prop time of 4s, this would still be tolerated by the rule above and a height battle would also occur with blocks within 1, 2, 3 or 4 slots. But I still think that bad propagation of more than 5s should be punished here... |
Internal/External
External
Summary
When a pool produces or propagates a block late so the block collides with the block of the next slot leader, only the vrf value is evaluated to determine the winning block, which is on its own the correct strategy deciding randomly between competitive slots. Due to the current logic in the case of delayed blocks it does happen that the block of the next slot leader which was properly propagated and produced on-time is lost due to the misconfiguration of the prior slot leader. This can be seen as a form of attack from the viewpoint of the on-time pool.
Similarily, a later slot leader could produce his block multiple seconds earlier and collide with the previous block, if his vrf value was lower he could attack the previous block leader as his early block would make it on chain, the on-time block of the prior slot leader would be lost. We do not see this type of attack yet, as this would be a conscious effort, right now this attack is most likely without malice just out of misconfiguration.
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
The consensus protocol should evaluate the slot of the blocks and favor the block group which is expected in the current time frame.
With expected I refer to the exact block slot height. The algorithm can calculate precisely which slot# a block at this exact moment in time should have.
If there is more than one block in that group of "on-time" blocks only then the lower vrf should decide the winner. The block of the pool which produced the block on-time and propagated the block swiflty should not be attackable by a prior slot leader who delays his blocks accidently or on purpose or by a following block leader who produces his block multiple seconds earlier by modifying the system time on purpose as we have seen on the ITN as a tactic to win competitve slots.
System info (please complete the following information):
git rev 9a7331cce5e8bc0ea9c6bfa1c28773f4c5a7000f
Screenshots and attachments
See epoch 244: https://pooltool.io/pool/000006d97fd0415d2dafdbb8b782717a3d3ff32f865792b8df7ddd00/orphans
This is the propagation delay of the slot leader before my block:
See propagation delays of the pool before my block here:
https://pooltool.io/pool/59d12b7a426724961607014aacea1e584f3ebc1196948f42a10893bc/blocks
This is the hash of the winning late block which made it on chain:
ca40eed5fd46f76fbf64e17a98808f098363a83dfe8c100046947505baa1e406
My block made it into the orphan list on pooltool, hash:
97abb258f15995688bdacdc75a054883b22471451026f409a967028ec7b30316
This is a log excerpt from my block producer, the block which should have been the parent for my block arrived full 4 seconds late:
{"at":"2021-01-28T07:16:47.00Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"97abb258f15995688bdacdc75a054883b22471451026f409a967028ec7b30316@20251916"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"}
{"at":"2021-01-28T07:16:48.04Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"ca40eed5fd46f76fbf64e17a98808f098363a83dfe8c100046947505baa1e406@20251913"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"}
This is the 2nd time I have observed this, last time was on December 21st, same pattern different slot leader:
Block producer log.
{"at":"2020-12-20T03:07:09.01Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"78f0c4a29a9c2b9a628584066f05ba3285f6b7eaac3bc270e353f52a0fa94a8c@16867338"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobat","sev":"Notice","thread":"49"}
{"at":"2020-12-20T03:07:10.64Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"2c237fded6c534200814d991deccc3c99f0a1bae01e603e743d6d5926e8a4519@16867333"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"}
78f0c4a29a9c2b9a628584066f05ba3285f6b7eaac3bc270e353f52a0fa94a8c was my block which was orphaned
2c237fded6c534200814d991deccc3c99f0a1bae01e603e743d6d5926e8a4519 was the hash of the block before mine (5 slots before) arriving 6 seconds late.
Mike downloaded the json of one of the blocks of the pool before mine and noticed a delay of about 10 seconds back then:
{"height": 5100112, "slot": 16870897, "theoretical": 1608437188000, "tiptiming": [10547, 10416, 10440, 10509, 10350, 10099, 10432, 10428, 10333, 10378, 10427, 10548, 10219, 10111, 10362, 10293, 10350, 10281, 10296, 10410, 10461, 10419, 10484, 10343, 10350, 10485, 10347, 10330, 10530, 10592, 10327, 10290, 10373, 10332, 10192, 10288, 10390, 10375, 10392, 10301, 10369, 10457, 10350, 10439, 10354, 10493, 10323, 10503, 10407, 10337, 10343, 10398, 10442, 10359, 10367, 10325, 10334, 10305, 10499, 10369, 10346, 10231, 10369, 10311, 10317, 10420, 10505, 10303, 10240, 10310, 10560, 10350, 10360, 11098, 10410, 10310, 10310, 10280, 10320, 10563, 10370, 10330, 10280, 10120, 10400, 10310, 10350, 10310, 10340, 10490, 10460, 10380, 10540, 10410, 10340, -1608437188000, 10330, 10290, 10340, 10370, 10420, 10310, 10260, 10320, 10380, 10440, 10380, 10370, 10350, 10420, 10270, 10517, 10560, 10360, 10110, 10410, 10380, 10300, 10420, 10440, 10390, 10640, 10580, 10580, 10550, 10280, 10740, 10400, 10580, 10380, 10380, 10420, 10380, 10400, 10320, 10370, 10360, 10450, 10300, 10500, 10340, 10410, 10320, 10300, 10550, 10360, 10410, 10320, 10350, 10400, 10350, 10240, 10630, 10370, 10457, 10350, 10330, 10340, 10530, 10280, 10320, 10737, 10310, 10300, 11560, 10479, 10360, 10290, 10430, 10380, 10280, 10360, 10330, 10410, 10310, 10380, 10320, 10320, 11710, 10320, 10310, 10340, 25580, 10450, 10400, 10320, 10440, 11766, 10390, 10310, 12846, 10320, 10320, 12740, 12500, 12952, 13053, 18000, 20610, 20610, 24800], "histogram": "[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
The text was updated successfully, but these errors were encountered: