Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment of availability-chunk indices to validators #47

Merged
merged 11 commits into from
Jan 25, 2024
313 changes: 313 additions & 0 deletions text/0047-assignment-of-availability-chunks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
# RFC-0047: Assignment of availability chunks to validators

| | |
| --------------- | ------------------------------------------------------------------------------------------- |
| **Start Date** | 03 November 2023 |
| **Description** | An evenly-distributing indirection layer between availability chunks and validators. |
| **Authors** | Alin Dima |

## Summary

Propose a way of permuting the availability chunk indices assigned to validators for a given core and relay
chain block, in the context of
[recovering available data from systematic chunks](https://github.com/paritytech/polkadot-sdk/issues/598), with the
purpose of fairly distributing network bandwidth usage.

## Motivation

Currently, the ValidatorIndex is always identical to the ChunkIndex. Since the validator array is only shuffled once
per session, naively using the ValidatorIndex as the ChunkIndex would pose an unreasonable stress on the first N/3
validators during an entire session, when favouring availability recovery from systematic chunks.

Therefore, the relay chain node needs a deterministic way of evenly distributing the first ~(N_VALIDATORS / 3)
systematic availability chunks to different validators, based on the relay chain block and core.
The main purpose is to ensure fair distribution of network bandwidth usage for availability recovery in general and in
particular for systematic chunk holders.

## Stakeholders

Relay chain node core developers.

## Explanation

### Systematic erasure codes

An erasure coding algorithm is considered systematic if it preserves the original unencoded data as part of the
resulting code.
[The implementation of the erasure coding algorithm used for polkadot's availability data](https://github.com/paritytech/reed-solomon-novelpoly) is systematic.
Roughly speaking, the first N_VALIDATORS/3 chunks of data can be cheaply concatenated to retrieve the original data,
without running the resource-intensive and time-consuming reconstruction algorithm.

Here's the concatenation procedure of systematic chunks for polkadot's erasure coding algorithm
(minus error handling, for briefness):
```rust
pub fn reconstruct_from_systematic<T: Decode>(
n_validators: usize,
chunks: Vec<&[u8]>,
) -> T {
let mut threshold = (n_validators - 1) / 3;
if !is_power_of_two(threshold) {
threshold = next_lower_power_of_2(threshold);
}
alindima marked this conversation as resolved.
Show resolved Hide resolved

let shard_len = chunks.iter().next().unwrap().len();

let mut systematic_bytes = Vec::with_capacity(shard_len * threshold);

for i in (0..shard_len).step_by(2) {
for chunk in chunks.iter().take(threshold) {
systematic_bytes.push(chunk[i]);
systematic_bytes.push(chunk[i + 1]);
}
}

Decode::decode(&mut &systematic_bytes[..]).unwrap()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this remove the end zero padding if odd length?

Anyways we should be careful about the boundary here, like maybe this should live in the erasure coding crate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, scale decoding ignores the trailing zeros

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to know the exact number of zeroed bytes added as padding, we need to know the size of the input data that was encoded.
Unfortunately, we don't have easy access to that in polkadot, unless we add it to the CandidateReceipt.

But scale decoding it works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some roundtrip quickcheck tests on regular chunk reconstruction. The regular reed-solomon code can have extra zeroed padding when reconstructing. So truncation to the expected size was already needed.

}
```

In a nutshell, it performs a column-wise concatenation with 2-byte chunks.
The output could be zero-padded at the end, so scale decoding must be aware of the expected length in bytes and ignore
trailing zeros.

### Availability recovery at present

According to the [polkadot protocol spec](https://spec.polkadot.network/chapter-anv#sect-candidate-recovery):

> A validator should request chunks by picking peers randomly and must recover at least `f+1` chunks, where
`n=3f+k` and `k in {1,2,3}`.

For parity's polkadot node implementation, the process was further optimised. At this moment, it works differently based
on the estimated size of the available data:

(a) for small PoVs (up to 128 Kib), sequentially try requesting the unencoded data from the backing group, in a random
order. If this fails, fallback to option (b).

(b) for large PoVs (over 128 Kib), launch N parallel requests for the erasure coded chunks (currently, N has an upper
limit of 50), until enough chunks were recovered. Validators are tried in a random order. Then, reconstruct the
original data.

All options require that after reconstruction, validators then re-encode the data and re-create the erasure chunks trie
in order to check the erasure root.

### Availability recovery from systematic chunks

As part of the effort of
[increasing polkadot's resource efficiency, scalability and performance](https://github.com/paritytech/roadmap/issues/26),
work is under way to modify the Availability Recovery protocol by leveraging systematic chunks. See
[this comment](https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099) for preliminary
performance results.

In this scheme, the relay chain node will first attempt to retrieve the ~N/3 systematic chunks from the validators that
should hold them, before falling back to recovering from regular chunks, as before.

A re-encoding step is still needed for verifying the erasure root, so the erasure coding overhead cannot be completely
brought down to 0.

### Chunk assignment function

#### Properties

The function that decides the chunk index for a validator should be parameterized by at least
`(validator_index, relay_parent, core_index)`
and have the following properties:
1. deterministic
1. relatively quick to compute and resource-efficient.
1. when considering the other params besides `validator_index` as fixed, the function should describe a permutation
of the chunk indices
1. considering `relay_parent` as a fixed argument, the validators that map to the first N/3 chunk indices should
have as little overlap as possible for different paras scheduled on that relay parent.

In other words, we want a uniformly distributed, deterministic mapping from `ValidatorIndex` to `ChunkIndex` per block
per core.

It's desirable to not embed this function in the runtime, for performance and complexity reasons.
However, this means that the function needs to be kept very simple and with minimal or no external dependencies.
Any change to this function could result in parachains being stalled and needs to be coordinated via a runtime upgrade
or governance call.

#### Proposed function

Pseudocode:

```rust
pub fn get_chunk_index(
n_validators: u32,
validator_index: ValidatorIndex,
block_number: BlockNumber,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used block hash (relay parent), then by putting the core index into the candidate receipt we would have everything needed for the lookup to be self contained.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok using the relay parent of the candidate is at the very least not ideal because with async backing, the used relay parent could vary from candidate to candidate in the same block, which means that the equal load distribution might end up not being that equal after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we could avoid using block hash or number entirely and just use the core index. Assuming all cores are occupied, it will just mean fixed mapping within a session from validators to bulk parachains chunks (on-demand would still rotate their mapping to cores I assume). That way it might be easier to screw (intentionally or not) the systematic recovery for a particular parachain for an entire session. OTOH, we need to handle one (or two) missing systematic recovery chunk in practice and fall back to using backers to some extent. Maybe worth mentioning the fallback bit in the RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we could avoid using block hash or number entirely and just use the core index. Assuming all cores are occupied, it will just mean fixed mapping within a session from validators to bulk parachains chunks (on-demand would still rotate their mapping to cores I assume). That way it might be easier to screw (intentionally or not) the systematic recovery for a particular parachain for an entire session.

Yeah, as you said, I think it would be too easy to screw up. It could result in a not-so-even load distribution, because sessions can be quite long and some validators would be too unlucky to be assigned to a high-throughput para for several hours. We also don't know which validators are lazier or ill-intended and switching only once per session would make this visible for some parachains more than the others.

OTOH, we need to handle one (or two) missing systematic recovery chunk in practice and fall back to using backers to some extent. Maybe worth mentioning the fallback bit in the RFC.

Yes, I added this bit to the RFC. I suggest we request at most one chunk from each validator in the backing group as a fallback.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is whether this matters? Yes indeed, validators could then consistently mess with the systemic recovery of one particular parachain, instead of messing with systemic recovery of multiple parachains.... so what?

  1. It should not make that much of a difference: They are slowing down other validators. This is also in availability recovery, not distribution: Therefore the only affect that such an attack can have is that finality is a bit slower or that validators get overloaded and fail to back things for example. But both are independent of a parachain - it does not matter for which parachain causes this.
  2. Backers are still rotating. So if some validators refuse to provide systemic chunks, we can still fetch them from the backers.

Now the only real argument for rotation every block in my opinion is, again to even out load. In this particular case it would make a difference, if some paras always fully fill their blocks, while others are keeping them mostly empty. But, I would argue that we we should solve this problem by incentivizing full block utilization and not worry about this here too much, at least not until it manifests in a real problem. In fact, we also have another way of solving this if it ever proves beneficial: We could rotate para id to core id assignments instead.

TL;DR: I like @ordian 's idea to just not rotate. There are downsides to this, but having the recovery be self contained is quite valuable. Let's start simple and go only more complex if it proves necessary?

@burdges am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what you mean here. We'll have the candidate reciept when fetching data, no?

Yes, and we'd like availability-recovery to be possible when having nothing more than the candidate receipt. The problem is that the receipt does not contain any info about the slot or block number.
For the moment, the core index isn't there either, but the plan is to add it.

Copy link

@burdges burdges Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you. We could only use some historical slot, not the slot where the candidate recipet gets backed.

All these options give some control of course, even the slot where backed, so relay parent hash works better than any slot choice, due to simplicity.

As I understand it @eskimor suggests our map depend upon only num_cores, num_validators, and core index, so each core has their systemic validators fixed for the session. It avoids bias except through selecting your core. I'd wager it worsens user experence, but only slightly.

We do however rotate backing groups and backers provide systemic chunks too, hence the slightly above. How do we determin the backers from the candidate recipet? Just because they signed the candidate recipet?

It's likely fine either way. We'll anyways have systemic reconstruction if each systemic chunk can be fetched from some backers or its one availability provider.

Actually who determines core index? We'd ideas where this occurs after the candidate reciept. We'd enable these if the map depends upon only num_cores, num_validators, relay parent, and paraid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these options give some control of course, even the slot where backed, so relay parent hash works better than any slot choice, due to simplicity.

As @eskimor said, "using the relay parent of the candidate is at the very least not ideal because with async backing, the used relay parent could vary from candidate to candidate in the same block, which means that the equal load distribution might end up not being that equal after all."

As I understand it @eskimor suggests our map depend upon only num_cores, num_validators, and core index, so each core has their systemic validators fixed for the session. It avoids bias except through selecting your core. I'd wager it worsens user experence, but only slightly.

yes, that's the suggestion AFAIU.

How do we determin the backers from the candidate recipet? Just because they signed the candidate recipet?

No, we only use the backers currently during approval-voting, when we have access to the relay block and we can query the session's validator groups from the runtime. For complexity reasons, we don't even verify that validators in the group all backed the candidate. We just assume that to be true (in practice, it's mostly true).

Actually who determines core index? We'd ideas where this occurs after the candidate reciept. We'd enable these if the map depends upon only num_cores, num_validators, relay parent, and paraid.

Currently, for the existing lease parachains, there's a fixed assignment between the paraid and the core index. @eskimor you know more here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just go with only the core index, from the discussion I don't see any real problems with that and once we have the core index in the candidate receipt we are golden from the simplicity/robustness perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I'll update the document

core_index: CoreIndex
) -> ChunkIndex {
let threshold = systematic_threshold(n_validators); // Roughly n_validators/3
let core_start_pos = abs(core_index - block_number) * threshold;

(core_start_pos + validator_index) % n_validators
}
```

### Network protocol

The request-response `/polkadot/req_chunk` protocol will be bumped to a new version (from v1 to v2).
For v1, the request and response payloads are:
```rust
/// Request an availability chunk.
pub struct ChunkFetchingRequest {
/// Hash of candidate we want a chunk for.
pub candidate_hash: CandidateHash,
/// The index of the chunk to fetch.
pub index: ValidatorIndex,
Comment on lines +130 to +131
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be stay the same IIUC in v2, but the meaning (and the doc comment) will be different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will stay the same, indeed. The meaning will change, in the sense that we cannot expect any more that the ValidatorIndex will be equal to the returned ChunkIndex. I hope I describe this in sufficient detail in the following paragraph

}

/// Receive a requested erasure chunk.
pub enum ChunkFetchingResponse {
/// The requested chunk data.
Chunk(ChunkResponse),
/// Node was not in possession of the requested chunk.
NoSuchChunk,
}

/// This omits the chunk's index because it is already known by
/// the requester and by not transmitting it, we ensure the requester is going to use his index
/// value for validating the response, thus making sure he got what he requested.
pub struct ChunkResponse {
/// The erasure-encoded chunk of data belonging to the candidate block.
pub chunk: Vec<u8>,
/// Proof for this chunk's branch in the Merkle tree.
pub proof: Proof,
}
```

Version 2 will add an `index` field to `ChunkResponse`:

```rust
#[derive(Debug, Clone, Encode, Decode)]
pub struct ChunkResponse {
/// The erasure-encoded chunk of data belonging to the candidate block.
pub chunk: Vec<u8>,
/// Proof for this chunk's branch in the Merkle tree.
pub proof: Proof,
/// Chunk index.
pub index: ChunkIndex
}
```

An important thing to note is that in version 1, the `ValidatorIndex` value is always equal to the `ChunkIndex`.
Until the feature is enabled, this will also be true for version 2. However, after the feature is enabled,
this will generally not be true.

The requester will send the request to validator with index `V`. The responder will map the `V` validator index to the
`C` chunk index and respond with the `C`-th chunk.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The responder does not even need to do that, if we keep storing per validator index in the av store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does. It needs to supply the chunk index to the requester (for verification purposes and because the reconstruction algorithm seems to need it)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have sworn, that I wrote somewhere that we would need to store the chunk index with the chunk in the av-store for this of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. yeah, that makes sense now 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit more.

The backing subsystem will issue a request to av-store to store all chunks. For that, we need to know the core index in order to store the ValidatorIndex -> Chunk as you suggest (so that we can compute the mapping).

Since we don't yet have the core index in the receipt, the backing subsystem needs to know the per-relay parent core_index assignment of the local validator.
From my knowledge, this would be just fine. When doing attestation, the availabiliy_cores runtime API already gets the core_index for us (but doesn't store it yet).

The slight caveat is that, when importing a statement, we may also have to call the availability_cores runtime API to see which core our para has been scheduled on. but it's no big deal, we need to have access to the relay parent anyway when doing candidate validation.

@eskimor please weigh in on my analysis. until elastic scaling, a para id couldn't be scheduled on multiple cores and the core assignment could only change on a relay-block boundary. And when we'll get elastic scaling, we'll have the core_index in the receipt anyway. so all good.


The protocol implementation MAY check the returned `ChunkIndex` against the expected mapping to ensure that
it received the right chunk.
alindima marked this conversation as resolved.
Show resolved Hide resolved
In practice, this is desirable during availability-distribution and systematic chunk recovery. However, regular
recovery may not check this index, which is particularly useful when participating in disputes that don't allow
for easy access to the validator->chunk mapping. See [Appendix A](#appendix-a) for more details.


### Upgrade path

#### Step 1: Enabling new network protocol
In the beginning, both `/polkadot/req_chunk/1` and `/polkadot/req_chunk/2` will be supported, until all validators and
collators have upgraded to use the new version. V1 will be considered deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a heads up that parachains may take a very long time to upgrade their collators's polkadot-sdk branch, many of them are still based on 0.9.x versions

Once all nodes are upgraded, a new release will be cut that removes the v1 protocol. Only once all nodes have upgraded
to this version will step 2 commence.
alindima marked this conversation as resolved.
Show resolved Hide resolved

#### Step 2: Enabling the new validator->chunk mapping
Considering that the Validator->Chunk mapping is critical to para consensus, the change needs to be enacted atomically
via governance, only after all validators have upgraded the node to a version that is aware of this mapping.
It needs to be explicitly stated that after the runtime upgrade and governance enactment, validators that run older
client versions that don't support this mapping will not be able to participate in parachain consensus.

Additionally, an error will be logged when starting a validator with an older version, after the runtime was upgraded
and the feature enabled.

On the other hand, collators will not be required to upgrade, as regular chunk recovery will work as before, granted
that version 1 of the networking protocol has been removed. However, they are encouraged to upgrade in order to take
advantage of the faster systematic recovery.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collators have to upgrade as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

I think they just need to upgrade to the v2 networking protocol. Once that's done, they don't really need to upgrade to use systematic recovery. They'll request regular chunks as before.
That's why I added this bit only on the step 2 (which enables systematic recovery). Upgrade for step 1 will still be needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly unrelated, but collators don't use recovery in a happy path. In can be needed in case there are malicious collators withholding the data. So it doesn't need to be optimized.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think they just need to upgrade to the v2 networking protocol.

Ok, to me that sounded like an upgrade. In other words: If you already have the requirement that they upgrade their network protocol, the rest does not sound like an issue anymore.


## Drawbacks

- Getting access to the `core_index` that used to be occupied by a candidate in some parts of the dispute protocol is
very complicated (See [appendix A](#appendix-a)). This RFC assumes that availability-recovery processes initiated during
disputes will only use regular recovery, as before. This is acceptable since disputes are rare occurrences in practice
and is something that can be optimised later, if need be.
- It's a breaking change that requires all validators and collators to upgrade their node version.

## Testing, Security, and Privacy

Extensive testing will be conducted - both automated and manual.
This proposal doesn't affect security or privacy.

## Performance, Ergonomics, and Compatibility

### Performance

This is a necessary data availability optimisation, as reed-solomon erasure coding has proven to be a top consumer of
CPU time in polkadot as we scale up the parachain block size and number of availability cores.

With this optimisation, preliminary performance results show that CPU time used for reed-solomon coding can be halved
and total POV recovery time decrease by 80% for large POVs. See more
[here](https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099).

### Ergonomics

Not applicable.

### Compatibility

This is a breaking change. See [upgrade path](#upgrade-path) section above.
All validators need to have upgraded their node versions before the feature will be enabled via a runtime upgrade and
governance call.

## Prior Art and References

See comments on the [tracking issue](https://github.com/paritytech/polkadot-sdk/issues/598) and the
[in-progress PR](https://github.com/paritytech/polkadot-sdk/pull/1644)

## Unresolved Questions

- Is there a better upgrade path that would preserve backwards compatibility?

## Future Directions and Related Material

This enables future optimisations for the performance of availability recovery, such as retrieving batched systematic
chunks from backers/approval-checkers.

## Appendix A

This appendix details the intricacies of getting access to the core index of a candidate in parity's polkadot node.

Here, `core_index` refers to the index of the core that a candidate was occupying while it was pending availability
(from backing to inclusion).

Availability-recovery can currently be triggered by the following phases in the polkadot protocol:
1. During the approval voting process.
1. By other collators of the same parachain.
1. During disputes.

Getting the right core index for a candidate can be troublesome. Here's a breakdown of how different parts of the
node implementation can get access to it:

1. The approval-voting process for a candidate begins after observing that the candidate was included. Therefore, the
node has easy access to the block where the candidate got included (and also the core that it occupied).
1. The `pov_recovery` task of the collators starts availability recovery in response to noticing a candidate getting
backed, which enables easy access to the core index the candidate started occupying.
1. Disputes may be initiated on a number of occasions:

3.a. is initiated by the validator as a result of finding an invalid candidate while participating in the
approval-voting protocol. In this case, availability-recovery is not needed, since the validator already issued their
vote.

3.b is initiated by the validator noticing dispute votes recorded on-chain. In this case, we can safely
assume that the backing event for that candidate has been recorded and kept in memory.

3.c is initiated as a result of getting a dispute statement from another validator. It is possible that the dispute
is happening on a fork that was not yet imported by this validator, so the subsystem may not have seen this candidate
being backed.

A naive attempt of solving 3.c would be to add a new version for the disputes request-response networking protocol.
Blindly passing the core index in the network payload would not work, since there is no way of validating that
the reported core_index was indeed the one occupied by the candidate at the respective relay parent.

Another attempt could be to include in the message the relay block hash where the candidate was included.
This information would be used in order to query the runtime API and retrieve the core index that the candidate was
occupying. However, considering it's part of an unimported fork, the validator cannot call a runtime API on that block.