Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: remove ttl #5461

Open
wants to merge 176 commits into
base: master
Choose a base branch
from
Open

runtime: remove ttl #5461

wants to merge 176 commits into from

Conversation

alindima
Copy link
Contributor

@alindima alindima commented Aug 26, 2024

Resolves #4776

This will enable proper core-sharing between paras, even if one of them is not producing blocks.

TODO:

  • duplicate first entry in the claim queue if the queue used to be empty
  • don't back anything if at the end of the block there'll be a session change
  • write migration for removing the availability core storage
  • update and write unit tests
  • prdoc
  • add zombienet test for synchronous backing
  • add zombienet test for core-sharing paras where one of them is not producing any blocks

Important note:
The ttl and max_availability_timeouts fields of the HostConfiguration are not removed in this PR, due to #64.
Adding the workaround with the storage version check for every use of the active HostConfiguration in all runtime APIs would be insane, as it's used in almost all runtime APIs.

So even though the ttl and max_availability_timeouts fields will now be unused, they will remain part of the host configuration.

These will be removed in a separate PR once #64 is fixed.

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
@command-bot
Copy link

command-bot bot commented Oct 2, 2024

@alindima Command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=rococo --target_dir=polkadot --pallet=polkadot_runtime_parachains::paras_inherent has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7484427 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7484427/artifacts/download.

Base automatically changed from sandreim/runtime_v2_descriptor_support to master October 7, 2024 07:22
@alindima
Copy link
Contributor Author

alindima commented Oct 7, 2024

oh this got pretty messed up since it was based on #5423, which was merged. I'll try to fix it but will probably need to force push it

Later edit: managed to fix it

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nits. Let's strive for well documented and encapsulated pallets with a nice API that serves a purpose/sets clear to understand boundaries and expectations in both directions.

Other than that, I think I found one issue with regards to virtual cores.

fn report_processed(para_id: ParaId, core_index: CoreIndex) {
// Reporting processed assignments is only important for on-demand.
// Doing the call below is a no-op if the assignment was a `Bulk` one.
on_demand::Pallet::<T>::report_processed(para_id, core_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sorry. Now I looked at the code as well. Indeed this is a hack of course. The assignment provider should know about on-demand/bulk. Calling into the on-demand for non-ondemand assignments is not nice indeed, but I see you added the no-op to the docs. Better yet would be to state this as a proper invariant, ideally even with an accommodating test, checking that this is indeed a noop.

fn report_processed(para_id: ParaId, core_index: CoreIndex) {
// Reporting processed assignments is only important for on-demand.
// Doing the call below is a no-op if the assignment was a `Bulk` one.
on_demand::Pallet::<T>::report_processed(para_id, core_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand this is only an intermediary hack right? With the proper fix coming, we should no longer need this hack anyway.

polkadot/runtime/parachains/src/scheduler.rs Outdated Show resolved Hide resolved
polkadot/runtime/parachains/src/scheduler.rs Outdated Show resolved Hide resolved
polkadot/runtime/parachains/src/session_info.rs Outdated Show resolved Hide resolved
fn report_processed(para_id: ParaId, core_index: CoreIndex) {
// Reporting processed assignments is only important for on-demand.
// Doing the call below is a no-op if the assignment was a `Bulk` one.
on_demand::Pallet::<T>::report_processed(para_id, core_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code I don't get why we need this hack. Anyhow, will not block on this as it is only interim.

polkadot/runtime/parachains/src/paras_inherent/mod.rs Outdated Show resolved Hide resolved
// In `Enter` context (invoked during execution) no more candidates should be
// filtered, because they have already been filtered during `ProvideInherent`
// context. Abort in such cases.
if context == ProcessInherentDataContext::Enter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks should be top-level. They should be a wrapper around process_inherent. For debugging we can leave logs on when something was filtered for what reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning is simple: If this is handled once at the top-level - you can not mess it up/forget it in some place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. have a look

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alin! Good work. I am having trouble getting confident that this working as expected. (hence all my comments about architecture, it is quite hard to reason about this code) But other than that, are we lacking tests? E.g. a test that: core sharing now works as expected, even if one chain is not producing blocks. Or that the runtime API now always returns the claim queue as expected?

@@ -686,18 +680,7 @@ pub mod pallet {
Self::set_coretime_cores_unchecked(new)
}

/// Set the max number of times a claim may timeout on a core before it is abandoned
#[pallet::call_index(7)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also already mark the fields as deprecated, pointing to the removing ticket. E.g. here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Self::process_inherent_data(data, ProcessInherentDataContext::Enter)
.map(|(_processed, post_info)| post_info)
Self::process_inherent_data(data, ProcessInherentDataContext::Enter).and_then(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not entirely what I meant. I mean we should have a generic test here. Essentially with Enter the processed data should be the same as the incoming one. This not only concerns backed_candidates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point being: We should be able to have this top-level invariant that the data passed in for Enter is already good, hence it should not be changed by calling process_inherent_data. It should not even be necessary to pass in the ProcessInherentDataContext. Instead we just let it filter, but if it actually filters something on Enter - we error out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. done!

@@ -345,7 +372,7 @@ impl<T: Config> Pallet<T> {
log::debug!(target: LOG_TARGET, "Time weight before filter: {}, candidates + bitfields: {}, disputes: {}", weight_before_filtering.ref_time(), candidates_weight.ref_time() + bitfields_weight.ref_time(), disputes_weight.ref_time());

let current_session = shared::CurrentSessionIndex::<T>::get();
let expected_bits = scheduler::AvailabilityCores::<T>::get().len();
let expected_bits = scheduler::Pallet::<T>::num_validator_groups();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitfields are for cores. So what we are actually querying here is number of cores, not backing groups. That some of these cores will never ever see an assignment, this is something only the scheduler needs to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed this

let mut eligible: BTreeMap<ParaId, BTreeSet<CoreIndex>> = BTreeMap::new();
let mut total_eligible_cores = 0;

for (core_idx, para_id) in Self::eligible_paras(&occupied_cores) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a function of inclusion? It already knows the occupied cores and I think it also just makes sense.

Copy link
Contributor Author

@alindima alindima Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need the occupied cores here when calling advance_claim_queue. this is why I chose to move it here.

.enumerate()
.map(|(i, core)| match core {
CoreOccupied::Paras(entry) => {
(0..n_cores)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing something similar to ClaimQueueIterator in inclusion would make this code a bit more straight forward. (Provide an iterator that iterates over all cores, including unoccupied ones.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Could also help avoiding looking up PendingAvailability twice.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this to not query PendingAvailability twice

Scheduler::free_cores_and_fill_claim_queue(BTreeMap::new(), 2);
{
let mut claim_queue = scheduler::ClaimQueue::<Test>::get();
assert_eq!(Scheduler::claim_queue_len(), 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like an exhaustive test, if we are not testing the newly added core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are. the core count used to be 2. we increased to 4. but out of the two new cores we only have an assignment for one of them. So we check that the claim queue contains the assignment for the third core (index 2)

@@ -135,8 +135,8 @@ impl<T: Config> Pallet<T> {
let assignment_keys = AssignmentKeysUnsafe::<T>::get();
let active_set = shared::ActiveValidatorIndices::<T>::get();

let validator_groups = scheduler::ValidatorGroups::<T>::get().into();
let n_cores = scheduler::AvailabilityCores::<T>::get().len() as u32;
let validator_groups = scheduler::ValidatorGroups::<T>::get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future refactor: There is very little reason left for ValidatorGroups to remain in scheduler. Given that we are already storing them in SessionInfo, I wonder whether we need that storage entry at all.

A potentially improved architecture would probably have both ValidatorGroups and core count be managed and exposed via the session_info crate.

For the distinction between configured number of cores and actual number of cores .. this is still annoying. In practice at least not even the scheduler should care. The scheduler code should work equally well if we used actual number of cores as exposed by session_info. Then the configured number of coretime_cores is only relevant within the coretime assignment provider and nicely encapsulated where it belongs. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted it on #5529

// Measured: `42760`
// Estimated: `46225`
// Minimum execution time: 342_453_000 picoseconds.
Weight::from_parts(360_067_000, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand where this significant increase is coming from?

// Measured: `76903`
// Estimated: `82843`
// Minimum execution time: 1_927_115_000 picoseconds.
Weight::from_parts(1_974_210_767, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also got significantly heavier. Not necessarily a problem, but worth understanding where it is coming from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a matter of rococo weights not being up to date even before this PR.
Westend weights were updated in #5270, but not for rococo.

If you compare the new rococo weights to the new westend weights, they're pretty similar. So I don't think there's any issue

@alindima alindima requested a review from a team as a code owner October 15, 2024 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Claim queue: Remove TTL
5 participants