Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

rvalle · 2024-05-28T09:51:16Z

Hi!

I cant sync a kusama people parachain node. I am experiencing troubles I have never seem when running all other nodes.

I am using docker distribution: polkadot-parachain:1.12.0, I use the RelayChain RPC interface.

I vanilla start with docker run --rm parity/polkadot-parachain:1.12.0 --chain people-kusama complaints about most (or all, not sure) nodes having a genesis mistmatch:

2024-05-28 09:37:31 [Parachain] Report 12D3KooWS7uzh62LChjfbyYGj1U5yGYaKNWMzzh6AAWHiJ5aLYLH: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.

I then restrict to boot nodes only using flags reserved-only and the boot nodes as reserved nodes, and then I get parachain blocks.... however no finalizations,

eventually I reach the top of the chain, also using the RelayChain RPC interface:

2024-05-28 09:40:46 [Parachain] ⚙️  Preparing  0.0 bps, target=#94115 (1 peers), best: #94015 (0x2dee…a9c6), finalized #0 (0xc1af…8b3f), ⬇ 12 B/s ⬆ 26 B/s

but no blocks appear to be finalized,

and eventually I get this other warning constantly:

2024-05-28 09:40:42 [Parachain] Event distribution channel has reached its limit. This can lead to missed notifications. error=TrySendError { kind: Full }

I am not sure what is going on. I am using the new default --prune archive-canonical and tried also the different sync modes (fast, warp) but nothing seems to make a difference.

I have also tried to use the polkadot-collator image but does not seem to have the kusama people spec.

What am I missing?

The text was updated successfully, but these errors were encountered:

rvalle · 2024-05-28T12:03:07Z

here is an example command that wont work:

docker run \
   parity/polkadot-parachain:1.12.0 \
   --chain people-kusama \
   --relay-chain-rpc-url wss://rpc.ibp.network/kusama

eventually fails, killing the node, with:

2024-05-28 12:05:23 [Parachain] ⚙️  Syncing 391.4 bps, target=#94755 (7 peers), best: #85939 (0x294d…66f1), finalized #0 (0xc1af…8b3f), ⬇ 2.2MiB/s ⬆ 1.3kiB/s    
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0xe6c6…8790)
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0x1d88…d4f9)
2024-05-28 12:05:25 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="chain-api" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] Protocol command streams have been shut down    
2024-05-28 12:05:25 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Error: Service(Other("Essential task failed."))

rvalle · 2024-05-28T12:12:12Z

here is another example running with the released binary in this repository:

./polkadot-parachain  --chain people-kusama --relay-chain-rpc-url wss://rpc.ibp.network/kusama

which reports version: version 1.12.0-b4016902ac7

similar behaviour....

rvalle · 2024-05-28T12:27:20Z

However, if I use the 1.11.0 release, and the parachain spec from the repo, here, as Paranodes remembers their initial sync, then it seems to work:

2024-05-28 14:19:48 [Relaychain] Received imported block via RPC: #23365460 (0x8b8d…8d67 -> 0xd6ae…6073)
2024-05-28 14:19:48 [Parachain] ♻️  Reorg on #94819,0xd53d…1cc7 to #94819,0x9610…7b2d, common ancestor #94818,0x6fbd…faee    
2024-05-28 14:19:50 [Parachain] 💤 Idle (7 peers), best: #94819 (0x9610…7b2d), finalized #94817 (0x9efd…10ab), ⬇ 6.8kiB/s ⬆ 4.1kiB/s    
2024-05-28 14:19:51 [Relaychain] Received finalized block via RPC: #23365457 (0x09f5…8239 -> 0x951c…cb3c)

is perhaps the initial sync broken in the latest release?

hitchhooker · 2024-05-30T23:15:25Z

https://gist.githubusercontent.mirror.nvdadr.com/hitchhooker/61a00eb3e3bda432598351347048af8b/raw/23d0c5d0c1e5ceed5bd2e0dd21e14cde3d38dc3d/gistfile1.txt

root@kppl27:/opt/cumulus# cat cumulus.service
[Unit]
Description="kppl27 endpoint - Cumulus service"
After=network-online.target
Wants=network-online.target

[Service]
User=cumulus
Group=cumulus
ExecStart=/opt/cumulus/cumulus \
  --name "Rotko Networks - kppl27 Endpoint" \
  --chain /opt/cumulus/people-kusama.json \
  --base-path /opt/cumulus \
  --state-pruning archive \
  --blocks-pruning=archive \
  --database paritydb \
  --sync full \
  --listen-addr /ip4/0.0.0.0/tcp/33857 \
  --listen-addr /ip4/0.0.0.0/tcp/34857/ws \
  --public-addr /ip4/27.131.160.106/tcp/33857 \
  --public-addr /ip4/27.131.160.106/tcp/34857/ws \
  --public-addr /dns/kppl27.rotko.net/tcp/33857 \
  --public-addr /dns/kppl27.rotko.net/tcp/34857/ws \
  --public-addr /dns/kppl27.rotko.net/tcp/35857/wss \
  --rpc-port 9857 \
  --prometheus-port 7857 \
  --prometheus-external \
  --relay-chain-rpc-urls ws://192.168.69.24:9324 \
  --wasm-execution Compiled \
  --no-hardware-benchmarks \
  --max-runtime-instances 32 \
  --rpc-max-request-size 16 \
  --rpc-max-response-size 16 \
  --rpc-max-subscriptions-per-connection 512 \
  --rpc-max-connections 10000 \
  --rpc-external \
  --rpc-methods safe \
  --rpc-cors all \
  --allow-private-ipv4

Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target****

struggling with same issue on coretime, people and bridgehub

related: #4648

skunert · 2024-06-03T10:44:28Z

Thanks for reporting, I will take a look!

here is an example command that wont work:

docker run \
   parity/polkadot-parachain:1.12.0 \
   --chain people-kusama \
   --relay-chain-rpc-url wss://rpc.ibp.network/kusama

eventually fails, killing the node, with:

2024-05-28 12:05:23 [Parachain] ⚙️  Syncing 391.4 bps, target=#94755 (7 peers), best: #85939 (0x294d…66f1), finalized #0 (0xc1af…8b3f), ⬇ 2.2MiB/s ⬆ 1.3kiB/s    
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0xe6c6…8790)
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0x1d88…d4f9)
2024-05-28 12:05:25 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="chain-api" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] Protocol command streams have been shut down    
2024-05-28 12:05:25 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Error: Service(Other("Essential task failed."))

This is a known issue and was recently fixed: #4167
In general we do not recommend to use public slow RPC nodes for collation. The goal is that you can run multiple collators in your network and point to a full node that you run yourself.

skunert · 2024-06-05T18:01:42Z

Quick update: I was able to reproduce this issue and have some hints on what might be happening. Will confirm and keep you posted.

## Issue Currently, syncing parachains from scratch can lead to a very long finalization time once they reach the tip of the chain. The problem is that we try to finalize everything from 0 to the tip, which can be thousands or even millions of blocks. We finalize sequentially and try to compute displaced branches during finalization. So for every block on the way, we compute an expensive tree route. ## Proposed Improvements In this PR, I propose improvements that solve this situation: - **Skip tree route calculation if `leaves().len() == 1`:** This should be enough for 90% of cases where there is only one leaf after sync. - **Optimize finalization for long distances:** It can happen that the parachain has imported some leaf and then receives a relay chain notification with the finalized block. In that case, the previous optimization will not trigger. A second mechanism should ensure that we do not need to compute the full tree route. If the finalization distance is long, we check the lowest common ancestor of all the leaves. If it is above the to-be-finalized block, we know that there are no displaced leaves. This is fast because forks are short and close to the tip, so we can leverage the header cache. ## Alternative Approach - The problem was introduced in #3962. Reverting that PR is another possible strategy. - We could store for every fork where it begins, however sounds a bit more involved to me. fixes #4614

…tech#4721) ## Issue Currently, syncing parachains from scratch can lead to a very long finalization time once they reach the tip of the chain. The problem is that we try to finalize everything from 0 to the tip, which can be thousands or even millions of blocks. We finalize sequentially and try to compute displaced branches during finalization. So for every block on the way, we compute an expensive tree route. ## Proposed Improvements In this PR, I propose improvements that solve this situation: - **Skip tree route calculation if `leaves().len() == 1`:** This should be enough for 90% of cases where there is only one leaf after sync. - **Optimize finalization for long distances:** It can happen that the parachain has imported some leaf and then receives a relay chain notification with the finalized block. In that case, the previous optimization will not trigger. A second mechanism should ensure that we do not need to compute the full tree route. If the finalization distance is long, we check the lowest common ancestor of all the leaves. If it is above the to-be-finalized block, we know that there are no displaced leaves. This is fast because forks are short and close to the tip, so we can leverage the header cache. ## Alternative Approach - The problem was introduced in paritytech#3962. Reverting that PR is another possible strategy. - We could store for every fork where it begins, however sounds a bit more involved to me. fixes paritytech#4614

github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label May 28, 2024

rvalle changed the title ~~Cant start a kusama-people node~~ Cant sync a kusama-people node May 28, 2024

rvalle changed the title ~~Cant sync a kusama-people node~~ Cant sync a kusama-people node, regression? May 28, 2024

rvalle changed the title ~~Cant sync a kusama-people node, regression?~~ Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? May 28, 2024

skunert mentioned this issue Jun 5, 2024

Fresh system-chain sync is stuck #4648

Closed

This was referenced Jun 6, 2024

finalization: Skip tree route calculation if no forks present #4721

Merged

Rococo Asset-Hub & Rococo Bridge-Hub cannot find peers #4697

Closed

skunert closed this as completed in #4721 Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

rvalle commented May 28, 2024 •

edited

Loading

rvalle commented May 28, 2024 •

edited

Loading

rvalle commented May 28, 2024

rvalle commented May 28, 2024 •

edited

Loading

hitchhooker commented May 30, 2024 •

edited

Loading

skunert commented Jun 3, 2024

skunert commented Jun 5, 2024

Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

Comments

rvalle commented May 28, 2024 • edited Loading

rvalle commented May 28, 2024 • edited Loading

rvalle commented May 28, 2024

rvalle commented May 28, 2024 • edited Loading

hitchhooker commented May 30, 2024 • edited Loading

skunert commented Jun 3, 2024

skunert commented Jun 5, 2024

rvalle commented May 28, 2024 •

edited

Loading

rvalle commented May 28, 2024 •

edited

Loading

rvalle commented May 28, 2024 •

edited

Loading

hitchhooker commented May 30, 2024 •

edited

Loading