Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

Closed
rvalle opened this issue May 28, 2024 · 6 comments · Fixed by #4721
Closed
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@rvalle
Copy link

rvalle commented May 28, 2024

Hi!

I cant sync a kusama people parachain node. I am experiencing troubles I have never seem when running all other nodes.

I am using docker distribution: polkadot-parachain:1.12.0, I use the RelayChain RPC interface.

I vanilla start with docker run --rm parity/polkadot-parachain:1.12.0 --chain people-kusama complaints about most (or all, not sure) nodes having a genesis mistmatch:

2024-05-28 09:37:31 [Parachain] Report 12D3KooWS7uzh62LChjfbyYGj1U5yGYaKNWMzzh6AAWHiJ5aLYLH: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.

I then restrict to boot nodes only using flags reserved-only and the boot nodes as reserved nodes, and then I get parachain blocks.... however no finalizations,

eventually I reach the top of the chain, also using the RelayChain RPC interface:

2024-05-28 09:40:46 [Parachain] ⚙️  Preparing  0.0 bps, target=#94115 (1 peers), best: #94015 (0x2dee…a9c6), finalized #0 (0xc1af…8b3f), ⬇ 12 B/s ⬆ 26 B/s

but no blocks appear to be finalized,

and eventually I get this other warning constantly:

2024-05-28 09:40:42 [Parachain] Event distribution channel has reached its limit. This can lead to missed notifications. error=TrySendError { kind: Full }

I am not sure what is going on. I am using the new default --prune archive-canonical and tried also the different sync modes (fast, warp) but nothing seems to make a difference.

I have also tried to use the polkadot-collator image but does not seem to have the kusama people spec.

What am I missing?

@github-actions github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label May 28, 2024
@rvalle rvalle changed the title Cant start a kusama-people node Cant sync a kusama-people node May 28, 2024
@rvalle
Copy link
Author

rvalle commented May 28, 2024

here is an example command that wont work:

docker run \
   parity/polkadot-parachain:1.12.0 \
   --chain people-kusama \
   --relay-chain-rpc-url wss://rpc.ibp.network/kusama

eventually fails, killing the node, with:

2024-05-28 12:05:23 [Parachain] ⚙️  Syncing 391.4 bps, target=#94755 (7 peers), best: #85939 (0x294d…66f1), finalized #0 (0xc1af…8b3f), ⬇ 2.2MiB/s ⬆ 1.3kiB/s    
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0xe6c6…8790)
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0x1d88…d4f9)
2024-05-28 12:05:25 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="chain-api" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] Protocol command streams have been shut down    
2024-05-28 12:05:25 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Error: Service(Other("Essential task failed."))

@rvalle
Copy link
Author

rvalle commented May 28, 2024

here is another example running with the released binary in this repository:

./polkadot-parachain  --chain people-kusama --relay-chain-rpc-url wss://rpc.ibp.network/kusama

which reports version: version 1.12.0-b4016902ac7

similar behaviour....

@rvalle rvalle changed the title Cant sync a kusama-people node Cant sync a kusama-people node, regression? May 28, 2024
@rvalle
Copy link
Author

rvalle commented May 28, 2024

However, if I use the 1.11.0 release, and the parachain spec from the repo, here, as Paranodes remembers their initial sync, then it seems to work:

2024-05-28 14:19:48 [Relaychain] Received imported block via RPC: #23365460 (0x8b8d…8d67 -> 0xd6ae…6073)
2024-05-28 14:19:48 [Parachain] ♻️  Reorg on #94819,0xd53d…1cc7 to #94819,0x9610…7b2d, common ancestor #94818,0x6fbd…faee    
2024-05-28 14:19:50 [Parachain] 💤 Idle (7 peers), best: #94819 (0x9610…7b2d), finalized #94817 (0x9efd…10ab), ⬇ 6.8kiB/s ⬆ 4.1kiB/s    
2024-05-28 14:19:51 [Relaychain] Received finalized block via RPC: #23365457 (0x09f5…8239 -> 0x951c…cb3c)

is perhaps the initial sync broken in the latest release?

@rvalle rvalle changed the title Cant sync a kusama-people node, regression? Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? May 28, 2024
@hitchhooker
Copy link
Contributor

hitchhooker commented May 30, 2024

https://gist.githubusercontent.mirror.nvdadr.com/hitchhooker/61a00eb3e3bda432598351347048af8b/raw/23d0c5d0c1e5ceed5bd2e0dd21e14cde3d38dc3d/gistfile1.txt

root@kppl27:/opt/cumulus# cat cumulus.service
[Unit]
Description="kppl27 endpoint - Cumulus service"
After=network-online.target
Wants=network-online.target

[Service]
User=cumulus
Group=cumulus
ExecStart=/opt/cumulus/cumulus \
  --name "Rotko Networks - kppl27 Endpoint" \
  --chain /opt/cumulus/people-kusama.json \
  --base-path /opt/cumulus \
  --state-pruning archive \
  --blocks-pruning=archive \
  --database paritydb \
  --sync full \
  --listen-addr /ip4/0.0.0.0/tcp/33857 \
  --listen-addr /ip4/0.0.0.0/tcp/34857/ws \
  --public-addr /ip4/27.131.160.106/tcp/33857 \
  --public-addr /ip4/27.131.160.106/tcp/34857/ws \
  --public-addr /dns/kppl27.rotko.net/tcp/33857 \
  --public-addr /dns/kppl27.rotko.net/tcp/34857/ws \
  --public-addr /dns/kppl27.rotko.net/tcp/35857/wss \
  --rpc-port 9857 \
  --prometheus-port 7857 \
  --prometheus-external \
  --relay-chain-rpc-urls ws://192.168.69.24:9324 \
  --wasm-execution Compiled \
  --no-hardware-benchmarks \
  --max-runtime-instances 32 \
  --rpc-max-request-size 16 \
  --rpc-max-response-size 16 \
  --rpc-max-subscriptions-per-connection 512 \
  --rpc-max-connections 10000 \
  --rpc-external \
  --rpc-methods safe \
  --rpc-cors all \
  --allow-private-ipv4

Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target****

struggling with same issue on coretime, people and bridgehub

related: #4648

@skunert
Copy link
Contributor

skunert commented Jun 3, 2024

Thanks for reporting, I will take a look!

here is an example command that wont work:

docker run \
   parity/polkadot-parachain:1.12.0 \
   --chain people-kusama \
   --relay-chain-rpc-url wss://rpc.ibp.network/kusama

eventually fails, killing the node, with:

2024-05-28 12:05:23 [Parachain] ⚙️  Syncing 391.4 bps, target=#94755 (7 peers), best: #85939 (0x294d…66f1), finalized #0 (0xc1af…8b3f), ⬇ 2.2MiB/s ⬆ 1.3kiB/s    
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0xe6c6…8790)
2024-05-28 12:05:24 [Relaychain] Received imported block via RPC: #23365318 (0x28bd…88f7 -> 0x1d88…d4f9)
2024-05-28 12:05:25 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="chain-api" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-05-28 12:05:25 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-05-28 12:05:25 [Relaychain] Protocol command streams have been shut down    
2024-05-28 12:05:25 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2024-05-28 12:05:25 [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Error: Service(Other("Essential task failed."))

This is a known issue and was recently fixed: #4167
In general we do not recommend to use public slow RPC nodes for collation. The goal is that you can run multiple collators in your network and point to a full node that you run yourself.

@skunert
Copy link
Contributor

skunert commented Jun 5, 2024

Quick update: I was able to reproduce this issue and have some hints on what might be happening. Will confirm and keep you posted.

github-merge-queue bot pushed a commit that referenced this issue Jun 11, 2024
## Issue

Currently, syncing parachains from scratch can lead to a very long
finalization time once they reach the tip of the chain. The problem is
that we try to finalize everything from 0 to the tip, which can be
thousands or even millions of blocks.

We finalize sequentially and try to compute displaced branches during
finalization. So for every block on the way, we compute an expensive
tree route.

## Proposed Improvements

In this PR, I propose improvements that solve this situation:

- **Skip tree route calculation if `leaves().len() == 1`:** This should
be enough for 90% of cases where there is only one leaf after sync.
- **Optimize finalization for long distances:** It can happen that the
parachain has imported some leaf and then receives a relay chain
notification with the finalized block. In that case, the previous
optimization will not trigger. A second mechanism should ensure that we
do not need to compute the full tree route. If the finalization distance
is long, we check the lowest common ancestor of all the leaves. If it is
above the to-be-finalized block, we know that there are no displaced
leaves. This is fast because forks are short and close to the tip, so we
can leverage the header cache.

## Alternative Approach

- The problem was introduced in #3962. Reverting that PR is another
possible strategy.
- We could store for every fork where it begins, however sounds a bit
more involved to me.


fixes #4614
Ank4n pushed a commit that referenced this issue Jun 14, 2024
## Issue

Currently, syncing parachains from scratch can lead to a very long
finalization time once they reach the tip of the chain. The problem is
that we try to finalize everything from 0 to the tip, which can be
thousands or even millions of blocks.

We finalize sequentially and try to compute displaced branches during
finalization. So for every block on the way, we compute an expensive
tree route.

## Proposed Improvements

In this PR, I propose improvements that solve this situation:

- **Skip tree route calculation if `leaves().len() == 1`:** This should
be enough for 90% of cases where there is only one leaf after sync.
- **Optimize finalization for long distances:** It can happen that the
parachain has imported some leaf and then receives a relay chain
notification with the finalized block. In that case, the previous
optimization will not trigger. A second mechanism should ensure that we
do not need to compute the full tree route. If the finalization distance
is long, we check the lowest common ancestor of all the leaves. If it is
above the to-be-finalized block, we know that there are no displaced
leaves. This is fast because forks are short and close to the tip, so we
can leverage the header cache.

## Alternative Approach

- The problem was introduced in #3962. Reverting that PR is another
possible strategy.
- We could store for every fork where it begins, however sounds a bit
more involved to me.


fixes #4614
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
…tech#4721)

## Issue

Currently, syncing parachains from scratch can lead to a very long
finalization time once they reach the tip of the chain. The problem is
that we try to finalize everything from 0 to the tip, which can be
thousands or even millions of blocks.

We finalize sequentially and try to compute displaced branches during
finalization. So for every block on the way, we compute an expensive
tree route.

## Proposed Improvements

In this PR, I propose improvements that solve this situation:

- **Skip tree route calculation if `leaves().len() == 1`:** This should
be enough for 90% of cases where there is only one leaf after sync.
- **Optimize finalization for long distances:** It can happen that the
parachain has imported some leaf and then receives a relay chain
notification with the finalized block. In that case, the previous
optimization will not trigger. A second mechanism should ensure that we
do not need to compute the full tree route. If the finalization distance
is long, we check the lowest common ancestor of all the leaves. If it is
above the to-be-finalized block, we know that there are no displaced
leaves. This is fast because forks are short and close to the tip, so we
can leverage the header cache.

## Alternative Approach

- The problem was introduced in paritytech#3962. Reverting that PR is another
possible strategy.
- We could store for every fork where it begins, however sounds a bit
more involved to me.


fixes paritytech#4614
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants