Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Rococo validator using paritydb stuck on startup with "Missing table, starting reindex" #13179

Closed
PierreBesson opened this issue Jan 19, 2023 · 11 comments · Fixed by #13226
Closed

Comments

@PierreBesson
Copy link
Contributor

PierreBesson commented Jan 19, 2023

I have several Rococo validator which remains stuck after startup. The only way to fix the issue is to restore the database from a backup.

Version: 0.9.37
Relevant flags:

--chain=rococo
--validator
--database=paritydb
--pruning=1000
--enable-offchain-indexing=true
2023-01-19 14:08:16.389  INFO main sc_cli::runner: Parity Polkadot    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: ✌️  version 0.9.37-645723987cf    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 📋 Chain specification: Rococo    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 🏷  Node name: rococo-validator-f-node-9    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 👤 Role: AUTHORITY    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 💾 Database: ParityDb at /chain-data/chains/rococo_v2_2/paritydb/full    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: ⛓  Native runtime: rococo-9370 (parity-rococo-v2.0-0.tx18.au0)    
2023-01-19 14:08:27.060  WARN main parity-db: Missing table 09-19, starting reindex    
2023-01-19 14:08:27.060  INFO main parity-db: Started reindex for 09-18    

I also noticed that just before the node became stuck like this it experienced huge cpu/memory usage which caused several OOM kills. Maybe the database was corrupted by OOM kills ?

image

@ggwpez ggwpez transferred this issue from paritytech/polkadot Jan 19, 2023
@ggwpez
Copy link
Member

ggwpez commented Jan 19, 2023

Stuck in the sense that it did not print anything anymore? Does it still respond to RPC?
I remember there was a similar issue about stuck nodes.
cc @bkchr

@PierreBesson
Copy link
Contributor Author

Yes it does not print anything anymore or respond to RPC.

@bkchr
Copy link
Member

bkchr commented Jan 19, 2023

Cc @arkpar @cheme

@arkpar
Copy link
Member

arkpar commented Jan 19, 2023

Can we get a copy of the database?

@arkpar
Copy link
Member

arkpar commented Jan 19, 2023

Reports that I've requested in https://github.com/paritytech/devops/issues/1539 would be useful here

@aardbol aardbol assigned aardbol and unassigned aardbol Jan 20, 2023
@PierreBesson
Copy link
Contributor Author

Here is the eu-stack report:

PID 28771 - process
TID 28771:
#0  0x00007f562c3114eb
#1  0x000055a359efe7e5 _$LT$std..io..buffered..bufreader..BufReader$LT$R$GT$$u20$as$u20$std..io..Read$GT$::read_exact::h5ea24cb7b21ce235
#2  0x000055a359f3737f parity_db::index::IndexTable::enact_plan::h776f9c37f4f412db
#3  0x000055a359f36a9e parity_db::column::Column::enact_plan::heb491e23448755eb
#4  0x000055a359f0c707 parity_db::db::DbInner::enact_logs::hdf86a9ce7012f638
#5  0x000055a359f036d0 parity_db::db::Db::open_inner::h339b4797ff8ab9e1
#6  0x000055a35ac92c32 sc_client_db::parity_db::open::h198a5d9b607798de
#7  0x000055a35ac926e5 sc_client_db::utils::open_parity_db::hf4d4efe05aa680f9
#8  0x000055a35ac919d9 sc_client_db::utils::open_database_at::h0df44d5d0dcc1ee0
#9  0x000055a35ac90acc sc_client_db::utils::open_database::hb5792630d9adde8c
#10 0x000055a35ac749eb sc_service::builder::new_db_backend::h784f64f15853cb5b
#11 0x000055a35af44d1f polkadot_service::new_partial_basics::hcd0a94cfdf3ccdcd
#12 0x000055a35a09830e polkadot_service::new_full::h4b35fa300d075633
#13 0x000055a35a091d75 _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h399f21d26f8d9414
#14 0x000055a35a08df38 sc_cli::runner::Runner$LT$C$GT$::run_node_until_exit::hf4c1e2c26018ea20
#15 0x000055a35a065e7d polkadot_cli::command::run::h03cd3a03cb55d5ab
#16 0x000055a359568c11 polkadot::main::hcad9695bb0e3bea5
#17 0x000055a359566b13 std::sys_common::backtrace::__rust_begin_short_backtrace::hc9ae75d99f9dc8d6
#18 0x000055a359566b2e std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h6b9bd7c21bb5d6a8
#19 0x000055a359569317 main
#20 0x00007f562c1d909b __libc_start_main
#21 0x000055a35956672a _start
TID 28774:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c339 parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c536057 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#3  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#4  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#5  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#6  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#7  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#8  0x00007f562c506fa3 start_thread
#9  0x00007f562c2ae06f __clone
TID 28775:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c339 parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c536057 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#3  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#4  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#5  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#6  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#7  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#8  0x00007f562c506fa3 start_thread
#9  0x00007f562c2ae06f __clone
TID 28776:
#0  0x00007f562c2ae38f epoll_wait
#1  0x000055a35c53181c tokio::runtime::io::Driver::turn::h51a482e565093d0a
#2  0x000055a35c53143b tokio::runtime::time::Driver::park_internal::h27a4841415f8b919
#3  0x000055a35c5360cc tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#4  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#5  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#6  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#7  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#8  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#9  0x00007f562c506fa3 start_thread
#10 0x00007f562c2ae06f __clone
TID 28777:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c339 parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c536057 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#3  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#4  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#5  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#6  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#7  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#8  0x00007f562c506fa3 start_thread
#9  0x00007f562c2ae06f __clone
TID 28778:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c1ff parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c35cc50 std::sys_common::backtrace::__rust_begin_short_backtrace::hebc737661396ee6c
#3  0x000055a35c35ca72 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h52b4ebd31a30fd80
#4  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#5  0x00007f562c506fa3 start_thread
#6  0x00007f562c2ae06f __clone

@arkpar
Copy link
Member

arkpar commented Jan 20, 2023

Ok, it looks like the database has a lot of unprocessed commits in the queue, that it tries to enact on startup. The node should start eventually. Still, a copy database snapshot would be nice.

There seem to be a lot of writes made at some point to the OFFCHAIN column, so I guess this is some offchain worker that decided to insert/update a lot of values at once. @bkchr any ideas what could cause this?

@aardbol
Copy link
Contributor

aardbol commented Jan 20, 2023

Here's also a copy of the db, compressed with zstd: https://storage.googleapis.com/rococo-blockstore-backups/rococo-paritydb-archive.tar.zst

@PierreBesson
Copy link
Contributor Author

PierreBesson commented Jan 20, 2023

The node should start eventually.

@arkpar this particular node is stuck since 4 days and 6 hours and has minimal cpu usage.

@arkpar
Copy link
Member

arkpar commented Jan 20, 2023

After examining the database: it seems like there's been a collision in the STATE column. I.e. a lot of similar keys have been inserted which blew up the index to 4TiB. Normally this should not be happening because STATE column only contains trie nodes, so the keys should be uniformly distributed. I'm going to sync rococo to try and reproduce this and get back in few days.

@arkpar
Copy link
Member

arkpar commented Jan 25, 2023

Fixed in ##13226. Note that the database linked in this comment is already damaged. So I'd recommend to make a new sync from scratch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants