Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orderer fails to start after some time being down. failed to open WAL: fileutil: file already locked #2931

Closed
kopaygorodsky opened this issue Sep 16, 2021 · 12 comments · Fixed by #3284

Comments

@kopaygorodsky
Copy link
Contributor

kopaygorodsky commented Sep 16, 2021

It happened on my network when an orderer crashed because no space was left of a device. I'm running nodes in k8s, so I just resized related PersistentVolumeClaim and restarted the node. It fails to start with this error fileutil: file already locked. It happens only to a specific channel, the other ones are ok and replicated just fine.

Logs : https://pastebin.com/pb1XSGn9

For some reason orderer tries to HandleChain two times for the same channel and it fails on the second attempt.
Maybe when it Restarting raft node channel=channel6 a race condition happens and the previous node hadn't closed the storage and new one already tried to use it.

@yacovm
Copy link
Contributor

yacovm commented Sep 17, 2021

2021-09-16 23:14:04.549 UTC [orderer.common.cluster] Configure -> INFO 079 Entering, channel: channel6, nodes: []
2021-09-16 23:14:04.549 UTC [orderer.consensus.etcdraft] start -> DEBU 07b Starting raft node: #peers: 1 channel=channel6 node=1

what is going on here? Is that node the only one in that channel?

@kopaygorodsky
Copy link
Contributor Author

kopaygorodsky commented Sep 17, 2021

yes, it was some kind of dev env with a single orderer node. I managed to replicate the same on the network with 3 orderers.

@kopaygorodsky
Copy link
Contributor Author

kopaygorodsky commented Sep 18, 2021

@yacovm pointed
a channel was created and space ran out, so the orderer crashed. When it boots, the system channel initialized before the application channel and it sees it hasn't finished creating the channel properly so it creates the channel again, and also initialized the chain afterwards, the application channel is started

@yacovm
Copy link
Contributor

yacovm commented Sep 29, 2021

Merged in 2.3, 2.2, 1.4 and main.

@guoger
Copy link
Member

guoger commented Mar 4, 2022

I think there's another issue caused by this problem: if we have two consecutive blocks creating the same channel, and if a new orderer is being added to this channel, it will be confused and recognize the second block as genesis block, and crashes later due to hash pointer mismatch.

possible related code

for seq := uint64(0); seq < lastConfigBlockNum; seq++ {
block = ci.Puller.PullBlock(seq)
if block == nil {
ci.Logger.Panicf("Failed pulling block [%d] from the system channel", seq)
}
ci.validateHashPointer(block, prevHash)
// Set the previous hash for the next iteration
prevHash = protoutil.BlockHeaderHash(block.Header) //lint:ignore SA5011 logs and panics above
channel, gb, err := ExtractGenesisBlock(ci.Logger, block)
if err != nil {
// If we failed to inspect a block, something is wrong in the system chain
// we're trying to pull, so abort.
ci.Logger.Panicf("Failed extracting channel genesis block from config block: %v", err)
}
if channel == "" {
ci.Logger.Info("Block", seq, "doesn't contain a new channel")
continue
}
ci.Logger.Info("Block", seq, "contains channel", channel)
channels[channel] = ChannelGenesisBlock{
ChannelName: channel,
GenesisBlock: gb,
}
}

possible fix would be to ignore second channel creation block during the loop.

cc @yacovm @kopaygorodsky

@guoger guoger reopened this Mar 4, 2022
@yacovm
Copy link
Contributor

yacovm commented Mar 6, 2022

I think there's another issue caused by this problem: if we have two consecutive blocks creating the same channel, and if a new orderer is being added to this channel, it will be confused and recognize the second block as genesis block, and crashes later due to hash pointer mismatch.

possible related code

for seq := uint64(0); seq < lastConfigBlockNum; seq++ {
block = ci.Puller.PullBlock(seq)
if block == nil {
ci.Logger.Panicf("Failed pulling block [%d] from the system channel", seq)
}
ci.validateHashPointer(block, prevHash)
// Set the previous hash for the next iteration
prevHash = protoutil.BlockHeaderHash(block.Header) //lint:ignore SA5011 logs and panics above
channel, gb, err := ExtractGenesisBlock(ci.Logger, block)
if err != nil {
// If we failed to inspect a block, something is wrong in the system chain
// we're trying to pull, so abort.
ci.Logger.Panicf("Failed extracting channel genesis block from config block: %v", err)
}
if channel == "" {
ci.Logger.Info("Block", seq, "doesn't contain a new channel")
continue
}
ci.Logger.Info("Block", seq, "contains channel", channel)
channels[channel] = ChannelGenesisBlock{
ChannelName: channel,
GenesisBlock: gb,
}
}

possible fix would be to ignore second channel creation block during the loop.

I think the right approach would be to treat channel creations as "config in flight" semantics - stop the pipeline until the channel creation block is committed. This would ensure only one channel creation with the same name.

@guoger
Copy link
Member

guoger commented Mar 6, 2022

I think there's another issue caused by this problem: if we have two consecutive blocks creating the same channel, and if a new orderer is being added to this channel, it will be confused and recognize the second block as genesis block, and crashes later due to hash pointer mismatch.
possible related code

for seq := uint64(0); seq < lastConfigBlockNum; seq++ {
block = ci.Puller.PullBlock(seq)
if block == nil {
ci.Logger.Panicf("Failed pulling block [%d] from the system channel", seq)
}
ci.validateHashPointer(block, prevHash)
// Set the previous hash for the next iteration
prevHash = protoutil.BlockHeaderHash(block.Header) //lint:ignore SA5011 logs and panics above
channel, gb, err := ExtractGenesisBlock(ci.Logger, block)
if err != nil {
// If we failed to inspect a block, something is wrong in the system chain
// we're trying to pull, so abort.
ci.Logger.Panicf("Failed extracting channel genesis block from config block: %v", err)
}
if channel == "" {
ci.Logger.Info("Block", seq, "doesn't contain a new channel")
continue
}
ci.Logger.Info("Block", seq, "contains channel", channel)
channels[channel] = ChannelGenesisBlock{
ChannelName: channel,
GenesisBlock: gb,
}
}

possible fix would be to ignore second channel creation block during the loop.

I think the right approach would be to treat channel creations as "config in flight" semantics - stop the pipeline until the channel creation block is committed. This would ensure only one channel creation with the same name.

Yep, that would be a more proper solution. Although we still need to address the issue in the code mentioned, to accommodate networks running in production, which might already suffer from this - having two consecutive blocks in system channel that create the same application channel.

@yacovm
Copy link
Contributor

yacovm commented Mar 6, 2022

So the function you mention already ignores the second channel creation, since it overrides the map entry, does it not? And come to discord, it's easier to collaborate there on this topic.

@yacovm
Copy link
Contributor

yacovm commented Mar 7, 2022

@guoger do you want to open a new issue for your use case? We can then make a trivial PR that skips that double occurrence and uses the first one.

@Param-S
Copy link
Contributor

Param-S commented Mar 9, 2022

@yacovm

So the function you mention already ignores the second channel creation, since it overrides the map entry, does it not?

I understood the logic in same way, that the existing code ignores as it overrides the map entry based on channel key name.

But the next comment about a need for trivial PR to check & skip the double occurrence:

We can then make a trivial PR that skips that double occurrence and uses the first one.

I assume that this additional check is required as the application channel genesis block is based on the first block & when the application channel gets pulled that time the hash might mismatch. could you correct me if I misunderstood or missing any points here.

@yacovm
Copy link
Contributor

yacovm commented Mar 10, 2022

You understood correctly, we need to skip the second and third and so on.

@guoger
Copy link
Member

guoger commented Mar 14, 2022

@guoger do you want to open a new issue for your use case? We can then make a trivial PR that skips that double occurrence and uses the first one.

sorry for late reply. shall i just open a PR and link it to this one?

guoger added a commit to guoger/fabric that referenced this issue Mar 14, 2022
fix hyperledger#2931

Signed-off-by: Jay Guo <guojiannan1101@gmail.com>
ale-linux pushed a commit that referenced this issue Mar 14, 2022
fix #2931

Signed-off-by: Jay Guo <guojiannan1101@gmail.com>
mergify bot pushed a commit that referenced this issue Mar 16, 2022
fix #2931

Signed-off-by: Jay Guo <guojiannan1101@gmail.com>
(cherry picked from commit 434a8d7)
mergify bot pushed a commit that referenced this issue Mar 16, 2022
fix #2931

Signed-off-by: Jay Guo <guojiannan1101@gmail.com>
(cherry picked from commit 434a8d7)
ale-linux pushed a commit that referenced this issue Mar 16, 2022
fix #2931

Signed-off-by: Jay Guo <guojiannan1101@gmail.com>
(cherry picked from commit 434a8d7)
ale-linux pushed a commit that referenced this issue Mar 16, 2022
fix #2931

Signed-off-by: Jay Guo <guojiannan1101@gmail.com>
(cherry picked from commit 434a8d7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants