Orderer fails to start after some time being down. failed to open WAL: fileutil: file already locked #2931

kopaygorodsky · 2021-09-16T23:32:17Z

It happened on my network when an orderer crashed because no space was left of a device. I'm running nodes in k8s, so I just resized related PersistentVolumeClaim and restarted the node. It fails to start with this error fileutil: file already locked. It happens only to a specific channel, the other ones are ok and replicated just fine.

Logs : https://pastebin.com/pb1XSGn9

For some reason orderer tries to HandleChain two times for the same channel and it fails on the second attempt.
Maybe when it Restarting raft node channel=channel6 a race condition happens and the previous node hadn't closed the storage and new one already tried to use it.

The text was updated successfully, but these errors were encountered:

yacovm · 2021-09-17T01:58:00Z

2021-09-16 23:14:04.549 UTC [orderer.common.cluster] Configure -> INFO 079 Entering, channel: channel6, nodes: []
2021-09-16 23:14:04.549 UTC [orderer.consensus.etcdraft] start -> DEBU 07b Starting raft node: #peers: 1 channel=channel6 node=1

what is going on here? Is that node the only one in that channel?

kopaygorodsky · 2021-09-17T08:32:58Z

yes, it was some kind of dev env with a single orderer node. I managed to replicate the same on the network with 3 orderers.

kopaygorodsky · 2021-09-18T01:24:31Z

@yacovm pointed
a channel was created and space ran out, so the orderer crashed. When it boots, the system channel initialized before the application channel and it sees it hasn't finished creating the channel properly so it creates the channel again, and also initialized the chain afterwards, the application channel is started

yacovm · 2021-09-29T18:53:01Z

Merged in 2.3, 2.2, 1.4 and main.

guoger · 2022-03-04T07:43:16Z

I think there's another issue caused by this problem: if we have two consecutive blocks creating the same channel, and if a new orderer is being added to this channel, it will be confused and recognize the second block as genesis block, and crashes later due to hash pointer mismatch.

possible related code

fabric/orderer/common/cluster/replication.go

Lines 547 to 573 in c3f6ef9

    
           for seq := uint64(0); seq < lastConfigBlockNum; seq++ { 
        
           	block = ci.Puller.PullBlock(seq) 
        
           	if block == nil { 
        
           		ci.Logger.Panicf("Failed pulling block [%d] from the system channel", seq) 
        
           	} 
        
           	ci.validateHashPointer(block, prevHash) 
        
           	// Set the previous hash for the next iteration 
        
           	prevHash = protoutil.BlockHeaderHash(block.Header) //lint:ignore SA5011 logs and panics above 
        
           	channel, gb, err := ExtractGenesisBlock(ci.Logger, block) 
        
           	if err != nil { 
        
           		// If we failed to inspect a block, something is wrong in the system chain 
        
           		// we're trying to pull, so abort. 
        
           		ci.Logger.Panicf("Failed extracting channel genesis block from config block: %v", err) 
        
           	} 
        
           	if channel == "" { 
        
           		ci.Logger.Info("Block", seq, "doesn't contain a new channel") 
        
           		continue 
        
           	} 
        
           	ci.Logger.Info("Block", seq, "contains channel", channel) 
        
           	channels[channel] = ChannelGenesisBlock{ 
        
           		ChannelName:  channel, 
        
           		GenesisBlock: gb, 
        
           	} 
        
           }

possible fix would be to ignore second channel creation block during the loop.

cc @yacovm @kopaygorodsky

yacovm · 2022-03-06T01:39:29Z

I think there's another issue caused by this problem: if we have two consecutive blocks creating the same channel, and if a new orderer is being added to this channel, it will be confused and recognize the second block as genesis block, and crashes later due to hash pointer mismatch.

possible related code

fabric/orderer/common/cluster/replication.go

Lines 547 to 573 in c3f6ef9

for seq := uint64(0); seq < lastConfigBlockNum; seq++ {

block = ci.Puller.PullBlock(seq)

if block == nil {

ci.Logger.Panicf("Failed pulling block [%d] from the system channel", seq)

}

ci.validateHashPointer(block, prevHash)

// Set the previous hash for the next iteration

prevHash = protoutil.BlockHeaderHash(block.Header) //lint:ignore SA5011 logs and panics above

channel, gb, err := ExtractGenesisBlock(ci.Logger, block)

if err != nil {

// If we failed to inspect a block, something is wrong in the system chain

// we're trying to pull, so abort.

ci.Logger.Panicf("Failed extracting channel genesis block from config block: %v", err)

}

if channel == "" {

ci.Logger.Info("Block", seq, "doesn't contain a new channel")

continue

}

ci.Logger.Info("Block", seq, "contains channel", channel)

channels[channel] = ChannelGenesisBlock{

ChannelName: channel,

GenesisBlock: gb,

}

}

possible fix would be to ignore second channel creation block during the loop.

I think the right approach would be to treat channel creations as "config in flight" semantics - stop the pipeline until the channel creation block is committed. This would ensure only one channel creation with the same name.

guoger · 2022-03-06T11:19:56Z

I think there's another issue caused by this problem: if we have two consecutive blocks creating the same channel, and if a new orderer is being added to this channel, it will be confused and recognize the second block as genesis block, and crashes later due to hash pointer mismatch.
possible related code

fabric/orderer/common/cluster/replication.go

Lines 547 to 573 in c3f6ef9

for seq := uint64(0); seq < lastConfigBlockNum; seq++ {

block = ci.Puller.PullBlock(seq)

if block == nil {

ci.Logger.Panicf("Failed pulling block [%d] from the system channel", seq)

}

ci.validateHashPointer(block, prevHash)

// Set the previous hash for the next iteration

prevHash = protoutil.BlockHeaderHash(block.Header) //lint:ignore SA5011 logs and panics above

channel, gb, err := ExtractGenesisBlock(ci.Logger, block)

if err != nil {

// If we failed to inspect a block, something is wrong in the system chain

// we're trying to pull, so abort.

ci.Logger.Panicf("Failed extracting channel genesis block from config block: %v", err)

}

if channel == "" {

ci.Logger.Info("Block", seq, "doesn't contain a new channel")

continue

}

ci.Logger.Info("Block", seq, "contains channel", channel)

channels[channel] = ChannelGenesisBlock{

ChannelName: channel,

GenesisBlock: gb,

}

}

possible fix would be to ignore second channel creation block during the loop.

I think the right approach would be to treat channel creations as "config in flight" semantics - stop the pipeline until the channel creation block is committed. This would ensure only one channel creation with the same name.

Yep, that would be a more proper solution. Although we still need to address the issue in the code mentioned, to accommodate networks running in production, which might already suffer from this - having two consecutive blocks in system channel that create the same application channel.

yacovm · 2022-03-06T14:58:11Z

So the function you mention already ignores the second channel creation, since it overrides the map entry, does it not? And come to discord, it's easier to collaborate there on this topic.

yacovm · 2022-03-07T09:10:49Z

@guoger do you want to open a new issue for your use case? We can then make a trivial PR that skips that double occurrence and uses the first one.

Param-S · 2022-03-09T19:38:59Z

@yacovm

So the function you mention already ignores the second channel creation, since it overrides the map entry, does it not?

I understood the logic in same way, that the existing code ignores as it overrides the map entry based on channel key name.

But the next comment about a need for trivial PR to check & skip the double occurrence:

We can then make a trivial PR that skips that double occurrence and uses the first one.

I assume that this additional check is required as the application channel genesis block is based on the first block & when the application channel gets pulled that time the hash might mismatch. could you correct me if I misunderstood or missing any points here.

yacovm · 2022-03-10T18:03:08Z

You understood correctly, we need to skip the second and third and so on.

guoger · 2022-03-14T02:58:20Z

@guoger do you want to open a new issue for your use case? We can then make a trivial PR that skips that double occurrence and uses the first one.

sorry for late reply. shall i just open a PR and link it to this one?

fix hyperledger#2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com>

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com>

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com> (cherry picked from commit 434a8d7)

kopaygorodsky mentioned this issue Sep 18, 2021

FAB-2931: do not create a chain if it's already created #2934

Merged

denyeart mentioned this issue Sep 23, 2021

FAB-2931: do not create a chain if it's already created (backport #2934) #2936

Merged

yacovm closed this as completed Sep 29, 2021

yacovm added the needs-release-note label Sep 29, 2021

guoger reopened this Mar 4, 2022

guoger added a commit to guoger/fabric that referenced this issue Mar 14, 2022

Ignore channel double creation during replication.

db7ed3e

fix hyperledger#2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com>

guoger mentioned this issue Mar 14, 2022

Ignore channel double creation during replication. #3284

Merged

ale-linux closed this as completed in #3284 Mar 14, 2022

ale-linux pushed a commit that referenced this issue Mar 14, 2022

Ignore channel double creation during replication.

434a8d7

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com>

mergify bot pushed a commit that referenced this issue Mar 16, 2022

Ignore channel double creation during replication.

958c827

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com> (cherry picked from commit 434a8d7)

mergify bot pushed a commit that referenced this issue Mar 16, 2022

Ignore channel double creation during replication.

3821b78

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com> (cherry picked from commit 434a8d7)

ale-linux pushed a commit that referenced this issue Mar 16, 2022

Ignore channel double creation during replication.

4ab9059

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com> (cherry picked from commit 434a8d7)

ale-linux pushed a commit that referenced this issue Mar 16, 2022

Ignore channel double creation during replication.

0396bf9

fix #2931 Signed-off-by: Jay Guo <guojiannan1101@gmail.com> (cherry picked from commit 434a8d7)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orderer fails to start after some time being down. failed to open WAL: fileutil: file already locked #2931

Orderer fails to start after some time being down. failed to open WAL: fileutil: file already locked #2931

kopaygorodsky commented Sep 16, 2021 •

edited

Loading

yacovm commented Sep 17, 2021

kopaygorodsky commented Sep 17, 2021 •

edited

Loading

kopaygorodsky commented Sep 18, 2021 •

edited

Loading

yacovm commented Sep 29, 2021

guoger commented Mar 4, 2022 •

edited

Loading

yacovm commented Mar 6, 2022

guoger commented Mar 6, 2022

yacovm commented Mar 6, 2022

yacovm commented Mar 7, 2022

Param-S commented Mar 9, 2022

yacovm commented Mar 10, 2022

guoger commented Mar 14, 2022

Orderer fails to start after some time being down. failed to open WAL: fileutil: file already locked #2931

Orderer fails to start after some time being down. failed to open WAL: fileutil: file already locked #2931

Comments

kopaygorodsky commented Sep 16, 2021 • edited Loading

yacovm commented Sep 17, 2021

kopaygorodsky commented Sep 17, 2021 • edited Loading

kopaygorodsky commented Sep 18, 2021 • edited Loading

yacovm commented Sep 29, 2021

guoger commented Mar 4, 2022 • edited Loading

yacovm commented Mar 6, 2022

guoger commented Mar 6, 2022

yacovm commented Mar 6, 2022

yacovm commented Mar 7, 2022

Param-S commented Mar 9, 2022

yacovm commented Mar 10, 2022

guoger commented Mar 14, 2022

kopaygorodsky commented Sep 16, 2021 •

edited

Loading

kopaygorodsky commented Sep 17, 2021 •

edited

Loading

kopaygorodsky commented Sep 18, 2021 •

edited

Loading

guoger commented Mar 4, 2022 •

edited

Loading