Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FAB-11334] Adds a new 'peer node unjoin' feature #2732

Merged
merged 1 commit into from
Jul 7, 2021

Conversation

jkneubuh
Copy link
Contributor

@jkneubuh jkneubuh commented Jul 1, 2021

The UnjoinChannel routine will mark a ledger as pending delete, removing all
channel specific data from the peer. The routine must be invoked while the
peer is down. Peer operators may issue the deletion by running the companion
'peer node unjoin' command from the console.

Type of change

  • New feature

Description

As a peer admin, I need a way to pause/resume a channel

Unjoining a is a helpful routine for operators managing channel lifecycles. In some cases, a network participant may opt out of a channel, requesting a complete deletion of all channel specific data from the ledger. While an operator may issue a node reset after pausing a channel, this is not viable in cases where the re-initialization requires the synchronization of a large (millions / billions) number of transactions.

Related issues

FAB-16035
FAB-4481
FAB-17787
FAB-17801

@jkneubuh jkneubuh requested a review from a team as a code owner July 1, 2021 13:37
@yacovm
Copy link
Contributor

yacovm commented Jul 1, 2021

The routine must be invoked while the peer is down.

Can you explain what is the fundamental reason we are shutting down the peer for that? Can't this be done while the peer is running and servicing other channels?

@jkneubuh
Copy link
Contributor Author

jkneubuh commented Jul 1, 2021

Hi @yacovm, thanks for the inquiry and this is a good question. There is no fundamental reason why unjoin can not operate while the peer is running. @manish-sethi articulated the pros/cons of running the unjoin in both online and offline modes - after review of approach alternatives we elected to run the unjoin in offline mode for simplicity and as a compromise to the current operational practice of pause+reset. Everyone is in agreement that there is still a need for the online unjoin channel operation.

Copy link
Contributor

@manish-sethi manish-sethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jkneubuh for your first PR. Left you a few comments.

Comment on lines 32 to 33
return errors.Wrap(err, "as another peer node command is executing,"+
" wait for that command to complete its execution or terminate it before retrying")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and other places below: errors.WithMessage fits better here. Wrap function creates a new stack trace that we use if the underlying error is thrown by a third party library.

Comment on lines 59 to 61
if err := idStore.updateLedgerStatus(ledgerID, status); err != nil {
return err
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := idStore.updateLedgerStatus(ledgerID, status); err != nil {
return err
}
return nil
return idStore.updateLedgerStatus(ledgerID, status)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I mildly disagree with this pattern / convention. It's OK but I actually prefer the explicit assignment and test of the error routine for absolute consistency. It seems like many (all?) golang routines consist of a serial flow of do-something+check-error+do-something+check-error+... invocations. When then final invocation in a series of calls is used as the error for the outer function, the reader needs to stop and think about the return type of the last call in the chain, rather than recognize the pattern as a a cookie cutter linear flow. For instance, in the two examples above, the explicit 'return nil' case is 100% clear that the routine is returning an OK result. In the latter, looking at the bottom of the routine it's not 100% clear that the method is returning an error, the new status, or the previous status of the ledger. Call me simple but I prefer the explicit style - the intent is clearer, and leaves opportunities for post-call logging, future method calls without restructuring, and ... ease of use when attached to a debugger. IMHO it's worth the extra typing. :)

Comment on lines 77 to 80
// Subsequent unjoins will not throw an error
require.NoError(t, UnjoinChannel(conf, ledgerID))
require.NoError(t, UnjoinChannel(conf, ledgerID))
require.NoError(t, UnjoinChannel(conf, ledgerID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is not a desired behavior. You got this here because you missed deleting the entry from the idStore. Otherwise, it would have rightly returned this error.

  • Another aspect of the behavior would be that it should be able to create the ledger with the same ID again. But this test perhaps fits better in kvledger/tests folder - where we maintain the tests for testing the behavior under various scenarios (including when the ledger contains private data) unlike in this package where the tests are more for very basic code coverage. Keeping that in mind, I would suggest that the two tests that you may want to retain here in package would be the first and last only as the rest of the tests would be covered in kvledger/tests package.

if err := removeLedgerData(config, ledgerID); err != nil {
return errors.Wrapf(err, "deleting ledger [%v]", ledgerID)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • After having the data deleted successfully, we need to delete the entry of the ledger from the idStore; leaving the peer in the state that it can join a channel with same name again.
  • We would need make change in the boot process so that we continue to delete the partially deleted ledgers, perhaps because of a crash during execution of this commend - as we do for partially constructed here. But this change can be done in a separate PR as it would need it's own tests to make sure the crash consistency of the deletion.

require.NoError(t, UnjoinChannel(conf, ledgerID))
}

func TestUpdateLedgerStatus(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it does not harm having this as a separate test but in any case, the assertions made in this tests should be made in the test TestUnjoinChannel above and as such this test can be removed.

Comment on lines 126 to 127
require.Error(t, UnjoinChannel(conf, ledgerID),
"as another peer node command is executing, wait for that command to complete its execution or terminate it before retrying")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For matching the errors, either we use EqualError or Contains funciton.

@yacovm
Copy link
Contributor

yacovm commented Jul 1, 2021

Hi @yacovm, thanks for the inquiry and this is a good question. There is no fundamental reason why unjoin can not operate while the peer is running. @manish-sethi articulated the pros/cons of running the unjoin in both online and offline modes - after review of approach alternatives we elected to run the unjoin in offline mode for simplicity and as a compromise to the current operational practice of pause+reset. Everyone is in agreement that there is still a need for the online unjoin channel operation.

Understood, thanks.

So the reason I asked that, is, that if the peer boots after being operated on and having the channel being marked as under deletion, it will obviously not initialize the in-memory structures and goroutines of that channel, because it will first delete the ledger and then resume bootstrapping.
However, other peers will then still think that peer is in the channel, and will try to replicate blocks from it, and send blocks to it, and the peer will still appear in discovery queries for a while. While it won't really harm any operations, it will pollute the logs, confuse clients (incorrect discovery results) and will surely raise questions from operators and administrators.

In gossip, I designed the ability to make the peer leave a channel (many years ago, in 2017) and it was under the assumption the channel leave API will be triggered when the peer is online, which will give the peer time to broadcast a message that makes other peers ignore it is in the channel.

While it is possible to still artificially call JoinChan after bootstrapping, on gossip, just for the channel, and then immediately call LeaveChannel, to make that message broadcasted, it is not effective as the membership view will most likely be empty when the peer boots up.

So, maybe this gap should be documented at least.

@jkneubuh jkneubuh force-pushed the feature/unjoin-channel branch 2 times, most recently from 4458e58 to 3562edd Compare July 6, 2021 18:34
The UnjoinChannel routine will mark a ledger as pending delete, removing all
channel specific data from the peer.  The routine must be invoked while the
peer is down.  Peer operators may issue the deletion by running the companion
'peer node unjoin' command from the console.

Signed-off-by: Josh Kneubuhl <jkneubuh@us.ibm.com>
Copy link
Contributor

@manish-sethi manish-sethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nit comments. Can be fixed in a separate PR or with next PR.

Comment on lines +60 to +61
logger.Infow("channel has been successfully unjoined", "ledgerID", ledgerID)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike error strings, log statements would typically start with a full sentence.

Comment on lines +26 to +27
// UnjoinChannel removes the data for a ledger and sets the status to UNDER_DELETION. This function is to be
// invoked while the peer is shut down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale function comment as this function now clears the UNDER_DELETION status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants