Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip - etcd 3.5.0-alpha-0 #2452

Closed
wants to merge 2 commits into from
Closed

wip - etcd 3.5.0-alpha-0 #2452

wants to merge 2 commits into from

Conversation

sykesm
Copy link
Contributor

@sykesm sykesm commented Feb 27, 2021

@jyellick @yacovm - An alpha of etcd v3.5 has been released as a module. I've pulled it in and done just enough work to make it compile. I thought you might want to take a look and see if it's something we want to pull in at some point in the near future.

Right now a number of tests are failing with a panic from etcd:

goroutine 1 [running]:
go.etcd.io/etcd/raft/v3.StartNode(0xc00050c200, 0x0, 0x0, 0x0, 0xc0001ec558, 0x1)
       /home/sykesm/fabric/vendor/go.etcd.io/etcd/raft/v3/node.go:218 +0x40b
github.com/hyperledger/fabric/orderer/consensus/etcdraft.(*node).start(0xc000302000, 0x101)
       /home/sykesm/fabric/orderer/consensus/etcdraft/node.go:73 +0x297
github.com/hyperledger/fabric/orderer/consensus/etcdraft.(*Chain).Start(0xc000afc600)
       /home/sykesm/fabric/orderer/consensus/etcdraft/chain.go:361 +0x1b5
github.com/hyperledger/fabric/orderer/common/multichannel.(*ChainSupport).start(...)
       /home/sykesm/fabric/orderer/common/multichannel/chainsupport.go:120
github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).startChannels(0xc000ab0000)
       /home/sykesm/fabric/orderer/common/multichannel/registrar.go:175 +0xb5
github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).Initialize(0xc000ab0000, 0xc000aa2450)
       /home/sykesm/fabric/orderer/common/multichannel/registrar.go:145 +0x8e
github.com/hyperledger/fabric/orderer/common/server.initializeMultichannelRegistrar(0xc00007c680, 0xc0000fc200, 0xc00034c870, 0x0, 0x0, 0xc000375800, 0x3a0,1, ...)
       /home/sykesm/fabric/orderer/common/server/main.go:824 +0x465
github.com/hyperledger/fabric/orderer/common/server.Main()
       /home/sykesm/fabric/orderer/common/server/main.go:245 +0x10ef
main.main()
       /home/sykesm/fabric/cmd/orderer/main.go:15 +0x25

One of the fields on the protobuf messages exchanged by raft also changed names (but appears to have kept the same field number and semantics).

sykesm and others added 2 commits February 26, 2021 19:10
Signed-off-by: Matthew Sykes <sykesmat@us.ibm.com>
The commit etcd-io/etcd@c9491d7
strongly distinguishes the StartNode API call of the etcd/raft Node from the RestartNode.

This commit:

    (1) Aligns our code with the bespoken API changes.
    (2) Makes the logic in node.start() be more easy to reason about.

Change-Id: I48070de9241ebcaae720fb1d7dc7c7d1e507625a
Signed-off-by: Yacov Manevich <yacovm@il.ibm.com>
@yacovm
Copy link
Contributor

yacovm commented Feb 27, 2021

Right now a number of tests are failing with a panic from etcd:

Panic is no more but there may be other issues:

ChannelParticipation three node etcdraft network without a system channel [It] joins application channels from genesis block and removes a channel using the channel participation API 
/home/vsts/work/1/fabric/integration/raft/channel_participation_test.go:122

  Timed out after 60.809s.
  Expected
      <channelparticipation.ChannelInfo>: {
          Name: "participation-trophy",
          URL: "/participation/v1/channels/participation-trophy",
          Status: "active",
          ConsensusRelation: "consenter",
          Height: 6,
      }
  to equal
      <channelparticipation.ChannelInfo>: {
          Name: "participation-trophy",
          URL: "/participation/v1/channels/participation-trophy",
          Status: "active",
          ConsensusRelation: "consenter",
          Height: 7,
      }

  /home/vsts/work/1/fabric/integration/raft/channel_participation_test.go:1308

Not sure if this failure is a flake or a genuine problem. Probably worth to run the tests locally and see.

@sykesm
Copy link
Contributor Author

sykesm commented Mar 1, 2021

@yacovm, Thanks for looking. The test that failed on your end is failing for me as well. There are also 5 or 6 unit tests (mostly related to reconfiguration) that are failing as well.

For grins I tried to pull in your change into a tree with the current versions of etcd and there were failures there as well. I was sort of hoping the cleanup could be applied independently since it was much easier to read.

So, none of this is urgent but I think we need to track it. Would be great for @jyellick to weigh in as well.

@yacovm
Copy link
Contributor

yacovm commented Mar 1, 2021

The test that failed on your end is failing for me as well. There are also 5 or 6 unit tests (mostly related to reconfiguration) that are failing as well.

But UTs passed in the latest build after my commit.

@sykesm
Copy link
Contributor Author

sykesm commented Mar 1, 2021

The test that failed on your end is failing for me as well. There are also 5 or 6 unit tests (mostly related to reconfiguration) that are failing as well.

But UTs passed in the latest build after my commit.

No. They're "pended" out in the code.

PIt("should be able to process config update adding single node", func() {

@yacovm
Copy link
Contributor

yacovm commented Mar 1, 2021

The test that failed on your end is failing for me as well. There are also 5 or 6 unit tests (mostly related to reconfiguration) that are failing as well.

But UTs passed in the latest build after my commit.

No. They're "pended" out in the code.

PIt("should be able to process config update adding single node", func() {

I see, I didn't look at the code and I wondered how come the unit tests passed but the integration tests didn't.

@jyellick
Copy link
Contributor

jyellick commented Mar 2, 2021

Definitely seems like something we'd like to pick up -- and at least thus far the changes don't seem too invasive. I've pulled the code and poked around a little, but not solved the UT failures yet. @guoger if you have some spare cycles, you might have some thoughts as well.

@jyellick
Copy link
Contributor

jyellick commented Mar 5, 2021

It looks like etcdraft has gotten a little bit smarter about how it evicts nodes from configuration which is causing problems interacting with Fabric splitting the config change across two indexes (one for the config block, the next for the actual raft conf change) or at least that's what things look like to me. Notably, before the conf state simply listed nodes and learners. Now we have voters, learners, voters outgoing, and learners next. There's also a chance that some of those rote compile changes from Nodes to Voters isn't actually what we wanted, so they might deserve a bit more scrutiny.

One thing I think that's important to note is that we've historically said that we should really combine the two steps of proposing a config block and proposing the actual raft config change, as splitting them across entries creates yet another point in reconfiguration where nodes can crash and create headaches for us. On the other hand, changing this behavior would create a bit of an upgrade headache (or maybe we could figure out a clever way to handle it). It's probably worth investigating the interoperability between the v1/v3 network ABIs, since if they are not compatible (and we will require downtime to upgrade), now might be an opportune time to change the two-step behavior.

I know that @guoger has started to look at this, so perhaps he will have some additional insight.

@guoger
Copy link
Member

guoger commented Mar 7, 2021

Reconfig is broken with new version of etcd/raft because it now performs more strict check on incoming snapshot. If the snapshot sent was taken before adding new node, then it certainly does not contain new node, and would fail the check aforementioned. etcd server it self handles this by injecting the actual ConfState to snapshot message (produced by raft) before sending them out.

It shouldn't be hard for us to do the same, and i think we should upgrade lib after this change.

@sykesm
Copy link
Contributor Author

sykesm commented Mar 16, 2021

Thank you for taking a look. We'll have to revisit this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants