Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service-mirror: support gateway resolving to multiple addresses #11499

Merged
merged 2 commits into from
Nov 3, 2023

Conversation

MrFreezeex
Copy link
Contributor

If the gateway address was a name resolving to multiple addresses, previously service-mirror took only one of those address. This behavior could led to downtime in case the gateway behind the IP picked by service-mirror is down. This commit fix this behavior by considering all the IPs resolved.

@MrFreezeex MrFreezeex requested a review from a team as a code owner October 18, 2023 09:49
@MrFreezeex MrFreezeex changed the title Resolve multiple ips service-mirror: support gateway resolving to multiple address Oct 18, 2023
@MrFreezeex MrFreezeex changed the title service-mirror: support gateway resolving to multiple address service-mirror: support gateway resolving to multiple addresses Oct 18, 2023
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @MrFreezeex, at a glance this looks good. An issue would have been nice for us to have a couple of examples and to understand the problem better. I suppose the issue is that an endpoint might go down and you get downtime while the service mirror is resolving the address again?

As far as the code is concerned, it looks good (modulo the comment I left on the sort).

@MrFreezeex
Copy link
Contributor Author

MrFreezeex commented Oct 18, 2023

Thanks for the PR @MrFreezeex, at a glance this looks good. An issue would have been nice for us to have a couple of examples and to understand the problem better. I suppose the issue is that an endpoint might go down and you get downtime while the service mirror is resolving the address again?

Thanks for the review!

Sorry for not creating the issue prior to this, so we did have a small downtime probably because of that but I am not 100% sure of it. The gist of it is that if you give an address to the gateway that resolve to multiple IPs without this code change it mirrored only one IP. In our case we have an AWS load balancer that resolves to 3 IPs for instance and dumping the mirrored kubernetes endpoints only show one of those IP.

If the gateway address was a name resolving to multiple addresses,
previously service-mirror took only one of those address. This behavior
could led to downtime in case the gateway behind the IP picked by service-mirror
is down. This commit fix this behavior by considering all the IPs
resolved.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@ledger.fr>
Now that we consider all the IPs resolved by the DNS, we make sure they
are sorted lexicographically so that resolveGatewayAddress answer a
stable output.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@ledger.fr>
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrFreezeex the change looks great! I tested it out manually by creating a headless service with a manual EndpointSlice to create DNS records (much easier than hacking CoreDNS).

:; k exec dnsutil-pod -- nslookup headless-test.default.svc.cluster.local
Server:         10.43.0.10
Address:        10.43.0.10#53

Name:   headless-test.default.svc.cluster.local
Address: 192.168.224.4
Name:   headless-test.default.svc.cluster.local
Address: 192.168.224.5

If we pass --gateway-addresses=headless-test.default.svc.cluster.local to the Link command we can see the IPs being resolved in the probe service's endpoints

[
  {
    "addresses": [
      "192.168.224.4"
    ],
    "conditions": {
      "ready": true
    }
  },
  {
    "addresses": [
      "192.168.224.5"
    ],
    "conditions": {
      "ready": true
    }
  }
]

I considered asking for an integration test. It'd be nice to have but also painful to wire up without over abstracting this particular piece of code to take in a DNS resolver. All in all, I'm confident with the manual test. If you feel like contributing and know how to solve this easily in an integration test we'd appreciate a follow-up! :)

Thanks again 🚢

@mateiidavid mateiidavid merged commit cef1b58 into linkerd:main Nov 3, 2023
38 checks passed
@MrFreezeex
Copy link
Contributor Author

Thanks @mateiidavid 🙏. I was wondering do you think this patch could be considered for backport to the 2.14 branch? Happy to submit a PR with the cherry-pick if yes! :D

@mateiidavid
Copy link
Member

@MrFreezeex we generally consider only bug fixes for backports. New features are expected to release in new stable versions. We could probably argue that this is a bug fix though :).

With that being said, we should let the change simmer for a bit in an edge release (or two). We almost exclusively release all changes to an edge version first to ensure they don't break anyone's production environment. If an edge version considered as a release candidate doesn't raise any major flags, we'll cherrypick fixes into point releases.

I know you're probably eager to see this shipped. When it makes it into an edge (should happen this week) you could only upgrade the multicluster extension and leave the rest of the control plane on stable. We'll consider backporting this into a stable point release; if you can only stay on stable then it might take a bit more time until you can rollout the new change. The sooner we get feedback around changes from our user base the quicker fixes can be released usually.

I hope this all makes sense, and thank you for your patience and for your contribution! I really appreciate all of the effort you've put in. Please feel free to reply here, or in a discussion, or anywhere else if you want to continue the conversation.

@MrFreezeex
Copy link
Contributor Author

MrFreezeex commented Nov 7, 2023

@MrFreezeex we generally consider only bug fixes for backports. New features are expected to release in new stable versions. We could probably argue that this is a bug fix though :).

With that being said, we should let the change simmer for a bit in an edge release (or two). We almost exclusively release all changes to an edge version first to ensure they don't break anyone's production environment. If an edge version considered as a release candidate doesn't raise any major flags, we'll cherrypick fixes into point releases.

I hope this all makes sense, and thank you for your patience and for your contribution! I really appreciate all of the effort you've put in. Please feel free to reply here, or in a discussion, or anywhere else if you want to continue the conversation.

No worries and thanks for the kind words. Please keep it on your radar for 2.14.4 if there was enough edge releases in the meantime 🙏.

I know you're probably eager to see this shipped. When it makes it into an edge (should happen this week) you could only upgrade the multicluster extension and leave the rest of the control plane on stable. We'll consider backporting this into a stable point release; if you can only stay on stable then it might take a bit more time until you can rollout the new change. The sooner we get feedback around changes from our user base the quicker fixes can be released usually.

Oh indeed thanks that's a great idea, I will do that!

alpeb added a commit that referenced this pull request Nov 9, 2023
## edge-23.11.2

This edge release contains observability improvements and bug fixes to the
Destination controller, and a refinement to the multicluster gateway resolution
logic.

* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in edge-23.10.3 ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the objects tracked are falling behind the state in the
  kube-apiserver ([#11534])
* In the multicluster service mirror, extended the target gateway resolution
  logic to take into account all the possible IPs a hostname might resolve to,
  not just the first one (thanks @MrFreezeex!) ([#11499])
* Added probes to the debug container to appease environments requiring probes
  for all containers ([#11308])
alpeb added a commit that referenced this pull request Nov 9, 2023
## edge-23.11.2

This edge release contains observability improvements and bug fixes to the
Destination controller, and a refinement to the multicluster gateway resolution
logic.

* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in [edge-23.10.3] ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the Kubernetes objects watched by the controller are falling behind
  the state in the kube-apiserver ([#11534])
* In the multicluster service mirror, extended the target gateway resolution
  logic to take into account all the possible IPs a hostname might resolve to,
  rather than just the first one (thanks @MrFreezeex!) ([#11499])
* Added probes to the debug container to appease environments requiring probes
  for all containers ([#11308])

[edge-23.10.3]: https://github.com/linkerd/linkerd2/releases/tag/edge-23.10.3
[#11546]: #11546
[#11534]: #11534
[#11499]: #11499
[#11308]: #11308
@MrFreezeex
Copy link
Contributor Author

Hi @mateiidavid, I saw that there was two new release of linkerd since then and it's been a couple of weeks that the edge release is out. Would it be possible to include this patch in the next one 🙏 ?

@mateiidavid
Copy link
Member

mateiidavid commented Dec 5, 2023

Hi @mateiidavid, I saw that there was two new release of linkerd since then and it's been a couple of weeks that the edge release is out. Would it be possible to include this patch in the next one 🙏 ?

Hey @MrFreezeex! Wanted to let you know that I did see your message even though I didn't reply immediately. I raised it with the team :) I think including it makes sense.

olix0r pushed a commit that referenced this pull request Dec 6, 2023
If the gateway address was a name resolving to multiple addresses,
previously service-mirror took only one of those address. This behavior
could led to downtime in case the gateway behind the IP picked by service-mirror
is down. This commit fix this behavior by considering all the IPs
resolved. 

Now that we consider all the IPs resolved by the DNS, we make sure they
are sorted lexicographically so that resolveGatewayAddress answer a
stable output.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@ledger.fr>
olix0r added a commit that referenced this pull request Dec 6, 2023
* 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644)
* 64bccd9 Improve klog (client-go logs) handling (#11632)
* 284d76b service-mirror: support gateway resolving to multiple addresses (#11499)
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287)
@olix0r olix0r mentioned this pull request Dec 6, 2023
olix0r added a commit that referenced this pull request Dec 7, 2023
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287)
* 284d76b service-mirror: support gateway resolving to multiple addresses (#11499)
* 64bccd9 Improve klog (client-go logs) handling (#11632)
* 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644)
* e294c78 Bump prometheus to v2.48.0 (#11633)
* b24b0e97c stable-2.14.6
olix0r added a commit that referenced this pull request Dec 7, 2023
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287)
* 284d76b service-mirror: support gateway resolving to multiple addresses (#11499)
* 64bccd9 Improve klog (client-go logs) handling (#11632)
* 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644)
* e294c78 Bump prometheus to v2.48.0 (#11633)
* b24b0e97c stable-2.14.6
olix0r added a commit that referenced this pull request Dec 7, 2023
stable-2.14.6

* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287)
* 284d76b service-mirror: support gateway resolving to multiple addresses (#11499)
* 64bccd9 Improve klog (client-go logs) handling (#11632)
* 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644)
* e294c78 Bump prometheus to v2.48.0 (#11633)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants