Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs introduced when support for multi-port NI was added #4221

Merged
merged 4 commits into from
Sep 7, 2024

Conversation

milan-zededa
Copy link
Contributor

As always, new features also introduce new bugs. Here is a PR with patches to address all bugs identified since the support for Local NI with multiple port was merged. Actually, one of them was introduced by configurable flow-logging, which was merged around the same time...

Avoid unecessary re-creation of static NI routes

We may have some discrepancies between the intended and the current
state of NI static routes mostly caused by Linux kernel being smart
and automatically setting some route fields which we leave undefined.
From the reconciler point of view, the current vs. the intended route
config therefore differ and it will unecessarily recreate the route.
Let's therefore set all those automatically filled route fields already
in the intended config (or ignore them in the comparison function)
to avoid unecessary route recreations.

Properly sync routes of NI with multiple ports

When we are syncing the current state of the NI routing table,
we should list all the routes in the table, even those with an output
port no longer used by NI. Otherwise, if user removes port from NI
(through a change in shared labels), current state re-sync will
incorrectly omit obsolete routes and reconciler will thus not remove
them (because they already appear removed from the reconciler point
of view).

Allow ICMP and server-initiated DHCP traffic to enter device

Now that the flow logging is configurable, we no longer rely on
the connection marking for device INPUT rules. Instead we use ACCEPT/DROP
rules inside the filter table. But with this change, we forgot to add
ICMP and DHCP rule to the filter/INPUT-device chain. As a consequence,
device will for example not reply to ICMP pings...
As for DHCP, most of the traffic is initiated by client and will match
the "ctstate RELATED,ESTABLISHED" rule with the ACCEPT action, so this
is not that much affected.

Avoid returning "< nil >" or zero UUID as route gateway IP/app

IP Route Gateway IP and Gateway App can be each empty (for e.g. connected
route). But EVE should avoid publishing "< nil >" or zero UUID inside the
NetworkInstanceInfo message.

IP Route Gateway IP and Gateway App can be each empty (for e.g. connected
route). But EVE should avoid publishing "<nil>" or zero UUID inside the
NetworkInstanceInfo message.

Signed-off-by: Milan Lenco <milan@zededa.com>
@milan-zededa milan-zededa added the bug Something isn't working label Sep 6, 2024
Now that the flow logging is configurable, we no longer rely on
the connection marking for device INPUT rules. Instead we use ACCEPT/DROP
rules inside the filter table. But with this change, we forgot to add
ICMP and DHCP rule to the filter/INPUT-device chain. As a consequence,
device will for example not reply to ICMP pings...
As for DHCP, most of the traffic is initiated by client and will match
the "ctstate RELATED,ESTABLISHED" rule with the ACCEPT action, so this
is not that much affected.

Signed-off-by: Milan Lenco <milan@zededa.com>
When we are syncing the current state of the NI routing table,
we should list all the routes in the table, even those with an output
port no longer used by NI. Otherwise, if user removes port from NI
(through a change in shared labels), current state re-sync will
incorrectly omit obsolete routes and reconciler will thus not remove
them (because they already appear removed from the reconciler point
of view).

Signed-off-by: Milan Lenco <milan@zededa.com>
We may have some discrepancies between the intended and the current
state of NI static routes mostly caused by Linux kernel being smart
and automatically setting some route fields which we leave undefined.
From the reconciler point of view, the current vs. the intended route
config therefore differ and it will unecessarily recreate the route.
Let's therefore set all those automatically filled route fields already
in the intended config (or ignore them in the comparison function)
to avoid unecessary route recreations.

Signed-off-by: Milan Lenco <milan@zededa.com>
@OhmSpectator
Copy link
Member

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to believe it works =D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried... I got the idea, but absolute don't understand the details))

@@ -806,6 +807,14 @@ func (r *LinuxNIReconciler) getIntendedNIL3Cfg(niID uuid.UUID) dg.Graph {
Type: unix.RTN_UNREACHABLE,
Protocol: unix.RTPROT_STATIC,
},
OutputIf: generic.NetworkIf{
// Linux automatically adds "dev lo" to unreachable IPv6 routes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so, this behaviour can change between the Linux versions?...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we explicitly specify the loopback interface for unreachable IPv6 route, it does not matter what the default behavior for unspecified interface is and whether it changes in future versions. The goal of one of these commits was to avoid having unspecified fields and having Linux set some default values.

@milan-zededa
Copy link
Contributor Author

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

@OhmSpectator
Copy link
Member

It's interesting. Why don't I see the running jobs here in the status? Everything is just green. But when I go to actions, I see the test suits are progressing.

@OhmSpectator
Copy link
Member

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

Ok, then, may I ask how you found these issues? =)

@milan-zededa
Copy link
Contributor Author

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

Ok, then, may I ask how you found these issues? =)

Actually most of them were found by other people in our company who tried EVE 13.0.0 and reported problems to me (as the default destination for networking issues).

@OhmSpectator
Copy link
Member

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

Ok, then, may I ask how you found these issues? =)

Actually most of them were found by other people in our company who tried EVE 13.0.0 and reported problems to me (as the default destination for networking issues).

Tricky... So, we can consider it as in-hause post-merge integration testing...

@OhmSpectator
Copy link
Member

@yash-zededa, do you have an idea why the Eden tests are skipped in this PR?...

@yash-zededa yash-zededa self-requested a review September 6, 2024 16:24
@yash-zededa
Copy link
Collaborator

@yash-zededa, do you have an idea why the Eden tests are skipped in this PR?...

The tests were triggered on my approval. Checking what happened when you had approved.

@yash-zededa
Copy link
Collaborator

@OhmSpectator the test was triggered here

@yash-zededa
Copy link
Collaborator

@OhmSpectator the tests execution is aborted by GH and it is quite weird.

Seems to be related to concurrency, checking..

image

@yash-zededa yash-zededa self-requested a review September 6, 2024 17:44
@yash-zededa
Copy link
Collaborator

Details

The concurrency is set up to run based on the PR number, so if there's an existing run for the same PR, the previous job should be canceled.

I'll keep an eye on all the executions of the lf-edge Test Suite.

@OhmSpectator
Copy link
Member

Wow, all green

@OhmSpectator OhmSpectator merged commit 38261fe into lf-edge:master Sep 7, 2024
57 of 77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants