Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing eden tests #986

Merged
merged 1 commit into from
Jun 21, 2024
Merged

Conversation

milan-zededa
Copy link
Contributor

@milan-zededa milan-zededa commented Jun 20, 2024

Multiple patches in this PR:

  • Increased the default EVE version to the latest 12.4
  • Removed workflow dependency on Smoke test suite. With reusable workflows this does not work properly in the EVE repository. The other test suites are triggered regardless of the outcome of Smoke tests. The only effect this dependency has, is that other workflows are delayed from starting until Smoke tests finish, increasing the total execution time of all tests.
  • Moved tests that depend on virtualization (apps must be deployed inside VMs) to a separate test suite. The primary reason is that the HW-assisted nested virtualization is performing poorly on the buildjet runners and we are getting many failures (not related to EVE code). For every other test it is therefore better to run EVE with acceleration disabled. A secondary reason is that most of these tests are from the networking test suite, which is already quite long and takes well over an hour to execute. It therefore makes sense to move some tests to a different test suite. Please note that when the Virtualization test suite is run locally, all tests are passing. However, on buildjet runners they are expected to fail until we resolve the issue with the nested virtualization.
  • Fixed 'eden pod modify' (wrong timeout + app state to wait for had to be modified to reflect recent EVE changes)
  • Fixed ctrl_cert_change test (improper use of -check-new)
  • App 'nodered' needs more time to get deployed
  • Test publish_location is now skipped and likely will be removed. This is because after the recent changes in the EVE/wwan microservice, injecting a fake GPS location data is no longer possible.
  • Modified EVE-upgrade tests to use newer base EVE version (> 12.0). This is because the policy_version of fscrypt was bumped from 1 to 2 and this change is not backward compatible. Meaning we cannot downgrade from 12.X to 11.Y or older.

Please note that we are still getting some occasional kernel crashes on buildjet runners (even with accel=false), so even stable tests may fail once in a while.

Copy link
Collaborator

@uncleDecart uncleDecart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you Milan, I think we should merge this and release eden and bump eden version in EVE. For Buildjet we also should integrate this #984. And point to guys that we still have problems with virtualisation tests as we found out. And that locally they run perfectly fine. Maybe we also should try to run test suite on self-hosted runners to see the difference.

Also @christoph-zededa this PR bumps EVE version, so yours won't have to if we merge this before

@uncleDecart
Copy link
Collaborator

uncleDecart commented Jun 20, 2024

Forgot the sticker :D :shipit:

And also I see that LPC LOC tests are failing, right?

Edit: and some of the Smoke tests are still failing and they fail on rebooting. I wonder if it has to do with Buildet runners as well

Multiple patches in this PR:
* Increased the default EVE version to the latest 12.4
* Removed workflow dependency on Smoke test suite. With reusable
  workflows this does not work properly in the EVE repository.
  The other test suites are triggered regardless of the outcome
  of Smoke tests. The only effect this dependency has, is that other
  workflows are delayed from starting until Smoke tests finish,
  increasing the total execution time of all tests.
* Moved tests that depend on virtualization (apps must be deployed
  inside VMs) to a separate test suite. The primary reason is that the
  HW-assisted nested virtualization is performing poorly on the buildjet
  runners and we are getting many failures (not related to EVE code).
  For every other test it is therefore better to run EVE with
  acceleration disabled. A secondary reason is that most of these tests
  are from the networking test suite, which is already quite long and
  takes well over an hour to execute. It therefore makes sense to move
  some tests to a different test suite.
  Please note that when the Virtualization test suite is run locally,
  all tests are passing. However, on buildjet runners they are expected
  to fail until we resolve the issue with the nested virtualization.
* Fixed 'eden pod modify' (wrong timeout + app state to wait for had to
  be modified to reflect recent EVE changes)
* Fixed ctrl_cert_change test (improper use of -check-new)
* App 'nodered' needs more time to get deployed
* Test publish_location is now skipped and likely will be removed. This
  is because after the recent changes in the EVE/wwan microservice,
  injecting a fake GPS location data is no longer possible.
* Modified EVE-upgrade tests to use newer base EVE version (> 12.0).
  This is because the policy_version of fscrypt was bumped from 1 to 2
  and this change is not backward compatible. Meaning we cannot
  downgrade from 12.X to 11.Y or older.

Signed-off-by: Milan Lenco <milan@zededa.com>
@milan-zededa
Copy link
Contributor Author

Forgot the sticker :D :shipit:

And also I see that LPC LOC tests are failing, right?

Edit: and some of the Smoke tests are still failing and they fail on rebooting. I wonder if it has to do with Buildet runners as well

Yes, all of those were due to kernel crashes that we are occasionally getting.

@milan-zededa
Copy link
Contributor Author

Runners are without Internet connectivity, will try again tomorrow...

@milan-zededa
Copy link
Contributor Author

Kernel crashes are quite often.
I wonder if the "noisy neighbor" (quoting the guy from Buildjet) is actually us and these problems are more frequent when we run many eden workflows at the same time (likely scheduled on the same host(s)).

@uncleDecart
Copy link
Collaborator

Okay, anyways I'm merging this PR and telemetry one, let's try to get as much info as possible

@uncleDecart uncleDecart merged commit 6e38968 into lf-edge:master Jun 21, 2024
17 of 19 checks passed
@uncleDecart
Copy link
Collaborator

Let's try to run it on local runner, get statistics from it and compare them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants