Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dhcpcd and virtual interface handling for native containers #4052

Merged
merged 2 commits into from
Jul 11, 2024

Conversation

rene
Copy link
Contributor

@rene rene commented Jul 3, 2024

This PR solves an issue like the log below, observed while deploying native containers and in some Eden tests:

ERROR: App BM-1 uuid 98c7224d-07bb-42a0-99fd-8afcd0d0e808 state HALTED error: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/tasks/vifs/98c7224d-07bb-42a0-99fd-8afcd0d0e808.7.1/etc/resolv.conf.new" to rootfs at "/etc/resolv.conf": stat /run/tasks/vifs/98c7224d-07bb-42a0-99fd-8afcd0d0e808.7.1/etc/resolv.conf.new: no such file or directory: unknown

Two commits are provided to solve the issue (descriptions below):

Implement retry mechanism for veth.sh

When deploying native containers, during the setup of the virtual network interfaces, the Bridge device might no be ready, leading to errors like the following:

"brctl: bridge bn1: Resource busy - Cannot find device nbu1x1.1"

This commit provides a workaround by implementing a retry mechanism, so in case of error the script will retry the operation after 5 seconds for at most 3 times before fail.

Do not remove directory for dhcpcd

The dhcpcd.sh script creates the /run/task/vifs/<APP_UUID>/ directories on an up command (configured as a Prestart hook in the container OCI interface) and removes it entirely on a down command (configured as Poststop). However, the etc/resolv.conf.new file is created by pillar and should be mounted inside the container. If any issue arises while setting up the bridge + virtual network interface during container initialization, pillar will retry to start the container, but at this point the etc/resolv.conf.new will not be available anymore since it was removed by along with the other directories created by the dhcpcd.sh script.

This commit solves this issue by simply not removing the entire directory, but only the resolv.conf file that is created during the setup. Pillar already handles the creation and removal of this directory, so no changes are required on his side and the directory will be removed when not needed anymore.

cc: @milan-zededa

@rene rene requested a review from eriknordmark as a code owner July 3, 2024 13:17
@rene rene requested a review from milan-zededa July 3, 2024 13:17
@rene rene marked this pull request as draft July 5, 2024 12:03
@rene rene force-pushed the fix-dhcpcd-native-containers branch from 2e7cfaf to 92ec5af Compare July 5, 2024 12:37
@rene rene marked this pull request as ready for review July 5, 2024 12:40
@rene
Copy link
Contributor Author

rene commented Jul 5, 2024

Updates in this PR:

  • Rebased on top of master
  • Increase timeout from 3s to 5s
  • Slightly change the fix: we can keep everything inside the same dhcpcd folder (as it was before), since pillar already handles the removal of the folder, we just need to remove the created file in the dhcpcd.sh script instead the whole folder. This solution was tested with QEMU (where the issue was observed) and it's working.

@deitch
Copy link
Contributor

deitch commented Jul 7, 2024

This commit provides a workaround by implementing a retry mechanism, so in case of error the script will retry the operation after 5 seconds for at most 3 times before fail.

So it still would fail after 3 retries, but this gives it a little more time to get there? Is there any way we can do this with a "wait for it", i.e. check the status of the underlying resource, rather than arbitrary retries?

Then again, I guess that doesn't buy us anything. If in the end we decide that we will wait up to 15 seconds, then either way we will wait 15 second and then fail. So then what you have here is just as good. 👍

by simply not removing the entire directory

Is there a circumstance where we would want to remove it? I would think in a normal container removal we want it gone? Or is it that we are conflating container down with container delete?

@milan-zededa
Copy link
Contributor

So it still would fail after 3 retries, but this gives it a little more time to get there? Is there any way we can do this with a "wait for it", i.e. check the status of the underlying resource, rather than arbitrary retries?

@deitch We have the same also in pillar for configuring VLANs on a bridge. For some reason, bridge is not immediately available after being created (some async ops continue apparently) and returns EBUSY if we try to use it too soon.
I couldn't find anything in netlink API that would allow us to find out if the bridge is ready, other than trying to attach interface to it or configuring a VLAN.

while [ "$RETRIES" -gt 0 ]; do
if ! "$@"; then
RETRIES="$(( RETRIES - 1 ))"
sleep 5
Copy link
Contributor

@milan-zededa milan-zededa Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5 seconds is a bit too long. What about we retry every second 15 times to avoid delaying boot of a native container too much?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense... I will change it...

@rene
Copy link
Contributor Author

rene commented Jul 8, 2024

So it still would fail after 3 retries, but this gives it a little more time to get there? Is there any way we can do this with a "wait for it", i.e. check the status of the underlying resource, rather than arbitrary retries?

@deitch We have the same also in pillar for configuring VLANs on a bridge. For some reason, bridge is not immediately available after being created (some async ops continue apparently) and returns EBUSY if we try to use it too soon. I couldn't find anything in netlink API that would allow us to find out if the bridge is ready, other than trying to attach interface to it or configuring a VLAN.

@deitch , answering your second question, the removal of the directory is already done by pillar, we don't need to do it from the dhcpcd.sh script, once the container is gone, pillar will take care and remove the whole vifs directory...

@rene rene force-pushed the fix-dhcpcd-native-containers branch 2 times, most recently from 316ce4a to 8538a43 Compare July 8, 2024 10:03
@rene
Copy link
Contributor Author

rene commented Jul 8, 2024

Updates in this PR:

  • Rebased on top of master
  • Addressed reviewers comments

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@milan-zededa
Copy link
Contributor

@europaul We finally caught that OCI error:

eclient2	lfedge/eden-eclient:b1c1de6	292b6beb-1c1a-4d76-a330-87f0773cef83	-		-		0 B/0 B		IN_CONFIG	INSTALLED: [description:"setting up OCI spec for domain 292b6beb-1c1a-4d76-a330-87f0773cef83.1.2 failed unexpected end of JSON input"  timestamp:{seconds:1720687417  nanos:909445634}  severity:SEVERITY_ERROR]

Here should be your extra logs: https://github.com/lf-edge/eve/actions/runs/9864092485/artifacts/1690177256

rene added 2 commits July 11, 2024 12:25
When deploying native containers, during the setup of the virtual
network interfaces, the Bridge device might no be ready, leading to
errors like the following:

"brctl: bridge bn1: Resource busy - Cannot find device nbu1x1.1"

This commit provides a workaround by implementing a retry mechanism, so
in case of error the script will retry the operation after 5 seconds for
at most 3 times before fail.

Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
The dhcpcd.sh script creates the /run/task/vifs/<APP_UUID>/ directories on
an up command (configured as a Prestart hook in the container OCI
interface) and removes it entirely on a down command (configured as
Poststop). However, the etc/resolv.conf.new file is created by pillar and
should be mounted inside the container. If any issue arises while setting
up the bridge + virtual network interface during container initialization,
pillar will retry to start the container, but at this point the
etc/resolv.conf.new will not be available anymore since it was removed by
along with the other directories created by the dhcpcd.sh script.

This commit solves this issue by simply not removing the entire
directory, but only the resolv.conf file that is created during the
setup. Pillar already handles the creation and removal of this
directory, so no changes are required on his side and the directory will
be removed when not needed anymore.

Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
@rene rene force-pushed the fix-dhcpcd-native-containers branch from 8538a43 to 2a0a655 Compare July 11, 2024 10:25
@rene
Copy link
Contributor Author

rene commented Jul 11, 2024

Updates in this PR:

  • Rebased onto master

@milan-zededa milan-zededa merged commit bafe5b7 into lf-edge:master Jul 11, 2024
19 checks passed
@europaul
Copy link
Contributor

@milan-zededa I got the bug narrowed down to an empty image-config.json file in the volume root directory, but I still don't know why it gets empty. In the current case it happens after a reboot - before reboot the same app runs fine, meaning the config file should contain information. I added more logs in #4088, don't know if they are too much for prod version however.

@milan-zededa
Copy link
Contributor

@milan-zededa I got the bug narrowed down to an empty image-config.json file in the volume root directory, but I still don't know why it gets empty. In the current case it happens after a reboot - before reboot the same app runs fine, meaning the config file should contain information. I added more logs in #4088, don't know if they are too much for prod version however.

Is this config file persisted or recreated after reboot?

@europaul
Copy link
Contributor

Is this config file persisted or recreated after reboot?

it should be persistent and touched / created only when the container volume is created

@europaul
Copy link
Contributor

@rene observed the same error when testing with native containers locally. In his case it wasn't after a reboot, but during the initial deployment if I understand correctly

@milan-zededa
Copy link
Contributor

Is this config file persisted or recreated after reboot?

it should be persistent and touched / created only when the container volume is created

Ah, Ok. Because I was wondering if we are missing a sync call after writing into the config file and before publishing volume info stating that it is ready (i.e. a race between volumemgr and domainmgr).

@europaul
Copy link
Contributor

@milan-zededa we sync the directory after the file was written and before we publish anything

if err := utils.DirSync(fileLocation); err != nil {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants