-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dhcpcd and virtual interface handling for native containers #4052
Fix dhcpcd and virtual interface handling for native containers #4052
Conversation
2e7cfaf
to
92ec5af
Compare
Updates in this PR:
|
So it still would fail after 3 retries, but this gives it a little more time to get there? Is there any way we can do this with a "wait for it", i.e. check the status of the underlying resource, rather than arbitrary retries? Then again, I guess that doesn't buy us anything. If in the end we decide that we will wait up to 15 seconds, then either way we will wait 15 second and then fail. So then what you have here is just as good. 👍
Is there a circumstance where we would want to remove it? I would think in a normal container removal we want it gone? Or is it that we are conflating container down with container delete? |
@deitch We have the same also in pillar for configuring VLANs on a bridge. For some reason, bridge is not immediately available after being created (some async ops continue apparently) and returns EBUSY if we try to use it too soon. |
while [ "$RETRIES" -gt 0 ]; do | ||
if ! "$@"; then | ||
RETRIES="$(( RETRIES - 1 ))" | ||
sleep 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 5 seconds is a bit too long. What about we retry every second 15 times to avoid delaying boot of a native container too much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. in pillar we retry with 1 second period: https://github.com/lf-edge/eve/blob/master/pkg/pillar/nireconciler/linuxitems/vlanbridge.go#L133-L169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense... I will change it...
@deitch , answering your second question, the removal of the directory is already done by pillar, we don't need to do it from the dhcpcd.sh script, once the container is gone, pillar will take care and remove the whole vifs directory... |
316ce4a
to
8538a43
Compare
Updates in this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@europaul We finally caught that OCI error:
Here should be your extra logs: https://github.com/lf-edge/eve/actions/runs/9864092485/artifacts/1690177256 |
When deploying native containers, during the setup of the virtual network interfaces, the Bridge device might no be ready, leading to errors like the following: "brctl: bridge bn1: Resource busy - Cannot find device nbu1x1.1" This commit provides a workaround by implementing a retry mechanism, so in case of error the script will retry the operation after 5 seconds for at most 3 times before fail. Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
The dhcpcd.sh script creates the /run/task/vifs/<APP_UUID>/ directories on an up command (configured as a Prestart hook in the container OCI interface) and removes it entirely on a down command (configured as Poststop). However, the etc/resolv.conf.new file is created by pillar and should be mounted inside the container. If any issue arises while setting up the bridge + virtual network interface during container initialization, pillar will retry to start the container, but at this point the etc/resolv.conf.new will not be available anymore since it was removed by along with the other directories created by the dhcpcd.sh script. This commit solves this issue by simply not removing the entire directory, but only the resolv.conf file that is created during the setup. Pillar already handles the creation and removal of this directory, so no changes are required on his side and the directory will be removed when not needed anymore. Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
8538a43
to
2a0a655
Compare
Updates in this PR:
|
@milan-zededa I got the bug narrowed down to an empty |
Is this config file persisted or recreated after reboot? |
it should be persistent and touched / created only when the container volume is created |
@rene observed the same error when testing with native containers locally. In his case it wasn't after a reboot, but during the initial deployment if I understand correctly |
Ah, Ok. Because I was wondering if we are missing a sync call after writing into the config file and before publishing volume info stating that it is ready (i.e. a race between volumemgr and domainmgr). |
@milan-zededa we sync the directory after the file was written and before we publish anything
|
This PR solves an issue like the log below, observed while deploying native containers and in some Eden tests:
Two commits are provided to solve the issue (descriptions below):
Implement retry mechanism for veth.sh
When deploying native containers, during the setup of the virtual network interfaces, the Bridge device might no be ready, leading to errors like the following:
This commit provides a workaround by implementing a retry mechanism, so in case of error the script will retry the operation after 5 seconds for at most 3 times before fail.
Do not remove directory for dhcpcd
The dhcpcd.sh script creates the /run/task/vifs/<APP_UUID>/ directories on an up command (configured as a Prestart hook in the container OCI interface) and removes it entirely on a down command (configured as Poststop). However, the etc/resolv.conf.new file is created by pillar and should be mounted inside the container. If any issue arises while setting up the bridge + virtual network interface during container initialization, pillar will retry to start the container, but at this point the etc/resolv.conf.new will not be available anymore since it was removed by along with the other directories created by the dhcpcd.sh script.
This commit solves this issue by simply not removing the entire directory, but only the resolv.conf file that is created during the setup. Pillar already handles the creation and removal of this directory, so no changes are required on his side and the directory will be removed when not needed anymore.
cc: @milan-zededa