-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start user apps from image instead of rootfs #4304
base: master
Are you sure you want to change the base?
start user apps from image instead of rootfs #4304
Conversation
pkg/pillar/containerd/containerd.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to write tarball to content store: %w", err) | ||
} | ||
fmt.Printf("Layer %s written to content store\n", layerDesc.Digest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Printf("Layer %s written to content store\n", layerDesc.Digest) | |
log.Printf("Layer %s written to content store\n", layerDesc.Digest) |
Two requests. First, can you describe the issue and how it presented itself? |
See the linked issue #4303 . |
Here is the gist: |
Some follow up questions
|
basically, yes, and we don't re-create the image if it already exists in the image store
yes, but I'm not sure about "normal APIs":
Once the image is created it can be used like any normal OCI images to create containers. |
Thanks for the explanation. This makes a lot more sense now. I have some questions, some more nitpicks, some more important. Overall, the approach looks quite sane (and a lot better than we do now). I will try to put them into a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the approach is correct.
I have a few areas of questions:
- it looks like we may be recreating existing functionality in
oci.go
/containerd.go
. If it really is new, by all means; if not, let's reuse what we have. - a few areas where constants would help
- can we avoid pillar having explicit specific knowledge of paths, perhaps by mounting in?
Ideally, it would be great to avoid having to do any of this at all. There are volumes in rootfs.yml
, but even those end up pulling down the image filesystem and expanding it, not saving you anything.
It isn't that the approach is wrong, it is that it is complex, even if this is the best we likely can do.
pkg/pillar/containerd/containerd.go
Outdated
ocispec "github.com/opencontainers/image-spec/specs-go/v1" | ||
spec "github.com/opencontainers/image-spec/specs-go/v1" | ||
imagespecs "github.com/opencontainers/image-spec/specs-go" | ||
ocispecs "github.com/opencontainers/image-spec/specs-go/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of weird, since both are image spec, but one is v1, i.e. versioned. The runtime spec also is an OCI spec. I guess not much to do to make this better, and this is not the fault of the PR, but how specs-go/v1.Manifest
references specs-go.Versioned
. It would be nice if we could get rid pf the first one, and then the second would just be imagespecs
, and it would be clean. But I don't see the easy way. If I do, I will comment below.
pkg/pillar/containerd/containerd.go
Outdated
@@ -246,6 +250,317 @@ func (client *Client) CtrDeleteBlob(ctx context.Context, blobHash string) error | |||
return client.contentStore.Delete(ctx, digest.Digest(blobHash)) | |||
} | |||
|
|||
// CtrCreateImageFromRootfs creates an OCI image from an existing rootfs directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am bothered that you have to do this. I don't think you did it incorrectly, but the fact that you have to do it at all.
The full build process looks like this:
- At build time,
make rootfs
pulls down the imagexen-tools
- It then lays it out on the filesystem in
/containers/services/...
, including the contents of the layers and creating the correctconfig.json
, etc. - You then create a tarball from the filesystem - which was originally a tar in the first place - and recreate your own config manifest (which originally was there), just to load it back into containerd.
Going all the way back to where you started. It would be great if you didn't have to go to all of that pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 💯 with you on this
pkg/pillar/containerd/containerd.go
Outdated
} | ||
|
||
// Create a tarball from the rootfs directory | ||
tarPath := "/tmp/rootfs.tar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be a file. Can you just stream it from the walk of the file path straight into containerd? As opposed to creating a tar stream to a file, and then from that file open a stream to containerd? Eliminate the middleman? Or does the need for the digest get in the way, i.e. content.WithDescriptor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And /tmp/rootfs.tar will be overlayfs i.e., this will use up memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking about this. You need to know the digest when you first open the channel to push it to containerd, but you need to have the whole thing ready to get the hash. I was going to suggest to do it twice: first stream it to /dev/null
while hashing, and get the digest, then stream it to containerd with the results of the first. But tar files are only somewhat reliably reproducible. I have done it, but do we want to depend on that? You probably could, if you control the dates tightly. If that is a path we want to follow, let me know, I can find the tar code where I did it.
It still bothers me: we built this xen-tools rootfs from an existing image when we built the rootfs. Sure, it was laid out on the filesystem because it was useful there and not in its blobs. Now we are trying to recreate it, and doing all sorts of fun things with it? I kind of want us to be able to just embed it in the file to start with. Not actually this way, but the equivalent of:
docker save xen-tools > /rootfs/images/xen-tools.tar
- In
pillar
, bind-mount/rootfs/images:/images
- Then we can just use
/images/xen-tools.tar
The issue is size. This now becomes a permanent part of the image, another 26MB. That seems a waste.
pkg/pillar/containerd/containerd.go
Outdated
Layers: []ocispecs.Descriptor{layerDesc}, | ||
} | ||
|
||
manifestLabels := map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to do these. Doesn't containerd automatically add the garbage collection labels?
Also, where we have to do this, don't we already have functions like CtrCreateImage
, CtrLoadBlob
, etc. in this file? Why recreate it all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to do these. Doesn't containerd automatically add the garbage collection labels?
usually yes, but I guess in this case it doesn't.
Also, where we have to do this, don't we already have functions like CtrCreateImage, CtrLoadBlob, etc. in this file? Why recreate it all?
Thank you for the hint! I overlooked the blob-related functions - I'll have a second look
pkg/pillar/containerd/containerd.go
Outdated
@@ -338,6 +663,63 @@ func (client *Client) CtrMountSnapshot(ctx context.Context, snapshotID, targetPa | |||
return mounts[0].Mount(targetPath) | |||
} | |||
|
|||
// CtrCreateSnapshotFromExistingRootfs creates a snapshot from an existing rootfs | |||
func (client *Client) CtrCreateSnapshotFromExistingRootfs(ctx context.Context, snapshotName, rootfsPath string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just the equilvant of your new CreateImageFromRootFS
followed by the existing CreateSnapshotForImage
? Do we need to recreate the logic here?
We have all of the handling cases and snapshots and garbage collection and all of that already inside the existing functions. Does this add anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating a snapshot is much simpler than creating an image - you only package the files from rootfs, without worrying about the config or manifest. the logic here is different however from creating the image - here we don't create any tarballs, but simply create an empty snapshot - copy the files there - save the snapshot with files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of that? Usually, when ctrd creates a snapshot, it is based on existing filesystems, essentially layers from images. Does this bypass all of that? I had thought this process was:
- Load rootfs as image into containerd
- Use containerd to create container (which would handle all of the snapshots and all of that)
Where is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not used in the current implementation. I based my first draft on it, which just created a snapshot from rootfs instead of the whole xen-tools image and then reused this snapshot to create user apps. I decided to keep this function in case we might need it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That definitely explains why I couldn't figure it out its purpose. 😄
Can we remove it? I don't want others getting confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure :) I just wanted to save it because it's a working aproach and it took me a lot of time to develop it. And it might be useful in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense. Then can you explain a bit more when it is useful? snapshots from a rootfs don't make sense to me (in the context of containerd). It is usually, "load an image" and then "create snapshots from that image". Those functions (I think) are elsewhere, so is this something more than just a wrapper around those two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well in our particular scenario, we don't really need a full image, we just need the files from rootfs, since we create the OCI config ourselves. so actually a snapshot works well, because a snapshot from overlayfs-snapshotter is just a overlayfs mount of the image, where the image stays read-only and the changes only exist in the upper directory.
so go like this:
- create (prepare) a new empty snapshot called xen-tools, unless it already exists (same as with the image)
- copy the files from xen-tools rootfs on disk into that snapshot
- commit that snapshot and give it a label to prevent garbage collecting
- use that snapshot as a base to create (prepare) new snapshots for each user app container
- those new snapshots will be managed by containerd internally, so we don't have to take care of them
pkg/pillar/containerd/oci.go
Outdated
ctrdCtx, done := s.client.CtrNewUserServicesCtx() | ||
defer done() | ||
|
||
image, err := s.client.CtrGetImage(ctrdCtx, "xen-tools:local") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So xen-tools:local
is the internal reference name we are using for the image? And if not found, we will load from filesystem and name it that? Can we save this as a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes to everything 😄
pkg/pillar/hypervisor/containerd.go
Outdated
return nil | ||
} | ||
// Create a xen-tools image if it does not exist | ||
err = ctx.ctrdClient.CtrCreateImageFromRootfs(ctrdUserCtx, "xen-tools:local", "/containers/services/xen-tools/rootfs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the name again, we should save it as a constant. I also think the name should be more specific, like lfedge/eve-xen-tools:local
, just so it is clear it is something ours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is the one place where we find the filesystem. In other words, this is where pillar becomes aware of the structure of the underlying OS. It might be elsewhere too, but I think I found only one other spot.
Can we avoid this? Can we mount it into pillar at the rootfs.yml
level and thus avoid this? E.g. in pillar:
const xenToolsRoot = "/xen-tools/"
// and later
err = ctx.ctrdClient.CtrCreateImageFromRootfs(ctrdUserCtx, xenToolsImage, xenToolsRoot)
and then in rootfs.yml.in
:
- name: pillar
image: PILLAR_TAG
cgroupsPath: /eve/services/pillar
oomScoreAdj: -999
binds.add:
- /containers/services/xen-tools/rootfs:/xen-tools:ro`
It has the same effect, except that pillar no longer knows the name of the container or its on-root path (no "breaking out"), and the place that does know it - rootfs.yml
- handles mapping it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better might be if we had an onboot container the could do this, rather than burdening pillar even further. But I think that would be too fundamental a change for this.
I have a general question regarding what problem we are trying to solve. I put it here: #4303 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doc should be updated when the PR is in it's final form.
From what I understand - now we create a new "base container" (xen-tools) for each new deployed application. And it's done to avoid collisions in mount paths that are used for different applications started from this container. And to create the per-app "base container", we first create an Image and load it into our CAS, so it can be used on app deployment.
If that's right, my questions would be:
- Could we resolve the collisions inside the container by just fixing the paths of mounted volumes?
- If no, and we need to create a new base container for each app and for that we need the base image: could we create it in build time, not in runtime?
That is partially because mounts are ugly, and containerd manages it all for you. And partially because we hacked around lots of it in eve. Somewhere on my list is a desire to clean up a lot of this and standardize it. Somewhere. |
What happens when we deploy a new version of EVE? Do we discard and recreate this image based on the files in the new version? |
pkg/pillar/hypervisor/containerd.go
Outdated
imageExists, err := ctx.ctrdClient.CtrImageExists(ctrdUserCtx, "xen-tools:local") | ||
if err != nil { | ||
return fmt.Errorf("couldn't check if image xen-tools exists: %v.", err) | ||
} | ||
if imageExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this image include the shim-VM linux kernel in addition to the xen-tools user-space?
Then there is an interesting question about EVE updates (and also about how this works today).
If a device is running EVE version X, and you deploy container A, it gets the Linux kernel from version X.
Then you update EVE to version Y. Do you expect container A to restart with the new Linux kernel from EVE version Y or not? (and I don't know what happens today for sure - we already have the shim-VM rootfs in /persist/vault/volumes but what about the kernel?)
Then you deploy container B. Do you expect it to get the version X or Y kernel?
In addition to the kernel, I assume we want to be able to deploy fixes (e.g., to the dhcp and mount scripts) in the shim VM and have those be used by existing and future deployed containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the shim-VM use the EVE kernel? I had thought it just mounts in the kernel from EVE itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the kernel.
But user-space bug fixes matter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the shim-VM use the EVE kernel?
The kernel that runs the container is exactly the same as the host. A container can just have some file overlays + limits in access to some resources. Runtime is shared with the host in any case. So no risk of Kernel version mismatch. So the only similar possible problem I could imagine would be the Linux modules that we can "repopulate" into image as we do mount them here:
eve/pkg/pillar/containerd/oci.go
Line 186 in 019f189
if err := os.MkdirAll(filepath.Join(volumeRoot, "modules"), 0600); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite true.
To some degree, I don't think we have asked these questions in a while. In the immediate sense, @europaul 's PR does not change how anything works now. Same filesystem, same kernel, same everything. It just sets up the filesystem path correctly, rather than sharing (and trouncing) like it does now. And relies on containerd to do it correctly. So I think we should not hold this up for that.
We probably should have a proper design discussion around the issues you have raised. What are the behaviours we are trying to achieve? What are potential challenges? How should we restructure it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm correct, we recreate the "base" image on each device reboot now
Not sure, Paul would know. If so, that would resolve the issue Erik raised in the previous comment (if I understood him).
(which I find odd, by the way, as we could do it in build time).
That would be even nicer. If we can find a way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it change the behavior of container B
Not really. All of this is using xen-tools. Whether we use the mounted one from the root filesystem, or whether we copy it to an image and then mount that, we still are using whatever is on the rootfs.
- Current: we are using xen-tools:1, it mounts the directory from running xen-tools, launches container B. We upgrade EVE, including xen-tools, which includes xen-tools:2. It mounts the directory from running xen-tools (clueless as to what version it is) and launches B. B just got a newer version of xen-tools.
- Proposed: we are using xen-tools:1, it creates an image from xen-tools, launches container B. We upgrade EVE, including xen-tools, which includes xen-tools:2. It creates an image from xen-tools (clueless as to what version it is), and launches B. B just got a newer version of xen-tools.
The case I can see (maybe this is what you were referring to, Erik?) is if, after the upgrade, it looks in containerd and says, "oh, wait I already have xen-tools, so I won't bother creating it from the filesystem." Then container B is stuck on xen-tools:1, because it was saved in containerd.
Is that what you were asking about? I am not sure if it works this way or the way I described in "Proposed" above. @europaul can you explain?
@deitch, yes that is what I'm asking about. AFAICT the new code checks if the image exists in containerd and if not it will never create a new version of that image. Sounds like this is a change.
FWIW we could name that xen-tools image based on the eve-version string to make this behave the same as today (if we understand the current behavior per above.) But that would require having code to delete the unreferenced old ones from containerd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, not bad. I had to walk through the whole path to get it, but I did in the end. 😄
@europaul can you address his issue? It is a valid one. We want to resolve this issue you have raised, but without changing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thanks you for all the comments, folks, I really appreciate it! I'm gonna start with this conversation since it has the most fundamental questions.
xen-tools:local
image is supposed to be created on the pillar startup, unless it already exists in the containerd image store. So it's not recreated after a simple reboot.
As far as my understanding goes, when EVE is upgraded the partition B is clean - like on the first ever start of EVE - so there will be no images in the containerd image store. That means that xen-tools:local
image will be recreated from the rootfs of the new version of xen-tools.
It would be great if we could put a xen-tools image into the containerd's image store at build time, but afaiu @deitch said that would require changes on the linuxkit front (which I'm happy to make if needs be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could put a xen-tools image into the containerd's image store at build time, but afaiu @deitch said that would require changes on the linuxkit front (which I'm happy to make if needs be)
I have been exploring it a bit. I will let you know.
@OhmSpectator |
@deitch @eriknordmark I think I might have a better solution: create user app container from an empty snapshot instead of from xen-tools, and just bind-mount xen-tools from
I'll try to implement a PoC now |
@europaul that is quite elegant. Can you add the rootfs.yml part we discussed? So that only that file knows the path, while you pillar itself just sees local mount point? |
And be sure to mount them read only into each container, so no chance of mistakes. |
7bf7c19
to
3421c9c
Compare
Two questions, @europaul First, do you mind writing up how "AddLoader" works in this new method? The old way was, essentially, modify the manifest as it changed the entrypoint (hence, "Add Loader"). And should it have a new name, or does it still reflect it correctly. Second, can we do what was discussed above, i.e. having the volume bind-mounted in via |
Add .gitignore file for pkg/pillar directory and include test artifacts and profiling results. Signed-off-by: Paul Gaiduk <paulg@zededa.com>
3421c9c
to
bc87704
Compare
@deitch even the old AddLoader() was doing more than that: it was also creating the necessary files for the container workloads such as Of course ideally AddLoader would just massage the OCI spec so that it fits with the loader and do nothing else and ideally all functions related to |
Starting user apps containers by bind mounting xen-tools rootfs doesn't provide sufficient isolation when mounting other files on top of it. Example: when mounting other files on top of xen-tools rootfs, containerd will create stubs for those files in the original rootfs even though it's mounted read-only. To provide better isolation, we need to mount xen-tools rootfs into the container through overlayfs. For this we first create an empty snapshot in containerd which we will use for upper and work directories of the overlayfs. We then mount xen-tools rootfs into the root of the container and the stubs that are created by containerd are now in the snapshot instead of the original rootfs. Since the snapshot is created explicitly by us and protected from being garbage collected we also delete it explicitly when the container is deleted. Signed-off-by: Paul Gaiduk <paulg@zededa.com>
Pillar uses xen-tools as a loader for user apps, but it doesn't need to know about the exact location of xen-tools rootfs on the disk. This commit mounts xen-tools container read-only into the pillar container, so that pillar doesn't need to "break out" anymore. Signed-off-by: Paul Gaiduk <paulg@zededa.com>
bc87704
to
9474d89
Compare
Done 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort and the investment @europaul .
Closes #4303
Since starting containers from rootfs does not provide sufficient isolation (especially with mounting), we create a OCI image from the xen-tools rootfs and start user apps from that image.
The xen-tools image is created only once when initializing hypervisors KVM and Xen. It is not recreated on reboot and never chagned. After EVE update the containerd content store is emptry again, so the image is recreated, since xen-tools rootfs may have changed.