Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start user apps from image instead of rootfs #4304

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

europaul
Copy link
Contributor

@europaul europaul commented Sep 27, 2024

Closes #4303

Since starting containers from rootfs does not provide sufficient isolation (especially with mounting), we create a OCI image from the xen-tools rootfs and start user apps from that image.

The xen-tools image is created only once when initializing hypervisors KVM and Xen. It is not recreated on reboot and never chagned. After EVE update the containerd content store is emptry again, so the image is recreated, since xen-tools rootfs may have changed.

@europaul europaul added bug Something isn't working stable Should be backported to stable release(s) labels Sep 27, 2024
if err != nil {
return fmt.Errorf("failed to write tarball to content store: %w", err)
}
fmt.Printf("Layer %s written to content store\n", layerDesc.Digest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Printf("Layer %s written to content store\n", layerDesc.Digest)
log.Printf("Layer %s written to content store\n", layerDesc.Digest)

@deitch
Copy link
Contributor

deitch commented Sep 27, 2024

Two requests.

First, can you describe the issue and how it presented itself?
Second, can you describe the architecture of the solution a little better? I probably can dig into the code and figure it out, but it would help if within normal prose you described here (or in the code comments), what the general flow is.

@europaul
Copy link
Contributor Author

First, can you describe the issue and how it presented itself?

See the linked issue #4303 .

@europaul
Copy link
Contributor Author

Second, can you describe the architecture of the solution a little better? I probably can dig into the code and figure it out, but it would help if within normal prose you described here (or in the code comments), what the general flow is.

Here is the gist:
when EVE starts with hypervisors KVM or Xen (that use xen-tools as a loader) EVE re-creates the xen-tools image from it's rootfs on disk (under /containers/services/xen-tools/rootfs) and saves it under xen-tools:local unless such an image already exists. Later when a new task needs to be created this task uses the image xen-tools:local for it's container. This results in every user-app container creating it's own snapshot of xen-tools:local, which provides the isolation and solves the problem of clashing files.

@deitch
Copy link
Contributor

deitch commented Sep 27, 2024

Some follow up questions

  1. So this happens on device startup, not on workload startup?
  2. It builds an image from the container already laid out on disk, and then loads it into containerd using normal APIs? And then later can just use it like any other?

@europaul
Copy link
Contributor Author

europaul commented Sep 27, 2024

1. So this happens on device startup, not on workload startup?

basically, yes, and we don't re-create the image if it already exists in the image store

2. It builds an image from the container already laid out on disk, and then loads it into containerd using normal APIs? And then later can just use it like any other?

yes, but I'm not sure about "normal APIs":

  • we create a tar from the rootfs - it becomes the first and only layer of the image
  • we create a minimal config for the image - a full config is not needed, since the OCI spec for eve-user-apps will be created by EVE during the app initialization
  • we create a minimal manifest for the image, that references the config and the layer created before

Once the image is created it can be used like any normal OCI images to create containers.

@deitch
Copy link
Contributor

deitch commented Sep 29, 2024

Thanks for the explanation. This makes a lot more sense now.

I have some questions, some more nitpicks, some more important. Overall, the approach looks quite sane (and a lot better than we do now).

I will try to put them into a review.

Copy link
Contributor

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the approach is correct.

I have a few areas of questions:

  • it looks like we may be recreating existing functionality in oci.go / containerd.go. If it really is new, by all means; if not, let's reuse what we have.
  • a few areas where constants would help
  • can we avoid pillar having explicit specific knowledge of paths, perhaps by mounting in?

Ideally, it would be great to avoid having to do any of this at all. There are volumes in rootfs.yml, but even those end up pulling down the image filesystem and expanding it, not saving you anything.

It isn't that the approach is wrong, it is that it is complex, even if this is the best we likely can do.

pkg/pillar/containerd/containerd.go Show resolved Hide resolved
ocispec "github.com/opencontainers/image-spec/specs-go/v1"
spec "github.com/opencontainers/image-spec/specs-go/v1"
imagespecs "github.com/opencontainers/image-spec/specs-go"
ocispecs "github.com/opencontainers/image-spec/specs-go/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of weird, since both are image spec, but one is v1, i.e. versioned. The runtime spec also is an OCI spec. I guess not much to do to make this better, and this is not the fault of the PR, but how specs-go/v1.Manifest references specs-go.Versioned. It would be nice if we could get rid pf the first one, and then the second would just be imagespecs, and it would be clean. But I don't see the easy way. If I do, I will comment below.

@@ -246,6 +250,317 @@ func (client *Client) CtrDeleteBlob(ctx context.Context, blobHash string) error
return client.contentStore.Delete(ctx, digest.Digest(blobHash))
}

// CtrCreateImageFromRootfs creates an OCI image from an existing rootfs directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bothered that you have to do this. I don't think you did it incorrectly, but the fact that you have to do it at all.

The full build process looks like this:

  1. At build time, make rootfs pulls down the image xen-tools
  2. It then lays it out on the filesystem in /containers/services/..., including the contents of the layers and creating the correct config.json, etc.
  3. You then create a tarball from the filesystem - which was originally a tar in the first place - and recreate your own config manifest (which originally was there), just to load it back into containerd.

Going all the way back to where you started. It would be great if you didn't have to go to all of that pain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 💯 with you on this

}

// Create a tarball from the rootfs directory
tarPath := "/tmp/rootfs.tar"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be a file. Can you just stream it from the walk of the file path straight into containerd? As opposed to creating a tar stream to a file, and then from that file open a stream to containerd? Eliminate the middleman? Or does the need for the digest get in the way, i.e. content.WithDescriptor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And /tmp/rootfs.tar will be overlayfs i.e., this will use up memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about this. You need to know the digest when you first open the channel to push it to containerd, but you need to have the whole thing ready to get the hash. I was going to suggest to do it twice: first stream it to /dev/null while hashing, and get the digest, then stream it to containerd with the results of the first. But tar files are only somewhat reliably reproducible. I have done it, but do we want to depend on that? You probably could, if you control the dates tightly. If that is a path we want to follow, let me know, I can find the tar code where I did it.

It still bothers me: we built this xen-tools rootfs from an existing image when we built the rootfs. Sure, it was laid out on the filesystem because it was useful there and not in its blobs. Now we are trying to recreate it, and doing all sorts of fun things with it? I kind of want us to be able to just embed it in the file to start with. Not actually this way, but the equivalent of:

  1. docker save xen-tools > /rootfs/images/xen-tools.tar
  2. In pillar, bind-mount /rootfs/images:/images
  3. Then we can just use /images/xen-tools.tar

The issue is size. This now becomes a permanent part of the image, another 26MB. That seems a waste.

Layers: []ocispecs.Descriptor{layerDesc},
}

manifestLabels := map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to do these. Doesn't containerd automatically add the garbage collection labels?

Also, where we have to do this, don't we already have functions like CtrCreateImage, CtrLoadBlob, etc. in this file? Why recreate it all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to do these. Doesn't containerd automatically add the garbage collection labels?

usually yes, but I guess in this case it doesn't.

Also, where we have to do this, don't we already have functions like CtrCreateImage, CtrLoadBlob, etc. in this file? Why recreate it all?

Thank you for the hint! I overlooked the blob-related functions - I'll have a second look

pkg/pillar/containerd/containerd.go Outdated Show resolved Hide resolved
@@ -338,6 +663,63 @@ func (client *Client) CtrMountSnapshot(ctx context.Context, snapshotID, targetPa
return mounts[0].Mount(targetPath)
}

// CtrCreateSnapshotFromExistingRootfs creates a snapshot from an existing rootfs
func (client *Client) CtrCreateSnapshotFromExistingRootfs(ctx context.Context, snapshotName, rootfsPath string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just the equilvant of your new CreateImageFromRootFS followed by the existing CreateSnapshotForImage? Do we need to recreate the logic here?

We have all of the handling cases and snapshots and garbage collection and all of that already inside the existing functions. Does this add anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a snapshot is much simpler than creating an image - you only package the files from rootfs, without worrying about the config or manifest. the logic here is different however from creating the image - here we don't create any tarballs, but simply create an empty snapshot - copy the files there - save the snapshot with files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of that? Usually, when ctrd creates a snapshot, it is based on existing filesystems, essentially layers from images. Does this bypass all of that? I had thought this process was:

  1. Load rootfs as image into containerd
  2. Use containerd to create container (which would handle all of the snapshots and all of that)

Where is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used in the current implementation. I based my first draft on it, which just created a snapshot from rootfs instead of the whole xen-tools image and then reused this snapshot to create user apps. I decided to keep this function in case we might need it in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That definitely explains why I couldn't figure it out its purpose. 😄

Can we remove it? I don't want others getting confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :) I just wanted to save it because it's a working aproach and it took me a lot of time to develop it. And it might be useful in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense. Then can you explain a bit more when it is useful? snapshots from a rootfs don't make sense to me (in the context of containerd). It is usually, "load an image" and then "create snapshots from that image". Those functions (I think) are elsewhere, so is this something more than just a wrapper around those two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well in our particular scenario, we don't really need a full image, we just need the files from rootfs, since we create the OCI config ourselves. so actually a snapshot works well, because a snapshot from overlayfs-snapshotter is just a overlayfs mount of the image, where the image stays read-only and the changes only exist in the upper directory.

so go like this:

  • create (prepare) a new empty snapshot called xen-tools, unless it already exists (same as with the image)
  • copy the files from xen-tools rootfs on disk into that snapshot
  • commit that snapshot and give it a label to prevent garbage collecting
  • use that snapshot as a base to create (prepare) new snapshots for each user app container
  • those new snapshots will be managed by containerd internally, so we don't have to take care of them

ctrdCtx, done := s.client.CtrNewUserServicesCtx()
defer done()

image, err := s.client.CtrGetImage(ctrdCtx, "xen-tools:local")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So xen-tools:local is the internal reference name we are using for the image? And if not found, we will load from filesystem and name it that? Can we save this as a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes to everything 😄

return nil
}
// Create a xen-tools image if it does not exist
err = ctx.ctrdClient.CtrCreateImageFromRootfs(ctrdUserCtx, "xen-tools:local", "/containers/services/xen-tools/rootfs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the name again, we should save it as a constant. I also think the name should be more specific, like lfedge/eve-xen-tools:local, just so it is clear it is something ours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is the one place where we find the filesystem. In other words, this is where pillar becomes aware of the structure of the underlying OS. It might be elsewhere too, but I think I found only one other spot.

Can we avoid this? Can we mount it into pillar at the rootfs.yml level and thus avoid this? E.g. in pillar:

const xenToolsRoot = "/xen-tools/"
// and later
	err = ctx.ctrdClient.CtrCreateImageFromRootfs(ctrdUserCtx, xenToolsImage, xenToolsRoot)

and then in rootfs.yml.in:

  - name: pillar
    image: PILLAR_TAG
    cgroupsPath: /eve/services/pillar
    oomScoreAdj: -999
    binds.add:
    - /containers/services/xen-tools/rootfs:/xen-tools:ro`

It has the same effect, except that pillar no longer knows the name of the container or its on-root path (no "breaking out"), and the place that does know it - rootfs.yml - handles mapping it in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better might be if we had an onboot container the could do this, rather than burdening pillar even further. But I think that would be too fundamental a change for this.

@OhmSpectator
Copy link
Member

I have a general question regarding what problem we are trying to solve. I put it here: #4303 (comment)

@OhmSpectator
Copy link
Member

Ooooph.. It's pretty hard to understand what's happening... I'll try, I'll try...
I hope @deitch can do the proper review meanwhile. And I'm glad you, @europaul, now have a good understanding of this OCI/mounts code. It was always like hell to me.

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doc should be updated when the PR is in it's final form.

From what I understand - now we create a new "base container" (xen-tools) for each new deployed application. And it's done to avoid collisions in mount paths that are used for different applications started from this container. And to create the per-app "base container", we first create an Image and load it into our CAS, so it can be used on app deployment.
If that's right, my questions would be:

  1. Could we resolve the collisions inside the container by just fixing the paths of mounted volumes?
  2. If no, and we need to create a new base container for each app and for that we need the base image: could we create it in build time, not in runtime?

@deitch
Copy link
Contributor

deitch commented Sep 30, 2024

It was always like hell to me.

That is partially because mounts are ugly, and containerd manages it all for you. And partially because we hacked around lots of it in eve. Somewhere on my list is a desire to clean up a lot of this and standardize it. Somewhere.

@eriknordmark
Copy link
Contributor

basically, yes, and we don't re-create the image if it already exists in the image store

What happens when we deploy a new version of EVE? Do we discard and recreate this image based on the files in the new version?

Comment on lines 350 to 354
imageExists, err := ctx.ctrdClient.CtrImageExists(ctrdUserCtx, "xen-tools:local")
if err != nil {
return fmt.Errorf("couldn't check if image xen-tools exists: %v.", err)
}
if imageExists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this image include the shim-VM linux kernel in addition to the xen-tools user-space?

Then there is an interesting question about EVE updates (and also about how this works today).
If a device is running EVE version X, and you deploy container A, it gets the Linux kernel from version X.
Then you update EVE to version Y. Do you expect container A to restart with the new Linux kernel from EVE version Y or not? (and I don't know what happens today for sure - we already have the shim-VM rootfs in /persist/vault/volumes but what about the kernel?)
Then you deploy container B. Do you expect it to get the version X or Y kernel?

In addition to the kernel, I assume we want to be able to deploy fixes (e.g., to the dhcp and mount scripts) in the shim VM and have those be used by existing and future deployed containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the shim-VM use the EVE kernel? I had thought it just mounts in the kernel from EVE itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the kernel.

But user-space bug fixes matter as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the shim-VM use the EVE kernel?

The kernel that runs the container is exactly the same as the host. A container can just have some file overlays + limits in access to some resources. Runtime is shared with the host in any case. So no risk of Kernel version mismatch. So the only similar possible problem I could imagine would be the Linux modules that we can "repopulate" into image as we do mount them here:

if err := os.MkdirAll(filepath.Join(volumeRoot, "modules"), 0600); err != nil {
But I'm not sure it's really the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite true.

To some degree, I don't think we have asked these questions in a while. In the immediate sense, @europaul 's PR does not change how anything works now. Same filesystem, same kernel, same everything. It just sets up the filesystem path correctly, rather than sharing (and trouncing) like it does now. And relies on containerd to do it correctly. So I think we should not hold this up for that.

We probably should have a proper design discussion around the issues you have raised. What are the behaviours we are trying to achieve? What are potential challenges? How should we restructure it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm correct, we recreate the "base" image on each device reboot now

Not sure, Paul would know. If so, that would resolve the issue Erik raised in the previous comment (if I understood him).

(which I find odd, by the way, as we could do it in build time).

That would be even nicer. If we can find a way...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it change the behavior of container B

Not really. All of this is using xen-tools. Whether we use the mounted one from the root filesystem, or whether we copy it to an image and then mount that, we still are using whatever is on the rootfs.

  • Current: we are using xen-tools:1, it mounts the directory from running xen-tools, launches container B. We upgrade EVE, including xen-tools, which includes xen-tools:2. It mounts the directory from running xen-tools (clueless as to what version it is) and launches B. B just got a newer version of xen-tools.
  • Proposed: we are using xen-tools:1, it creates an image from xen-tools, launches container B. We upgrade EVE, including xen-tools, which includes xen-tools:2. It creates an image from xen-tools (clueless as to what version it is), and launches B. B just got a newer version of xen-tools.

The case I can see (maybe this is what you were referring to, Erik?) is if, after the upgrade, it looks in containerd and says, "oh, wait I already have xen-tools, so I won't bother creating it from the filesystem." Then container B is stuck on xen-tools:1, because it was saved in containerd.

Is that what you were asking about? I am not sure if it works this way or the way I described in "Proposed" above. @europaul can you explain?

@deitch, yes that is what I'm asking about. AFAICT the new code checks if the image exists in containerd and if not it will never create a new version of that image. Sounds like this is a change.
FWIW we could name that xen-tools image based on the eve-version string to make this behave the same as today (if we understand the current behavior per above.) But that would require having code to delete the unreferenced old ones from containerd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not bad. I had to walk through the whole path to get it, but I did in the end. 😄

@europaul can you address his issue? It is a valid one. We want to resolve this issue you have raised, but without changing behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks you for all the comments, folks, I really appreciate it! I'm gonna start with this conversation since it has the most fundamental questions.

xen-tools:local image is supposed to be created on the pillar startup, unless it already exists in the containerd image store. So it's not recreated after a simple reboot.
As far as my understanding goes, when EVE is upgraded the partition B is clean - like on the first ever start of EVE - so there will be no images in the containerd image store. That means that xen-tools:local image will be recreated from the rootfs of the new version of xen-tools.

It would be great if we could put a xen-tools image into the containerd's image store at build time, but afaiu @deitch said that would require changes on the linuxkit front (which I'm happy to make if needs be).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could put a xen-tools image into the containerd's image store at build time, but afaiu @deitch said that would require changes on the linuxkit front (which I'm happy to make if needs be)

I have been exploring it a bit. I will let you know.

@europaul
Copy link
Contributor Author

europaul commented Sep 30, 2024

Could we resolve the collisions inside the container by just fixing the paths of mounted volumes?

@OhmSpectator
I thought about this, but this would be like adding another hack on top of the existing hack (running multiple containers from the same rootfs from disk). I really really don't want any more hacks - when we see containers, we should think "isolated environment" and not worry about where to put files in a container so that they don't collide somewhere else.

@europaul
Copy link
Contributor Author

europaul commented Oct 1, 2024

@deitch @eriknordmark I think I might have a better solution: create user app container from an empty snapshot instead of from xen-tools, and just bind-mount xen-tools from /containers/services/xen-tools/rootfs into the container in AddLoader when needed. This way

  • no extra space is used on disk
  • the user app container always has access to the newest version of xen-tools, even after EVE upgrade
  • we doesn't have to manage any extra images

I'll try to implement a PoC now

@deitch
Copy link
Contributor

deitch commented Oct 1, 2024

@europaul that is quite elegant.

Can you add the rootfs.yml part we discussed? So that only that file knows the path, while you pillar itself just sees local mount point?

@deitch
Copy link
Contributor

deitch commented Oct 1, 2024

And be sure to mount them read only into each container, so no chance of mistakes.

@europaul europaul force-pushed the better-isolation-for-eve-user-apps branch from 7bf7c19 to 3421c9c Compare October 8, 2024 10:03
@github-actions github-actions bot requested a review from deitch October 8, 2024 10:03
@deitch
Copy link
Contributor

deitch commented Oct 8, 2024

Two questions, @europaul

First, do you mind writing up how "AddLoader" works in this new method? The old way was, essentially, modify the manifest as it changed the entrypoint (hence, "Add Loader"). And should it have a new name, or does it still reflect it correctly.

Second, can we do what was discussed above, i.e. having the volume bind-mounted in via rootfs.yml, and using that path to pass to AddLoader()? So pillar does not need to know, "I can find this at the possibly-changed path /containers/services/...?

Add .gitignore file for pkg/pillar directory and include test artifacts
and profiling results.

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
@europaul europaul force-pushed the better-isolation-for-eve-user-apps branch from 3421c9c to bc87704 Compare October 10, 2024 19:32
@europaul
Copy link
Contributor Author

First, do you mind writing up how "AddLoader" works in this new method? The old way was, essentially, modify the manifest as it changed the entrypoint (hence, "Add Loader"). And should it have a new name, or does it still reflect it correctly.

@deitch even the old AddLoader() was doing more than that: it was also creating the necessary files for the container workloads such as mountPoints, environment, cmdline and not only in the containers, but straight up in the volumes!
So I think the function of AddLoader can be best described as "do everything necessary to enable the loader e.g. qemu (kvm, xen) as opposed to just running that OCI spec without hypervisor (containerd)".
What it does regarding the snapshotting I described here.

Of course ideally AddLoader would just massage the OCI spec so that it fits with the loader and do nothing else and ideally all functions related to ociSpec would do the same. But we already have other functions like func (s *ociSpec) CreateContainer that interact with containerd client, so I think we can make the same exception for AddLoader as well.

Starting user apps containers by bind mounting xen-tools rootfs doesn't
provide sufficient isolation when mounting other files on top of it.
Example: when mounting other files on top of xen-tools rootfs,
containerd will create stubs for those files in the original rootfs
even though it's mounted read-only.

To provide better isolation, we need to mount xen-tools rootfs into the
container through overlayfs. For this we first create an empty snapshot
in containerd which we will use for upper and work directories of the
overlayfs. We then mount xen-tools rootfs into the root of the container
and the stubs that are created by containerd are now in the snapshot
instead of the original rootfs.

Since the snapshot is created explicitly by us and protected from being
garbage collected we also delete it explicitly when the container is
deleted.

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
Pillar uses xen-tools as a loader for user apps, but it doesn't need to
know about the exact location of xen-tools rootfs on the disk. This
commit mounts xen-tools container read-only into the pillar container,
so that pillar doesn't need to "break out" anymore.

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
@europaul europaul force-pushed the better-isolation-for-eve-user-apps branch from bc87704 to 9474d89 Compare October 10, 2024 19:44
@europaul
Copy link
Contributor Author

Second, can we do what was discussed above, i.e. having the volume bind-mounted in via rootfs.yml, and using that path to pass to AddLoader()? So pillar does not need to know, "I can find this at the possibly-changed path /containers/services/...?

Done 👍

Copy link
Contributor

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort and the investment @europaul .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stable Should be backported to stable release(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Insufficient isolation between eve-user-apps
5 participants