Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker provider #1743

Merged
merged 5 commits into from
Apr 25, 2024
Merged

Conversation

bpradipt
Copy link
Member

@bpradipt bpradipt commented Mar 14, 2024

For quick testing:

  • Clone the code
git clone --single-branch https://github.com/bpradipt/cloud-api-adaptor.git -b docker-provider
  • Deploy CAA
CLOUD_PROVIDER=docker make deploy

Once CAA is deployed, change the image using the following command

kubectl set image ds/cloud-api-adaptor-daemonset -n confidential-containers-system cloud-api-adaptor-con=quay.io/bpradipt/cloud-api-adaptor

Download the pod VM container image

docker pull quay.io/confidential-containers/podvm-docker-image

Create a same pod with runtimeClass kata-remote

@bpradipt bpradipt force-pushed the docker-provider branch 2 times, most recently from 9bca008 to 46837a3 Compare March 16, 2024 14:25
@bpradipt bpradipt marked this pull request as ready for review April 12, 2024 04:29
@bpradipt
Copy link
Member Author

I had to use Ubuntu binaries and not fedora since I rely on Kind image as the base podvm container image which is using Ubuntu.
Kind image already is prepared to run containers, systemd inside a container. Using a new container image (fedora) will take some effort to figure out what's needed. This can be done as a future work item

@bpradipt bpradipt added the test_e2e_libvirt Run Libvirt e2e tests label Apr 12, 2024
@bpradipt bpradipt force-pushed the docker-provider branch 2 times, most recently from b504032 to b36329b Compare April 12, 2024 06:18
@bpradipt bpradipt removed the test_e2e_libvirt Run Libvirt e2e tests label Apr 12, 2024
@bpradipt bpradipt force-pushed the docker-provider branch 2 times, most recently from cdde175 to d2eb5f7 Compare April 12, 2024 17:44
Copy link
Member

@liudalibj liudalibj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review of this PR about the copyright and some strings.
Will follow the document to play this new provider in my dev machine later.
@bpradipt this is very cool, thanks for adding this new provider.

src/cloud-api-adaptor/docker/README.md Show resolved Hide resolved
src/cloud-api-adaptor/docker/Dockerfile Outdated Show resolved Hide resolved
src/cloud-providers/util.go Outdated Show resolved Hide resolved
src/cloud-providers/docker/provider_test.go Show resolved Hide resolved
src/cloud-providers/docker/provider.go Show resolved Hide resolved
src/cloud-providers/docker/misc.go Show resolved Hide resolved
src/cloud-providers/docker/Dockerfile Outdated Show resolved Hide resolved
src/cloud-providers/docker/docker.go Show resolved Hide resolved
src/cloud-providers/docker/manager.go Show resolved Hide resolved
Copy link

@huoqifeng huoqifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bpradipt brings the great feature, left several comments...

src/cloud-api-adaptor/docker/Makefile Outdated Show resolved Hide resolved
src/cloud-providers/docker/Dockerfile Outdated Show resolved Hide resolved
src/cloud-api-adaptor/docker/Dockerfile Outdated Show resolved Hide resolved
Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few in progress comments from my review

src/cloud-api-adaptor/docker/README.md Outdated Show resolved Hide resolved
src/cloud-api-adaptor/docker/README.md Show resolved Hide resolved
src/cloud-providers/go.mod Outdated Show resolved Hide resolved
src/cloud-api-adaptor/docker/README.md Show resolved Hide resolved
src/cloud-api-adaptor/docker/README.md Show resolved Hide resolved
@bpradipt
Copy link
Member Author

The CI failures are unrelated and looks like a network issue

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

src/cloud-api-adaptor/docker/README.md Show resolved Hide resolved
src/cloud-api-adaptor/docker/README.md Outdated Show resolved Hide resolved
src/cloud-api-adaptor/docker/README.md Show resolved Hide resolved
@stevenhorsman
Copy link
Member

I've tried re-starting the documented process from scratch and it's now failing with:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  30s                default-scheduler  Successfully assigned default/nginx to sh-operator1.fyre.ibm.com
  Normal   Pulled     15s                kubelet            Successfully pulled image "nginx" in 6.276497662s
  Normal   Pulling    11s (x2 over 21s)  kubelet            Pulling image "nginx"
  Normal   Created    10s (x2 over 15s)  kubelet            Created container nginx
  Normal   Pulled     10s                kubelet            Successfully pulled image "nginx" in 808.867772ms
  Warning  Failed     8s (x2 over 11s)   kubelet            Error: failed to create containerd task: failed to create shim task: No such file or directory (os error 2): unknown
  Warning  BackOff    7s (x2 over 8s)    kubelet            Back-off restarting failed container

Looking into the kata-agent log (thanks docker exec 😄 ) I see

Apr 24 12:00:57 59f9cd025634 kata-agent[343]: [2024-04-24T12:00:57Z WARN  confidential_data_hub::config] read config file /run/confidential-containers/cdh.toml failed configuration file "/run/confidential-containers/cdh.toml" not found
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:     Stack backtrace:
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        1: confidential_data_hub::config::CdhConfig::from_file
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        2: confidential_data_hub::main::{{closure}}
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        3: tokio::runtime::park::CachedParkThread::block_on
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        4: tokio::runtime::context::runtime::enter_runtime
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        5: tokio::runtime::runtime::Runtime::block_on
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        6: confidential_data_hub::main
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        7: std::sys_common::backtrace::__rust_begin_short_backtrace
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        8: std::rt::lang_start::{{closure}}
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        9: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:                  at ./rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:284:13

So I think your rebase (and hence the main code?) might be broken? This is the same error that Zvonko reported, but I thought we were creating the /run/confidential-containers/cdh.toml as part of process-user-data, or is there as difference in logic in the docker provider?

@bpradipt
Copy link
Member Author

I've tried re-starting the documented process from scratch and it's now failing with:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  30s                default-scheduler  Successfully assigned default/nginx to sh-operator1.fyre.ibm.com
  Normal   Pulled     15s                kubelet            Successfully pulled image "nginx" in 6.276497662s
  Normal   Pulling    11s (x2 over 21s)  kubelet            Pulling image "nginx"
  Normal   Created    10s (x2 over 15s)  kubelet            Created container nginx
  Normal   Pulled     10s                kubelet            Successfully pulled image "nginx" in 808.867772ms
  Warning  Failed     8s (x2 over 11s)   kubelet            Error: failed to create containerd task: failed to create shim task: No such file or directory (os error 2): unknown
  Warning  BackOff    7s (x2 over 8s)    kubelet            Back-off restarting failed container

Looking into the kata-agent log (thanks docker exec 😄 ) I see

Apr 24 12:00:57 59f9cd025634 kata-agent[343]: [2024-04-24T12:00:57Z WARN  confidential_data_hub::config] read config file /run/confidential-containers/cdh.toml failed configuration file "/run/confidential-containers/cdh.toml" not found
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:     Stack backtrace:
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        1: confidential_data_hub::config::CdhConfig::from_file
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        2: confidential_data_hub::main::{{closure}}
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        3: tokio::runtime::park::CachedParkThread::block_on
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        4: tokio::runtime::context::runtime::enter_runtime
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        5: tokio::runtime::runtime::Runtime::block_on
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        6: confidential_data_hub::main
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        7: std::sys_common::backtrace::__rust_begin_short_backtrace
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        8: std::rt::lang_start::{{closure}}
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:        9: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
Apr 24 12:00:57 59f9cd025634 kata-agent[343]:                  at ./rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:284:13

So I think your rebase (and hence the main code?) might be broken? This is the same error that Zvonko reported, but I thought we were creating the /run/confidential-containers/cdh.toml as part of process-user-data, or is there as difference in logic in the docker provider?

I do see the cdh warning on my setup, but container starts

root@d48fb0b16e51:/# cat log | grep cdh
[2024-04-24T07:41:25Z INFO  confidential_data_hub] Use configuration file /run/confidential-containers/cdh.toml
[2024-04-24T07:41:25Z WARN  confidential_data_hub::config] read config file /run/confidential-containers/cdh.toml failed configuration file "/run/confidential-containers/cdh.toml" not found
[2024-04-24T07:41:25Z INFO  confidential_data_hub] Confidential Data Hub starts to listen to request: unix:///run/confidential-containers/cdh.sock

Try with AA_KBC_PARAMS in peer-pods-cm to create the cdh.toml.

The main error seems to this one Error: failed to create containerd task: failed to create shim task: No such file or directory (os error 2): unknown and I don't have a clue :-(

@stevenhorsman
Copy link
Member

Try with AA_KBC_PARAMS in peer-pods-cm to create the cdh.toml.

Apologies for my ignorance (and if I've missed the doc), but where exactly should I put it. I tried in:

data:
  AA_KBC_PARAMS: offline_fs_kbc::null
  CLOUD_CONFIG_VERIFY: "false"
  CLOUD_PROVIDER: docker

and deleted the CAA pod to trigger a restart, but then hit a docker entrypoint error:

+ exec cloud-api-adaptor docker -pods-dir /run/peerpod/pods '-aa-kbc-params offline_fs_kbc::null ' -socket /run/peerpod/hypervisor.sock
2024/04/24 13:15:21 [adaptor/cloud] Cloud provider external plugin loading is disabled, skipping plugin loading
flag provided but not defined: -aa-kbc-params offline_fs_kbc::null
Usage: cloud-api-adaptor docker [options]

The options for "docker" are:
  -aa-kbc-params string
    	attestation-agent KBC parameters
  ...

so it seemed to pass it through correctly, but I guess the quotes screwed it up?

@bpradipt
Copy link
Member Author

Try with AA_KBC_PARAMS in peer-pods-cm to create the cdh.toml.

Apologies for my ignorance (and if I've missed the doc), but where exactly should I put it. I tried in:

data:
  AA_KBC_PARAMS: offline_fs_kbc::null
  CLOUD_CONFIG_VERIFY: "false"
  CLOUD_PROVIDER: docker

and deleted the CAA pod to trigger a restart, but then hit a docker entrypoint error:

+ exec cloud-api-adaptor docker -pods-dir /run/peerpod/pods '-aa-kbc-params offline_fs_kbc::null ' -socket /run/peerpod/hypervisor.sock
2024/04/24 13:15:21 [adaptor/cloud] Cloud provider external plugin loading is disabled, skipping plugin loading
flag provided but not defined: -aa-kbc-params offline_fs_kbc::null
Usage: cloud-api-adaptor docker [options]

The options for "docker" are:
  -aa-kbc-params string
    	attestation-agent KBC parameters
  ...

so it seemed to pass it through correctly, but I guess the quotes screwed it up?

Ah .. this looks like a code bug.. I have pushed a fix.. Please try with a new CAA image

Copy link

@huoqifeng huoqifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! @bpradipt

@stevenhorsman
Copy link
Member

stevenhorsman commented Apr 24, 2024

Ah .. this looks like a code bug.. I have pushed a fix.. Please try with a new CAA image

I retried and when I added AA_KBC_PARAMS to the peer-pods-cm it got rid of the cdh error, but the underlying problem still remains:

Apr 24 18:08:01 2c9c36047573 kata-agent[291]: {"msg":"create container process error No such file or directory (os error 2)","level":"ERRO","ts":"2024-04-24T18:08:01.628364688Z","module":"rustjail","subsystem":"container","version":"0.1.0","pid":"291","eid":"77a8346d11b931bc53f9d1ebe4854dc50782327bb9f2024baec245675f02e5c3","source":"agent","cid":"77a8346d11b931bc53f9d1ebe4854dc50782327bb9f2024baec245675f02e5c3","name":"kata-agent"}

At one point this was caused by the aa-offline-fs files not being there, but I'v checked and they are:

# ls /etc/aa-offline_fs*
/etc/aa-offline_fs_kbc-keys.json  /etc/aa-offline_fs_kbc-resources.json

Unfortunately image-rs doesn't do logging, so it's pretty tricky to work out what has gone wrong, but I don't understand how everyone else has got this working and which step in the instructions I've got wrong

@bpradipt
Copy link
Member Author

Ah .. this looks like a code bug.. I have pushed a fix.. Please try with a new CAA image

I retried and when I added AA_KBC_PARAMS to the peer-pods-cm it got rid of the cdh error, but the underlying problem still remains:

Apr 24 18:08:01 2c9c36047573 kata-agent[291]: {"msg":"create container process error No such file or directory (os error 2)","level":"ERRO","ts":"2024-04-24T18:08:01.628364688Z","module":"rustjail","subsystem":"container","version":"0.1.0","pid":"291","eid":"77a8346d11b931bc53f9d1ebe4854dc50782327bb9f2024baec245675f02e5c3","source":"agent","cid":"77a8346d11b931bc53f9d1ebe4854dc50782327bb9f2024baec245675f02e5c3","name":"kata-agent"}

At one point this was caused by the aa-offline-fs files not being there, but I'v checked and they are:

# ls /etc/aa-offline_fs*
/etc/aa-offline_fs_kbc-keys.json  /etc/aa-offline_fs_kbc-resources.json

Unfortunately image-rs doesn't do logging, so it's pretty tricky to work out what has gone wrong, but I don't understand how everyone else has got this working and which step in the instructions I've got wrong

Can you try once with this image - quay.io/confidential-containers/podvm-docker-image ?

@stevenhorsman
Copy link
Member

stevenhorsman commented Apr 24, 2024

Can you try once with this image - quay.io/confidential-containers/podvm-docker-image ?

Same issue:

Events:
  Type     Reason     Age              From               Message
  ----     ------     ----             ----               -------
  Normal   Scheduled  17s              default-scheduler  Successfully assigned default/nginx-dbc79c87-4dzfs to sh-operator1.fyre.ibm.com
  Normal   Pulled     4s (x2 over 8s)  kubelet            Container image "nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279" already present on machine
  Normal   Created    4s (x2 over 8s)  kubelet            Created container nginx
  Warning  Failed     2s (x2 over 4s)  kubelet            Error: failed to create containerd task: failed to create shim task: No such file or directory (os error 2): unknown
  Warning  BackOff    1s (x2 over 2s)  kubelet            Back-off restarting failed container

Just to check - this is the podvm image I have just pulled:

# docker inspect image quay.io/confidential-containers/podvm-docker-image
[
    {
        "Id": "sha256:0224c78bcb69fa45daf2a29cea0f4a6f875a40759cb52704b8a1fa5fe3242316",
        "RepoTags": [
            "quay.io/confidential-containers/podvm-docker-image:latest"
        ],
        "RepoDigests": [
            "quay.io/confidential-containers/podvm-docker-image@sha256:d5176a5cae453deb3f9b5546b565eefd7cef47d552f0b1080d06c9fbc87c0382"
        ],

It is super late for you now, so we can debug it tomorrow my morning if you have time?

@liudalibj
Copy link
Member

I built out caa from this PR and deploy the docker provider succeeded, but I failed to got a running nginx pod.
here is my docker env and related logs:

docker version
Client: Docker Engine - Community
 Version:           26.1.0
 API version:       1.45
 Go version:        go1.21.9
 Git commit:        9714adc
 Built:             Mon Apr 22 17:06:41 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.9
  Git commit:       c8af8eb
  Built:            Mon Apr 22 17:06:41 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.31
  GitCommit:        e377cd56a71523140ca6ae87e30244719194a521
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 
systemctl status docker.service
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-04-25 02:54:24 UTC; 47min ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
   Main PID: 964 (dockerd)
      Tasks: 21
     Memory: 130.1M
        CPU: 1.188s
     CGroup: /system.slice/docker.service
             └─964 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Apr 25 02:54:22 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:22.881054344Z" level=info msg="Starting up"
Apr 25 02:54:22 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:22.897835741Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Apr 25 02:54:23 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:23.200749279Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
Apr 25 02:54:23 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:23.680000559Z" level=info msg="Loading containers: start."
Apr 25 02:54:23 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:23.987127523Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Apr 25 02:54:24 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:24.020446148Z" level=info msg="Loading containers: done."
Apr 25 02:54:24 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:24.059005316Z" level=info msg="Docker daemon" commit=c8af8eb containerd-snapshotter=false storage-driver=overlay2 version=26.1.0
Apr 25 02:54:24 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:24.059409939Z" level=info msg="Daemon has completed initialization"
Apr 25 02:54:24 liudali-x86-build systemd[1]: Started Docker Application Container Engine.
Apr 25 02:54:24 liudali-x86-build dockerd[964]: time="2024-04-25T02:54:24.735013181Z" level=info msg="API listen on /run/docker.sock"

systemctl status docker.socket
● docker.socket - Docker Socket for the API
     Loaded: loaded (/lib/systemd/system/docker.socket; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-04-25 02:54:19 UTC; 50min ago
   Triggers: ● docker.service
     Listen: /run/docker.sock (Stream)
      Tasks: 0 (limit: 38487)
     Memory: 0B
        CPU: 935us
     CGroup: /system.slice/docker.socket

Apr 25 02:54:19 liudali-x86-build systemd[1]: Starting Docker Socket for the API...
Apr 25 02:54:19 liudali-x86-build systemd[1]: Listening on Docker Socket for the API.
  • CAA log
kubectl logs -f -n confidential-containers-system   cloud-api-adaptor-daemonset-llvsn
+ exec cloud-api-adaptor docker -pods-dir /run/peerpod/pods -socket /run/peerpod/hypervisor.sock
cloud-api-adaptor version v0.0.1-dev
  commit: 595688c113b52af67a294be1c0dd7cf5dd772099
  go: go1.21.9
cloud-api-adaptor: starting Cloud API Adaptor daemon for "docker"
2024/04/25 03:44:32 [adaptor/cloud] Cloud provider external plugin loading is disabled, skipping plugin loading
2024/04/25 03:44:32 [adaptor/cloud/docker] docker config: &docker.Config{DockerHost:"unix:///var/run/docker.sock", DockerAPIVersion:"1.40", DockerCertPath:"", DockerTLSVerify:false, DataDir:"/var/lib/docker/peerpods"}
2024/04/25 03:44:32 [adaptor] server config: &adaptor.ServerConfig{TLSConfig:(*tlsutil.TLSConfig)(0xc00068ff00), SocketPath:"/run/peerpod/hypervisor.sock", CriSocketPath:"", PauseImage:"", PodsDir:"/run/peerpod/pods", ForwarderPort:"15150", ProxyTimeout:300000000000, AAKBCParams:"", EnableCloudConfigVerify:false}
2024/04/25 03:44:32 [util/k8sops] initialized PeerPodService
2024/04/25 03:44:32 [probe/probe] Using port: 8000
2024/04/25 03:44:32 [adaptor] server started
2024/04/25 03:44:32 [podnetwork] routes on netns /var/run/netns/cni-0ff81211-543a-d0e1-0ba1-8458f570d5cf
2024/04/25 03:44:32 [podnetwork]     0.0.0.0/0 via 10.244.1.1 dev eth0
2024/04/25 03:44:32 [podnetwork]     10.244.0.0/16 via 10.244.1.1 dev eth0
2024/04/25 03:44:32 [adaptor/cloud] Credentials file is not in a valid Json format, ignored
2024/04/25 03:44:32 [adaptor/cloud] stored /run/peerpod/pods/f48a12a60691c7eda8f5783e464396e0cd76a376fe7609656de01d186898e4df/daemon.json
2024/04/25 03:44:32 [adaptor/cloud] create a sandbox f48a12a60691c7eda8f5783e464396e0cd76a376fe7609656de01d186898e4df for pod nginx in namespace default (netns: /var/run/netns/cni-0ff81211-543a-d0e1-0ba1-8458f570d5cf)
2024/04/25 03:44:32 [adaptor/cloud/docker] CreateInstance: name: "podvm-nginx-f48a12a6"
2024/04/25 03:44:32 [adaptor/cloud] creating an instance : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2024/04/25 03:44:32 [adaptor/proxy] shutting down socket forwarder
2024/04/25 03:44:32 [adaptor/cloud/docker] DeleteInstance: instanceID: ""
2024/04/25 03:44:32 [adaptor/cloud] Error deleting an instance : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2024/04/25 03:44:32 [tunneler/vxlan] Delete tc redirect filters on eth0 and ens3 in the network namespace /var/run/netns/cni-0ff81211-543a-d0e1-0ba1-8458f570d5cf
2024/04/25 03:44:32 [adaptor/cloud] tearing down netns /var/run/netns/cni-0ff81211-543a-d0e1-0ba1-8458f570d5cf: failed to tear down tunnel "vxlan": failed to delete a tc redirect filter from vxlan1 to eth0: failed to get interface vxlan1: Link not found
2024/04/25 03:44:33 [podnetwork] routes on netns /var/run/netns/cni-f9094684-dbfa-be99-68f1-687392dc25cb
2024/04/25 03:44:33 [podnetwork]     0.0.0.0/0 via 10.244.1.1 dev eth0
2024/04/25 03:44:33 [podnetwork]     10.244.0.0/16 via 10.244.1.1 dev eth0
2024/04/25 03:44:33 [adaptor/cloud] Credentials file is not in a valid Json format, ignored
2024/04/25 03:44:33 [adaptor/cloud] stored /run/peerpod/pods/8d57b6ab6c8a634fe5780110cc65b36430319c6b3eb2c090331ff988e39f0718/daemon.json
2024/04/25 03:44:33 [adaptor/cloud] create a sandbox 8d57b6ab6c8a634fe5780110cc65b36430319c6b3eb2c090331ff988e39f0718 for pod nginx in namespace default (netns: /var/run/netns/cni-f9094684-dbfa-be99-68f1-687392dc25cb)
2024/04/25 03:44:33 [adaptor/cloud/docker] CreateInstance: name: "podvm-nginx-8d57b6ab"
2024/04/25 03:44:33 [adaptor/cloud] creating an instance : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
....

I installed docker by follow this page https://docs.docker.com/engine/install/ubuntu/
@bpradipt @stevenhorsman do you have any suggestion?

@bpradipt
Copy link
Member Author

bpradipt commented Apr 25, 2024

2024/04/25 03:44:33 [adaptor/cloud/docker] CreateInstance: name: "podvm-nginx-8d57b6ab"
2024/04/25 03:44:33 [adaptor/cloud] creating an instance : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
....


I installed docker by follow this page https://docs.docker.com/engine/install/ubuntu/ @bpradipt @stevenhorsman do you have any suggestion?

If you are executing as non-root user (eg. ubuntu), can you check if you are able to run docker commands (eg docker info, docker ps etc)? Otherwise you'll need to first run the postinstall step for docker and then retry.

Also please check if the docker socket is mounted inside the CAA pod
Ref: https://github.com/confidential-containers/cloud-api-adaptor/pull/1743/files#diff-1eace0282d1a65f7a10342af7cf78ec00073487f8c57a4bae80981660ff82683

@bpradipt
Copy link
Member Author

Can you try once with this image - quay.io/confidential-containers/podvm-docker-image ?

Same issue:

Events:
  Type     Reason     Age              From               Message
  ----     ------     ----             ----               -------
  Normal   Scheduled  17s              default-scheduler  Successfully assigned default/nginx-dbc79c87-4dzfs to sh-operator1.fyre.ibm.com
  Normal   Pulled     4s (x2 over 8s)  kubelet            Container image "nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279" already present on machine
  Normal   Created    4s (x2 over 8s)  kubelet            Created container nginx
  Warning  Failed     2s (x2 over 4s)  kubelet            Error: failed to create containerd task: failed to create shim task: No such file or directory (os error 2): unknown
  Warning  BackOff    1s (x2 over 2s)  kubelet            Back-off restarting failed container

Just to check - this is the podvm image I have just pulled:

# docker inspect image quay.io/confidential-containers/podvm-docker-image
[
    {
        "Id": "sha256:0224c78bcb69fa45daf2a29cea0f4a6f875a40759cb52704b8a1fa5fe3242316",
        "RepoTags": [
            "quay.io/confidential-containers/podvm-docker-image:latest"
        ],
        "RepoDigests": [
            "quay.io/confidential-containers/podvm-docker-image@sha256:d5176a5cae453deb3f9b5546b565eefd7cef47d552f0b1080d06c9fbc87c0382"
        ],

It is super late for you now, so we can debug it tomorrow my morning if you have time?

The same image works on my env. Let me create a fresh env and check.

@stevenhorsman
Copy link
Member

The same image works on my env. Let me create a fresh env and check.

So I'm not sure what the difference is, but I also created a completely fresh environment and it seems to be working now:

# kubectl get pods
NAME                   READY   STATUS    RESTARTS   AGE
nginx-dbc79c87-mqcvc   1/1     Running   0          42s
root@sh-operator-21:~/go/src/github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor# docker ps
CONTAINER ID   IMAGE                                                COMMAND                  CREATED          STATUS          PORTS       NAMES
c942569cbb38   quay.io/confidential-containers/podvm-docker-image   "/usr/local/bin/entr…"   51 seconds ago   Up 50 seconds   15150/tcp   podvm-nginx-dbc79c87-mqcvc-f5e0d4f7
root

For reference my full history on this box is:

# Create cluster
sudo apt-get -y update && sudo apt-get -y install ansible
ansible-playbook -i localhost, -c local --tags untagged ansible/main.yaml
export GOPATH="${HOME}/go"
repo="github.com/confidential-containers/operator"
repo_dir="${GOPATH}/src/${repo}"
mkdir -p $(dirname "${repo_dir}")
git clone "https://${repo}.git" ${repo_dir}
pushd $repo_dir/tests/e2e/
ansible-playbook -i localhost, -c local --tags untagged ansible/main.yaml
sudo -E PATH="$PATH" bash -c './cluster/up.sh'
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get pods
# Update docker
docker --verison
docker version
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo   "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" |   sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce
docker version
sudo groupadd docker
sudo usermod -aG docker $USER
exit
docker version
systemctl start docker.service
docker version
# Check out CAA docker provider
export GOPATH="${HOME}/go"
cloud_api_adaptor_repo="github.com/confidential-containers/cloud-api-adaptor"
cloud_api_adaptor_dir="${GOPATH}/src/${cloud_api_adaptor_repo}"
mkdir -p $(dirname "${cloud_api_adaptor_dir}")
git clone -b main "https://${cloud_api_adaptor_repo}.git" "${cloud_api_adaptor_dir}"
pushd $cloud_api_adaptor_dir
git remote add bpradipt https://github.com/bpradipt/cloud-api-adaptor.git
git fetch  bpradipt
git checkout -b docker-provider bpradipt/docker-provider
# Build docker podvm-image
export CLOUD_PROVIDER=docker
cd src/cloud-api-adaptor/docker/image
make
make image
cd ../
snap install yq
docker login quay.io/stevenhorsman
registry=quay.io/stevenhorsman make image
export KUBECONFIG=/etc/kubernetes/admin.conf
CLOUD_PROVIDER=docker make deploy
kubectl set image ds/cloud-api-adaptor-daemonset -n confidential-containers-system cloud-api-adaptor-con=quay.io/stevenhorsman/cloud-api-adaptor:dev-595688c113b52af67a294be1c0dd7cf5dd772099
kubectl get runtimeclass
kubectl get pods -A
echo '
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        io.containerd.cri.runtime-handler: kata-remote
    spec:
      runtimeClassName: kata-remote
      containers:
      - image: nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279
        name: nginx
' | kubectl apply -f -
kubectl get pods

Copy link
Member

@liudalibj liudalibj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpradipt @stevenhorsman I figure out the root cause why my dev machine not work as expected.
I created the cluster by using ./libvirt/kcli_cluster.sh create but I setup the docker on the my dev machine directly. The docker env need be installed in the worker node peer-pods-worker-0.
After I install the docker in worker node and pull the image inside it. I meet the same error as @stevenhorsman report before.

2024/04/25 14:26:56 [podnetwork] routes on netns /var/run/netns/cni-fe8183e1-ec9f-325a-62d7-bd5e1248ee13
2024/04/25 14:26:56 [podnetwork]     0.0.0.0/0 via 10.244.1.1 dev eth0
2024/04/25 14:26:56 [podnetwork]     10.244.0.0/16 via 10.244.1.1 dev eth0
2024/04/25 14:26:56 [adaptor/cloud] Credentials file is not in a valid Json format, ignored
2024/04/25 14:26:56 [adaptor/cloud] stored /run/peerpod/pods/d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1/daemon.json
2024/04/25 14:26:56 [adaptor/cloud] create a sandbox d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1 for pod nginx-5bb58f7796-7blpm in namespace default (netns: /var/run/netns/cni-fe8183e1-ec9f-325a-62d7-bd5e1248ee13)
2024/04/25 14:26:56 [adaptor/cloud/docker] CreateInstance: name: "podvm-nginx-5bb58f7796-7blpm-d4835bc5"
2024/04/25 14:26:56 [adaptor/cloud/docker] CreateInstance: instanceID: "55e08ff5777096488a5abd3be50d6c5c9693dcae0abd8080f321b2dab5490764", ip: "172.17.0.3"
2024/04/25 14:26:56 [util/k8sops] nginx-5bb58f7796-7blpm is now owning a PeerPod object
2024/04/25 14:26:56 [adaptor/cloud] created an instance podvm-nginx-5bb58f7796-7blpm-d4835bc5 for sandbox d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1
2024/04/25 14:26:56 [tunneler/vxlan] vxlan ppvxlan1 (remote 172.17.0.3:4789, id: 555002) created at /proc/1/task/12/ns/net
2024/04/25 14:26:56 [tunneler/vxlan] vxlan ppvxlan1 created at /proc/1/task/12/ns/net
2024/04/25 14:26:56 [tunneler/vxlan] vxlan ppvxlan1 is moved to /var/run/netns/cni-fe8183e1-ec9f-325a-62d7-bd5e1248ee13
2024/04/25 14:26:56 [tunneler/vxlan] Add tc redirect filters between eth0 and vxlan1 on pod network namespace /var/run/netns/cni-fe8183e1-ec9f-325a-62d7-bd5e1248ee13
2024/04/25 14:26:56 [adaptor/proxy] Listening on /run/peerpod/pods/d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1/agent.ttrpc
2024/04/25 14:26:56 [adaptor/proxy] failed to init cri client, the err: cri runtime endpoint is not specified, it is used to get the image name from image digest
2024/04/25 14:26:56 [adaptor/proxy] Trying to establish agent proxy connection to 172.17.0.3:15150
2024/04/25 14:26:58 [adaptor/proxy] established agent proxy connection to 172.17.0.3:15150
2024/04/25 14:26:58 [adaptor/cloud] agent proxy is ready
2024/04/25 14:26:58 [adaptor/proxy] CreateSandbox: hostname:nginx-5bb58f7796-7blpm sandboxId:d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1
2024/04/25 14:26:58 [adaptor/proxy]     storages:
2024/04/25 14:26:58 [adaptor/proxy]         mountpoint:/run/kata-containers/sandbox/shm source:shm fstype:tmpfs driver:ephemeral
2024/04/25 14:27:01 [adaptor/proxy] CreateContainer: containerID:d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1
2024/04/25 14:27:01 [adaptor/proxy]     mounts:
2024/04/25 14:27:01 [adaptor/proxy]         destination:/proc source:proc type:proc
2024/04/25 14:27:01 [adaptor/proxy]         destination:/dev source:tmpfs type:tmpfs
2024/04/25 14:27:01 [adaptor/proxy]         destination:/dev/pts source:devpts type:devpts
2024/04/25 14:27:01 [adaptor/proxy]         destination:/dev/shm source:/run/kata-containers/sandbox/shm type:bind
2024/04/25 14:27:01 [adaptor/proxy]         destination:/dev/mqueue source:mqueue type:mqueue
2024/04/25 14:27:01 [adaptor/proxy]         destination:/sys source:sysfs type:sysfs
2024/04/25 14:27:01 [adaptor/proxy]         destination:/dev/shm source:/run/kata-containers/sandbox/shm type:bind
2024/04/25 14:27:01 [adaptor/proxy]         destination:/etc/resolv.conf source:/run/kata-containers/shared/containers/d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1-1303004197fc5a0c-resolv.conf type:bind
2024/04/25 14:27:01 [adaptor/proxy]     annotations:
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-name: nginx-5bb58f7796-7blpm
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-namespace: default
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-cpu-quota: 0
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-id: d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-cpu-shares: 2
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.container-type: sandbox
2024/04/25 14:27:01 [adaptor/proxy]         io.katacontainers.pkg.oci.container_type: pod_sandbox
2024/04/25 14:27:01 [adaptor/proxy]         io.katacontainers.pkg.oci.bundle_path: /run/containerd/io.containerd.runtime.v2.task/k8s.io/d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-cpu-period: 100000
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-memory: 0
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-log-directory: /var/log/pods/default_nginx-5bb58f7796-7blpm_c9855d8f-b89d-4efd-a7e3-b7bd2b323d5e
2024/04/25 14:27:01 [adaptor/proxy]         nerdctl/network-namespace: /var/run/netns/cni-fe8183e1-ec9f-325a-62d7-bd5e1248ee13
2024/04/25 14:27:01 [adaptor/proxy]         io.kubernetes.cri.sandbox-uid: c9855d8f-b89d-4efd-a7e3-b7bd2b323d5e
2024/04/25 14:27:01 [adaptor/proxy] getImageName: no pause image specified uses default pause image: registry.k8s.io/pause:3.7
2024/04/25 14:27:01 [adaptor/proxy] CreateContainer: calling PullImage for "registry.k8s.io/pause:3.7" before CreateContainer (cid: "d4835bc53007be1d9115e73708a7eb1f277657458db6b457fc6fb6881b77ead1")
2024/04/25 14:27:02 [adaptor/proxy] CreateContainer: successfully pulled image "registry.k8s.io/pause:3.7"
2024/04/25 14:27:02 [adaptor/proxy] CreateContainer fails: rpc error: code = Internal desc = Establishing a D-Bus connection

Caused by:
    0: I/O error: No such file or directory (os error 2)
    1: No such file or directory (os error 2)
2024/04/25 14:27:02 [adaptor/proxy] DestroySandbox
2024/04/25 14:27:02 [adaptor/proxy] shutting down socket forwarder
2024/04/25 14:27:02 [adaptor/cloud/docker] DeleteInstance: instanceID: "55e08ff5777096488a5abd3be50d6c5c9693dcae0abd8080f321b2dab5490764"
2024/04/25 14:27:02 [util/k8sops] nginx-5bb58f7796-7blpm's owned PeerPod object can now be deleted

src/cloud-api-adaptor/docker/README.md Outdated Show resolved Hide resolved
src/cloud-api-adaptor/docker/README.md Outdated Show resolved Hide resolved
@bpradipt
Copy link
Member Author

0: I/O error: No such file or directory (os error 2)
1: No such file or directory (os error 2)

2024/04/25 14:27:02 [adaptor/proxy] DestroySandbox
2024/04/25 14:27:02 [adaptor/proxy] shutting down socket forwarder
2024/04/25 14:27:02 [adaptor/cloud/docker] DeleteInstance: instanceID: "55e08ff5777096488a5abd3be50d6c5c9693dcae0abd8080f321b2dab5490764"
2024/04/25 14:27:02 [util/k8sops] nginx-5bb58f7796-7blpm's owned PeerPod object can now be deleted

I'm suspecting it could be something to do with cgroup entries. We have seen similar errors earlier with kata-agent. Debugging is hard

Add initial support to run peer-pods in a docker container.

We create a container image with all the necessary components required to
act as pod VM. Currently we rely on K8s Kind image as the base image to act
as pod VM

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Some ARGS were missing that were present in Fedora and RHEL

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Add docker provider in entrypoint.sh

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Allow installation via kustomize files

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Run go mod tidy and update for all sub projects

Also update the base golang version to go1.21 for
the sub projects

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@bpradipt
Copy link
Member Author

@stevenhorsman @liudalibj I have addressed all your comments.
@liudalibj I don't know at this moment the cause of the error you are seeing.

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough to merge in. There are some question marks about failures we've seen, but we don't have an easy debug path at the moment, and have got things working, so I think it's enough to get merged as it's primarily a developer option. I guess going forward it might be good to seem some e2e tests to help ensure stability. Thanks for the idea and execution @bpradipt

@bpradipt bpradipt merged commit 313b626 into confidential-containers:main Apr 25, 2024
18 checks passed
@bpradipt bpradipt deleted the docker-provider branch April 25, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants