KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288

adrianreber · 2023-10-09T16:51:35Z

As defined in the existing KEP the steps to graduate from Alpha to Beta are

At least one container engine has to have implemented the
corresponding CRI APIs to introduce e2e test for checkpointing.

Enable the feature per default
No major bugs reported in the previous cycle

CRI-O implemented the corresponding CRI RPC and no major bugs have been reported since the initial release in 1.25.

One-line PR description: Graduate "Forensic Container Checkpointing" to Beta
Issue link: Forensic Container Checkpointing #2008

k8s-ci-robot · 2023-10-09T16:51:44Z

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

keps/sig-node/2008-forensic-container-checkpointing/README.md

mikebrow

a. should we move beyond forensics use case(s) before moving to beta? b. should we add scenarios for managing security/encryption (contents include live memory in container..) c. how do we manage the checkpoints (rm/gc) d. do we enable use cases that require restore first?

https://github.com/kubernetes/enhancements/pull/4288/files#diff-240948f6b9e24b79601915d2508930149c894411df0d623ac8c01c46d9cc57eaR87-R92

bart0sh · 2023-10-12T14:29:26Z

/ok-to-test

kannon92 · 2023-12-08T14:28:40Z

a. should we move beyond forensics use case(s) before moving to beta? b. should we add scenarios for managing security/encryption (contents include live memory in container..) c. how do we manage the checkpoints (rm/gc) d. do we enable use cases that require restore first?

https://github.com/kubernetes/enhancements/pull/4288/files#diff-240948f6b9e24b79601915d2508930149c894411df0d623ac8c01c46d9cc57eaR87-R92

So @mrunalp and I discussed this.

I think that the scope of this KEP will drastically change if we try to support restore as part of this KEP. There is a lot of interest in restore but I think there needs to be a different design for that case. I know there were security implications and it warrants a discussion. Could we consider promoting this KEP to beta (with gc questions answered)? And draft a new KEP to cover checkpoint/restore in more details?

For promotion to beta, I think we should answer the question how do we gc checkpoints and how is storage monitored for it?

One suggestion we have is to maybe consider an operator for managing checkpoints rather than putting it in upstream kubelet.

kannon92 · 2024-01-25T20:55:06Z

@adrianreber Where are we on garbage collecting old container checkpoints?

adrianreber · 2024-01-26T07:46:36Z

@adrianreber Where are we on garbage collecting old container checkpoints?

I understood it that it should be done in an operator. That sounded to me like it might be independent of this PR.

kannon92 · 2024-01-26T13:56:34Z

@adrianreber Where are we on garbage collecting old container checkpoints?

I understood it that it should be done in an operator. That sounded to me like it might be independent of this PR.

So even if its done in an operator, would we require mention of the operator? My main concern is that kubelet is not going to monitor checkpoints and if they fill up the disk then we effectively take down the node. I don't think we want gc in the main repo (as this is an optional feature) but we may want to have some kind of suggestion or operator that people could use?

We could document this as a known issue and suggest ways to mitigate it if your disk fills up due to checkpoints.

cc @mrunalp @SergeyKanzhelev

adrianreber · 2024-01-26T15:11:10Z

@kannon92 I added a short paragraph concerning garbage collection and a possible operator. Something like that?

keps/sig-node/2008-forensic-container-checkpointing/README.md

kannon92 · 2024-01-26T15:41:09Z

@adrianreber the other thing that is crucial for beta is to fill out the PRR questions in this KEP. I think you are missing quite a few since this feature was created.

I think everyone one of this questions needs a response to push to beta.

https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template#production-readiness-review-questionnaire

adrianreber · 2024-02-02T16:55:53Z

I created an operator which replicates the functionality I tried to bring into the kubelet: https://github.com/checkpoint-restore/checkpoint-restore-operator

At this point it is really simple. It has a parameter to define the maximum number of checkpoints for a container and if more checkpoints are created older checkpoints are deleted.

Maybe we could add a function to eviction manager in case of disk pressure to remove all checkpoints (if feature is enabled).

At first I liked the idea, but unconditionally deleting all checkpoint archives seems a bit harsh and maybe unexpected. Which would mean we need some logic to decide which checkpoint archive to keep. Not sure we want that. If people agree that this would be a good idea I am happy to implement it. At this point I am not 100% convinced it should be done.

@adrianreber the other thing that is crucial for beta is to fill out the PRR questions in this KEP. I think you are missing quite a few since this feature was created.

Thanks for pointing that out. I will look into that.

keps/sig-node/2008-forensic-container-checkpointing/README.md

deads2k · 2024-02-05T19:26:09Z

Several PRR sections are missing: https://github.com/kubernetes/enhancements/blame/master/keps/NNNN-kep-template/README.md#L454

Please copy/paste the new template and fill out all the alpha and beta questions.

adrianreber · 2024-02-08T20:19:34Z

@kannon92 Again, thanks a lot for your suggestions. Makes it easy for me to finally understand how this needs to work. It also makes updating the text in the PR easy. Thanks.

keps/sig-node/2008-forensic-container-checkpointing/README.md

deads2k · 2024-02-08T21:06:40Z

keps/sig-node/2008-forensic-container-checkpointing/README.md

+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+Does not apply as the enhancement will only be called when requested. Not a service.
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+Does not apply as the enhancement will only be called when requested. Not a service.


These two need updates to indicate that the kubelet will add (or does have) metrics for the kubelet endpoints indicating usage counts and failure counts. Prior to going to beta, the exact metric names must be added.

Reworded as suggested

keps/sig-node/2008-forensic-container-checkpointing/README.md

As defined in the existing KEP the steps to graduate from Alpha to Beta are At least one container engine has to have implemented the corresponding CRI APIs to introduce e2e test for checkpointing. - [ ] Enable the feature per default - [ ] No major bugs reported in the previous cycle CRI-O implemented the corresponding CRI RPC and no major bugs have been reported since the initial release in 1.25. Signed-off-by: Adrian Reber <areber@redhat.com>

adrianreber · 2024-02-08T21:35:20Z

@deads2k Reworked based on your suggestions.

kannon92 · 2024-02-08T23:03:25Z

/lgtm
/assign @deads2k @mrunalp

deads2k · 2024-02-08T23:53:55Z

PRR is complete for beta.

/approve

dims · 2024-02-09T01:03:42Z

/approve
/lgtm

k8s-ci-robot · 2024-02-09T01:03:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianreber, deads2k, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-node/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mikebrow · 2024-02-09T02:21:39Z

LGTM.. would like to talk about documenting how an admin can direct these checkpoints to a secure/encrypted mount the keys for which can be managed by another party. Will be watching for the changes and help get it into containerd cheers!

rst0git · 2024-02-09T09:58:04Z

would like to talk about documenting how an admin can direct these checkpoints to a secure/encrypted mount the keys for which can be managed by another party.

@mikebrow we are currently working on enabling built-in support for encryption in CRIU (checkpoint-restore/criu#2297). In the current implementation, we use a certificate with a public key to encrypt the content of the checkpoint.

adrianreber · 2024-02-09T10:12:45Z

I want to thank everyone involved for all the feedback and quick turnaround to get this merged in time. Thanks!

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 9, 2023

k8s-ci-robot requested review from dchen1107 and jeremyrickard October 9, 2023 16:51

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 9, 2023

adrianreber force-pushed the 2023-10-09-beta branch from 74301a9 to 9c1211a Compare October 9, 2023 16:52

mikebrow reviewed Oct 10, 2023

View reviewed changes

keps/sig-node/2008-forensic-container-checkpointing/README.md Show resolved Hide resolved

mikebrow reviewed Oct 10, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2023

adrianreber mentioned this pull request Oct 19, 2023

Forensic Container Checkpointing #2008

Open

21 tasks

adrianreber force-pushed the 2023-10-09-beta branch from 9c1211a to 0ff3da6 Compare January 26, 2024 15:10

kannon92 reviewed Jan 26, 2024

View reviewed changes

keps/sig-node/2008-forensic-container-checkpointing/README.md Outdated Show resolved Hide resolved

mrunalp reviewed Feb 3, 2024

View reviewed changes

keps/sig-node/2008-forensic-container-checkpointing/README.md Show resolved Hide resolved

adrianreber force-pushed the 2023-10-09-beta branch from 0ff3da6 to 7239c99 Compare February 6, 2024 15:22

k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 6, 2024

adrianreber force-pushed the 2023-10-09-beta branch from 0b30ea5 to 447f338 Compare February 8, 2024 20:18

deads2k reviewed Feb 8, 2024

View reviewed changes

keps/sig-node/2008-forensic-container-checkpointing/README.md Show resolved Hide resolved

deads2k reviewed Feb 8, 2024

View reviewed changes

keps/sig-node/2008-forensic-container-checkpointing/README.md Show resolved Hide resolved

deads2k reviewed Feb 8, 2024

View reviewed changes

keps/sig-node/2008-forensic-container-checkpointing/README.md Show resolved Hide resolved

adrianreber force-pushed the 2023-10-09-beta branch from 447f338 to 67cf1ec Compare February 8, 2024 21:33

k8s-ci-robot assigned deads2k Feb 8, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024

mrunalp approved these changes Feb 8, 2024

View reviewed changes

mrunalp approved these changes Feb 9, 2024

View reviewed changes

k8s-ci-robot assigned dims Feb 9, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024

k8s-ci-robot merged commit 1c1793c into kubernetes:master Feb 9, 2024
4 checks passed

k8s-ci-robot added this to the v1.30 milestone Feb 9, 2024

adrianreber mentioned this pull request Feb 19, 2024

Switch 'ContainerCheckpoint' from Alpha to Beta kubernetes/kubernetes#123215

Merged

9 tasks

adrianreber deleted the 2023-10-09-beta branch February 29, 2024 15:51

muvaf mentioned this pull request Mar 27, 2024

Will cilium IPAM support specified ip address of pod? cilium/cilium#17026

Open

soltysh mentioned this pull request Apr 3, 2024

Add soltysh to prod-readiness-approvers #4566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288

KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288

adrianreber commented Oct 9, 2023

k8s-ci-robot commented Oct 9, 2023

mikebrow left a comment

bart0sh commented Oct 12, 2023

kannon92 commented Dec 8, 2023

kannon92 commented Jan 25, 2024

adrianreber commented Jan 26, 2024

kannon92 commented Jan 26, 2024 •

edited

Loading

adrianreber commented Jan 26, 2024

kannon92 commented Jan 26, 2024

adrianreber commented Feb 2, 2024

deads2k commented Feb 5, 2024

adrianreber commented Feb 8, 2024

deads2k Feb 8, 2024

adrianreber Feb 8, 2024

adrianreber commented Feb 8, 2024

kannon92 commented Feb 8, 2024

deads2k commented Feb 8, 2024

dims commented Feb 9, 2024

k8s-ci-robot commented Feb 9, 2024

mikebrow commented Feb 9, 2024

rst0git commented Feb 9, 2024

adrianreber commented Feb 9, 2024

KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288

KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288

Conversation

adrianreber commented Oct 9, 2023

k8s-ci-robot commented Oct 9, 2023

mikebrow left a comment

Choose a reason for hiding this comment

bart0sh commented Oct 12, 2023

kannon92 commented Dec 8, 2023

kannon92 commented Jan 25, 2024

adrianreber commented Jan 26, 2024

kannon92 commented Jan 26, 2024 • edited Loading

adrianreber commented Jan 26, 2024

kannon92 commented Jan 26, 2024

adrianreber commented Feb 2, 2024

deads2k commented Feb 5, 2024

adrianreber commented Feb 8, 2024

deads2k Feb 8, 2024

Choose a reason for hiding this comment

adrianreber Feb 8, 2024

Choose a reason for hiding this comment

adrianreber commented Feb 8, 2024

kannon92 commented Feb 8, 2024

deads2k commented Feb 8, 2024

dims commented Feb 9, 2024

k8s-ci-robot commented Feb 9, 2024

mikebrow commented Feb 9, 2024

rst0git commented Feb 9, 2024

adrianreber commented Feb 9, 2024

kannon92 commented Jan 26, 2024 •

edited

Loading