Skip to content

Commit

Permalink
Graduate "Forensic Container Checkpointing" to Beta
Browse files Browse the repository at this point in the history
As defined in the existing KEP the steps to graduate from Alpha to Beta
are

   At least one container engine has to have implemented the
   corresponding CRI APIs to introduce e2e test for checkpointing.

   - [ ] Enable the feature per default
   - [ ] No major bugs reported in the previous cycle

CRI-O implemented the corresponding CRI RPC and no major bugs
have been reported since the initial release in 1.25.

Signed-off-by: Adrian Reber <areber@redhat.com>
  • Loading branch information
adrianreber committed Jan 26, 2024
1 parent f451a19 commit 0ff3da6
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 7 deletions.
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/2008.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 2008
alpha:
approver: "@ehashman"
beta:
approver: "@deads2k"
37 changes: 34 additions & 3 deletions keps/sig-node/2008-forensic-container-checkpointing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,10 @@ message CheckpointContainerRequest {
string container_id = 1;
// Location of the checkpoint archive used for export/import
string location = 2;
// Timeout in seconds for the checkpoint to complete.
// Timeout of zero means to use the CRI default.
// Timeout > 0 means to use the user specified timeout.
int64 timeout = 3;
}
message CheckpointContainerResponse {}
Expand All @@ -146,6 +150,16 @@ In its first implementation the risks are low as it tries to be a CRI API
change with minimal changes to the kubelet and it is gated by the feature
gate `ContainerCheckpoint`.

One possible risk that was identified during Alpha is that the disk of
the node requesting the checkpoints could fill up if too many checkpoints
are created. One approach to solve this was some kind of garbage collection
of checkpoint archives. A pull request to implement garbage collection
was opened ([#115888](https://github.com/kubernetes/kubernetes/pull/115888))
but during review it became clear that the kubelet might not be the right
place to implement checkpoint archive garbage collection and the pull request
was closed again. Currently the most likely solution seems to be to implement
the garbage collection in an operator.

## Design Details

The feature gate `ContainerCheckpoint` will ensure that the API
Expand Down Expand Up @@ -244,13 +258,29 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
Once CRI implementation provide the relevant RPC calls
the e2e tests will not fail but need to be extended.

- Once the initial Alpha release CRI-O supports the
`CheckpointContainer` CRI RPC and tests have been
enhanced to support CRI implementation that implement
the `CheckpointContainer` CRI RPC

- Once Kubernetes was released with the `CheckpointContainer` CRI RPC
CRI-O has been updated to support the new CRI RPC.
The tests have been enhanced to work with CRI implementations
that support the `CheckpointContainer` CRI RPC as well as
CRI implementations that do not support it. The tests also handle
if the corresponding feature gate is disabled or enabled:
<https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/checkpoint_container.go>

### Graduation Criteria

#### Alpha

- [ ] Implement the new feature gate and kubelet implementation
- [ ] Ensure proper tests are in place
- [ ] Update documentation to make the feature visible
- [X] Implement the new feature gate and kubelet implementation
- [X] Ensure proper tests are in place
- [X] Update documentation to make the feature visible
- <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
- <https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/>
- <https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/>

#### Alpha to Beta Graduation

Expand Down Expand Up @@ -350,6 +380,7 @@ does not compress the checkpoint archive on disk.
* 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint`
* 2022-04-05: Added CRI API section and targeted 1.25
* 2022-05-17: Remove *restore* RPC from the CRI API
* 2023-10-09: Beta graduation in 1.30

## Drawbacks

Expand Down
8 changes: 4 additions & 4 deletions keps/sig-node/2008-forensic-container-checkpointing/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,18 @@ approvers:
- "@dchen1107"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.25"
latest-milestone: "v1.30"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.25"
beta: "v1.26"
stable: "v1.28"
beta: "v1.30"
stable: "v1.33"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down

0 comments on commit 0ff3da6

Please sign in to comment.