Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vTPM : add state management #4223

Merged
merged 3 commits into from
Sep 12, 2024
Merged

vTPM : add state management #4223

merged 3 commits into from
Sep 12, 2024

Conversation

shjala
Copy link
Member

@shjala shjala commented Sep 6, 2024

This PR add state management to vTPM, allowing the domain manager/kvm to :

  • Send purge request when the VM is being deleted. Purge request will delete the data of the particular SWTPM instance associated with the VM/App.
  • Send terminate request if kvm failed to run the VM/App. Delete kills the SWTPM instance if stalled and updates the live instances counter.

(Research has shown Friday is best day to write buggy code, so I keep it in draft state to test some more with Eden and make sure everything works as expected)

tools/mini-yetus.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you delete an app instance, do you first purge and then delete the swtpm instance?

Down the road might need garbage collection as well to handle the case when

  1. app instance is deployed - running
  2. device is powered off
  3. app instance is deleted in controller
  4. device is powered on

In that sequence the EVE agents to not see any delete handler being called; merely the absense of a create handler. We have specific logic in e.g., volumemgr to look for which volumes and content-trees (latter managed by containerd) no longer have any references from the config and gabage collect those to handle such a sequence.
Other things which store state in /persist need something similar.

Kicking off tests.

@OhmSpectator
Copy link
Member

When you delete an app instance, do you first purge and then delete the swtpm instance?

Down the road might need garbage collection as well to handle the case when

  1. app instance is deployed - running
  2. device is powered off
  3. app instance is deleted in controller
  4. device is powered on

In that sequence the EVE agents to not see any delete handler being called; merely the absense of a create handler. We have specific logic in e.g., volumemgr to look for which volumes and content-trees (latter managed by containerd) no longer have any references from the config and gabage collect those to handle such a sequence. Other things which store state in /persist need something similar.

That's a good point. As far as I understand, the current implementation will also not handle the case when we start swTPM for the same domain twice (for example, after rebooting the node), as it will try to create some files again, but they already exist. Or maybe I'm wrong?
If that's the case, will we solve it in the scope of this PR? It looks relevant...

@shjala
Copy link
Member Author

shjala commented Sep 10, 2024

(for example, after rebooting the node)

No, if a (id,pid) pair already exist in the list of instances, vTPM just returns with no action, but I'll add a comment to make it more clear, in any other case, it just checks that the required directory exists and creates them if not (no overwriting).

if _, ok := pids[id]; ok {
return fmt.Errorf("SWTPM instance with id %s already running", id)
}

@OhmSpectator
Copy link
Member

(for example, after rebooting the node)

No, if a (id,pid) pair already exist in the list of instances, vTPM just returns with no action, but I'll add a comment to make it more clear, in any other case, it just checks that the required directory exists and creates them if not (no overwriting).

if _, ok := pids[id]; ok {
return fmt.Errorf("SWTPM instance with id %s already running", id)
}

But the pids slice is empty after the board reboot... So it will pass this check.

@shjala
Copy link
Member Author

shjala commented Sep 10, 2024

But the pids slice is empty after the board reboot... So it will pass this check.

Sorry I thought you mean rebooting the VM, in that case it is still safe, vTPM calls MkdirAll which If path is already a directory, it does nothing and returns.

In no case vTPM overwrites the data, it just creates if no dir exist with ID name, and deletes the ID dir when VM is being deleted (not turned off, rebooted or crashed, only deleted).

@shjala
Copy link
Member Author

shjala commented Sep 10, 2024

But the pids slice is empty after the board reboot... So it will pass this check.

Sorry I thought you mean rebooting the VM, in that case it is still safe, vTPM calls MkdirAll which If path is already a directory, it does nothing and returns.

In no case vTPM overwrites the data, it just creates if no dir exist with ID name, and deletes the ID dir when VM is being deleted (not turned off, rebooted or crashed, only deleted).

commented in the code for brevity.

@eriknordmark
Copy link
Contributor

That's a good point. As far as I understand, the current implementation will also not handle the case when we start swTPM for the same domain twice (for example, after rebooting the node), as it will try to create some files again, but they already exist. Or maybe I'm wrong? If that's the case, will we solve it in the scope of this PR? It looks relevant...

Do we have any test cases for this?
If not it would make sense to add a test case (in addition to making sure it works manually.)
It doesn't need to be part of this PR, but I think it needs to be fixed before the next LTS when users are more likely to start using the new vTPM implementation.

@OhmSpectator
Copy link
Member

But the pids slice is empty after the board reboot... So it will pass this check.

Sorry I thought you mean rebooting the VM, in that case it is still safe, vTPM calls MkdirAll which If path is already a directory, it does nothing and returns.

In no case vTPM overwrites the data, it just creates if no dir exist with ID name, and deletes the ID dir when VM is being deleted (not turned off, rebooted or crashed, only deleted).

Thanks. @shjala ! And what about the case that @eriknordmark has described? Removal of the app that was deleted while the device was not booted. Hence, we do not have a call to handleDelete that triggers removing the files.

@shjala
Copy link
Member Author

shjala commented Sep 10, 2024

That's a good point. As far as I understand, the current implementation will also not handle the case when we start swTPM for the same domain twice (for example, after rebooting the node), as it will try to create some files again, but they already exist. Or maybe I'm wrong? If that's the case, will we solve it in the scope of this PR? It looks relevant...

Do we have any test cases for this? If not it would make sense to add a test case (in addition to making sure it works manually.) It doesn't need to be part of this PR, but I think it needs to be fixed before the next LTS when users are more likely to start using the new vTPM implementation.

We have unit test and eden test, the eden test tests both Aziot and vTPM, but still not merged because #4204 is not part of any release yet :
https://github.com/lf-edge/eden/blob/4f46f8a900b1ab7a256a686490da9923f51b362f/tests/aziot/aziot_test.go#L1-L12

I have ran both tests manually and it pass, and I'm cleaning the TPM required unit tests a bit, in another PR I'll add the vTPM unit test to that workflow.

@shjala
Copy link
Member Author

shjala commented Sep 10, 2024

Removal of the app that was deleted while the device was not booted. Hence, we do not have a call to handleDelete that triggers removing the files.

@OhmSpectator I need to dig into it and figure out how to do it, I'll do it in another PR if thats OK.

@OhmSpectator
Copy link
Member

Removal of the app that was deleted while the device was not booted. Hence, we do not have a call to handleDelete that triggers removing the files.

@OhmSpectator I need to dig into it and figure out how to do it, I'll do it in another PR if thats OK.

Ok, that makes sense.

@shjala
Copy link
Member Author

shjala commented Sep 10, 2024

@eriknordmark I'll expand the eden test to test for the case you described too.

@shjala
Copy link
Member Author

shjala commented Sep 11, 2024

I would put the TPM field in DomainStatus.VMConfig struct.

it was already part DomainStatus.VMConfig, I changed it based on your comment. I'm confused.

@shjala shjala closed this Sep 11, 2024
@shjala shjala reopened this Sep 11, 2024
@shjala
Copy link
Member Author

shjala commented Sep 11, 2024

I would put the TPM field in DomainStatus.VMConfig struct.

it was already part DomainStatus.VMConfig, I changed it based on your comment. I'm confused.

my bad, names are too similar.

@OhmSpectator
Copy link
Member

In general, I like the new approach! Clean and clear!

@OhmSpectator
Copy link
Member

I would put the TPM field in DomainStatus.VMConfig struct.

it was already part DomainStatus.VMConfig, I changed it based on your comment. I'm confused.

my bad, names are too similar.

yeah, sorry for the confusion...

Move vTPM related code from hypervisor.go to kvm.go since for now it is
only KVM specific. Add uniform vTPM state management for all hypervisors.
In addition improve vTPM state management by sending terminate and purge
requests when it is appropriate.

Signed-off-by: Shahriyar Jalayeri <shahriyar@zededa.com>
Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, looks good to me. Nevertheless, let's @eriknordmark also take a look.

Also, let's run the tests.

@OhmSpectator
Copy link
Member

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerun the tests

@OhmSpectator
Copy link
Member

OhmSpectator commented Sep 11, 2024

@shjala, by the way, for the case when we use the swtpm tool for the same domain after reboot, we think it will work as we store the state on /persist. But shouldn't we start the tool with the --flags not-need-init option?

I would like to clarify it before merging.

@shjala
Copy link
Member Author

shjala commented Sep 11, 2024

@shjala, by the way, for the case when we use the swtpm tool for the same domain after reboot, we think it will work as we store the state on /persist. But shouldn't we start the tool with the --flags not-need-init option?

I would like to clarify it before merging.

AFAIK in HW TPM_Init flag sets some internal TPM flags to make it operational, here it is sent by Qemu through the SWTPM control socket. I like to keep it this way to be as close to the HW implementation as possible.

@shjala
Copy link
Member Author

shjala commented Sep 11, 2024

@OhmSpectator
Copy link
Member

@shjala, by the way, for the case when we use the swtpm tool for the same domain after reboot, we think it will work as we store the state on /persist. But shouldn't we start the tool with the --flags not-need-init option?

I would like to clarify it before merging.

AFAIK in HW TPM_Init flag sets some internal TPM flags to make it operational, here it is sent by Qemu through the SWTPM control socket. I like to keep it this way to be as close to the HW implementation as possible.

Got it, thanks!

@eriknordmark
Copy link
Contributor

We have unit test and eden test, the eden test tests both Aziot and vTPM, but still not merged because #4204 is not part of any release yet :

FWIW 13.2.0 is now out and includes #4204

@shjala
Copy link
Member Author

shjala commented Sep 12, 2024

We have unit test and eden test, the eden test tests both Aziot and vTPM, but still not merged because #4204 is not part of any release yet :

FWIW 13.2.0 is now out and includes #4204

Great, I will clean the vTPM eden tests for merging.

@OhmSpectator
Copy link
Member

All green! @shjala, don't you mind if we merge it? =)

@shjala
Copy link
Member Author

shjala commented Sep 12, 2024

All green! @shjala, don't you mind if we merge it? =)

go for it thanks!

@OhmSpectator OhmSpectator merged commit 820c7d9 into lf-edge:master Sep 12, 2024
91 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants