Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vizier feature flags are set only at deploy time #1632

Open
aimichelle opened this issue Jul 21, 2023 · 1 comment · Fixed by #1670
Open

Vizier feature flags are set only at deploy time #1632

aimichelle opened this issue Jul 21, 2023 · 1 comment · Fixed by #1670
Labels
area/deployment Issues replated to deployments area/vizier good first issue Good for newcomers kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@aimichelle
Copy link
Member

Describe the bug
We have a system for setting feature flags on Viziers. The flow is as follows:

  1. When the operator is deploying/updating Vizier to a cluster, it makes a request to the control plane's config-manager-service:
    rpc GetConfigForVizier(ConfigForVizierRequest) returns (ConfigForVizierResponse);
    The deployKey is included as part of the vzspec.
  2. The config manager service uses the deployKey to lookup which org is attempting to deploy/update the Vizier.
  3. The org ID is then used to make a request to LaunchDarkly for the Vizier feature flags. Feature flags are set on a per-org basis, for example org A may have featureFlag1 set to true.
  4. The config manager updates the Vizier YAMLs with the correct feature flags flipped.
  5. Operator receives the Vizier YAMLs as a response from the config manager service, and applies the YAMLs.

However, this flow only works when the deployKey is valid (so basically, only at deploy time). Most times, a deployKey is generated when deploying a Vizier, and automatically deleted once the Vizier is up-and-running. When the Vizier updates, since the deployKey is no longer valid, the config manager service is unable to fetch the associated orgID. If it is unable to get an orgID, it simply does not add any of the feature flags to the Vizier YAMLs.

Expected behavior
Regardless of whether a user is deploying or updating a Vizier, the feature flags should be set properly based on which org is deploying the Vizier. Since the deploy key is not always valid, we can determine the org associated with a Vizier through its Vizier ID.

  1. Add VizierID to ConfigForVizierRequest protos: src/api/proto/cloudpb/cloudapi.proto, src/cloud/config_manager/configmanagerpb/service.proto
  2. Update the operator to include VizierID in ConfigForVizierRequest:
    req := &cloudpb.ConfigForVizierRequest{
    Note: Vizier ID will not be available if the user is deploying Vizier for the first time, but it will be available if the Vizier is going through an update. We can use
    s := k8s.GetSecret(clientset, namespace, "pl-cluster-secrets")
    if s == nil {
    return errors.New("Missing cluster secrets")
    }
    if _, ok := s.Data["cluster-id"]; ok {
    clusterID = true
    }
    to get the Vizier ID.
  3. Update config manager service to get the org associated to a Vizier, if the deployKey is not valid:
    if err != nil || orgID == uuid.Nil {
    log.WithError(err).Error("Error getting org ID from deploy key, skipping feature flag logic")
    We will need to make a request to vzmgr service to do so:
    rpc GetOrgFromVizier(uuidpb.UUID) returns (GetOrgFromVizierResponse);
@aimichelle aimichelle added kind/bug Something isn't working good first issue Good for newcomers area/deployment Issues replated to deployments priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. area/vizier labels Jul 21, 2023
vihangm pushed a commit that referenced this issue Aug 9, 2023
Summary: This PR allows the VizierID to be used when getting the org as
opposed to only relying on the DeployKey when adding feature flags.

Relevant Issues: #1632

Type of change: /kind bug

Test Plan: Another PR needs to come in on the operator side to
retrieve/send the VizierID to the cloud. Getting this PR in on the cloud
side will simplify the process of testing the integration between the
two later.

---------

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
aimichelle pushed a commit that referenced this issue Aug 11, 2023
Summary: After testing the interaction between the operator and config
manager service, it seems that the config manager service segmentation
faults when it receives the VizierID from the operator as it's not able
to properly convert it to type uuid, causing the service to restart.
This needs to be fixed and in the meantime these changes should be
reverted.

Relevant Issues: #1632

Type of change: /kind bug

Test Plan: Make sure it still builds

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
vihangm pushed a commit that referenced this issue Aug 15, 2023
Summary: This PR allows the VizierID to be used when getting the
Vizier's org as opposed to only relying on the DeployKey when adding
feature flags. This PR also contains changes to prevent a segmentation
fault that occurred previously.

Relevant Issues: #1632

Type of change: /kind bug

Test Plan: Skaffold the cloud changes on the testing cluster and then
skaffold the operator changes on a dev cluster. Manually trigger an
update on the operator side to send the VizierID to the config manager
service. Check the config manager service's logs to see that the new
logic is being executed as expected to handle the VizierID.

---------

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
copybaranaut pushed a commit to pixie-io/pxapi.go that referenced this issue Aug 15, 2023
Summary: This PR allows the VizierID to be used when getting the
Vizier's org as opposed to only relying on the DeployKey when adding
feature flags. This PR also contains changes to prevent a segmentation
fault that occurred previously.

Relevant Issues: pixie-io/pixie#1632

Type of change: /kind bug

Test Plan: Skaffold the cloud changes on the testing cluster and then
skaffold the operator changes on a dev cluster. Manually trigger an
update on the operator side to send the VizierID to the config manager
service. Check the config manager service's logs to see that the new
logic is being executed as expected to handle the VizierID.

---------

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
GitOrigin-RevId: 87180347a8b9bfca31b9b561712b64d29bd97807
JamesMBartlett pushed a commit that referenced this issue Aug 16, 2023
Summary: This PR adjusts the cloud side to accept a type uuidpb.UUID for
the VizierID instead of a string.

Relevant Issues: #1632

Type of change: /kind bug

Test Plan: Skaffold the cloud changes on the testing cluster and then
skaffold the operator changes on a dev cluster. Manually trigger an
update on the operator side to send the VizierID to the config manager
service. Check the config manager service's logs to see that the new
logic is being executed as expected to handle the VizierID.

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
JamesMBartlett pushed a commit that referenced this issue Aug 16, 2023
…1670)

Summary: This PR lets the operator get the VizierID from the cluster's
secrets and then sends it to the config manager service for it to
ultimately set feature flags.

Relevant Issues: Fixes #1632

Type of change: /kind bug

Test Plan: Skaffold the cloud changes on the testing cluster and then
skaffold the operator changes on a dev cluster. Manually trigger an
update on the operator side to send the VizierID to the config manager
service. Check the config manager service's logs to see that the new
logic is being executed as expected to handle the VizierID.

---------

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
@ddelnano ddelnano reopened this May 13, 2024
ddelnano added a commit that referenced this issue May 13, 2024
#1899)

Summary: Revert change that causes `px deploy` to timeout on initial
deployment

This reverts #1670, which creates a circular dependency between the
operator and the cloud connector service. The cloud connector service is
responsible for [registering a
vizier](https://github.com/pixie-io/pixie/blob/fb18345808e6fbd8aed02dac8163469322b84205/src/vizier/services/cloud_connector/bridge/server.go#L290-L302)
and populating the `pl-cluster-secrets` k8s secret with its vizier ID.
The operator is responsible for creating the vizier services (including
the cloud connector), so [this
call](https://github.com/pixie-io/pixie/blob/fb18345808e6fbd8aed02dac8163469322b84205/src/operator/controllers/vizier_controller.go#L503)
will never succeed on a fresh install.

This bug doesn't cause pixie installs to fail completely. A `px deploy`
cli command will time out after 10 minutes and then the vizier is
deployed following that timeout expiration. While it eventually
converges to a healthy vizier, this is a poor user experience. The perf
tool is also experiencing this problem, but because it requires `px
deploy` to return a successful status code it is causing it to fail
completely.

Relevant Issues: Reopens #1632

Type of change: /kind bug

Test Plan: Reverted this change and verified that the a `skaffold`'ed
operator doesn't timeout

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
@ddelnano
Copy link
Member

@kpattaswamy FYI this was reverted because it was causing px deploys on fresh clusters to timeout in the v0.1.5 operator (despite it eventually succeeding). See #1899 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deployment Issues replated to deployments area/vizier good first issue Good for newcomers kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants