Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACM-14639: Dashboard loader: retry adding folder on errors #1646

Conversation

jacobbaungard
Copy link
Contributor

Grafana sometimes fails to correctly set (default) permissions for folders added via the API. This seems to mostly affect the "Custom" folder. When this happens, while the folder is added, it's not possible for ACM users to view the folder, and its dashboards. This commit tries to detect that case, and retry adding the folder if it fails.

Additional minor changes:

  • Configure grafana to retry queries
  • Log messages from the component all starts in lower-case.

Copy link

openshift-ci bot commented Oct 7, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@jacobbaungard
Copy link
Contributor Author

/test all

@jacobbaungard jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from ba7bfa2 to 5e8347e Compare October 7, 2024 13:10
@jacobbaungard
Copy link
Contributor Author

/test test-e2e

@jacobbaungard jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from 5e8347e to fd3f6bd Compare October 7, 2024 14:16
@jacobbaungard
Copy link
Contributor Author

/test test-e2e

@jacobbaungard jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from fd3f6bd to 79b6f9b Compare October 7, 2024 15:13
@jacobbaungard
Copy link
Contributor Author

/test all

@jacobbaungard jacobbaungard changed the title Dashboard loader: retry adding folder on errors ACM-14639: Dashboard loader: retry adding folder on errors Oct 7, 2024
Grafana sometimes fails to correctly set (default) permissions for
folders added via the API. This seems to mostly affect the "Custom"
folder. When this happens, while the folder is added, it's not possible
for ACM users to view the folder, and its dashboards. This commit tries
to detect that case, and retry adding the folder if it fails.

Additional minor changes:
- Configure grafana to retry queries
- Log messages from the component all starts in lower-case.

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
@jacobbaungard jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from 79b6f9b to 036178d Compare October 7, 2024 15:57
@jacobbaungard jacobbaungard marked this pull request as ready for review October 7, 2024 15:57
@jacobbaungard
Copy link
Contributor Author

/test test-e2e

1 similar comment
@jacobbaungard
Copy link
Contributor Author

/test test-e2e

Previously, the grafana-dev test would create a new custom dashboard,
and test the exporting to configmap using that. However since adding
custom dashboards are not so reliable after Grafana 11 update, we change
the test to use of the default dashboards. This should improve test
reliability, and there are no difference whether we export a custom
dashboard or a default one.

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
Since these are unreliable, and requires a few retries increase the test
timeout, to give a better chance of success.

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
.. since we no longer add it.

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
1 is slightly better than one, since 1 is usually used as a generic "the
program had an error" exit code.

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
@jacobbaungard
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@philipgough philipgough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jacobbaungard
Copy link
Contributor Author

@moadz you want to contest the forceful exit?

I guess it's a matter of whether we want this to silently fail or not. With this approach there's a chance Grafana availability will be impacted until at least one pod has succeeded (Grafana replicas=2 by default), but at least it's clear something is wrong. Alternatively, we fail more silently and dashboards will just be missing (in this case maybe we want a bit more retries, but it's already pretty high as is).

@moadz
Copy link
Contributor

moadz commented Oct 9, 2024

So my main question with forcing a crash is there is likely no intervention from a user that will fix this issue. And restarting the pod actually increases the likelihood this will fail again upon creation. So materially it's the same effect (we keep retrying whether through crashloop or otherwise) and if it doesn't work they will have to create a support ticket to review.

Restarting at least gives them a culprit when the dashboard doesn't show I suppose. But dashboard loader restarting grafana feels sketchy to me, so i would er on the side of just constantly retrying. If it doesn't work after N number of retries it's a big problem we will need to investigate regardless (i'm hoping the likelihood is remote).

Copy link
Contributor

@moadz moadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % the restart discussion

@@ -40,24 +41,25 @@ const (
var (
grafanaURI = "http://127.0.0.1:3001"
// Retry on errors.
retry = 10
httpRetry = 10
dashboardRetry = 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
dashboardRetry = 25
maxDashboardRetry = 25

@@ -40,24 +41,25 @@ const (
var (
grafanaURI = "http://127.0.0.1:3001"
// Retry on errors.
retry = 10
httpRetry = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
httpRetry = 10
maxHttpRetry = 10

@jacobbaungard
Copy link
Contributor Author

But dashboard loader restarting grafana

It doesn't actually restart the Grafana container (just the loader sidecar) but it will stop the service from directing requests to the failing pod if it's in crashloop backoff. The two Grafana replicas we have by default, don't use the same sqllite database. Not crashing the pod, could lead to some weird behavior where sometimes the dashboard show, sometimes it doesn't depending on which pod your request is routed to - Not a big fan of this this undetermism/hidden failures that this could cause.

However, I do agree that the impact of hitting this error, is potentially much larger with failing the pod, since if by bad luck we don't manage to ever actually create the dashboard, Grafana will not be available at all, whereas if we just silently fail, the worst that can happen is a few dashboards are missing. So maybe it is a little "safer" to not crash the pod, so the impact is minimal (albeit potentially more confusing for customers imo).

- Don't forcefully exit
- bump retries
- naming nits

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Oct 9, 2024
Copy link

sonarcloud bot commented Oct 9, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
17.2% Coverage on New Code (required ≥ 70%)

See analysis details on SonarCloud

Copy link

openshift-ci bot commented Oct 9, 2024

@jacobbaungard: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sonarcloud 045461c link false /test sonarcloud

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@moadz moadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Oct 9, 2024
Copy link

openshift-ci bot commented Oct 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jacobbaungard, moadz, philipgough

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jacobbaungard,moadz,philipgough]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 99d6bb0 into stolostron:main Oct 9, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants