Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus and Grafana on stats.<domain> #838

Merged
merged 24 commits into from
Feb 12, 2021
Merged

Conversation

Peetz0r
Copy link
Contributor

@Peetz0r Peetz0r commented Jan 29, 2021

The Node Exporter is missing some data because it isn't fully effective inside the container, but most of it works.

Tested in debian 10 amd64 (in a vm) and on my RockPro64 running debian testing/bullseye/11.

Demo: https://stats.glitch.im/

@pushytoxin
Copy link
Contributor

If this adds a new subdomain, docs/configuring-dns.md should be updated as well

@pushytoxin
Copy link
Contributor

You can also extend the list at docs/self-building.md since you added self-building

@aaronraimist
Copy link
Contributor

aaronraimist commented Jan 30, 2021

Maybe there should be two tables for the DNS settings in docs/configuring-dns.md. One that has the components that are enabled by default and one for things that aren't. Right now it is a bit confusing because at the top of that table it says "General outline of DNS settings you need to do" even though half of the table is optional, off by default stuff.

@Peetz0r
Copy link
Contributor Author

Peetz0r commented Jan 30, 2021

If this adds a new subdomain, docs/configuring-dns.md should be updated as well

Done!

Maybe there should be two tables for the DNS settings in docs/configuring-dns.md. One that has the components that are enabled by default and one for things that aren't. Right now it is a bit confusing because at the top of that table it says "General outline of DNS settings you need to do" even though half of the table is optional, off by default stuff.

Done!

You can also extend the list at docs/self-building.md since you added self-building

Enabling it in the first place was actually a mistake. I am new to Docker and quite rusty on Ansible, so I started just copypasting stuff from other services. Only to find out in the process how it actually works.

I found out that all images I used support amd64, arm32 and arm64. I also find out that self-building didn't even actually work. So I disabled it.

I also tested it on my (arm64) rockpro64 and everything works just as well as it did on my amd64 VM and deployed it to my main personal homeserver :)

Demo at https://stats.glitch.im/ (should be online 24/7, feel free poke me if it breaks or anything)

@Peetz0r
Copy link
Contributor Author

Peetz0r commented Jan 30, 2021

I rebased and edited the commits enabling and then disabling self-building so it's as if it never happened :)

@spantaleev
Copy link
Owner

What do you think about renaming the domain name for Grafana from stats.DOMAIN to stats.matrix.DOMAIN?

While jitsi.DOMAIN and dimension.DOMAIN are also generic, it's probably unlikely that one is hosting something else called "Dimension" for the domain, or that one is hosting a 2nd Jitsi server, so naming these jitsi.matrix.DOMAIN and dimension.matrix.DOMAIN is unnecessary.

stats. on the other hand sounds quite generic. Especially right now, it's probably difficult to use the same Prometheus/Grafana installation for non-Matrix stuff as well, so I wonder if it should be so "top-level".

On the other hand:

  • it's easier that way
  • it may very well be possible to make use of this installation for non-Matrix things
  • people who have a problem with stats.DOMAIN can override it

@Peetz0r
Copy link
Contributor Author

Peetz0r commented Jan 31, 2021

Those are all good improvements I think.

Like I said, most of this is new to me, I am learning as I go :)

@pushytoxin
Copy link
Contributor

pushytoxin commented Feb 1, 2021

I'm getting confused by some of the variables, I think there needs to be a cleanup for better defaults.

  • Is there any reason for the synapse metrics to be presented through web proxy if Grafana and Prometheus are enabled?
  • Why would someone run Grafana without Prometheus on the homeserver?
  • Why would someone run Prometheus without Grafana on the homeserver?

There might be users who collect the metrics on an external time series DB. They probably don't need Grafana nor Prometheus in the homeserver.

  1. There should be a separate variable for the stats.DOMAIN Grafana dashboard, which turns on metrics internally on port 9100, and Prometheus+Grafana
  2. There should be a separate variable for exposing /_synapse/metrics in the web proxy

Then for additional fine tuning, the admin user or the node-exporter can be configured. The admin user should also be disabled if possible unless explicitly set up by the user.

@spantaleev
Copy link
Owner

Yeah, this is opening pandora's box in a way..

One could also say that we need to hook Prometheus to the SMTP server, so people can set up Alerts, etc.

Then there's the people who will use a separate Prometheus installation and separate Grafana. They'd want to just expose /_synapse/metrics somehow. And they'd want to do it privately (behind HTTP Basic Auth, etc.). We already have support for this, but it's not documented on this page (it might be on some other docs page):

matrix_synapse_metrics_enabled: true

matrix_nginx_proxy_proxy_synapse_metrics: true
matrix_nginx_proxy_proxy_synapse_metrics_basic_auth_enabled: true
# The HTTP Basic Auth user is always `prometheus`
matrix_nginx_proxy_proxy_synapse_metrics_basic_auth_key: "password"

Then, #456 is coming along and each worker exposes its own metrics. I'm not sure how those should be handled.


I think we shouldn't get overly obsessed with this and just try to launch something. Even now, it seems good enough.

Documentation around edge cases can be improved. People who want SMTP support for alerts or want to monitor workers, etc., can also jump in and contribute in the future.

Initial attempt. Seems to work fine.

Only tested on debian amd64 so far
Basic system stats, to show stuff the synapse metrics
can't show such as resource usage by bridges, etc

Seems to work fine as well.

This too has only been tested on debian amd64 so far
Also includes the dashboards for Synapse and for Node Exporter.

Again has only been tested on debian amd64 so far, but the grafana docker image is available for arm64 and arm32. Nice.
By running it in a more privileged container with access to the host network stack and such
Using the hardcoded IP did break while I was
messing with IPv6 stuff on the other branch
@Peetz0r
Copy link
Contributor Author

Peetz0r commented Feb 10, 2021

Rebased on current master, and updated grafana and prometheus-node-exporter to latest versions.

Yes, we might need more documentation to expose how flexible this can be. But for the common usecase (just exploring performance metrics of your new shiny homeserver) it works just fine.

Peetz0r and others added 6 commits February 12, 2021 11:59
The quotes around "host" for both `--pid` and `--net` were
causing trouble for me:

> docker: --pid: invalid PID mode.

and:

> docker: Error response from daemon: network "host" not found.

I've also changed the `-v` call to `--mount` for consistency with the
rest of the playbook.
We mainly switch the anonymous metrics viewing variable
to false, along with other wording changes.
It seems like it doesn't cause any issues for any of these services.
@spantaleev spantaleev merged commit 2ac2b02 into spantaleev:master Feb 12, 2021
@spantaleev
Copy link
Owner

You rebasing made my changes disappear, but I rebased as well ;)

Our poll in the support room indicated that people prefer stats.DOMAIN over stats.matrix.DOMAIN, so we'll go with that as the default.

I've made a few minor changes and merged this! I've gotten rid of the explicit matrix_synapse_metrics_enabled: true, making the configuration a little simpler for most people. I've also documented how one can use an external Prometheus/Grafana.

Something else we might wish to do in the future is to adjust matrix_grafana_dashboard_download_urls dynamically based on the enabled services. People who don't use Synapse (when Dendrite comes along) won't need a Synapse dashboard at all, etc. So not only should we not download it, but maybe we should even try to clean up after ourselves. It doesn't seem too important now, and I'd rather not block this PR any more (because #456 is to be merged soon.. and #818 is waiting upon #456 as well, etc.), so I got it merged as-is. I believe it works great for most of us anyway!

Thanks a lot for the huge amount of work on this, @Peetz0r! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants