Server metrics and monitoring #67

steinbro · 2023-11-03T01:05:35Z

In addition to serving tiles, the backend also exposes two additional API endpoints:

/metrics that reports some performance and usage stats in Prometheus format
/probe/alive that can be pinged to monitor uptime

We're not utilizing either of these, but we should be. @RDMurray, do you have any favorite dashboard or monitoring tools? Not talking about full-blown error monitoring like Sentry, just something that can render the metrics histograms and something else that can fire off an email when the heartbeat isn't responding.

Actually, now that I look at the metrics, it does report that the alive endpoint has been pinged a few thousand times since it was spun up a few weeks ago -- is this being polled by something?

The text was updated successfully, but these errors were encountered:

RDMurray · 2023-11-03T14:13:32Z

I don't have any favourite monitoring tools, but I'll look into it. Sentry does have a generous free plan for open source organizations, so that might be worth looking into. Having said that I know very little about Sentry apart from that it is popular. I have no sight at all, so I can't really give an opinion on software that renders histograms or helps to visualise data.

I'm currently monitoring the alive endpoint with Uptime Robot, which sends an email if it is down. It also has a mobile app with push notifications.

steinbro · 2023-11-04T15:56:25Z

Thanks. For Uptime Robot, is there a way to add more users to the account, or is it easiest to just create my own monitor if I want notifications? It also looks like you can create a basic uptime page for free as well; should we do so, at least for our own maintenance purposes?

I did try tinkering with Grafana Cloud, which also provides some customizable alerts, but the documentation states "Grafana Cloud won’t accept a public URL that is not protected by authentication," ostensibly for some security reason. I'd rather keep our metrics page open, though I suppose we could have a second password-protected URL if we really wanted to make Grafana Cloud happy. But I suppose visual dashboards wouldn't be super useful for this crowd, anyway.

RDMurray · 2023-11-05T15:22:01Z

The free tier of Uptime Robot doesn't allow adding users. Hopefully it will allow you to create a monitor for the same site. Even if they don't allow that, it might work anyway because I am still monitoring newprod0.openscape.io.

There is an open source uptime monitoring service Upptime which uses Github actions and Issues. It can poll every 5 minutes. I find the idea of spinning up a VM every 5 minutes just to do an http request kind of horrible, but presumably Github is okay with it so we could possibly use that.

steinbro · 2023-11-18T12:31:39Z

@RDMurray What are your thoughts on Glitchtip? It uses the Sentry API but has a much simpler UI. It also has a hosted free tier.

RDMurray · 2023-11-28T15:06:57Z

I have played with Glitchtip a bit, making a test project and sending some events and metrics. It certainly is a much simpler UI. Issues are logged in detail.

The performance monitoring seems to be very simplistic though. I can only see the number of events and average duration, With a screen reader that is, I don't know if there is a graph.

The free tier is only 1000 events per month and I can't see any additional offers for open source projects.

I think the sentry open source free tier is to good to pass up, provided there are no shostoppers with the UI.

RDMurray · 2023-11-28T16:54:54Z

I also created a test project on sentry.io. It is much more comprehensive, and looks very accessible so far. Once we have a dashboard or two and some alerts set up, it should be quite usable.

RDMurray · 2023-11-30T13:10:16Z

Related to this issue, I set up an uptime monitoring service Uptime Kuma at uptime.mur.org.uk which can currently be accessed by the team @soundscape-community/backend . There is a public status page at soundscape-status.mur.org.uk.

It is self-hosted, but simple enough to manage and not a critical service.

Let me know what you think.

steinbro · 2023-12-01T13:10:05Z

Nice! Together with the Slack integration, I'm satisfied with this level of monitoring for the tile service. I'd say the next priority is monitoring the ingest service (#71), and since you suggested using Sentry for that which you mentioned here I'll leave this issue open.

steinbro assigned RDMurray Nov 3, 2023

steinbro added enhancement New feature or request infrastructure Related to backend services and other remote servers labels Nov 3, 2023

steinbro mentioned this issue Nov 9, 2023

ingest service status visibility #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server metrics and monitoring #67

Server metrics and monitoring #67

steinbro commented Nov 3, 2023

RDMurray commented Nov 3, 2023

steinbro commented Nov 4, 2023

RDMurray commented Nov 5, 2023

steinbro commented Nov 18, 2023

RDMurray commented Nov 28, 2023

RDMurray commented Nov 28, 2023

RDMurray commented Nov 30, 2023

steinbro commented Dec 1, 2023

Server metrics and monitoring #67

Server metrics and monitoring #67

Comments

steinbro commented Nov 3, 2023

RDMurray commented Nov 3, 2023

steinbro commented Nov 4, 2023

RDMurray commented Nov 5, 2023

steinbro commented Nov 18, 2023

RDMurray commented Nov 28, 2023

RDMurray commented Nov 28, 2023

RDMurray commented Nov 30, 2023

steinbro commented Dec 1, 2023