Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate resources usage and cost for monitoring #3161

Closed
Tracked by #3039
QuentinBisson opened this issue Jan 22, 2024 · 7 comments
Closed
Tracked by #3039

Evaluate resources usage and cost for monitoring #3161

QuentinBisson opened this issue Jan 22, 2024 · 7 comments
Assignees

Comments

@QuentinBisson
Copy link

QuentinBisson commented Jan 22, 2024

Towards #3039

Let's evaluate the cost of running mimir vs the cost of running 1 Prometheus per cluster:

  • current prometheus cost
  • mimir cost
  • prometheus-agent transfer cost

Goal is to create a cost dashboard similar to the one we built for Loki

Let's first check if this dashboard still works giantswarm/dashboards#305

@QuentinBisson
Copy link
Author

The mimir cost dashboard is missing the sanples sent to s3 metric and the other one is getting fixed giantswarm/dashboards#441

But I'm a bit scared about the number of network request and current memory usage for just 1 MC. Maybe we can reduce that ?

@QuantumEnigmaa
Copy link

I looked at the cost estimate dashboards for both mimir and prometheus on golem for a 1 week period and this is what I found :
image

  • Storage cost can be overlooked as for now the per month cost for our current S3 bucket on golem is around 1 euro (we use Standard Storage Class which costs $0.023 per GB per month for the first 50 TB of data, the price per GB decreasing while the total storage used increase) and the prometheus storage cost is around 0 as its PVC is using a gp3 - IOPS EBS volume which is free under 3000 IOPS.
  • the EC2 instances used for the worker nodes are r6i.2xlarge (8 vCPUs and 64GBi memory) meaning that over the last week, mimir alone has used the equivalent of an entire instance.

image

The graph shows a huge spike uin resources consumption for mimir beginning on the 03/19 for memory and on the 03/20 for CPU with the mimit-store-gateway having its CPU consumption rising from 0.25 to 4.5 CPU used.

We should definitely investigate of that much CPU is really needed for the store-gateway.

image

In comparison, the resources usage of prometheus is quite steady (see above).

  • Another important point to look at is the network transfer data which is incredibly higher for Mimir compared to prometheuses. This could be expected with Mimir having to send/fetch data from the Object Storage but still, maybe we could take a look there to see if there are some savings to do.

@QuentinBisson
Copy link
Author

I;m quite bothered by the resource usage, those are really high numbers and especially the memory. I'm wondering it there is no optimization we could use. It would be interesting to attend a community meeting.

Maybe we could try those computations https://grafana.com/docs/mimir/latest/manage/run-production-environment/planning-capacity/?

They did not deploy the Single Scalable deployment model into a helm chart yet :(

@QuantumEnigmaa
Copy link

I created an issue dedicated to Mimir resources usage optimization here.

What do you think remains to be done for this specific issue ? Should I add more details such as a real cost approximation ? (I mean in dollars/euros)

@QuentinBisson
Copy link
Author

Could you maybe post the number of time series for prometheus and Mimir ? Or add a graph to the dashboard ?

@QuantumEnigmaa
Copy link

Total series for Mimir over the last 7 days (sum(cortex_ingester_active_series{cluster_id=~"($cluster)"}) by (cluster_id)) :

image

Total series for Prometheus over the last 7 days (sort_desc(sum(prometheus_tsdb_head_series{cluster_id=~"($cluster)"}) by (cluster_id))) :

image

@QuentinBisson
Copy link
Author

I think we can close :)

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants