Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] [max 6h] Decide on cost allocation strategy - Athena vs. Cost Explorer API #4648

Closed
17 tasks done
Tracked by #4453
consideRatio opened this issue Aug 21, 2024 · 5 comments
Closed
17 tasks done
Tracked by #4453
Assignees

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Aug 21, 2024

This task is blocking tasks towards attributing costs using Athena, as Yuvi learned about another approach to be evaluated first. This is described in #4453 (comment):

Regardless, I think it's early enough that we should investigate this alternative to Athena.

It would involve:

  1. https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api.html as the source of data.
  2. An intermediate python web server, that talks to the Cost Explorer API
  3. https://grafana.com/grafana/plugins/yesoreyeram-infinity-datasource/ for connecting this from Grafana. This is recommended by grafana as the replacement for https://github.com/grafana/grafana-json-datasource

There are a few major advantages over using Athena:

  1. Much easier to validate, as we aren't writing complex SQL queries but translating what we can visually do in the cost explorer into API calls.
  2. Athena is not per AWS account but at the AWS organization level, so we would have needed an intermediate layer anyway for cases when we use the 2i2c AWS organization. We wouldn't have needed this for Openscapes, but trying to use it for any of our other AWS accounts would've required an intermediate python layer for access control (so different communities can't see ach other's data).

So if possible, we should prefer this method.

We can resuse all the work we had done, except for some parts of #4546.

Next step here is to design a spike to validate this (instead of #4544). The athena specific issues that are subtasks of this can be closed if we are going to take this approach.

Practical spike steps

I think this has to be updated continuously as part of the spike, but the goal is to clarify and verify that its reasonable to move towards using the Cost Explorer API.

  • Read up about Cost Explorer API, starting from this Yuvi linked docs

  • Read up about Grafana Infinity plugin as a datasource from Yuvi linked docs

    • Evaluate if the plugin is installed and/or enabled by default in our Grafana deployments
      It is a third-party plugin not installed by default, but installing it persists it in the grafana persistent directory we mount so restarting the grafana pod doesn't make it deleted. It needs to be manually updated.
  • Understand and clarify details of Yuvi's step 2. An intermediate python web server, that talks to the Cost Explorer API.
    My preliminary understanding is that we would opt-in to deploy something from the support chart for this, and that it may need credentials setup via terraform to access the Cost Explorer API.

    • Use of https://github.com/boto/boto3 directly seems like a given, but possibly also https://github.com/aws/aws-sdk-pandas as a higher level helper. Can aws/aws-sdk-pandas aka awswrangler be worth using?
      Its not clear if we should use awswrangler, but I think the path is to assume we don't until we have a known need to manipulate the response from Cost Explorer API before serving it back to Grafana.

    • What if any transformations are required by this intermediate python web server?

      Probably not much data transformation, but it needs to expose a bridge API so that Grafana requests to populate our wanted dashboards can be translated to Cost Explorer API responses.

      • Will it be a live passthrough fetching and possibly transforming relevant info on demand, or will it scrape the cost api and then serve responses to grafana entirely based on the scraped data?
        I think it must be a live passthrough, but it could include caching of requests. Each request to the cost explorer API costs 0.01 USD according to official docs.

      • What kind of queries are to be expected to come from Grafana, and what kind of response is expected?
        We need to consider time ranges etc right?

        I think the responses from the Python server should be JSON, and respect time ranges in query parameters and be able to filter etc based on relevant things to filter on.

      • What kind of queries can I make to the Cost API through Python SDKs?
        I think it boils down to those listed in the CostExplorer boto3 client.

  • Exploration of existing things

    • https://github.com/electrolux-oss/aws-cost-exporter creates a prometheus exporter - providing metrics for prometheus to scrape. This isn't great for us as it can't look back in time, and makes it hard to adjust any queries in grafana, because they can only work against the already scraped metrics so if we didn't have a metric representing cumulative monthly net spend, we could end up needing to add together spend over time - which very likely would be inaccurate - we want accuracy in anything here.

      Python dependencies: relies on boto3 and botocore, not awswrangler
      Cost API used: get_cost_and_usage as seen here

    • https://github.com/intuit/costBuddy is a big clunky ~4 years stale project
      This included a too large machinery involving terraform (buckets, cloudwatch, vms, ...), config in excel file, management in a centralized parent account for multiple child AWS accounts, etc.

    • https://github.com/dnavarrom/grafana-aws-cost-explorer-backend stale old project that had a Node backend for Grafana to interact with, where it responded with JSON that grafana parsed using the grafana json plugin that is now deprecated in favor of the infinity plugin.

    • Add AWS Cost Explorer API grafana/grafana#73444 requests a datasource to work against the Cost Explorer API. I think in practice this is perhaps exactly what we need ourselves, and the creationg of a Python intermediary is doing that.

  • Should we create a the Python intermediary as a Grafana datasource, or should we let grafana treat our Python intermediary as some JSON providing API via the Infinity datasource?
    I don't think we should create a Grafana datasource plugin to install in Grafana, we would be ending up using NodeJS etc for that and it would be a lot of overhead related to handling a NodeJS project for us as a team.

  • Should the Python webserver mostly passthrough requests to the Cost Explorer API or should it map requests in a more hardcoded way?
    Ideally, we can avoid hardcoding a mapping between a Python web API returning JSON and the Cost Explorer API, and instead do it more in a passthrough way to relevant endpoints in the Cost Explorer API.

    It seems that we can do quite a bit of things with raw JSON data by crafting queries against the infinity datasource, that then post-processes the JSON response. Due to this, I think the key thing we should ensure, is that we through the python intermediary provides relevant JSON responses for post-processing by the infinity datasource query.
    image

  • Verification of feasibility

    • Determine if Cost Explorer API seem to provide sufficient data
      I'm quite sure we aren't getting limited here, I consider it verified enough to proceed at this point.
    • Determine if we reasonably can manage to get data to be accessed by Grafana Infinity via an intermediary Python web server
      I'm quite sure pieces will fit together.

Definition of done

  • A decision is made with motivation on either:
    • a) moving onwards with a Cost Explore API approach
    • b) moving onwards with an Athena approach
    • c) followup in some other way

Potential followup work not part of spike

  • If we go for Cost Explorer API, work to define/refine further tasks to be worked is needed.
@consideRatio consideRatio changed the title [Spike] PLACEHOLDER - decide on athena path or another strategy [Spike] Decide on cost allocation strategy - Athena or new strategy Aug 21, 2024
@consideRatio consideRatio changed the title [Spike] Decide on cost allocation strategy - Athena or new strategy [Spike] [max 8h] Decide on cost allocation strategy - Athena or new strategy Aug 21, 2024
@consideRatio consideRatio changed the title [Spike] [max 8h] Decide on cost allocation strategy - Athena or new strategy [Spike] [max 4h] Decide on cost allocation strategy - Athena or new strategy Aug 21, 2024
@consideRatio consideRatio changed the title [Spike] [max 4h] Decide on cost allocation strategy - Athena or new strategy [Spike] [max 6h] Decide on cost allocation strategy - Athena or new strategy Aug 21, 2024
@consideRatio consideRatio changed the title [Spike] [max 6h] Decide on cost allocation strategy - Athena or new strategy [Spike] [max 6h] Decide on cost allocation strategy - Athena vs. Cost Explorer API Aug 22, 2024
@yuvipanda
Copy link
Member

The definition of done looks good to me, @consideRatio.

If we go for Cost Explorer API, work to define/refine further tasks to be worked is needed.

If this isn't part of the spike, once the spike is done can you create another issue to track this? Thanks.

@consideRatio
Copy link
Contributor Author

Picking it up now with some initial reading at the end of my day, to be continued tomorrow.

@consideRatio
Copy link
Contributor Author

Notes to sketch a future possible implementation

  • I think defining grafana dashboards with panels and queries can be done in isolation as long as we have a dummy JSON blob of data to work against, we can then tweak the dummy JSON blob to become live.

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 23, 2024

Conclusion - moving forward with Cost Explorer API

I've arrived at what I consider sufficient grounds for a Decision to move ahead with Cost Explorer API.

It seems technicallt very viable, and the mhe motivation by Yuvi for using Cost Explorer API over Athena is sufficient in my mind.

There are a few major advantages over using Athena:

  1. Much easier to validate, as we aren't writing complex SQL queries but translating what we can visually do in the cost explorer into API calls.
  2. Athena is not per AWS account but at the AWS organization level, so we would have needed an intermediate layer anyway for cases when we use the 2i2c AWS organization. We wouldn't have needed this for Openscapes, but trying to use it for any of our other AWS accounts would've required an intermediate python layer for access control (so different communities can't see ach other's data).

Another positive conclusion is that it seems that we can avoid needing much complexity within the Python intermediary, and can put that complexity in the Grafana queries instead. This is because the infinity plugins seem to allow for notable post-processing of the JSON responses. Due to this, we can probably more responsively and quickly iterate on the cost dashboards and improve them, letting the Python intermediary be a quite slimmed project with relatively low complexity, making it more viable for re-use by others as well.

@yuvipanda
Copy link
Member

Another positive conclusion is that it seems that we can avoid needing much complexity within the Python intermediary, and can put that complexity in the Grafana queries instead.

Given that we'll be working on https://2i2c.productboard.com/roadmap/7803626-product-delivery-flow/features/27195081 in the future, as well as possibly needing to extend this work onto GCP, and the recommendations in https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api-best-practices.html#ce-api-best-practices-optimize-costs, I'd like most of the complexity to actually be in the python layer, and not in the grafana layer. Fixing issues in Python code is also far more accessible to more team members and other open source contributors than fixing it in jsonnet + the filtering languages that the grafana plugin uses. So let's use the grafana plugin as primarily a visual display layer, and keep most of the complexity in the python code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants