Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log-Receiver: Investigation #3567

Closed
2 tasks done
Tracked by #3566
Rotfuks opened this issue Jul 10, 2024 · 26 comments
Closed
2 tasks done
Tracked by #3566

Log-Receiver: Investigation #3567

Rotfuks opened this issue Jul 10, 2024 · 26 comments
Assignees
Labels
blocked needs/refinement Needs refinement in order to be actionable team/atlas Team Atlas

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Jul 10, 2024

Motivation

We need to make sure customers can receive logs from outside the installations. For this we need to first find out, how exactly we can achieve this. Do we need a new component, can we reuse an already existing one, or do we have to create our own thing?

Investigation

  • Find out how we can best receive logs from outside the installation from different sources (and get them into Loki)
    • The Log Receiver has to be on the MC
    • The Log Receiver should have minimal resource consumption
    • Customers should be able to configure the sources the logs come from in self service
  • Discuss the possible solutions with the team

Outcome

  • We have a concept in place on how we will implement the log receiver.
@QuentinBisson
Copy link

QuentinBisson commented Jul 16, 2024

Hey @giantswarm/team-atlas so I've been doing some investigation on this topic for a while as I was getting into open-telemetry and I think we could do it in multiple ways:

  • Expose Loki to the internet with all the security risks that comes with it. We would need to find a way to create api keys on demand but I think it should be quite easy to add some kind of ApiKey CRD or a Tenant CRD that would result in the creation of an api key to send logs to loki.
  • Expose an open-telemetry gateway (in the form of alloy of course) that would act as a gateway to receive logs from different tenants through the open-telemetry procotol only. This means we could allow for different authorization schemes for those apps the same way we do it for the multi-tenant-proxy (current otel receiver supports oauth2 and basic-auth) but we would need to find a way to get the tenant information anyway. We could also use apikey CRD (see above) and generate the gateway config from the observability operator anyways.
    • Some advantages is that would could drop DDOS traffic from the gateway and that if we want to ingest traces, profiles and so on later, and maybe external metrics, the gateway would already be there and this also means we start playing with opentelemetry which is quite nice :)
    • But this would mean a bit of new things to learn
  • Expose fluent-bit as a log receiver and push logs to loki.

I definitely would be in favor of solution 2 because I think it's the most useful one future wise but it will most likely take longer.

Now, to my point about the api keys, that is something we could start thinking about today. Do we think it makes sense to move to some kind of PKI for this?

@QuentinBisson QuentinBisson self-assigned this Jul 16, 2024
@QuentinBisson QuentinBisson added blocked needs/refinement Needs refinement in order to be actionable labels Jul 17, 2024
@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 22, 2024

I would also support the second solution because it's the most secure and future-safe approach. Don't want to make the platform less secure and put a legacy agent we wanted to get rid of for such a niché feature. 
So what do we need to find out to have a full concept? Do we want to create follow up tasks or is it fine to do it here? Next step for me would be to have a rough concept on a target state that we want to achieve here.

@QuentinBisson
Copy link

I need to draw something yes so we can discuss it as a team tomorrow and find the end state we all want, I will try to do it later tonight to explain where I think our observability platform should go to be able to support more features and ideally otel OOTB. I wanted to do it today but life got in the way.

Once we have this, we can agree on steps we want to do in the implementation phase :)

@QuentinBisson
Copy link

Here is the schema:

Image

I'm not adding anything related to secret management but it should be here

@QuentinBisson
Copy link

QuentinBisson commented Jul 25, 2024

@giantswarm/team-atlas
As a rough plan for this, what I am envisionning is to:

  1. Deploy an instance of alloy on the MC acting as an oltp eceiver to be able to receive logs I'm not sure how auth would work because the receivers (both oltp and loki.source.api) support any kind of auth so this would require a gateway in front that checks for auth. Could be the multi-tenant gateway :) oltp reveiver with include_metadata ensures headers are forwarded in the pipeline context (useful for the tenant id)
  2. Define a new CRD (not sure what to name it, maybe source.observability.giantswarm.io but this can have another name later like datasource, apikey, datacollector, whathever works) to define a source of data that is managed by the observabiity operator. Idea would be that when we create it, it creates an API key secret (linked un thé CR status) for the source of data so customers could use this CR to get a secret, we could use this for teleport logs.
    Idea for improvements, the observability operator can create a source CR for each WC and logging operator would use the created secret as a source instead of also creating secrets
  3. Configure the alloy gateway ingress to check headers secrets and have logs in Loki which I would assume means we need the multi-tenant proxy to be deployed as a standalone component outside of the loki namespace. This also means the mtp should not set the tenant anymore but it should bé coming as thé x-scope-ordid header from promtail ?

Maybe you can think of an easier way to make this work for now?
For instance, I assume we could leave the multi-tenant gateway where it is now and have the observability operator configure the api key and have the oltp receiver send logs to loki by bypassing the gateway so we can move forward and do the better solution later but I'm not sure I like this.

The main idea here is to make sure @QuantumEnigmaa and @TheoBrigitte can work on the implementation phase if we think this is legit :)

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 25, 2024

Can we shift the perspective slightly and look at it from a customer journey perspective as well? 
With that setup and with our fully self-service platform mindset, how would the customer then configure new sources of data from outside? By extending the CRD or creating a CRD for every new data source? And that CRD then lives in the observability folder of his installation repo?

@QuentinBisson
Copy link

QuentinBisson commented Jul 25, 2024

In that journey, they could create the CR with whatever name they want and thé operators would generate a secret they would need get to configure their logshipper. It's thé best we can do without any ui integration

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 25, 2024

So they need a logshipper that sends the data to alloy which only receives but doesn't scrape? :) 

For example: customer A wants to get the logs of a Cloud Service Database that is connected to their app in the cluster. So they set up fluentbit or whatever tool they like with access to the DB app, then they create the cloudDB-CR which generates a secret. Now where do they access that secret? 

Once they accessed it and have the secret they add it to fluentbit, with the target to send it to (where do they get that target?) and finally babam, logs in Grafana?

@QuentinBisson
Copy link

So yes they create a CR on the MC and they check the status of that cr on thé mc to get the name of the secret and get that secret value on the mc as well. Maybe it's not the best user journey, but i'm not sure any other would bé approved by security.

Once they have thé secret they should send data to our alloy on thé mc, which is one of the main reason why a single ingress for observability would bé helpful

@marieroque
Copy link

I like the idea of the gateways: observability-gateway (Alloy for now IIUC) in the MC and o11y-data-gateway in the WC.

I'm fine with the Source CRD to allow the customer adding new data sources and get credentials to send their data to us.

I like the way you propose to configure the observability bundle.

The tenant/organization configuration is still not clear to me.

The topology type is a nice idea, but not sure it's the priority.

@QuentinBisson
Copy link

QuentinBisson commented Jul 26, 2024

So coming back to use cases for @Rotfuks because I'm on my laptop today and I can explain better :D

I'll call the CRD Omega because I don't want to induce anyone's opinion, even my own on how we should name it

How to send logs to our Managed Loki

1. Generate an API Key

  • Create a CR of Omega on the management cluster (via GitOps or via an Operator)
  • Watch for the status of the Omega CR to get the name of the secret to get the api key from because Omega CR content would be in clear
  • Go fetch the API Key

Remarks

  • Maybe we could use a PKI instead of an operator but I'm not sure it makes sense yet
  • I would think secret rotation would be quite easy to do, delete the secret CR and let operator recreate it if the one in status does not exist

2. Configure the application:

  • Customers would need to configure the api_key, the oltp gateway endpoint and the tenant id header in their log shipper (as long as it support oltp). Teleport would fall under that category
  • Operators get the secret from the MC and configure the logging agents on the WCs

Remarks

  • I would advocate that we expose another header for multi-tenancy at this point (something like X_GiantSwarm_Tenant_ID instead of X_Org_ID for abstraction and obfuscation purposes) but this could make the migration of the agents a bit harder?

3. Go to grafana and see your logs :)

Remarks

It could be nice to be able to have a view of that pipeline in some kind of blocks in grafana I guess like nameoftheomegatype - > alloy - > loki to be able to debug where it's blocked if possible? But we can improve with customer feedback.

@Rotfuks maybe something for you:
It would be nice to check with honeybadger if the generated secret could be sent via flux to the gitops repo (ciphered by sops) but I think it's highely unlikely and maybe check with shield if they think is is a good way to generate api keys?

@Rotfuks this is definitely something for you:
@stone-z brought up that us ingesting customer logs will need to be discussed regarding ISO

@QuentinBisson
Copy link

Interesting idea that came from a discussion with honeybadger, we should probably use the external storage operator to push the secret back to customers (https://external-secrets.io/latest/api/pushsecret/) and most likely to create the api key as well https://external-secrets.io/latest/api/generator/password/ (less code for us)

@QuentinBisson
Copy link

Also, maybe a better Idea, let's not use api keys but oidc in front of the gateway so tokens are rotated?

@QuentinBisson
Copy link

QuentinBisson commented Jul 26, 2024

Let's wait for feedback from @giantswarm/team-bigmac https://gigantic.slack.com/archives/C053JHJC99Q/p1722015045075429

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Aug 15, 2024

Alright, big mac is sadly completely overloaded with topics already. Let's talk once you're back what exactly we need from BigMac and how we can reduce the dependency to them. Maybe we can boil it down to a kickoff workshop so we can do the PoC on our own. I'll discuss it further.

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Aug 20, 2024

I like the idea of using Alloy as our OpenTelemetry gateway, it does support a wide variety of receivers (OpenTelemetry, Datadog, Jaeger, Kafka, etc ...). But I would also like to know what are the use cases and which receiver should we support.
I am also unsure how it would preform comparing to exposing Loki or Mimir directly, but this would also limit our capabilities in terms of receiver and would expose critical service like Mimir to the outside.

I would also be interested on defining a high level user journey with this new solution.

1. Generate an API Key

Remarks

  • I would think secret rotation would be quite easy to do, delete the secret CR and let operator recreate it if the one in status does not exist

I would rather have the user create a new Omega CR to get a new API key, rather than have them delete the secret attached to the current Omega CR, this make things complicated IMO.

2. Configure the application:

  • Customers would need to configure the api_key, the oltp gateway endpoint and the tenant id header in their log shipper (as long as it support oltp). Teleport would fall under that category

We first need to figure out how we implement Authentication and how we add support for the different protocols (http, grpc, thrift_http, and the like)

  • Operators get the secret from the MC and configure the logging agents on the WCs

What do you mean by this?

3. Go to grafana and see your logs :)

I think it would be good to point user on where and how they can visualize their data in Grafana.

Remarks

It could be nice to be able to have a view of that pipeline in some kind of blocks in grafana I guess like nameoftheomegatype - > alloy - > loki to be able to debug where it's blocked if possible? But we can improve with customer feedback.

This would be a view for us for debugging purposes right ?

@QuentinBisson
Copy link

QuentinBisson commented Aug 20, 2024

I like the idea of using Alloy as our OpenTelemetry gateway, it does support a wide variety of receivers (OpenTelemetry, Datadog, Jaeger, Kafka, etc ...). But I would also like to know what are the use cases and which receiver should we support. I am also unsure how it would preform comparing to exposing Loki or Mimir directly, but this would also limit our capabilities in terms of receiver and would expose critical service like Mimir to the outside.

I would also be interested on defining a high level user journey with this new solution.

So originally, at least for receiving logs, I wanted to reduce the surface by only opening the default OLTP port (http and GRPC) as it is usually supported pretty well and we see with time if we need to open more things.
The main advantage is that the gateway would act as an authentication proxy and reduce the attack surface by a lot because well, we would have only 1 ingress

1. Generate an API Key

Remarks

  • I would think secret rotation would be quite easy to do, delete the secret CR and let operator recreate it if the one in status does not exist

I would rather have the user create a new Omega CR to get a new API key, rather than have them delete the secret attached to the current Omega CR, this make things complicated IMO.

We can discuss that at the end of the week for sure. We are currently having discussions to see if we can use OIDC instead of API keys to actually make it more secure so I did not spend too much time investigating this. The recreaction part is also because our operator use it so we cannot really create a new secret so easily

2. Configure the application:

  • Customers would need to configure the api_key, the oltp gateway endpoint and the tenant id header in their log shipper (as long as it support oltp). Teleport would fall under that category

We first need to figure out how we implement Authentication and how we add support for the different protocols (http, grpc, thrift_http, and the like)

I linked this page in the early discussions https://grafana.com/docs/alloy/latest/reference/components/otelcol/otelcol.receiver.otlp/ that explains how to enable the oltp receiver and there are also extra components with auth in them https://grafana.com/docs/alloy/latest/reference/components/otelcol/otelcol.auth.bearer/ but ongoing discussions ar e going towards OIDC and either DEX or SPIFFE/SPIRE but we need to see if that is achievable in a possible timeline which I highly doubt.

  • Operators get the secret from the MC and configure the logging agents on the WCs

What do you mean by this?

This was me explaining how our operators will also use the Omega CRD

3. Go to grafana and see your logs :)

I think it would be good to point user on where and how they can visualize their data in Grafana.

Yes and I hope this gets built into backstage

Remarks

It could be nice to be able to have a view of that pipeline in some kind of blocks in grafana I guess like nameoftheomegatype - > alloy - > loki to be able to debug where it's blocked if possible? But we can improve with customer feedback.

This would be a view for us for debugging purposes right ?

For us and cutomers yes, kinda like the alloy ui but in grafana :)

@stone-z
Copy link
Contributor

stone-z commented Aug 20, 2024

Just to echo my comments in the internal threads, I think a homegrown API key mechanism is the wrong way to go here. The tools exist in the ecosystem to use an identity-based authn/z scheme. Aside from being more secure, we already have other use cases for it, it is a great platform feature, and it ends up being less work in the long run anyway

@QuentinBisson
Copy link

I totally agree here @stone-z, but you know I'm a bit skeptical when it comes to a possibe timeline for say Spire :D

@QuentinBisson
Copy link

QuentinBisson commented Aug 27, 2024

@QuentinBisson
Copy link

If we cannot have spire running, we could theoretically use this https://kubernetes.github.io/ingress-nginx/examples/auth/client-certs/

@QuentinBisson
Copy link

@Rotfuks I think the investigation is done if we thinkg about security as a next step right?

  • We will have an oltp gateway with minimal resources
  • They can define whatever tenants they want once the write path POC is done

And we can figure out security in the implementation phase?

I'm asking because I don't know where to go from here

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Sep 9, 2024

Do you have a list of security risks already that we have to figure out/keep in mind when we implement it?
I would say next phase is to create a MVP and roll it out to a first installation to test it. So we update and continue with #3568 Wdyt?

@QuentinBisson
Copy link

QuentinBisson commented Sep 9, 2024

I think the main thing is we need workload identity or something kind of cert authentication in the MVP but we Can start with the implémentation and deploy alloy as a gateway and replace our current pipeline first with oltp and maybe Loki protocol and check the différence in resource usage
Apart from that we Can close

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Sep 9, 2024

Good, would love if you could update the implementation ticket with your findings from here and what our steps to implement the log receiver would be (maybe also a smol architecture graphic would help to drive what we want to achieve here?) Then we can close this. Thanks!

@QuentinBisson
Copy link

See write up here #3568 (comment), I'm closing in favor of implementation specific questions

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked needs/refinement Needs refinement in order to be actionable team/atlas Team Atlas
Projects
Archived in project
Development

No branches or pull requests

5 participants