Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: hedged request between the external cache and the object storage #6712

Closed
damnever opened this issue Sep 8, 2023 · 16 comments · Fixed by #7860
Closed

Feature Request: hedged request between the external cache and the object storage #6712

damnever opened this issue Sep 8, 2023 · 16 comments · Fixed by #7860

Comments

@damnever
Copy link
Contributor

damnever commented Sep 8, 2023

Is your proposal related to a problem?

The long-tail requests sometimes are inevitable between the store-gateway and the external cache service. Lowering the timeouts between the store-gateway and the cache service isn't a proper way to address this problem.

Describe the solution you'd like

If accessing the external cache service takes too long, issue a hedged request to the object storage, as object storages have reasonable latency on average nowadays.

Describe alternatives you've considered

Additional context

@GiedriusS
Copy link
Member

I had a similar idea. We could use this https://github.com/cristalhq/hedgedhttp HTTP client to implement this. Just before that, we could make it better by estimating the 90th percentile and automatically sending a hedged request if the duration exceeds that. T-Digest seems like a good option for estimating the percentiles. Ideally, we would like to avoid specifying a threshold from which another request should be sent.

@Vanshikav123
Copy link
Contributor

Hello @GiedriusS can I work on this ?

@GiedriusS
Copy link
Member

@Vanshikav123 sure. cristalhq/hedgedhttp#52 that client now supports dynamic thresholds/durations so shouldn't be too hard to implement with t-digest 🤔

@Vanshikav123
Copy link
Contributor

@GiedriusS it would be great help if you provide me with some references for this issue.

@rahulbansal3005
Copy link

Hi @GiedriusS @damnever, I am interested in working on this issue, in the LFX term 3.

@Zyyeric
Copy link

Zyyeric commented Aug 10, 2024

Hi @GiedriusS ! I am very interested in working on this issue through LFX. Just wondering do I need to submit a proposal on the implementation?

@GiedriusS
Copy link
Member

Please submit everything through the LFX website 😊

@aakashbansode2310
Copy link

Hello @GiedriusS @saswatamcode,
I hope this message finds you well. My name is Aakash Undergraduate from IIT Bombay, and I am excited to contribute to the implementation of hedged requests for reducing tail latency in Thanos.
I'm eager to help enhance the performance and reliability of Thanos and would greatly appreciate your guidance. I'm looking forward to collaborating and making this improvement together!

@mani1911
Copy link

mani1911 commented Aug 12, 2024

I am really interested in contributing to Thanos. Is there any pretests that I can work on? @damnever @GiedriusS
Should I submit my proposal through cover letter (LFX term 3)?

@saswatamcode
Copy link
Member

Yes, please submit everything using the LFX website 🙂

@mani1911
Copy link

Yes, please submit everything using the LFX website 🙂

Is there any pre task that can help me understand Thanos better? I looked up on the working of Thanos.

@Zyyeric
Copy link

Zyyeric commented Aug 13, 2024

@GiedriusS @saswatamcode I am a bit confused about how https://github.com/cristalhq/hedgedhttp would be able to achieve this task. The hedge HTTP client in this implementation would send HTTP requests to the same destination, while in this use case, the first and second requests shall be sent to different services, external cache service, and obj respectively. Would something like a timeout monitoring mechanism since the first request, and then sending the second request using the same HTTP client if latency > t-digest.Quantile(90) make sense?

@GiedriusS
Copy link
Member

Yeah, sorry for the confusion 🤦 hedging between two different systems doesn't make sense. Cache operations are supposed to be ultra fast. I believe the original issue is that with some k/v storages like memcached one is always forced to download the same data. So, in case the cached data is big, it takes a long time. This could be solved by having a two layered cache. We use client-side caching in Redis to solve this problem and it works well. With it, hot items don't need to be re-downloaded constantly because they are kept in memory. I will edit the title/description once I have some time unless someone disagrees.

And yes, I do imagine it to work something like that. The hedged HTTP client works like that - it sends another request if some timeout is reached. We could use the t-digest library to avoid the guesswork of setting the latency after which to send another request manually.

@milinddethe15
Copy link
Contributor

@GiedriusS If you have a moment, could you clarify this for me?

Do obj storage providers internally manage query requests among replicas? If not then do we need to make thanos do that for hedged requests?
https://cloud-native.slack.com/archives/CK5RSSC10/p1723450096247419?thread_ts=1723358359.204139&cid=CK5RSSC10

@yeya24
Copy link
Contributor

yeya24 commented Oct 1, 2024

@GiedriusS Is this issue still valid? From your comment looks like we can still have some sort of hedging but not using hedgedhttp library?

@yeya24
Copy link
Contributor

yeya24 commented Nov 17, 2024

Yeah, sorry for the confusion 🤦 hedging between two different systems doesn't make sense. Cache operations are supposed to be ultra fast. I believe the original issue is that with some k/v storages like memcached one is always forced to download the same data. So, in case the cached data is big, it takes a long time. This could be solved by having a two layered cache. We use client-side caching in Redis to solve this problem and it works well. With it, hot items don't need to be re-downloaded constantly because they are kept in memory. I will edit the title/description once I have some time unless someone disagrees.

Something we observe is that the external caches can still get overloaded sometimes and your requests can stuck waiting. We use two layered cache but it still happen sometimes. We probably need a circuit breaker for GET cache operations as mentioned in #7010 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants