-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: retry delay strategy #871
Conversation
size-limit report 📦
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exponential backoff with jitter should work fine for most cases, but assuming no jitter in the case of 3 retries that's a call at 0, 150ms, 600ms, and 1200ms (am I correct here?)
It would be nice to see if this covers every case, or if there would still be a super small chance that this error might get hit (even with the exponential backoff).
Timestamp failed to pass the watermark after retrying the configured 3 times. We cannot guarantee the integrity of the response since it could be a replay attack.
If this error got hit with the exponential backoff in place, what would that mean for the FE app? Is the subnet down or behind? Did we just get super unlucky? Does the same error message make sense now that the exponential backoff is in place (i.e. does this error have a different meaning)?
Ideally, this error would never happen by accident, or even if the FE client user has a poor internet connection. Then if they receive the error it would actually suggest something is not healthy with the nodes/subnet they've communicated with.
Putting a higher threshold in by using exponential backoff and hoping that this is a rare case is just going to confuse developers, and especially confuse end users in the even this error is hit and a toast component pops up with "Timestamp failed to pass the watermark after retrying the configured 3 times. We cannot guarantee the integrity of the response since it could be a replay attack" (in fact, this sort of sounds like security issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I adopted Eric's suggestions
Description
Developers have noticed the agent is more frequently erroring with the new watermark protections against replay attacks / stale data. This feature adds a delay strategy for retries that will allow for more time for nodes to catch up, with exponential increases to the rate
Fixes SDK-1562
How Has This Been Tested?
new e2e tests
Checklist: