Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle response exception from Elasticsearch client #26424

Conversation

mrkm4ntr
Copy link

If Elasticsearch client returns non success HTTP status, it throws ResponseException like below.
https://github.com/elastic/elasticsearch/blob/f8ea89edcb9e79a62f79495b65eeb7c88c252f4f/client/rest/src/main/java/org/elasticsearch/client/RestClient.java#L340-L355

So currently ElasticsearchIO doesn't retry 429 response.
#22160

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@mrkm4ntr
Copy link
Author

R: @egalpin

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@egalpin
Copy link
Member

egalpin commented Apr 27, 2023

thanks @mrkm4ntr for the contribution! I'll take a look more closely as soon as I can. In the meantime, based on the spotless checker output[1] you need to run ./gradlew :sdks:java:io:elasticsearch:spotlessApply on your local machine and commit any changes to auto-fix java style.

Could you also double-check that there is an existing test for the failure mode that your code aims to address? I'll also verify in my review. Thanks!

[1] https://ci-beam.apache.org/job/beam_PreCommit_Spotless_Commit/26535/console

@mrkm4ntr mrkm4ntr force-pushed the elasticsearch/handle-response-exception branch 2 times, most recently from 1a4fb5d to 98dff77 Compare April 28, 2023 06:01
@mrkm4ntr
Copy link
Author

@egalpin Thanks. I checked the current tests and found that Elasticsearch returns HTTP 200 status for invalid bulk data (400 is only in response entity). While too many requests error is HTTP 429. I changed test to emulate non-200 status with 405 status.

@mrkm4ntr mrkm4ntr force-pushed the elasticsearch/handle-response-exception branch from 98dff77 to 6b0a905 Compare April 28, 2023 07:36
@@ -373,6 +376,8 @@ abstract static class Builder {

abstract Builder setTrustSelfSignedCerts(boolean trustSelfSignedCerts);

abstract Builder setInvalidBulkEndpoint(boolean invalidBulkEndpoint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like the idea of having this builder method only for the sake of testing. I feel it will be very confusing to users to see this as an option when building configurations. Could you look for an alternative approach?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method and this are not visible from users. Same approach as this.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 9, 2023
@@ -1224,26 +1241,9 @@ static class DefaultRetryPredicate implements RetryPredicate {
this(429);
}

/** Returns true if the response has the error code for any mutation. */
private static boolean errorCodePresent(HttpEntity responseEntity, int errorCode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrkm4ntr for your patience, I have kept you waiting too many times.

This change is definitely interesting. I can see that in some cases, such as successfully creating a document via _bulk, the http status code is 200 while the status field under items in the response body is 201 (updating the doc results in status field being 200:

{
    "took": 383,
    "errors": false,
    "items": [
        {
            "update": {
                "_index": "foo",
                "_type": "_doc",
                "_id": "2",
                "_version": 1,
                "result": "created",
                "_shards": {
                    "total": 2,
                    "successful": 1,
                    "failed": 0
                },
                "_seq_no": 0,
                "_primary_term": 1,
                "status": 201
            }
        }
    ]
}

It's also the case, more importantly, that when one of the operations of the bulk payload fails, but another succeeds, there can be an http status code of 200 but one of the items can have status field with non-200 code:

{
    "took": 19,
    "errors": true,
    "items": [
        {
            "update": {
                "_index": "foo",
                "_type": "_doc",
                "_id": "2",
                "_version": 7,
                "result": "updated",
                "_shards": {
                    "total": 2,
                    "successful": 2,
                    "failed": 0
                },
                "_seq_no": 7,
                "_primary_term": 1,
                "status": 200
            }
        },
        {
            "update": {
                "_index": "foo",
                "_type": "_doc",
                "_id": "2",
                "status": 400,
                "error": {
                    "type": "illegal_argument_exception",
                    "reason": "failed to execute script",
                    "caused_by": {
                        "type": "script_exception",
                        "reason": "compile error",
                        "script_stack": [
                            "if(ctx._source.group !=== null) { ctx._source.gro ...",
                            "                        ^---- HERE"
                        ],
                        "script": "if(ctx._source.group !=== null) { ctx._source.group = params.id % 2 } else { ctx._source.group = 0 }",
                        "lang": "painless",
                        "position": {
                            "offset": 24,
                            "start": 0,
                            "end": 49
                        },
                        "caused_by": {
                            "type": "illegal_argument_exception",
                            "reason": "invalid sequence of tokens near ['='].",
                            "caused_by": {
                                "type": "no_viable_alt_exception",
                                "reason": "no_viable_alt_exception: null"
                            }
                        }
                    }
                }
            }
        }
    ]
}

So the status code alone cannot be "trusted" to inform of all failures, since there could be false negatives. However, it seems that the http status code can be used as a first check: if the http status code is >=400, we know a failure has occurred and we don't need to iterate through the items necessarily.

Checking the http status code first might be a viable path forward for properly handling the 429 case.

Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
So, you believe there's a scenario where the HTTP status code is 200, but the status in the body is 429. I'm not certain this is accurate, but we should pay attention to this if we're not confident.

OK, I will change my PR like this.

Checking the http status code first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you believe there's a scenario where the HTTP status code is 200, but the status in the body is 429.

It's more so that I feel that we cannot remove the errorCodePresent method and rely on http status code alone. I have seen first hand (examples above) scenarios where 1 out of n bulk entities succeeds and therefore the http status code is 200, even if n-1 bulk entities failed and had their associated status field in items set to something like 400.

I believe the failure mode for not properly handling http 429 is that the response body does not contain the key items which errorCodePresent currently depends on to look further for status. Instead, in the case of 429, status is a top-level key in the response body.

So I feel that we need to keep all existing errorCodePresent logic, and add additional handling to check the http status code.

@github-actions github-actions bot removed the stale label Aug 10, 2023
@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 10, 2023
@mrkm4ntr
Copy link
Author

/hold

@github-actions github-actions bot removed the stale label Oct 13, 2023
@kennknowles
Copy link
Member

R: @egalpin

@damccorm
Copy link
Contributor

@egalpin @mrkm4ntr what are next steps for this PR?

Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 11, 2024
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants