Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: ml/engine/utils/FileUtils casts long file length to int incorrectly #3198

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

maxlepikhin
Copy link

@maxlepikhin maxlepikhin commented Nov 3, 2024

Description

"(int) file.length()" makes length negative for file sizes greater than 2GB (but less than 4GB). This results in function returning empty list of chunks and model registration task being stuck in CREATED state.

The fix is to use longs when splitting model zip file. Tested locally that updated opensearch-ml-algorithms jar fixes the problem.

Bug: #3197

Related Issues

N/A

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@pyek-bot
Copy link
Contributor

pyek-bot commented Nov 4, 2024

Should we create an issue for this and link it to the PR? @ylwu-amzn

Edit: I see that it is here already #3197
@maxlepikhin can you add it to the description?

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 5, 2024 16:11 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 5, 2024 16:11 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Nov 5, 2024

@maxlepikhin in every commit, you need to commit with your sign off using -s, for example. git commit -m"commit message" -s

Your last two commits are missing sign off, you can fix it by the following:

To add your Signed-off-by line to every commit in this branch:

Ensure you have a local copy of your branch by checking out the pull request locally via command line.
In your local branch, run: git rebase HEAD~2 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin fix-3197

@b4sjoo
Copy link
Collaborator

b4sjoo commented Nov 5, 2024

Seems need to run spotlessApply

Signed-off-by: Max Lepikhin <[email protected]>
Signed-off-by: Max Lepikhin <[email protected]>
@maxlepikhin maxlepikhin temporarily deployed to ml-commons-cicd-env-require-approval November 6, 2024 18:14 — with GitHub Actions Inactive
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 6, 2024 18:14 — with GitHub Actions Failure
@brianf-aws
Copy link
Contributor

Hey @maxlepikhin ! Just curious, how did you debug this?

@maxlepikhin
Copy link
Author

Hey @maxlepikhin ! Just curious, how did you debug this?

By reading the code.

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2024 19:17 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Nov 11, 2024

rerunning the flaky IT test for linux

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2024 21:51 — with GitHub Actions Failure
@maxlepikhin
Copy link
Author

rerunning the flaky IT test for linux

@mingshl it appears to be waiting on some human action, what is the next step here?

@brianf-aws
Copy link
Contributor

brianf-aws commented Nov 13, 2024

rerunning the flaky IT test for linux

@mingshl it appears to be waiting on some human action, what is the next step here?

Hey Max, reaching out to team internally to run the job again. Apologies for the late response.

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 13, 2024 00:18 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 13, 2024 07:35 — with GitHub Actions Failure
@mingshl mingshl added the bug Something isn't working label Nov 13, 2024
@mingshl
Copy link
Collaborator

mingshl commented Nov 13, 2024

will approve after all tests passed.

@maxlepikhin can you identify the version when this bug is happening? trying to figure out the backport versions cc @ylwu-amzn

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 13, 2024 21:51 — with GitHub Actions Failure
@maxlepikhin
Copy link
Author

will approve after all tests passed.

@maxlepikhin can you identify the version when this bug is happening? trying to figure out the backport versions cc @ylwu-amzn

From 11/17/22 (bfb0748): all releases it seems. It'd be great if somebody from the maintainers can help the tests pass, are they flaky?

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 15, 2024 04:53 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 18, 2024 17:01 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 18, 2024 18:22 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Nov 18, 2024

It's a flaky test.

Created the issue to track it.

Approved.

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

@maxlepikhin
Copy link
Author

It's a flaky test.

Created the issue to track it.

Approved.

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

Ok, thanks @mingshl . How to rerun it or override to submit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants