Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix various tests caused by cases byte size validation not handled properly #364

Merged
merged 10 commits into from
Jun 7, 2024

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Jun 4, 2024

What does the PR do?

  • Recent byte size validation checks were failing the L0_http HttpTest.test_byte unit test which sends a raw binary tensor.
  • Recent input byte size validation checks are segfaulting for the CUDA Shared Memory case because it assumes it is in CPU memory and dereferences an invalid pointer.
  • Currently, L0_trt_shape_tensor and L0_trt_reformat_free tests are failing because TensorRT infer requests should not check byte size.

The raw binary tensor in the failing test case consists of two memory chunks by the time it gets validated:

  1. The size of the bytes element
  2. The bytes element

This PR makes some generic refactoring to try to handle any series of buffers, where each element is not guaranteed to fit within a single buffer, and each buffer is not guaranteed to contain a single element. Also it skips byte size validation if input memory type is GPU or input platform is TensorRT.

Caveats:

  1. GPU buffers are currently skipped, as they may need some special handling to be dereferenced/checked compared to CPU buffers in the existing code. This will likely cause the element count checks to fail as well.
  2. TensorRT inputs are currently skipped, as format-free IO tensors require communication with the backend to determine the byte size.
  3. A check is made that the 4-byte byte_size indicator for a given element is contained within a single buffer, and is not split across buffers. If a byte_size is split across buffers, then an error is returned. This was just easier to implement, open to suggestions if anyone has a slick solution. Technically the buffer APIs should support doing this, but I assume it is unlikely to be done. Example of this currently unsupported case:
    • buffer1=[<size1><element1><size2_partial>]
    • buffer2=[<size2_partial><element2>...]
    • <size2_partial> is split across buffers, and is currently rejected.

Open Items

  • Verify test plan passes in CI
  • Decide what to do about GPU tensors - probably need to handle them
  • Decide what to do about TensorRT, e.g. shape tensors and reformat-free I/O tensors.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

triton-inference-server/server#7326

Where should the reviewer start?

N/A

Test plan:

  • L0_http
  • L0_input_validation
  • L0_infer_cudashm
  • L0_sequence_batcher_cudashm
  • L0_backend_output_detail
  • L0_request_cancellation
  • L0_trt_reformat_free
  • L0_trt_shape_tensors
  • Nightly pipeline

CI Pipeline ID: 15635924

Background

None

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

None

@rmccorm4 rmccorm4 requested review from GuanLuo and yinggeh June 4, 2024 03:12
src/infer_request.cc Outdated Show resolved Hide resolved
src/infer_request.cc Outdated Show resolved Hide resolved
src/infer_request.cc Outdated Show resolved Hide resolved
src/infer_request.cc Outdated Show resolved Hide resolved
src/infer_request.cc Outdated Show resolved Hide resolved
int64_t buffer_memory_id;

// Validate elements until all buffers have been fully processed.
while (remaining_buffer_size || buffer_idx < buffer_count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not like below to keep variables in scope

for (buf : buffers) {
  buffer_size = ...;
  for (checked_size = 0; checked_size < buffer_size) {
    ..
  }
}

@rmccorm4
Copy link
Contributor Author

rmccorm4 commented Jun 5, 2024

Re-assigning to @yinggeh as this involves a lot of refactoring to the original validation logic feature he wrote and it seems like he has some ideas for further refactoring.

@yinggeh yinggeh marked this pull request as draft June 5, 2024 07:54
@yinggeh yinggeh changed the title fix: Fix byte size validation for string tensors when size and data are split across buffers fix: Fix various failed tested caused by cases not handled by byte size validation Jun 5, 2024
@yinggeh yinggeh changed the title fix: Fix various failed tested caused by cases not handled by byte size validation fix: Fix various failed tests caused by cases not handled by byte size validation Jun 5, 2024
@yinggeh yinggeh marked this pull request as ready for review June 5, 2024 17:04
@yinggeh yinggeh added the bug Something isn't working label Jun 5, 2024
src/infer_request.cc Outdated Show resolved Hide resolved
src/infer_request.cc Show resolved Hide resolved
src/infer_request.cc Outdated Show resolved Hide resolved
src/infer_request.cc Outdated Show resolved Hide resolved
@yinggeh yinggeh changed the title fix: Fix various failed tests caused by cases not handled by byte size validation fix: Fix various tests caused by cases not handled by byte size validation Jun 5, 2024
@yinggeh yinggeh changed the title fix: Fix various tests caused by cases not handled by byte size validation fix: Fix various tests caused by cases not handled by byte size validation properly Jun 5, 2024
@yinggeh yinggeh changed the title fix: Fix various tests caused by cases not handled by byte size validation properly fix: Fix various tests caused by cases byte size validation not handled properly Jun 5, 2024
@yinggeh yinggeh requested a review from GuanLuo June 6, 2024 11:45
Copy link
Contributor

@GuanLuo GuanLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new comments in opened discussion.

@GuanLuo
Copy link
Contributor

GuanLuo commented Jun 6, 2024

Please update the description with the latest information and pipeline ID, I need to see the test status

@yinggeh yinggeh requested a review from GuanLuo June 6, 2024 20:45
@yinggeh
Copy link
Contributor

yinggeh commented Jun 6, 2024

Please update the description with the latest information and pipeline ID, I need to see the test status

I believe I have updated with the latest info and pipeline ID. Is there anything particular you think is missing?

Copy link
Contributor Author

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't approve this PR because I originally opened it, but I am verbally approving this if all the affected CI tests in the test plan description are passing. I'd like to get these fixes in ASAP to fix the broken tests in CI and reduce cherry-picks.

I added a follow-up ticket DLIS-6833 to address some of the logical refactoring brought up by Guan after code freeze.

@yinggeh
Copy link
Contributor

yinggeh commented Jun 6, 2024

Please also review related server PR github.com/triton-inference-server/server/pull/7326

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging this pull request may close these issues.

3 participants