Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make container startup timeout configurable #65

Merged
merged 10 commits into from
Aug 2, 2024

Conversation

jameslamb
Copy link
Member

Fixes #61

Contributes to rapidsai/docker#667

container-canary runs a container and then executes commands inside it. When it starts up that container, if it doesn't see the container come up within 10 seconds, it raises an exception.

This proposes making that timeout configurable, via a CLI argument --startup-timeout. This will make it easier to use container-canary with images / environments where container starting is slower (e.g. where something expensive is done at startup or where local disks are slow).

@jameslamb
Copy link
Member Author

I think that these changes are working as intended, but I'm struggling to test them.

I tested that the configuration value is getting all the way down to the place where it needs to be by adding a print statement, like this:

return fmt.Errorf("-- timeoutSeconds: %d", timeoutSeconds)
if time.Since(startTime) > (time.Second * time.Duration(timeoutSeconds)) {
   ...

And then doing this:

make clean testprep build

bin/canary validate \
  --file ./examples/kubeflow.yaml \
  --startup-timeout 5 \
  container-canary/kubeflow:shouldpass

bin/canary validate \
  --file ./examples/kubeflow.yaml \
  --startup-timeout 9 \
  container-canary/kubeflow:shouldpass
Error: -- timeoutSeconds: 5
Error: -- timeoutSeconds: 5

Error: -- timeoutSeconds: 9
Error: -- timeoutSeconds: 9

So this is at least as correct as the existing code.

But I'm not sure how to actually test the enforcement of the timeout. This startup timeout applies to the time between when docker run -d returns, here:

_, err := exec.Command("docker", commandArgs...).Output()

and when the container is first reported as "running", here:

info, err := c.Status()

I'm not sure how to easily influence the amount of time between those two states. That does not include any of these:

  • image-pulling time (happens before docker run -d completes)
  • time to run ENTRYPOINT / CMD (happens after docker run -d returns)

The only things I can think of that cause this startup timeout to be exceeded are like this:

  • slow disk I/O loading the image from local storage
  • insufficient memory or CPU available to the container runtime
  • some combination of the above + image having a large number of big layers

@jacobtomlinson @KyleFromNVIDIA can you think of a good way to reliably test this? Or should we just merge this as-is without testing changes?

@jameslamb jameslamb changed the title WIP: [NOT READY FOR REVIEW] make container startup timeout configurable make container startup timeout configurable Aug 1, 2024
@jameslamb jameslamb marked this pull request as ready for review August 1, 2024 18:00
@jameslamb jameslamb requested a review from a team as a code owner August 1, 2024 18:00
Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want to test this, my suggestion would be to make some sort of mock interface that can wait any desired period of time for testing purposes. This can be either a Go interface or a shim command that gets substituted in place of docker, or something else.

That said, this is a small, self-explanatory change, and I don't think it strictly has to be tested.

Approving as-is. Feel free to add more robust tests if you want.

@jameslamb
Copy link
Member Author

This can be either a Go interface or a shim command that gets substituted in place of docker, or something else.

Thanks for these ideas, they're good ones I hadn't considered. A shim for docker, like you did over in rapids-build-backend for nvcc (code link), sounds like it could be a powerful way to test more of how this tool interacts with the container runtime it's shelling out to.

I think that's way beyond my Go abilities though 😬

I agree with you about this being small and self-explanatory, and I'm also confident it could just be merged as-is. Let's see what @jacobtomlinson thinks.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in the long term we want to improve testing here, and putting more places where we can add shims is a great idea. But I also agree for this change that manual testing seems fine.

@jacobtomlinson jacobtomlinson changed the title make container startup timeout configurable Make container startup timeout configurable Aug 2, 2024
@jacobtomlinson jacobtomlinson merged commit 42e3d11 into NVIDIA:main Aug 2, 2024
7 checks passed
@jameslamb jameslamb deleted the configurable-timeout branch August 2, 2024 12:49
@jameslamb
Copy link
Member Author

Alright sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: make container startup timeout configurable
3 participants