Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering from OOMkill pod failures/evictions #10

Open
1beb opened this issue Oct 23, 2021 · 1 comment
Open

Recovering from OOMkill pod failures/evictions #10

1beb opened this issue Oct 23, 2021 · 1 comment

Comments

@1beb
Copy link

1beb commented Oct 23, 2021

One critical piece that I think makes this challenging to use at a larger scale is that R is a garbage collected language.

There are a number of odd situations, especially when reading or writing files that will continue to "grow" memory that ought to be garbage collected but never does. We were discussing this a little bit in the future repository. Henrik suggested using the callr plan which works extremely well when you're working on a single computer, but is incompatible with the setup command that is specified in the future-kubernetes helm chart.

I've been thinking about a number of alternative approaches:

  • Find a way to restart the R process when it finishes on a pod, before the next iteration.
  • Instead of setting up the cluster via helm chart, use ssh based cluster by distributing tasks over ssh from within your primary parallel loop.

Do you have any thoughts on how one might approach this?

@1beb 1beb changed the title Recovering from pod failures Recovering from OOMkill pod failures/evictions Oct 23, 2021
@paciorek
Copy link
Owner

paciorek commented Nov 2, 2021

I'm not sure. It seems like this is trying to avoid what seems like a flaw in how certain circumstances are handled in R. I'd be inclined to see if this could be addressed on the R side.

I think that if you managed to kill the R process in a given pod, the pod would restart, restarting R with it. So there's a chance that could somehow be used, though it feels pretty awkward.

Using ssh should be possible, but I didn't go down that path because it seems like working around Kubernetes rather than using Kubernetes the way it was intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants