You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Runner class and service configuration have a delete_local_files argument which deletes any datafiles downloaded and any registered temporary directories at the end of an analysis run. We know (and warn) that this isn't thread-safe. However, we've recently seen the same consequences of thread unsafety in what should be a thread-safe environment (separate containers running separate processes with supposedly separate filesystems in Cloud Run).
To get our containers to run reliably, we've recently had to disable local file deletion (delete_local_files=False is now the default). We now have a potential problem of running out of memory when running our containers in Cloud Run where, by default, local storage is actually in-memory. We need to find a way to clean up files from one analysis before the next is run.
The actual error we saw was a FileNotFoundError which would occur when using functions from the os.path module e.g. when trying to get the relative path of a datafile in a dataset so it could be downloaded/uploaded or when trying to get the absolute path of a metadata file.
We triggered this error when asking 24 questions (6 in parallel at a time) to one of our services with concurrency set to 1 (1 request (question) at a time per container). The first several questions would work ok (between 4 and 8), then the rest would fail. We think what happened was this:
The first few questions would get new containers
These questions would finish and clean up successfully
The rest of the questions would be taken up by the containers that were already running and would attempt to use temporary directories. This would somehow fail because of the previous cleanup.
So, despite concurrency being 1, each container would attempt to answer multiple questions in serial. This should have been fine if each container had its own filesystem, but a new shared filesystem (between containers in a Cloud Run service instance) is being trialled in Cloud Run that we think makes it effectively thread-unsafe to clean up like this. I think it's possible that separate containers are interfering with each others list of registered temporary directories; it's also possible it's a problem within each container but caused by serial analyses.
What is the expected behavior?
We can automatically delete downloaded and temporary files/directories at the end of each analysis run.
Proposed Solution
Some ideas:
Set delete_local_files=True and make our file cleanup thread-safe (e.g. apply a lock on accessing and modifying the list of registered temporary directories)
e.g. put the registered_temporary_directories variable on the Runner instance so it's not global
Keep delete_local_files=False and start providing a non-in-memory filesystem that's cleaned up separately
The text was updated successfully, but these errors were encountered:
Bug report
What is the current behavior?
The
Runner
class and service configuration have adelete_local_files
argument which deletes any datafiles downloaded and any registered temporary directories at the end of an analysis run. We know (and warn) that this isn't thread-safe. However, we've recently seen the same consequences of thread unsafety in what should be a thread-safe environment (separate containers running separate processes with supposedly separate filesystems in Cloud Run).To get our containers to run reliably, we've recently had to disable local file deletion (
delete_local_files=False
is now the default). We now have a potential problem of running out of memory when running our containers in Cloud Run where, by default, local storage is actually in-memory. We need to find a way to clean up files from one analysis before the next is run.The actual error we saw was a
FileNotFoundError
which would occur when using functions from theos.path
module e.g. when trying to get the relative path of a datafile in a dataset so it could be downloaded/uploaded or when trying to get the absolute path of a metadata file.We triggered this error when asking 24 questions (6 in parallel at a time) to one of our services with concurrency set to 1 (1 request (question) at a time per container). The first several questions would work ok (between 4 and 8), then the rest would fail. We think what happened was this:
So, despite concurrency being 1, each container would attempt to answer multiple questions in serial. This should have been fine if each container had its own filesystem, but a new shared filesystem (between containers in a Cloud Run service instance) is being trialled in Cloud Run that we think makes it effectively thread-unsafe to clean up like this. I think it's possible that separate containers are interfering with each others list of registered temporary directories; it's also possible it's a problem within each container but caused by serial analyses.
What is the expected behavior?
We can automatically delete downloaded and temporary files/directories at the end of each analysis run.
Proposed Solution
Some ideas:
delete_local_files=True
and make our file cleanup thread-safe (e.g. apply a lock on accessing and modifying the list of registered temporary directories)registered_temporary_directories
variable on theRunner
instance so it's not globaldelete_local_files=False
and start providing a non-in-memory filesystem that's cleaned up separatelyThe text was updated successfully, but these errors were encountered: