Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions when extracting zip files #456

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

alexdewar
Copy link
Contributor

@alexdewar alexdewar commented Jun 24, 2024

Before extracting a zip file, we check whether we already have the contents of that file extracted to a cache folder and, if so, skip the extraction.

While this is safe to do if there is only one HealthGPS process running, if there are multiple processes attempting to extract the zip file at the same time, there is a potential race condition (#454):

  1. Process A observes that the cache folder doesn't exist, creates it and begins extraction
  2. Process B observes that the cache folder exists, tries to use it before extraction is complete and crashes
  3. Process A completes extraction

Note that while this sounds unlikely, it is entirely possible that this could happen with multiple jobs running on the HPC (particularly given that filesystem access is often slow on the Imperial HPC).

Another possible bug is that a HealthGPS could terminate before extraction is complete, in which case, again, the cache folder will exist, but only contain a subset of the data files. In this case, the only remedy would be to delete the cache folder -- not particularly user friendly.

This PR solves both of these problems by performing extraction to a temporary folder (with a unique name) before renaming the folder. While moving a folder is not guaranteed to be atomic because it may involve copying data between devices, it seems that renaming them should be (or at least it is on POSIX, which will fix the HPC case).

Fixes #454.

Copy link
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. just a nitpick.

Comment on lines 100 to 101
return cache_path;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    if (!std::filesystem::exists(cache_path)) {
        extract_zip_file(file_path, cache_path);
    }
    return cache_path;

@alexdewar alexdewar force-pushed the fix_extract_race_condition branch from 299601c to 67be6c3 Compare June 26, 2024 11:48
@alexdewar alexdewar enabled auto-merge June 26, 2024 11:48
Copy link
Contributor

@TinyMarsh TinyMarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM.

I don't know if this is relevant but would std::filesystem::temp_directory_path be useful here? Perhaps there's a reason we wouldn't want to use the system temp directory?

@alexdewar alexdewar force-pushed the fix_extract_race_condition branch from 67be6c3 to 413d5fc Compare June 28, 2024 08:56
@alexdewar alexdewar disabled auto-merge June 28, 2024 09:09
@alexdewar
Copy link
Contributor Author

@TinyMarsh Ordinarily that would make sense, but the weird thing about what we're doing here is that we want to be able to move the temporary folder to its final destination atomically (i.e. so that another HPC job doesn't try to read the cache folder when it's only got half the files in) and moving a folder will only be atomic if it's in the same partition, which is why we do it in the same cache folder (e.g. on Linux /tmp is usually a ramdisk, which means the files would have to be copied).

@alexdewar alexdewar enabled auto-merge June 28, 2024 09:50
@alexdewar alexdewar merged commit 34ce340 into main Jun 28, 2024
5 checks passed
@alexdewar alexdewar deleted the fix_extract_race_condition branch July 1, 2024 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make zip caching code robust to race conditions
3 participants