Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition for multiple very fast downloads #3907

Merged
merged 1 commit into from
Sep 16, 2023

Conversation

HebaruSan
Copy link
Member

@HebaruSan HebaruSan commented Sep 15, 2023

Problem

See #3906 for details; to sum up, one of the new NetAsyncDownloader tests from #3904 was failing in GitHub workflows but nowhere else.

When downloading multiple URLs, the DownloadTarget.size values were 0 for some of the files when NetAsyncDownloader.DownloadAndWait completed.

Cause

This new test found a real, pre-existing problem for us! I'm really glad I didn't give in to the temptation to add [Category("FlakyNetwork")] and move on. 🎉

There's a race condition with multiple extremely fast downloads (the test is using file:// URLs to avoid spurious failures caused by server glitches, so the files are already on disk and possibly cached in memory). At the start, the main thread loops over the requested URLs and starts a thread for each one. This sets up the downloads list with the currently executing downloads, and the queuedDownloads list with the downloads that are waiting their turn. When a thread completes, the sizes of the downloads and queuedDownloads lists are compared to completed_downloads to determine whether there are any other downloads in progress or waiting.

If at any time the active download threads manage to finish their work before all of the downloads are started, then they will be comparing the completed download count to incomplete downloads and queuedDownloads lists! This results in the completion signal being sent early, and when the main thread returns to DownloadAndWait, it will skip waiting and signal completion with whatever was finished up to that point.

Changes

  • Now the downloader's main thread holds the dlMutex lock (which the other threads acquire before they check for completion) while it starts all the downloads. This ensures that by the time any downloader thread is able to check for completion, the downloads and queuedDownloads lists will be fully populated, which will prevent the completion notification from being triggered prematurely.
  • The test now uses 13 files instead of 4, which made the problem start happening in my Ubuntu VM where it hadn't before (there seems to be a sensitivity to the number of downloads, with more downloads making the problem more likely).
  • Now the test performs two additional checks: comparing the real downloaded file sizes that it observes on disk to the target sizes and the sizes of the original files.
  • Since that test was so helpful, I decided to try adding one for failed downloads. In the process, I discovered that the index values in the failed downloads exception can be incorrect, so some further changes are made to clean that up (specifically, never queueing file:// URLs and looking up the failed targets in the original input, which is now an IList instead of an ICollection because we need to call IndexOf on it).

Fixes #3906.

@HebaruSan HebaruSan added Bug Something is not working as intended Core (ckan.dll) Issues affecting the core part of CKAN Linux Issues specific for Linux Mono Issues specific for Mono Tests Issues affecting the internal tests Network Issues affecting internet connections of CKAN labels Sep 15, 2023
@HebaruSan HebaruSan force-pushed the fix/downloader-race-condition branch 3 times, most recently from 2886643 to af0feaf Compare September 16, 2023 01:17
Copy link
Member

@techman83 techman83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh awesome, this likely explains the errant failure reported here and there. Thanks @HebaruSan !

@HebaruSan HebaruSan merged commit c38c7a2 into KSP-CKAN:master Sep 16, 2023
10 checks passed
@HebaruSan HebaruSan deleted the fix/downloader-race-condition branch September 16, 2023 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something is not working as intended Core (ckan.dll) Issues affecting the core part of CKAN Linux Issues specific for Linux Mono Issues specific for Mono Network Issues affecting internet connections of CKAN Tests Issues affecting the internal tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GitHub test runner doesn't like the new NetAsyncDownloader test
2 participants