Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume file upload #2379

Open
jrchudy opened this issue Dec 7, 2023 · 3 comments
Open

Resume file upload #2379

jrchudy opened this issue Dec 7, 2023 · 3 comments
Assignees
Labels
hatrac Anything related to hatrac/uploader investigation required Requires some initial investigation

Comments

@jrchudy
Copy link
Member

jrchudy commented Dec 7, 2023

Using recordedit to upload files doesn't properly resume an upload if the connection to the server was lost or the window was refreshed. For instance, if a user is uploading a 200 MB file and only half of the file gets uploaded before an interruption, the user has to restart the upload process.

We should properly "resume" the file upload if a partial file exists on the server already. This will be handled in multiple steps:

  1. when the connection to the server is interrupted but page is NOT reloaded
  2. when the page is reloaded
  3. resuming in a different tab/window

step 1 - resume on connection interruption

For resuming a file upload process that was interrupted in recordedit, the following should be done:

  1. As a file is selected and the first chunk has completed uploading, start to track the last contiguous chunk that was uploaded. As each new chunk is uploaded, the last chunk index is updated if needed. Also keep track of the jobUrl to ensure it is the same “path” that is being uploaded to when the upload process is attempting to resume.
    • lastChunkIdx - the index of the last chunk that was successfully uploaded
    • jobUrl - the hatrac namespace with the upload job appended to the end
    • fileSize - the size of the file initially uploaded to help ensure the resumed file is the same as the original
    • uploadVersion - the final name for an upload job after the job is marked as complete
    • the key in the map is intended to ensure each upload that is being resumed is for the same file (checksums match) being uploaded to the same column and recordedit form index
  2. After an error occurs (loss of internet connection for example) and the user tries to upload again (clicks submit after resolving connection issue), the file checksum is calculated for the “new” upload
  3. if that checksum, the associated column name, and recordedit form index all match one of our stored values, check the following before marking the current UploadFileObject to be a partial upload
    • the new url is contained in the jobUrl we tracked
    • the lastChunkIdx represents some chunks have been uploaded
    • the new file's size is the same as fileSize we tracked
    • the uploadVersion is not set yet
  4. While checking if the file exists using the generated url (/hatrac/path/to/file.txt without ;upload/somehash), if we get a 409 response assume the namespace already exists and the jobUrl (/hatrac/path/to/file.txt;upload/somehash) is used for the upload instead of creating a new upload job
  5. When starting the upload job, the lastChunkIdx is used to notify which chunk to start uploading from so the job is properly resumed and we don’t upload any duplicate chunks

Map for storing information about incomplete upload jobs

{
 `${file.checksum}_${column.name}_${recordedit_form_index}`: {
     lastChunkIdx: n,
     jobUrl: '/hatrac/path/to/file.txt;upload/somehash',
     fileSize: n,
     uploadVersion: '/hatrac/path/to/file.txt:version'
 }
}

step 2 - when the page is reloaded

Other changes to accomplish this across reloads include:

  • moving the map for storing information about incomplete upload jobs to local storage
  • ensure object is cleaned up when a job is complete before redirecting after submission

More information that should be stored:

  • catalog, schema, table, and shortest key

step 3 - resuming in a different tab/window

Other changes to accomplish this across multiple tabs/windows:

  • lock/coordination to ensure the job isn't being uploaded 2 from 2 different sources at the same time
    • window ID and timestamps
    • release lock after X time
      • check the timestamp info and calculate how long it has been since the lock was created
@jrchudy
Copy link
Member Author

jrchudy commented Dec 7, 2023

Looking at the hatrac REST-API doc, I see this line:

Note, there is no support for determining which chunks have or have not been uploaded as such tracking is not a requirement placed on Hatrac implementations.

@RFSH RFSH added the hatrac Anything related to hatrac/uploader label Feb 5, 2024
@jrchudy jrchudy added the investigation required Requires some initial investigation label Feb 5, 2024
@jrchudy
Copy link
Member Author

jrchudy commented Feb 13, 2024

Issue #1837 is related to this issue. 1837 resets the file upload job when a user logs back in after having their session expire. This would further improve that failure scenario but won't "fix" that issue. Ideally, to address 1837 we don't refresh the page after, but this issue will further improve that feature since other things can happen to refresh the page and force a restart.

@jrchudy jrchudy self-assigned this Mar 21, 2024
@jrchudy jrchudy changed the title Resume file upload on page reload Resume file upload May 13, 2024
@jrchudy
Copy link
Member Author

jrchudy commented Jun 27, 2024

Step 1 from the main message above has been merged. Moving this issue to "Scheduled" for implementing steps 2 and 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hatrac Anything related to hatrac/uploader investigation required Requires some initial investigation
Projects
None yet
Development

No branches or pull requests

2 participants