Add "smart" upload function #90

cdhowie · 2020-01-03T21:13:35Z

The AWS S3 SDK for JavaScript has an upload function that does not correspond to any particular API request. You can give it a buffer or a stream, and it will automatically perform either a single PutObject call or a multi-part upload.

It would be a great benefit to this library to provide something similar. Right now, large file uploads are unnecessarily cumbersome, especially when the input is a stream. Authorization token management is a giant pain.

I am working on such a function right now for our own internal use. I'm writing it as a module that exposes a single function that can be attached to the prototype of the B2 class provided by this library (B2.prototype.uploadAny = require('backblaze-b2-upload-any');).

This issue is intended to convey my intent to integrate this function into this library and submit a PR. Therefore, I would very much appreciate any feedback on my proposal so that I can accommodate any necessary design changes as early as possible.

The current planned features of this function (many of which are already done) are:

Performs the upload using a single upload_file call or switches to a large-file upload as appropriate.
In large-file mode, uploads multiple parts with configurable concurrency.
Automatic management of upload tokens. An upload token (URL + authorization token) can be reused by future part uploads. Expired tokens (where the server returns 503 or 400) are discarded.
Automatic re-authorization if the server returns 401 in the middle of an upload.
Retry with exponential backoff.
Support for uploading:
- Buffers
- Streams
- Local files (specified as a string path)
If the operation is aborted for whatever reason, any outstanding large-file upload is canceled with cancel_large_file.
The caller need not (and cannot) supply a hash. When uploading in large-file mode, a hash of the entire content is not provided to B2 -- a hash is provided for each part. A caller-supplied hash of the content is therefore useless in large-file mode anyway.

There is a difference between the local file and stream cases. When uploading a local file, no content is buffered in memory. Rather, multiple read streams are created (and re-created as necessary if a part upload must be retried).

Stream support necessarily requires some buffering in memory to facilitate retries since node streams cannot be seeked (and not all stream types would be seekable, anyway).

Note that I currently introduce two new dependencies:

@hapi/joi which is used to validate the options object.
memoizee which is used during re-authorization. If multiple part uploads are trying to re-authorize at the same time, this prevents multiple authorize calls to B2.

The text was updated successfully, but these errors were encountered:

cbess · 2020-01-05T15:41:55Z

I'm definitely in favor of this change. I personally use joi in some of my projects as well. memoizee seems pretty solid as well. But, adding these would definitely grow the dependency tree. I think your design changes seem fine, but I'd like to see the final implementation (code review).

cdhowie · 2020-02-07T17:23:19Z

I had to take a hiatus from this development but will be back working on it this month.

I am considering introducing another dependency maintained by me as it will simplify much of the logic: it is a function that accepts an iterator or async iterator and returns a promise for an array collecting the results of invoking a mapper function over each element produced by the iterator, in the order the iterator produced the elements. The amount of concurrency can be specified in the same way as Bluebird's Promise.map(). It does this without converting the iterator to an array first, so it can be used on very large sequences and/or sequences of an indeterminate size.

This would be used in the piecewise upload routine. Each upload source (buffer, file, stream) would have an iterator that produces pieces to upload.

odensc · 2020-02-13T03:56:16Z

Sounds fine if it's something that would be useful on it's own as a module or is large enough to warrant being separate. But if it's just a small helper function it'd be better to throw it in this module.

cdhowie · 2020-02-17T21:09:43Z

I decided not to bring in this other dependency. I might later, but what I have today works (I've tested it against large files/buffers). I'm also not sure what environments are supported by this module (backblaze-b2) and whether they all support ECMAScript's asynchronous iteration feature (Symbol.asyncIterator) so I decided not to use this feature after all, at least not yet.

I did bring in one more dependency (p-retry) to handle exponential back-off following a temporary upload failure.

Sometime this week I hope to publish this function as a module on npm, since I will be using it in multiple projects and this will simplify distribution. My goal is to get the functionality merged into backblaze-b2 itself. Once I publish this new module, I'd like to request a code review of it. After working through any concerns there, I can submit a PR to add the function to backblaze-b2.

cdhowie · 2020-02-18T21:16:53Z

The module is published. Feel free to leave code review comments here, or as issues on the module's repository.

https://www.npmjs.com/package/@gideo-llc/backblaze-b2-upload-any

cdhowie · 2020-02-20T21:16:52Z

We have been using the module in production for the last few days and have migrated several TB of data successfully, streaming it from an HTTP origin to B2. I have not found any instances of corruption yet. There has been a pretty even mixture of standard and large-file uploads.

odensc · 2020-03-06T01:13:38Z

@cdhowie Sorry for the delay, the API looks good but hopefully I can take a look at the code over the weekend.

cdhowie · 2020-03-06T03:42:49Z

@odensc There's a few things that I may want to rewrite but I haven't had time, and what's there works pretty well.

The main large-file-upload routine is a bit complex. Replacing it with async iterables (async function*) would significantly improve readability of the source interface code, and -- assuming that we can add async as a dependency -- the provided mapLimit function is exactly what the primary large-file-upload loop does, and it works with async iterables. The code would be substantially more readable and maintainable, at the cost of adding a dependency.

jlarmstrongiv · 2020-11-03T06:58:15Z

I just added @gideo-llc/backblaze-b2-upload-any to my project and it helped simplify my code a lot. I would love to see this helpful method added in core 😄

cdhowie mentioned this issue Nov 3, 2020

Usage with TypeScript gideo-llc/backblaze-b2-upload-any#1

Closed

MarcGuiselin mentioned this issue Dec 11, 2021

Recommend S3 API for basic needs #118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "smart" upload function #90

Add "smart" upload function #90

cdhowie commented Jan 3, 2020

cbess commented Jan 5, 2020

cdhowie commented Feb 7, 2020 •

edited

Loading

odensc commented Feb 13, 2020 •

edited

Loading

cdhowie commented Feb 17, 2020

cdhowie commented Feb 18, 2020

cdhowie commented Feb 20, 2020

odensc commented Mar 6, 2020

cdhowie commented Mar 6, 2020 •

edited

Loading

jlarmstrongiv commented Nov 3, 2020

Add "smart" upload function #90

Add "smart" upload function #90

Comments

cdhowie commented Jan 3, 2020

cbess commented Jan 5, 2020

cdhowie commented Feb 7, 2020 • edited Loading

odensc commented Feb 13, 2020 • edited Loading

cdhowie commented Feb 17, 2020

cdhowie commented Feb 18, 2020

cdhowie commented Feb 20, 2020

odensc commented Mar 6, 2020

cdhowie commented Mar 6, 2020 • edited Loading

jlarmstrongiv commented Nov 3, 2020

cdhowie commented Feb 7, 2020 •

edited

Loading

odensc commented Feb 13, 2020 •

edited

Loading

cdhowie commented Mar 6, 2020 •

edited

Loading