Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optional --range argument for cp to download single part of object #772

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,9 @@ You can also install `s5cmd` from [MacPorts](https://ports.macports.org/port/s5c
> conda config --add channels conda-forge
> conda config --set channel_priority strict
> ```
>
>
> Once the `conda-forge` channel has been enabled, `s5cmd` can be installed with `conda`:
>
>
> ```
> conda install s5cmd
> ```
Expand Down Expand Up @@ -319,7 +319,7 @@ folder hierarchy.
an [open ticket](https://github.com/peak/s5cmd/issues/29) to track the issue.

#### Using Exclude and Include Filters
`s5cmd` supports the `--exclude` and `--include` flags, which can be used to specify patterns for objects to be excluded or included in commands.
`s5cmd` supports the `--exclude` and `--include` flags, which can be used to specify patterns for objects to be excluded or included in commands.

- The `--exclude` flag specifies objects that should be excluded from the operation. Any object that matches the pattern will be skipped.
- The `--include` flag specifies objects that should be included in the operation. Only objects that match the pattern will be handled.
Expand Down Expand Up @@ -540,7 +540,7 @@ The environment variable `SHELL` must be accurate for the autocompletion to func
The autocompletion is tested with following versions of the shells: \
***zsh*** 5.8.1 (x86_64-apple-darwin21.0) \
GNU ***bash***, version 5.1.16(1)-release (x86_64-apple-darwin21.1.0) \
***PowerShell*** 7.2.6
***PowerShell*** 7.2.6

### Google Cloud Storage support

Expand Down Expand Up @@ -687,6 +687,14 @@ s5cmd --numworkers 10 cp --concurrency 10 '/Users/foo/bar/*' s3://mybucket/foo/b

If you have a few, large files to download, setting `--numworkers` to a very high value will not affect download speed. In this scenario setting `--concurrency` to a higher value may have a better impact on the download speed.

### range

`range` is a `cp` command option that targets only a specific byterange in the source object to download. This parameter is used by the AWS Go SDK (setting the [Range header](https://www.rfc-editor.org/rfc/rfc9110.html#name-range) in the GET request). Passing `range` option to `cp` will override any `--concurrency` or `--part_size` arguments (1 thread will be used to download this 1 part specified by the byterange).

```
s5cmd cp --range bytes=500-999 's3://mybucket/foo/bar/file.txt' partialFile.txt
```

## Benchmarks
Some benchmarks regarding the performance of `s5cmd` are introduced below. For more
details refer to this [post](https://medium.com/@joshua_robinson/s5cmd-for-high-performance-object-storage-7071352cc09d)
Expand Down
2 changes: 1 addition & 1 deletion command/cat.go
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ func (c Cat) processObjects(ctx context.Context, client *storage.S3, objectChan

func (c Cat) processSingleObject(ctx context.Context, client *storage.S3, url *url.URL) error {
buf := orderedwriter.New(os.Stdout)
_, err := client.Get(ctx, url, buf, c.concurrency, c.partSize)
_, err := client.Get(ctx, url, buf, c.concurrency, c.partSize, nil)
return err
}

Expand Down
11 changes: 10 additions & 1 deletion command/cp.go
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,9 @@ Examples:

24. Pass arbitrary metadata to the object during upload or copy
> s5cmd {{.HelpName}} --metadata "camera=Nixon D750" --metadata "imageSize=6032x4032" flowers.png s3://bucket/prefix/flowers.png

25. Copy only a specific byte range out of an S3 object.
> s5cmd {{.HelpName}} --range bytes=500-999 s3://bucket/prefix/object .
`

func NewSharedFlags() []cli.Flag {
Expand Down Expand Up @@ -252,6 +255,10 @@ func NewCopyCommandFlags() []cli.Flag {
Name: "version-id",
Usage: "use the specified version of an object",
},
&cli.StringFlag{
Name: "range",
Usage: "defines range header for target object, e.g. --range bytes=0-100",
},
&cli.BoolFlag{
Name: "show-progress",
Aliases: []string{"sp"},
Expand Down Expand Up @@ -320,6 +327,7 @@ type Copy struct {
contentType string
contentEncoding string
contentDisposition string
contentRange string
metadata map[string]string
metadataDirective string
showProgress bool
Expand Down Expand Up @@ -398,6 +406,7 @@ func NewCopy(c *cli.Context, deleteSource bool) (*Copy, error) {
contentType: c.String("content-type"),
contentEncoding: c.String("content-encoding"),
contentDisposition: c.String("content-disposition"),
contentRange: c.String("range"),
metadata: metadata,
metadataDirective: c.String("metadata-directive"),
showProgress: c.Bool("show-progress"),
Expand Down Expand Up @@ -665,7 +674,7 @@ func (c Copy) doDownload(ctx context.Context, srcurl *url.URL, dsturl *url.URL)
}

writer := newCountingReaderWriter(file, c.progressbar)
size, err := srcClient.Get(ctx, srcurl, writer, c.concurrency, c.partSize)
size, err := srcClient.Get(ctx, srcurl, writer, c.concurrency, c.partSize, &c.contentRange)
file.Close()

if err != nil {
Expand Down
6 changes: 5 additions & 1 deletion storage/s3.go
Original file line number Diff line number Diff line change
Expand Up @@ -600,13 +600,14 @@ func (s *S3) Presign(ctx context.Context, from *url.URL, expire time.Duration) (

// Get is a multipart download operation which downloads S3 objects into any
// destination that implements io.WriterAt interface.
// Makes a single 'GetObject' call if 'concurrency' is 1 and ignores 'partSize'.
// Makes a single 'GetObject' call if 'concurrency' is 1 or contentRange is not nil, ignoring 'partSize'.
func (s *S3) Get(
ctx context.Context,
from *url.URL,
to io.WriterAt,
concurrency int,
partSize int64,
contentRange *string, // optional
) (int64, error) {
if s.dryRun {
return 0, nil
Expand All @@ -620,6 +621,9 @@ func (s *S3) Get(
if from.VersionID != "" {
input.VersionId = aws.String(from.VersionID)
}
if contentRange != nil {
input.Range = aws.String(*contentRange)
}

return s.downloader.DownloadWithContext(ctx, to, input, func(u *s3manager.Downloader) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think I should use the Download() function from your screenshot instead of s.downloader.DownloadWithContext(...)? Since it does the same thing as the code in line 624 of s3.go?

u.PartSize = partSize
Expand Down
Loading