Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to provide compressed input #49

Open
REASY opened this issue Mar 22, 2021 · 3 comments
Open

Allow to provide compressed input #49

REASY opened this issue Mar 22, 2021 · 3 comments

Comments

@REASY
Copy link

REASY commented Mar 22, 2021

Would be a great to have an ability to provide a compressed file in GZIP/ZIP

@binakot
Copy link
Contributor

binakot commented Apr 29, 2021

It's good issue.

Currently timescaledb-parallel-copy just separate an input file on batch of rows: https://github.com/timescale/timescaledb-parallel-copy/blob/master/cmd/timescaledb-parallel-copy/main.go#L195. Implementation of this feature is required partial decompressing and understanding where begin and end another batch of rows.

Full preliminary decompressing of the file will not have any effect, given that the file may not fit into RAM. Also, without this mechanism, parallelism will not work, because each of the workers will not know which piece of data it needs to extract.

@jchampio
Copy link
Contributor

Is an unzip pipeline helpful enough? E.g.

$ gunzip my-data.czv.gz | timescaledb-parallel-copy ...

This will unzip only enough to fill up the OS buffer and then it'll wait for the utility to read more. Or is there a particular reason you'd like the utility to handle this internally?

@leonardochen
Copy link

For reference, the command that works is:

gunzip -c csv.gz | tail -n+2 | timescaledb-parallel-copy ...

-c outputs into stdout
tail -n+2 ignores the first line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants