A command-line tool for converting Parquet to newline-delimited JSON.
It uses the excellent Apache Parquet Official Native Rust Implementation.
Install from crates.io and execute from the command line, e.g.:
$ cargo install parquet2json
$ parquet2json --help
USAGE:
parquet2json [OPTIONS] <FILE>
ARGS:
<FILE> Location of Parquet input file (path, HTTP or S3 URL)
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-l, --limit <NUMBER> Maximum number of rows to output
-o, --offset <NUMBER> Starts outputting from this row
Credentials are provided as per standard AWS toolchain, i.e. per environment variables (AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
), AWS credentials file or IAM ECS container/instance profile.
The default AWS region must be set per environment variable (AWS_DEFAULT_REGION
) o in AWS credentials file and must match region of the bucket the bucket is located in.
Use it to stream output to files and other tools such as grep
and jq.
$ parquet2json ./myfile.pq > output.ndjson
$ parquet2json ./myfile.pq | jq 'select(.level==3) | .id'
$ parquet2json s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
$ parquet2json https://amazon-reviews-pds.s3.us-east-1.amazonaws.com/parquet/product_category%3DGift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet