-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpreserved order for qsv stats --everything
-> qsv jsonp <filepath|stdin>
#1922
Comments
This is a head-scratcher @rzmk... does this only happen with The way stats writes out the record is in a fixed order, even with |
Digging further, when polars reads the JSON into a Dataframe, it adds the JSON KV pairs into a Hashmap, which is not necessarily ordered, unlike an Indexmap. |
Relevant polars issue - pola-rs/polars#14415 (comment) |
Not sure about other qsv commands. The column order is usually random for me with |
Additional related issue - pola-rs/polars#3823 |
One potential idea for now though I'm not sure if it would work if implemented is providing a |
BTW, I cannot reproduce the error on macOS... Regardless, if it happens on Windows, we still have to solve it. I like your idea... calling |
Tried it with hard coding the types of I had to get the types of fn df_from_stdin() -> PolarsResult<DataFrame> {
let schema = Schema::from_iter(vec![
Field::new("field", DataType::String),
Field::new("type", DataType::String),
Field::new("is_ascii", DataType::String),
Field::new("sum", DataType::Float64),
Field::new("min", DataType::String),
Field::new("max", DataType::String),
Field::new("range", DataType::Float64),
Field::new("min_length", DataType::Int32),
Field::new("max_length", DataType::Int32),
Field::new("mean", DataType::Float64),
Field::new("sem", DataType::Float64),
Field::new("stddev", DataType::Float64),
Field::new("variance", DataType::Float64),
Field::new("cv", DataType::Float64),
Field::new("nullcount", DataType::Int32),
Field::new("max_precision", DataType::Int32),
Field::new("sparsity", DataType::Int32),
Field::new("mad", DataType::Float64),
Field::new("lower_outer_fence", DataType::Float64),
Field::new("lower_inner_fence", DataType::Float64),
Field::new("q1", DataType::Float64),
Field::new("q2_median", DataType::Float64),
Field::new("q3", DataType::Float64),
Field::new("iqr", DataType::Float64),
Field::new("upper_inner_fence", DataType::Float64),
Field::new("upper_outer_fence", DataType::Float64),
Field::new("skewness", DataType::Float64),
Field::new("cardinality", DataType::Int32),
Field::new("mode", DataType::String),
Field::new("mode_count", DataType::Int32),
Field::new("mode_occurrences", DataType::Int32),
Field::new("antimode", DataType::String),
Field::new("antimode_count", DataType::Int32),
Field::new("antimode_occurrences", DataType::Int32),
]);
// Create a buffer in memory for stdin
let mut buffer: Vec<u8> = Vec::new();
let stdin = std::io::stdin();
let mut stdin_handle = stdin.lock();
stdin_handle.read_to_end(&mut buffer)?;
drop(stdin_handle);
JsonReader::new(Box::new(std::io::Cursor::new(buffer)))
.with_schema(schema.into())
.finish()
} Now to figure out how to implement this for arbitrary JSON data. |
Great! Your proposed solution to call To make it performant, make sure to call This will auto-create an index, parallelizing |
@jqnatividad Something I'm wondering about regarding the processing steps:
The part I'm wondering about is step 2 how do we intend to run I think I'll take a look at serde for this. |
Hhmmm... what about using Polar's JSONReader to read in the JSON, and then turn around to save it to JSONL using Polars, which, if I'm not mistaken, uses IndexMap by default... The WDYT? |
Note that this method has the issue if trying to infer the data types from the first dictionary alone since that is not representative of the values of the rest of the data since for example the first dict could have a field with a value of |
As an alternative we could resort to another library (may remove necessity for polars altogether too): https://github.com/vtselfa/json-objects-to-csv
But looks like it sorts alphabetically |
OK @rzmk, LGTM! Should I hold off releasing 0.129.0? |
I do have a local implementation but there are a few unwraps. I'll make a PR then feel free to decide. |
See #1924. |
Describe the bug
Generally
qsv jsonp
has been preserving the order of the input I provide but when I either provide the output file ofqsv stats --everything
or pipe it in from the command intoqsv jsonp
the order isn't preserved.To Reproduce
Run
qsv stats <filepath> --everything | qsv slice --json | qsv jsonp
.Expected behavior
Preserve the order of the columns/keys or provide a flag to do so.
Screenshots/Backtrace/Sample Data
Desktop (please complete the following information):
Additional context
I tried adding
println!("{:?}")
statements for the dataframe that is generated at around line 110 along with the actual output:The
qsv stats --everything
dataframe itself has unordered columns while theqsv stats
is ordered fromstdin
.This is the same behavior when using a file path instead:
Here's
stats.json
:Here's
stats.everything.json
:The text was updated successfully, but these errors were encountered: