Unpreserved order for `qsv stats --everything` -> `qsv jsonp <filepath|stdin>` #1922

rzmk · 2024-06-25T23:12:44Z

Describe the bug
Generally qsv jsonp has been preserving the order of the input I provide but when I either provide the output file of qsv stats --everything or pipe it in from the command into qsv jsonp the order isn't preserved.

To Reproduce
Run qsv stats <filepath> --everything | qsv slice --json | qsv jsonp.

Expected behavior
Preserve the order of the columns/keys or provide a flag to do so.

Screenshots/Backtrace/Sample Data

Desktop (please complete the following information):

OS: Windows 10
qsv Version: master (between v0.128.0 and v0.129.0) on commit 138a649

Additional context
I tried adding println!("{:?}") statements for the dataframe that is generated at around line 110 along with the actual output:

The qsv stats --everything dataframe itself has unordered columns while the qsv stats is ordered from stdin.

This is the same behavior when using a file path instead:

Here's stats.json:

[{"field":"fruit","type":"String","is_ascii":"true","sum":null,"min":"apple","max":"strawberry","range":null,"min_length":"5","max_length":"10","mean":null,"sem":null,"stddev":null,"variance":null,"cv":null,"nullcount":"0","max_precision":null,"sparsity":"0"},{"field":"price","type":"Float","is_ascii":null,"sum":"7","min":"1.5","max":"3.0","range":"1.5","min_length":"4","max_length":"4","mean":"2.3333","sem":"0.36","stddev":"0.6236","variance":"0.3889","cv":"26.7261","nullcount":"0","max_precision":"1","sparsity":"0"}]

Here's stats.everything.json:

[{"field":"fruit","type":"String","is_ascii":"true","sum":null,"min":"apple","max":"strawberry","range":null,"min_length":"5","max_length":"10","mean":null,"sem":null,"stddev":null,"variance":null,"cv":null,"nullcount":"0","max_precision":null,"sparsity":"0","mad":null,"lower_outer_fence":null,"lower_inner_fence":null,"q1":null,"q2_median":null,"q3":null,"iqr":null,"upper_inner_fence":null,"upper_outer_fence":null,"skewness":null,"cardinality":"3","mode":null,"mode_count":"0","mode_occurrences":"0","antimode":"*ALL","antimode_count":"0","antimode_occurrences":"1"},{"field":"price","type":"Float","is_ascii":null,"sum":"7","min":"1.5","max":"3.0","range":"1.5","min_length":"4","max_length":"4","mean":"2.3333","sem":"0.36","stddev":"0.6236","variance":"0.3889","cv":"26.7261","nullcount":"0","max_precision":"1","sparsity":"0","mad":"0.5","lower_outer_fence":"-3","lower_inner_fence":"-0.75","q1":"1.5","q2_median":"2.5","q3":"3","iqr":"1.5","upper_inner_fence":"5.25","upper_outer_fence":"7.5","skewness":"-0.3333","cardinality":"3","mode":null,"mode_count":"0","mode_occurrences":"0","antimode":"*ALL","antimode_count":"0","antimode_occurrences":"1"}]

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-06-26T09:32:18Z

This is a head-scratcher @rzmk... does this only happen with stats or does it happen with any other qsv command?

The way stats writes out the record is in a fixed order, even with --everything. Also, is the column order random or is it always the same order?

jqnatividad · 2024-06-26T11:46:05Z

Digging further, when polars reads the JSON into a Dataframe, it adds the JSON KV pairs into a Hashmap, which is not necessarily ordered, unlike an Indexmap.
Looking to see if it can be forced to preserve order...

jqnatividad · 2024-06-26T12:00:20Z

Relevant polars issue - pola-rs/polars#14415 (comment)

rzmk · 2024-06-26T13:30:56Z

Not sure about other qsv commands. The column order is usually random for me with --everything.

jqnatividad · 2024-06-26T16:04:45Z

Additional related issue - pola-rs/polars#3823

rzmk · 2024-06-26T17:17:17Z

One potential idea for now though I'm not sure if it would work if implemented is providing a Schema to the JSONReader in the ordered format which could be done with qsv stats --typesonly since the schema needs the types for each column too and also converted to the correct Polars dtype. Again not sure if this would work as intended.

jqnatividad · 2024-06-26T17:30:30Z

BTW, I cannot reproduce the error on macOS...

Regardless, if it happens on Windows, we still have to solve it.

I like your idea... calling qsv stats --typesonly internally from jsonp should work. You may want to look at tojsonl as it also calls stats internally to infer JSON data types to use when generating JSONL from CSV.

rzmk · 2024-06-26T18:56:48Z

Tried it with hard coding the types of qsv stats --everything just to make sure and it's working well as it looks like the Schema uses an IndexMap:

I had to get the types of qsv stats --everything on it. For example:

    fn df_from_stdin() -> PolarsResult<DataFrame> {
        let schema = Schema::from_iter(vec![
            Field::new("field", DataType::String),
            Field::new("type", DataType::String),
            Field::new("is_ascii", DataType::String),
            Field::new("sum", DataType::Float64),
            Field::new("min", DataType::String),
            Field::new("max", DataType::String),
            Field::new("range", DataType::Float64),
            Field::new("min_length", DataType::Int32),
            Field::new("max_length", DataType::Int32),
            Field::new("mean", DataType::Float64),
            Field::new("sem", DataType::Float64),
            Field::new("stddev", DataType::Float64),
            Field::new("variance", DataType::Float64),
            Field::new("cv", DataType::Float64),
            Field::new("nullcount", DataType::Int32),
            Field::new("max_precision", DataType::Int32),
            Field::new("sparsity", DataType::Int32),
            Field::new("mad", DataType::Float64),
            Field::new("lower_outer_fence", DataType::Float64),
            Field::new("lower_inner_fence", DataType::Float64),
            Field::new("q1", DataType::Float64),
            Field::new("q2_median", DataType::Float64),
            Field::new("q3", DataType::Float64),
            Field::new("iqr", DataType::Float64),
            Field::new("upper_inner_fence", DataType::Float64),
            Field::new("upper_outer_fence", DataType::Float64),
            Field::new("skewness", DataType::Float64),
            Field::new("cardinality", DataType::Int32),
            Field::new("mode", DataType::String),
            Field::new("mode_count", DataType::Int32),
            Field::new("mode_occurrences", DataType::Int32),
            Field::new("antimode", DataType::String),
            Field::new("antimode_count", DataType::Int32),
            Field::new("antimode_occurrences", DataType::Int32),
        ]);
        // Create a buffer in memory for stdin
        let mut buffer: Vec<u8> = Vec::new();
        let stdin = std::io::stdin();
        let mut stdin_handle = stdin.lock();
        stdin_handle.read_to_end(&mut buffer)?;
        drop(stdin_handle);
        JsonReader::new(Box::new(std::io::Cursor::new(buffer)))
            .with_schema(schema.into())
            .finish()
    }

Now to figure out how to implement this for arbitrary JSON data.

jqnatividad · 2024-06-26T19:05:59Z

Great! Your proposed solution to call stats --typesonly should work then!

To make it performant, make sure to call stats in a similar way you do with describegpt with a --cache-threshold of -5.

This will auto-create an index, parallelizing stats, and auto-deletes the index and stats cache afterwards.

rzmk · 2024-06-26T19:16:43Z

@jqnatividad Something I'm wondering about regarding the processing steps:

Get JSON input data.
Somehow run qsv stats --typesonly on this JSON input data to get the field,type CSV data.
Make a Vec<Field> of each non-header row with the field name and the DataType which we get by running a match on the type.
Make a Schema::from_iter(fields) then add that as with_schema(schema.into()) to our JSONReader.

The part I'm wondering about is step 2 how do we intend to run qsv stats --typesonly from the JSON input data?
If we by default make an assumption that the data is in the format [{key: value, ...}, ..., {key: value, ...}] then we could figure out how to parse the first dictionary in that list (not sure what to do for this step though) since we assume every dictionary in the list has the same keys but may differ in values.

I think I'll take a look at serde for this.

jqnatividad · 2024-06-26T19:43:32Z

Hhmmm... what about using Polar's JSONReader to read in the JSON, and then turn around to save it to JSONL using Polars, which, if I'm not mistaken, uses IndexMap by default...

The jsonp command essentially converts to JSONL, and then we can then just use the existing jsonl command...

WDYT?

rzmk · 2024-06-26T19:53:40Z

I'm getting random organization with the method you mentioned (due to JSONReader I think):

Separate run piping into qsv jsonl:

rzmk · 2024-06-26T21:06:41Z

The part I'm wondering about is step 2 how do we intend to run qsv stats --typesonly from the JSON input data?
If we by default make an assumption that the data is in the format [{key: value, ...}, ..., {key: value, ...}] then we could figure out how to parse the first dictionary in that list (not sure what to do for this step though) since we assume every dictionary in the list has the same keys but may differ in values.

Note that this method has the issue if trying to infer the data types from the first dictionary alone since that is not representative of the values of the rest of the data since for example the first dict could have a field with a value of null while the next dict has a value of 1.

rzmk · 2024-06-26T21:14:12Z

As an alternative we could resort to another library (may remove necessity for polars altogether too): https://github.com/vtselfa/json-objects-to-csv

The headers are sorted alphabetically and are the union of all the keys in all the objects in the input after they are flattened.

But looks like it sorts alphabetically ~~not sure how to preserve original order for it either~~. Could try to retain the headers starting from the first dict's keys then use qsv select to reorder the columns. Also made a new issue in the crate's repository regarding this: vtselfa/json-objects-to-csv#10

jqnatividad · 2024-06-26T23:36:42Z

OK @rzmk, LGTM!

Should I hold off releasing 0.129.0?

rzmk · 2024-06-26T23:57:48Z

I do have a local implementation but there are a few unwraps. I'll make a PR then feel free to decide.

rzmk · 2024-06-27T00:07:00Z

See #1924.

jqnatividad · 2024-06-27T12:45:04Z

Merged #1924, did some tweaks, and we can start prepping again for 0.129.0

I'll just need to convert the select call from spawning a separate qsv process to just calling it internally before publishing...

Thanks @rzmk !

jqnatividad closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpreserved order for `qsv stats --everything` -> `qsv jsonp <filepath|stdin>` #1922

Unpreserved order for `qsv stats --everything` -> `qsv jsonp <filepath|stdin>` #1922

rzmk commented Jun 25, 2024

jqnatividad commented Jun 26, 2024

jqnatividad commented Jun 26, 2024

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024 •

edited

Loading

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024 •

edited

Loading

jqnatividad commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024

rzmk commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 27, 2024

jqnatividad commented Jun 27, 2024

Unpreserved order for qsv stats --everything -> qsv jsonp <filepath|stdin> #1922

Unpreserved order for qsv stats --everything -> qsv jsonp <filepath|stdin> #1922

Comments

rzmk commented Jun 25, 2024

jqnatividad commented Jun 26, 2024

jqnatividad commented Jun 26, 2024

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024 • edited Loading

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024 • edited Loading

jqnatividad commented Jun 26, 2024 • edited Loading

rzmk commented Jun 26, 2024 • edited Loading

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024

rzmk commented Jun 26, 2024 • edited Loading

rzmk commented Jun 26, 2024 • edited Loading

jqnatividad commented Jun 26, 2024

rzmk commented Jun 26, 2024 • edited Loading

rzmk commented Jun 27, 2024

jqnatividad commented Jun 27, 2024

Unpreserved order for `qsv stats --everything` -> `qsv jsonp <filepath|stdin>` #1922

Unpreserved order for `qsv stats --everything` -> `qsv jsonp <filepath|stdin>` #1922

rzmk commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading

jqnatividad commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading

rzmk commented Jun 26, 2024 •

edited

Loading