Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpreserved order for qsv stats --everything -> qsv jsonp <filepath|stdin> #1922

Closed
rzmk opened this issue Jun 25, 2024 · 18 comments
Closed

Comments

@rzmk
Copy link
Collaborator

rzmk commented Jun 25, 2024

Describe the bug
Generally qsv jsonp has been preserving the order of the input I provide but when I either provide the output file of qsv stats --everything or pipe it in from the command into qsv jsonp the order isn't preserved.

To Reproduce
Run qsv stats <filepath> --everything | qsv slice --json | qsv jsonp.

Expected behavior
Preserve the order of the columns/keys or provide a flag to do so.

Screenshots/Backtrace/Sample Data
image

Desktop (please complete the following information):

  • OS: Windows 10
  • qsv Version: master (between v0.128.0 and v0.129.0) on commit 138a649

Additional context
I tried adding println!("{:?}") statements for the dataframe that is generated at around line 110 along with the actual output:

image

The qsv stats --everything dataframe itself has unordered columns while the qsv stats is ordered from stdin.

This is the same behavior when using a file path instead:

image

Here's stats.json:

[{"field":"fruit","type":"String","is_ascii":"true","sum":null,"min":"apple","max":"strawberry","range":null,"min_length":"5","max_length":"10","mean":null,"sem":null,"stddev":null,"variance":null,"cv":null,"nullcount":"0","max_precision":null,"sparsity":"0"},{"field":"price","type":"Float","is_ascii":null,"sum":"7","min":"1.5","max":"3.0","range":"1.5","min_length":"4","max_length":"4","mean":"2.3333","sem":"0.36","stddev":"0.6236","variance":"0.3889","cv":"26.7261","nullcount":"0","max_precision":"1","sparsity":"0"}]

Here's stats.everything.json:

[{"field":"fruit","type":"String","is_ascii":"true","sum":null,"min":"apple","max":"strawberry","range":null,"min_length":"5","max_length":"10","mean":null,"sem":null,"stddev":null,"variance":null,"cv":null,"nullcount":"0","max_precision":null,"sparsity":"0","mad":null,"lower_outer_fence":null,"lower_inner_fence":null,"q1":null,"q2_median":null,"q3":null,"iqr":null,"upper_inner_fence":null,"upper_outer_fence":null,"skewness":null,"cardinality":"3","mode":null,"mode_count":"0","mode_occurrences":"0","antimode":"*ALL","antimode_count":"0","antimode_occurrences":"1"},{"field":"price","type":"Float","is_ascii":null,"sum":"7","min":"1.5","max":"3.0","range":"1.5","min_length":"4","max_length":"4","mean":"2.3333","sem":"0.36","stddev":"0.6236","variance":"0.3889","cv":"26.7261","nullcount":"0","max_precision":"1","sparsity":"0","mad":"0.5","lower_outer_fence":"-3","lower_inner_fence":"-0.75","q1":"1.5","q2_median":"2.5","q3":"3","iqr":"1.5","upper_inner_fence":"5.25","upper_outer_fence":"7.5","skewness":"-0.3333","cardinality":"3","mode":null,"mode_count":"0","mode_occurrences":"0","antimode":"*ALL","antimode_count":"0","antimode_occurrences":"1"}]
@jqnatividad
Copy link
Collaborator

This is a head-scratcher @rzmk... does this only happen with stats or does it happen with any other qsv command?

The way stats writes out the record is in a fixed order, even with --everything. Also, is the column order random or is it always the same order?

@jqnatividad
Copy link
Collaborator

Digging further, when polars reads the JSON into a Dataframe, it adds the JSON KV pairs into a Hashmap, which is not necessarily ordered, unlike an Indexmap.
Looking to see if it can be forced to preserve order...

@jqnatividad
Copy link
Collaborator

Relevant polars issue - pola-rs/polars#14415 (comment)

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

Not sure about other qsv commands. The column order is usually random for me with --everything.

@jqnatividad
Copy link
Collaborator

Additional related issue - pola-rs/polars#3823

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

One potential idea for now though I'm not sure if it would work if implemented is providing a Schema to the JSONReader in the ordered format which could be done with qsv stats --typesonly since the schema needs the types for each column too and also converted to the correct Polars dtype. Again not sure if this would work as intended.

@jqnatividad
Copy link
Collaborator

BTW, I cannot reproduce the error on macOS...

Regardless, if it happens on Windows, we still have to solve it.

I like your idea... calling qsv stats --typesonly internally from jsonp should work. You may want to look at tojsonl as it also calls stats internally to infer JSON data types to use when generating JSONL from CSV.

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

image

Tried it with hard coding the types of qsv stats --everything just to make sure and it's working well as it looks like the Schema uses an IndexMap:

image

I had to get the types of qsv stats --everything on it. For example:

    fn df_from_stdin() -> PolarsResult<DataFrame> {
        let schema = Schema::from_iter(vec![
            Field::new("field", DataType::String),
            Field::new("type", DataType::String),
            Field::new("is_ascii", DataType::String),
            Field::new("sum", DataType::Float64),
            Field::new("min", DataType::String),
            Field::new("max", DataType::String),
            Field::new("range", DataType::Float64),
            Field::new("min_length", DataType::Int32),
            Field::new("max_length", DataType::Int32),
            Field::new("mean", DataType::Float64),
            Field::new("sem", DataType::Float64),
            Field::new("stddev", DataType::Float64),
            Field::new("variance", DataType::Float64),
            Field::new("cv", DataType::Float64),
            Field::new("nullcount", DataType::Int32),
            Field::new("max_precision", DataType::Int32),
            Field::new("sparsity", DataType::Int32),
            Field::new("mad", DataType::Float64),
            Field::new("lower_outer_fence", DataType::Float64),
            Field::new("lower_inner_fence", DataType::Float64),
            Field::new("q1", DataType::Float64),
            Field::new("q2_median", DataType::Float64),
            Field::new("q3", DataType::Float64),
            Field::new("iqr", DataType::Float64),
            Field::new("upper_inner_fence", DataType::Float64),
            Field::new("upper_outer_fence", DataType::Float64),
            Field::new("skewness", DataType::Float64),
            Field::new("cardinality", DataType::Int32),
            Field::new("mode", DataType::String),
            Field::new("mode_count", DataType::Int32),
            Field::new("mode_occurrences", DataType::Int32),
            Field::new("antimode", DataType::String),
            Field::new("antimode_count", DataType::Int32),
            Field::new("antimode_occurrences", DataType::Int32),
        ]);
        // Create a buffer in memory for stdin
        let mut buffer: Vec<u8> = Vec::new();
        let stdin = std::io::stdin();
        let mut stdin_handle = stdin.lock();
        stdin_handle.read_to_end(&mut buffer)?;
        drop(stdin_handle);
        JsonReader::new(Box::new(std::io::Cursor::new(buffer)))
            .with_schema(schema.into())
            .finish()
    }

Now to figure out how to implement this for arbitrary JSON data.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Jun 26, 2024

Great! Your proposed solution to call stats --typesonly should work then!

To make it performant, make sure to call stats in a similar way you do with describegpt with a --cache-threshold of -5.

This will auto-create an index, parallelizing stats, and auto-deletes the index and stats cache afterwards.

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

@jqnatividad Something I'm wondering about regarding the processing steps:

  1. Get JSON input data.
  2. Somehow run qsv stats --typesonly on this JSON input data to get the field,type CSV data.
  3. Make a Vec<Field> of each non-header row with the field name and the DataType which we get by running a match on the type.
  4. Make a Schema::from_iter(fields) then add that as with_schema(schema.into()) to our JSONReader.

The part I'm wondering about is step 2 how do we intend to run qsv stats --typesonly from the JSON input data?
If we by default make an assumption that the data is in the format [{key: value, ...}, ..., {key: value, ...}] then we could figure out how to parse the first dictionary in that list (not sure what to do for this step though) since we assume every dictionary in the list has the same keys but may differ in values.

I think I'll take a look at serde for this.

@jqnatividad
Copy link
Collaborator

Hhmmm... what about using Polar's JSONReader to read in the JSON, and then turn around to save it to JSONL using Polars, which, if I'm not mistaken, uses IndexMap by default...

The jsonp command essentially converts to JSONL, and then we can then just use the existing jsonl command...

WDYT?

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

I'm getting random organization with the method you mentioned (due to JSONReader I think):

image

Separate run piping into qsv jsonl:

image

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

The part I'm wondering about is step 2 how do we intend to run qsv stats --typesonly from the JSON input data?
If we by default make an assumption that the data is in the format [{key: value, ...}, ..., {key: value, ...}] then we could figure out how to parse the first dictionary in that list (not sure what to do for this step though) since we assume every dictionary in the list has the same keys but may differ in values.

Note that this method has the issue if trying to infer the data types from the first dictionary alone since that is not representative of the values of the rest of the data since for example the first dict could have a field with a value of null while the next dict has a value of 1.

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

As an alternative we could resort to another library (may remove necessity for polars altogether too): https://github.com/vtselfa/json-objects-to-csv


The headers are sorted alphabetically and are the union of all the keys in all the objects in the input after they are flattened.

But looks like it sorts alphabetically not sure how to preserve original order for it either. Could try to retain the headers starting from the first dict's keys then use qsv select to reorder the columns. Also made a new issue in the crate's repository regarding this: vtselfa/json-objects-to-csv#10

@jqnatividad
Copy link
Collaborator

OK @rzmk, LGTM!

Should I hold off releasing 0.129.0?

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 26, 2024

I do have a local implementation but there are a few unwraps. I'll make a PR then feel free to decide.

@rzmk
Copy link
Collaborator Author

rzmk commented Jun 27, 2024

See #1924.

@jqnatividad
Copy link
Collaborator

Merged #1924, did some tweaks, and we can start prepping again for 0.129.0

I'll just need to convert the select call from spawning a separate qsv process to just calling it internally before publishing...

Thanks @rzmk !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants