`stats`: add dataset level stats #2288

jqnatividad · 2024-11-13T13:06:46Z

Currently, stats only computes column-level stats.

Also add dataset-level stats, with the "_qsv_" prefix, like:

rowcount (_qsv_rowcount)
column count (_qsv_columncount)
filesize (_qsv_filesize)
file hash (_qsv_hash) using xxHash algorithm (use twox-hash crate)

The value for each dataset stat will be stored in a column named _qsv_value

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-11-18T05:00:51Z

instead of computing the file hash which may take a long time for large files, just compute the hash of all the stats, including the rowcount, column count and the filesize.

This pretty much guarantees the hash will be unique for the file in its current state, without having to scan the entire file, serving as a "fingerprint hash."

jqnatividad · 2024-11-19T16:07:28Z

#2297 is still WIP, but changed the dataset-level stats to:

qsv__rowcount
qsv__columncount
qsv__filesize_bytes
qsv__fingerprint_hash
with their values stored in the last column of stats as qsv__value.

Removed the leading underscore because it was tripping up select in CI as underscore is a select sentinel value for last column. Made the prefix qsv__ with two trailing underscores.

jqnatividad · 2024-11-20T16:47:06Z

This is not truly done until corresponding CI tests succeed.
stats has hundreds of tests, so this will take a bit of effort.

jqnatividad · 2024-11-24T13:14:26Z

done.
There are still some issues with fingerprint hash being non-deterministic when using cache_threshold, but will track that separately.

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Nov 13, 2024

jqnatividad mentioned this issue Nov 18, 2024

feat: add dataset-level stats #2297

Merged

jqnatividad closed this as completed in #2297 Nov 19, 2024

jqnatividad closed this as completed in d9bc2a5 Nov 19, 2024

jqnatividad reopened this Nov 20, 2024

jqnatividad added the WIP work in progress label Nov 22, 2024

jqnatividad closed this as completed Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`stats`: add dataset level stats #2288

`stats`: add dataset level stats #2288

jqnatividad commented Nov 13, 2024 •

edited

Loading

jqnatividad commented Nov 18, 2024 •

edited

Loading

jqnatividad commented Nov 19, 2024

jqnatividad commented Nov 20, 2024

jqnatividad commented Nov 24, 2024

stats: add dataset level stats #2288

stats: add dataset level stats #2288

Comments

jqnatividad commented Nov 13, 2024 • edited Loading

jqnatividad commented Nov 18, 2024 • edited Loading

jqnatividad commented Nov 19, 2024

jqnatividad commented Nov 20, 2024

jqnatividad commented Nov 24, 2024

`stats`: add dataset level stats #2288

`stats`: add dataset level stats #2288

jqnatividad commented Nov 13, 2024 •

edited

Loading

jqnatividad commented Nov 18, 2024 •

edited

Loading