Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: add dataset level stats #2288

Closed
jqnatividad opened this issue Nov 13, 2024 · 4 comments · Fixed by #2297
Closed

stats: add dataset level stats #2288

jqnatividad opened this issue Nov 13, 2024 · 4 comments · Fixed by #2297
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. WIP work in progress

Comments

@jqnatividad
Copy link
Collaborator

jqnatividad commented Nov 13, 2024

Currently, stats only computes column-level stats.

Also add dataset-level stats, with the "_qsv_" prefix, like:

  • rowcount (_qsv_rowcount)
  • column count (_qsv_columncount)
  • filesize (_qsv_filesize)
  • file hash (_qsv_hash) using xxHash algorithm (use twox-hash crate)

The value for each dataset stat will be stored in a column named _qsv_value

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Nov 13, 2024
@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Nov 18, 2024

instead of computing the file hash which may take a long time for large files, just compute the hash of all the stats, including the rowcount, column count and the filesize.

This pretty much guarantees the hash will be unique for the file in its current state, without having to scan the entire file, serving as a "fingerprint hash."

@jqnatividad
Copy link
Collaborator Author

#2297 is still WIP, but changed the dataset-level stats to:

  • qsv__rowcount
  • qsv__columncount
  • qsv__filesize_bytes
  • qsv__fingerprint_hash
    with their values stored in the last column of stats as qsv__value.

Removed the leading underscore because it was tripping up select in CI as underscore is a select sentinel value for last column. Made the prefix qsv__ with two trailing underscores.

@jqnatividad
Copy link
Collaborator Author

This is not truly done until corresponding CI tests succeed.
stats has hundreds of tests, so this will take a bit of effort.

@jqnatividad jqnatividad reopened this Nov 20, 2024
@jqnatividad jqnatividad added the WIP work in progress label Nov 22, 2024
@jqnatividad
Copy link
Collaborator Author

done.
There are still some issues with fingerprint hash being non-deterministic when using cache_threshold, but will track that separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. WIP work in progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant