Releases: dathere/qsv
1.0.0
qsv v1.0.0 is here! π
After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!
What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!
To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!
Added
joinp
: add--ignore-case
option #2287py
: add ability to load python expression from file #2295replace
: add--not-one
flag (resolves #2305) by @rzmk in #2307slice
: add--invert
option #2298stats
: add dataset-level stats #2297sqlp
: auto-decompression of gzip, zstd & zlib compressed csv files withread_csv
table function (implements suggestion from @wardi in #2301) #2315template
: add lookup support #2313- added
ui
feature to make it easier to make a headless build of qsv #2289 - added better panic handling #2304
- added new benchmark for
template
command cd7e480 - added π
lookup support
legend b46de73
Changed
- move qsv from personal Github repo to datHere GitHub org #2317
template
: parallelized template rendering for significant speedups #2273- simplify input format check #2309
- bump embedded
luau
from 0.650 to 0.653 986a1d3 - deps: Switch back to
simple-home-dir
fromsimple-expand-tilde
#2319 - deps: Add minijinja contrib #2276
- deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
- build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
- build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
- build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
- build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
- build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
- build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
- build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
- build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
- build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
- build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
- build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
- build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
- build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
- build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
- build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
- build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
- build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
- build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
- build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
- build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
- applied several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped MSRV to latest Rust stable (1.83.0)
- bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars
Fixed
- fix
get_stats_records()
helper to handle input files with embedded spaces (fixes #2294) #2296 - added better panic handling (fixes #2301) #2304
- implement simple format check for input files (fixes #2301) #2308
Removed
- removed
simple-expand-tilde
dependency in favor ofsimple-home-dir
#2318 - removed patched fork of
indicatif
now that 0.17.9 is released, fixing GH unmaintained advisory forinstant
33fa54a - removed
clipboard
command fromqsvlite
binary variant 9c663d8
Full Changelog: 0.138.0...1.0.0
0.138.0
Highlights:
-
β New
template
command for rendering templates with CSV data.
Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template). -
β New
lookup
module for fetching reference data from remote and local files.
In addition to the typicalhttp
/https
schemes for remote files, qsv adds two additional schemes -CKAN://
anddatHere://
, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
Thelookup
module is now being used by theluau
(for itsqsv_register_lookup
helper) andvalidate
(for itsdynamicEnum
custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g.apply
,geocode
,template
,sqlp
, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract). -
β¨ Enhanced
fetchpost
with MiniJinja templating for payload construction.
Previously,fetchpost
was limited to posting url-encoded HTML Form data with content typeapplication/x-www-form-urlencoded
. Now with the new--payload-tpl
and--content-type
options, users can post request bodies rendered with MiniJinja and specify other content types (typicallyapplication/json
,text/plain
,multipart/form-data
) as well. -
β¨ Improved Polars integration with automatic schema detection
Thejoinp
andsqlp
commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:- Faster execution by skipping Polars' schema inference step
- GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
-
π
fast-float2
crate for faster float parsing
Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) withfast-float2
. -
πͺ Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.
Added
- added lookup module - enabling fetching and caching of reference data from remote and local files #2262
fetchpost
: add--payload-tpl <file>
and--content-type
options to construct payload using MiniJinja with the appropriate content-type #2268 5921498joinp
: derive polars schema from stats cache 86fe22esqlp
: derive polars schema from stats cache #2256template
: new command to render MiniJinja templates with CSV data #2267validate
: adddynamicEnum
lookup support #2265contrib(completions)
: add template command and update fetchpost by @rzmk in #2269- add
fast-float2
dependency for faster bytes to float conversion 7590e4e 3ca30aa - added more benchmarks for new/updated commands f8a1d4f cd7e480
Changed
luau
: adapt to mlua 0.10 API changes 268cb45luau
: refactored stage management 31ef58aluau
: now uses the lookup module 2f4be34stats
: minor perf refactoring 6cdd6ea- build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
- build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
- build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
- build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
- build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
- build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
- build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
- build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
- build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
- build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
- build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
- build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
- build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
- build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
- build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
- build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
- build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
- build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
- build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
- build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
deps
: updated our fork of the csv crate with more perf optimizations eae7d76deps
: use calamine upstream with unreleased fixes 4cc7f37deps
: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322deps
: bump jsonschema from 0.25 to 0.26 #2251deps
: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0deps
: bump mlua from 0.9 to 0.10 #2249deps
: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44- apply select clippy lint suggestions
- updated indirect dependencies
- aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5
Fixed
Removed
- removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
- removed unneeded
create_dir_all_threadsafe
helper now that std::create_dir_all is threadsafe d0af83b
Full Changelog: 0.137.0...0.138.0
0.137.0
Highlights:
extdedup
&extsort
now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful--select
option to specify which columns to deduplicate or sort on.
This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table forextdedup
, and an external merge sort forextsort
) - they can handle files larger than memory.sqlp
now has a--cache-schema
option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.fetch
andfetchpost
have been updated to use thejaq
crate instead of thejql
crate. This change was made to improve performance and to make the commands consistent with thejson
command which also usesjaq
. Furthermore,jaq
is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.stats
is a tad faster as we keep squeezing more performance from this central command.
Added
extdedup
: now supports two modes - LINE mode and CSV mode #2208extsort
: now also has two modes - CSV mode and LINE mode #2210sqlp
: add--cache-schema
option #2224- added
sqlp --cache-schema
benchmarks
Changed
apply
&applydp
: use smallvec for operations vector & other minor performance optimizations #2219 & bc837aeapply
&applydp
: specify min_length for parallel iterators 7d6ce5efetch
&fetchpost
: replace jql with jaq #2222stats
: performance optimizations f205809 e26c27f 4579c1bvalidate
: specify min_length for parallel iterators a5b8185deps
: updated polars to 0.43.1 at the py-1.10.0 tag.- build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
- build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
- build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
- build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
- build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
- build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
- build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
- build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
- build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
- build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
- build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
- apply select clippy lints
- bumped indirect dependencies
- bumped MSRV to 1.82
Fixed:
- fix performance regression in batched commands by refactoring
optimal_batch_size
to require indexed CSV files #2206
Removed:
fetch
&fetchpost
: removed jql options; replaced with jaq #2222
Full Changelog: 0.136.0...0.137.0
0.136.0
π qsv pro is now available in the Microsoft Store! π
It's Data Wrangling Democratized on the Desktop, featuring:
- π Familiar Spreadsheet Interface
tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line. - CKAN desktop client
designed to make data publishing easier for portal operators and data stewards using the CKAN platform. - π₯ Flow
allows you to build custom node-based flows and data pipelines using a visual interface. - π§ Toolbox
features an ever-expanding library of reusable scripts for common data-wrangling use cases. - β and more!
Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support,.qsv
file format, etc.) that will be unveiled in future versions.
Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!
Other highlights:
excel
: new--table
option for XLSX files; new--header-row
option; expanded--range
option, adding support for Named Ranges and absolute ranges (e.g.Sheet2!$A$1:$J$10
); and expanded metadata export now including Named Ranges and Tables (for XLSX files)- Improved performance for several commands (
apply
,datefmt
,tojsonl
andvalidate
) through automatic batch size optimization validate
:dynamicEnum
custom JSON Schema keyword in validate command (renamed fromdynenum
) and enhanced email validationschema
: automatic JSON Schemaconst
inferencing for columns with just one value- Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes
NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT
Added
- π qsv pro is now in the Microsoft Store!!! π
apply
,datefmt
,tojsonl
,validate
: added logic to automatically determine optimal batch size for better parallelization #2178enum
: added--new-column
support for all enum modes, not just--increment
#2173excel
: new--table
option for XLSX files #2194excel
: new--header-row
option 458f79aexcel
: expanded range and metadata options #2195schema
: added JSON Schema automaticconst
inferencing #2180- Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
contrib(completions)
: add--table
option toqsv excel
by @rzmk in #2197completions
: add--header-row
option toqsv excel
e8794d5- added new
apply operations sentiment
benchmark b745e64 docs
: added indexing section to PERFORMANCE.md 804145a
Changed
stats
: various minor micro-optimizations 62d95fc 2c2862avalidate
: renamed custom keyworddynenum
todynamicEnum
to be more consistent with JSON schema naming conventions 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cfvalidate
: optimizations for increased performance; replace serde_json with simd_json 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf- apply new
clippy::ref_option
lint to Config::new API #2192 - Update debian package readme by @tino097 in #2187
deps
: bumpcalamine
from 0.25 to 0.26 b42279adeps
:jsonschema
use latest 0.22.3 upstream with unreleased features/fixesdeps
:polars
use latest 0.43.1 upstream with unreleased features/fixesdeps
: created our own fork of unmaintained vader_sentiment crate b426761deps
: useserde_json
upstream with unreleased perf improvement/fixes https://github.com/jqnatividad/qsv/blob/1c1174b3b8b65d9dfd9c841597366fb09d0a047c/Cargo.toml#L221- build(deps): bump flate2 from 1.0.33 to 1.0.34 by @dependabot in #2171
- build(deps): bump flexi_logger from 0.29.0 to 0.29.1 by @dependabot in #2189
- build(deps): bump flexi_logger from 0.29.1 to 0.29.2 by @dependabot in #2196
- build(deps): bump hashbrown from 0.14.5 to 0.15.0 by @dependabot in #2186
- build(deps): bump jsonschema from 0.20.0 to 0.21.0 by @dependabot in #2177
- build(deps): bump jsonschema from 0.22.1 to 0.22.2 by @dependabot in #2191
- build(deps): bump regex from 1.10.6 to 1.11.0 by @dependabot in #2176
- build(deps): bump reqwest from 0.12.7 to 0.12.8 by @dependabot in #2183
- build(deps): bump simd-json from 0.14.0 to 0.14.1 #2199
- build(deps): bump simple-expand-tilde from 0.4.2 to 0.4.3 by @dependabot in #2190
- build(deps): bump sysinfo from 0.31.4 to 0.32.0 by @dependabot in #2193
- build(deps): bump tempfile from 3.12.0 to 3.13.0 by @dependabot in #2175
- apply select clippy lints
- bumped indirect dependencies
- aligned Rust nightly to Polars nightly - 2024-09-29 7cd2de1
Fixed
schema
: fixenum
so it only adds a list when the number of unique values >--enum-threshold
#2180- Upload artifact fix for Debian package publishing by @tino097 in #2168
- fixed typos configuration 627de89
- fixed various GitHub Actions publishing workflow issues
Full Changelog: 0.135.0...0.136.0
0.135.0
Highlights
JSON Schema validation just got a whole lot more powerful with the introduction of qsv's custom dynenum
keyword!
With dynenum
, you can now dynamically lookup valid enum values from a CSV (on the filesystem or on a URL), allowing for more flexible and responsive data validation.
Unlike the standardenum
keyword, dynenum
does not require hardcoding valid values at schema definition time, and can be used to validate data against a changing set of valid values.
For an example, see #1872 (reply in thread).
In an upcoming qsv pro release, we're planning on making dynenum
even more powerful by allowing you to easily specify high-value reference data (e.g. US Census data, World Bank data, data.gov, etc.) that is maintained at data.dathere.com and other CKAN instances.
This release also add the custom currency
JSON Schema format, which enables currency validation according to the ISO 4217 standard.
The Polars engine was also upgraded to 0.43.1 at the py-1.81.1 tag - making for various under-the-hood improvements for the sqlp
, joinp
and count
commands, as we set the stage for more Polars-powered features in future releases.
Added
foreach
: enabledforeach
command on Windows prebuilt binaries def9c8flens
: added support for QSV_SNIFF_DELIMITER env var and snappy auto-decompression 8340e89sample
: add--max-size
option e845a3cvalidate
: addeddynenum
custom JSON Schema keyword for dynamic validation lookups #2166tests
: add tests for https://100.dathere.com/lessons/2 by @rzmk in #2141- added
stats_sorted
andfrequency_sorted
benchmarks - added
validate_dynenum
benchmarks
Changed
json
: add error for empty key and update usage text by @rzmk in #2167prompt
: gateprompt
command behindprompt
feature #2163validate
: expandedcurrency
JSON Schema custom format to support ISO 4217 currency codes and alternate formats 5202508validate
: migrate to newjsonschema
crate api 5d65054- Update ubuntu version for deb package by @tino097 in #2126
contrib(completions)
: update completions for qsv v0.134.0 and fix subcommand options by @rzmk in #2135contrib(completions)
: add--max-size
completion forsample
by @rzmk in #2142deps
: bump to polars 0.43.1 at py-1.81.1 #2130deps
: switch back to calamine upstream instead of our fork 677458f- build(deps): bump actix-governor from 0.5.0 to 0.6.0 by @dependabot in #2146
- build(deps): bump anyhow from 1.0.87 to 1.0.88 by @dependabot in #2132
- build(deps): bump arboard from 3.4.0 to 3.4.1 by @dependabot in #2137
- build(deps): bump bytes from 1.7.1 to 1.7.2 by @dependabot in #2148
- build(deps): bump geosuggest-core from 0.6.3 to 0.6.4 by @dependabot in #2153
- build(deps): bump geosuggest-utils from 0.6.3 to 0.6.4 by @dependabot in #2154
- build(deps): bump jql-runner from 7.1.13 to 7.2.0 by @dependabot in #2165
- build(deps): bump jsonschema from 0.18.1 to 0.18.2 by @dependabot in #2127
- build(deps): bump jsonschema from 0.18.2 to 0.18.3 by @dependabot in #2134
- build(deps): bump jsonschema from 0.18.3 to 0.19.1 by @dependabot in #2144
- build(deps): bump jsonschema from 0.19.1 to 0.20.0 by @dependabot in #2152
- build(deps): bump pyo3 from 0.22.2 to 0.22.3 by @dependabot in #2143
- build(deps): bump rfd from 0.14.1 to 0.15.0 by @dependabot in #2151
- build(deps): bump simple-expand-tilde from 0.4.0 to 0.4.2 by @dependabot in #2129
- build(deps): bump qsv_currency from 0.6.0 to 0.7.0 by @dependabot in #2159
- build(deps): bump qsv_docopt from 1.7.0 to 1.8.0 by @dependabot in #2136
- build(deps): bump redis from 0.26.1 to 0.27.0 by @dependabot in #2133
- build(deps): bump simdutf8 from 0.1.4 to 0.1.5 by @dependabot in #2164
- bump indirect dependencies
- apply select clippy lint suggestions
- several usage text/documentation improvements
- bump MSRV to 1.81.0
Fixed
validate
: correctfail_validation_error!
macro; reformat error messages to use hyphens as the JSONschema error message already starts with "error:" 9a25524- moved
--help
output from stderr to stdout as per GNU CLI guidelines #2138 lens
: fixed parsing of lens options 1cdd1bcsearchset
: fixed usage text for<regexset-file>
9a60fb0- used patched forks of
arrow
,csvlens
andxlsxwriter
crates that replaces a dependency on an old version oflexical-core
with known soundness issues - https://rustsec.org/advisories/RUSTSEC-2023-0086. Once those crates have updated theirlexical-core
dependency, we will revert to the original crates.
Removed
- removed
prompt
command from qsvlite #2163 - publish: remove
lens
feature from i686 targets as it does not compile 959ca76 deps
: remove anyhow dependency #2150
Full Changelog: 0.134.0...0.135.0
0.134.0
qsv pro v1 is here! π
If you've been using qsv for a while, even if you're a command-line ninja, you'll find a lot of new capabilities in qsv pro that can make your data wrangling experience even better!
Apart from making qsv easier to use, qsv pro has a multitude of features including: view interactive data tables; browse stats/frequency/metadata; run recipes and tools (scripts); run Polars SQL queries; use Natural Language queries (using Retrieval Augmented Generation (RAG) techniques); regular expression search; export to multiple file formats; download/upload from/to compatible CKAN instances; design custom node-based flows and data pipelines; interact with a local API from external programs including the qsv pro command; run various qsv commands in a graphical user interface; and the list goes on!
And that's just the beginning, there's more to come! You just have to try it!
Download qsv pro v1 now at qsvpro.dathere.com.
Other highlights include:
pro
: new command to allow qsv to interact with the qsv pro API to tap into qsv pro exclusive features.lens
: new command to interactively view CSVs using the csvlens crate.- The ludicrously fast
diff
command is now easier to use with its--drop-equal-fields
option. @janriemer continues to work on hiscsv-diff
crate, and there's morediff
UX improvements coming soon! stats
addssum_length
andavg_length
"streaming" statistics in addition to the existingmin_length
andmax_length
metrics. These are especially useful for datasets with a lot of "free text" columns.stats
also got "smarter" and "faster" by dog-fooding its own statistics to make it run faster!
It's a little complicated, but the waystats
works is that it compiles the "streaming" statistics on the fly first as it multiplex load the data across several threads, and the more expensive advanced statistics are "lazily" computed at the end.
Since we now compile "sort order" in a streaming manner, we use this info when deriving cardinality at the end to see if we can skip sorting - an otherwise necessary step to get cardinality which is done by "scanning" all the sorted values of a column. Everytime two neighboring values differ in a sorted column, it increments the cardinality count.
Apart from this "sort order" optimization, we also improved the "cardinality scan" algorithm - halving its memory footprint and making it faster still for larger datasets by parallelizing the computation. This in turn, makes thefrequency
command faster and more memory efficient.
It's performance tweaks like these, that despite adding six metrics (is_ascii
,sort_order
,sum_length
,avg_length
,sem
- standard error of the mean &cv
- coefficient of variation) in recent releases, thatstats
is still able to compile 35 statistics and do GUARANTEED data type inferences of a million row, 41 column, 520 MB sample of NYC's 311 data in 1.327 seconds (753,580 records per second)!1- we now also use our own fork of the
csv
crate, featuring SIMD-accelerated UTF-8 validation and other minor perf tweaks, making the entire qsv suite faster still!
Added
pro
: addqsv pro
command to interact with qsv pro API by @rzmk in #2039lens
: new command to interactively view CSVs using the csvlens crate #2117apply
: add crc32 operation #2121count
: add --delimiter option #2120diff
: add flag--drop-equal-fields
by @janriemer in #2114stats
: addsum_length
andavg_length
columns #2113stats
: smarter cardinality computation - added new parallel algorithm for large datasets (10,000+ rows) and updated sequential algorithm for smaller datasets 4e63fec
Changed
count
: added comment to justify magic number 5241e39stats
: use simdjson for faster JSONL parsing; micro-optimizecompute
hot loop 0e8b734stats
: standardized OVERFLOW and UNDERFLOW messages 38c6128sort
: renamed symbol so eliminate devskim lint false positive warning 12db739- enable
lens
feature in GH workflows #2122 deps
: bump polars 0.42.0 to latest upstream at time of release 3c17ed1deps
: use our own optimized fork of csv crate, with simdutf8 validation and other minor perf tweaks e4bcd71- build(deps): bump serde from 1.0.209 to 1.0.210 by @dependabot in #2111
- build(deps): bump serde_json from 1.0.127 to 1.0.128 by @dependabot in #2106
- build(deps): bump qsv-stats from 0.19.0 to 0.22.0 #2107 #2112 cb1eb60
- apply select clippy lint suggestions
- updated several indirect dependencies
- made various doc and usage text improvements
Fixed
schema
: Print an error if theqsv stats
invocation fails by @abrauchli in #2110
New Contributors
- @abrauchli made their first contribution in #2110
Full Changelog: 0.133.1...0.134.0
0.133.1
Highlights
1 | This release doubles down on Polars' capabilities, as we now, as a matter of policy track the latest polars upstream. If you think qsv has a torrid release schedule, you should see Polars. They're constantly fixing bugs, adding new features and optimizations! To keep up, we've added Polars revision info to the --version output, and the --envlist option now includes Polars relevant env vars. We've also added support for the POLARS_BACKTRACE_IN_ERR env var to control whether Polars backtraces are included in error messages.We also removed the to parquet subcommand as its redundant with the Polars-powered sqlp 's ability to create parquet files. This removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries smaller. |
Other highlights include:
- New
edit
command that allows you to edit CSV files. - The
count
command's--width
option now includes record width stats beyond max length (avg, median, min, variance, stddev & MAD). - The
fixlengths
command now has--quote
and--escape
options. - The
stats
command adds asort_order
streaming statistic.
NOTE: 0.133.0 was skipped because of a dev dependency conflict with the
csvs_convert
crate, preventing us from publishing 0.133.0 to crates.io. This has been resolved in 0.133.1.
Added
count
: expanded--width
options, adding record width stats beyond max length (avg, median, min, variance, stddev & MAD). Also added--json
output when using--width
#2099edit
: addqsv edit
command by @rzmk in #2074fixlengths
: added--quote
and--escape
options #2104stats
: addsort_order
streaming statistic #2101polars
: add polars revision info to--version
output e60e44fpolars
: added Polars relevant env vars to--envlist
option 0ad68fepolars
: add & documentPOLARS_BACKTRACE_IN_ERR
env var f9cc559
Changed
- Optimize polars optflags #2089
deps
: bump polars 0.42.0 to latest upstream at time of release 3b7af51- bump polars to latest upstream, removing smartstring #2091
- build(deps): bump actions/setup-python from 5.1.1 to 5.2.0 by @dependabot in #2094
- build(deps): bump flate2 from 1.0.32 to 1.0.33 by @dependabot in #2085
- build(deps): bump flexi_logger from 0.28.5 to 0.29.0 by @dependabot in #2086
- build(deps): bump indexmap from 2.4.0 to 2.5.0 by @dependabot in #2096
- build(deps): bump jsonschema from 0.18.0 to 0.18.1 by @dependabot in #2084
- build(deps): bump serde from 1.0.208 to 1.0.209 by @dependabot in #2082
- build(deps): bump serde_json from 1.0.125 to 1.0.127 by @dependabot in #2079
- build(deps): bump sysinfo from 0.31.2 to 0.31.3 by @dependabot in #2077
- build(deps): bump qsv-stats from 0.18.0 to 0.19.0 by @dependabot in #2100
- build(deps): bump tokio from 1.39.3 to 1.40.0 by @dependabot in #2095
- apply select clippy lint suggestions
- updated several indirect dependencies
- made various doc and usage text improvements
- pin Rust nightly to 2024-08-26 from 2024-07-26, aligning with Polars pinned nightly
Fixed
- Ensure portable binaries are "added" to the publish zip archive, instead of replacing all the binaries with just the portable version. Fixes #2083. 34ad206
Removed
- removed
to parquet
subcommand as its redundant withsqlp
's ability to create parquet files. This also removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries much smaller #2088 - removed
smartstring
dependency now that Polars has its own compact inlined string type 47f047e - removed
to parquet
benchmark
Full Changelog: 0.132.0...0.133.1
-
ChatGPT prompt: Using the logos for the Polars project and the qsv project as a baseline, can you create a version with the cowboy riding a polar bear instead? β©
0.132.0
Highlights
With this release, we finally finish the stats
caching refactor started in 0.131.0, replacing the binary encoded stats cache with a simpler JSONL cache. The stats
cache stores the necessary statistical metadata to make several key commands smarter & faster. Per the benchmarks:
frequency
is 6x faster (frequency_index_stats_mode_auto
).
Not only is it faster, it now doesn't need to compile a hashmap for columns with ALL unique values (e.g. ID columns) - practically, making it able to handle "real-world" datasets of any size (that is, unless all the columns have ALL unique cardinalities. In that case, the entire CSV will have to fit into memory).tojsonl
is 2.67x faster (tojsonl_index
)schema
is two orders of magnitude (100x) faster!!! (schema_index
)
The stats cache also provides the foundation for even more "smart" features and commands in the future. It also has the side-benefit of adding a way to produce stats in JSONL format that can be used for other purposes beyond qsv.
The search
, searchset
, and replace
commands now also have a --literal
option that allows you to search for and replace strings with regex special/reserved characters. This makes it easier to search for and replace strings that contain otherwise reserved regex characters without having to escape them (especially useful with URL columns that often contain characters like ?
,:
,-
,.
, etc.)
Added
search
,searchset
&replace
: add--literal
option #2060 & 7196053slice
: added usage text examples 04afaa3publish
: added workflow to build "portable" binaries with CPU features disabledcontrib(completions)
: add--literal
forsearch
andsearchset
by @rzmk in #2061contrib(completions)
: add--literal
completion toreplace
by @rzmk in #2062- add more polars metadata in
--version
info #2073 docs
: added more info to SECURITY.md 609d4dfdocs
: expanded Goals/Non-Goals 54998e3docs
: added Installation "Option 0" quick start bf5bf82- added
search --literal
benchmark
Changed
-
stats
,schema
,frequency
&tojsonl
: stats caching refactor, replacing binary encoded stats cache with a simpler JSONL cache #2055 -
rename
stats --stats-json
option tostats --stats-jsonl
#2063 -
changed "broken pipe" error to a warning 7353275
-
docs
: update multithreading and caching sections of PERFORMANCE.md 5e6bc45 -
deps
: switch to our qsv-optimized fork of csv crate 3fc1e82 -
deps
: bump polars from 0.41.3 to 0.42.0 #2051 -
build(deps): bump actix-web from 4.8.0 to 4.9.0 by @dependabot in #2041
-
build(deps): bump flate2 from 1.0.31 to 1.0.32 by @dependabot in #2071
-
build(deps): bump indexmap from 2.3.0 to 2.4.0 by @dependabot in #2049
-
build(deps): bump reqwest from 0.12.6 to 0.12.7 by @dependabot in #2070
-
build(deps): bump rust_decimal from 1.35.0 to 1.36.0 by @dependabot in #2068
-
build(deps): bump serde from 1.0.205 to 1.0.206 by @dependabot in #2043
-
build(deps): bump serde from 1.0.206 to 1.0.207 by @dependabot in #2047
-
build(deps): bump serde from 1.0.207 to 1.0.208 by @dependabot in #2054
-
build(deps): bump serde_json from 1.0.122 to 1.0.124 by @dependabot in #2045
-
build(deps): bump serde_json from 1.0.124 to 1.0.125 by @dependabot in #2052
-
apply select clippy lint suggestions
-
updated several indirect dependencies
-
made various usage text improvements
Fixed
stats
: fix--output
delimiter inferencing based on file extension #2065- make process_input helper handle stdin better #2058
docs
: fix completions for--stats-jsonl
and qsv pro installation text update by @rzmk in #2072docs
: added Note about whyluau
feature is disabled in musl binaries - ffa2bc5 & 27d0f8e
Removed
Full Changelog: 0.131.1...0.132.0
0.131.1
Changed
- deps: bump polars to latest upstream post py-1.41.1 release at the time of this release
- build(deps): bump filetime from 0.2.23 to 0.2.24 by @dependabot in #2038
Fixed
frequency
: change--stats-mode
default tonone
fromauto
.
This is because of a big performance regression when using--stats-mode auto
on datasets with columns with ALL unique values.
See #2040 for more info.
Full Changelog: 0.131.0...0.131.1
0.131.0
Highlights
- Refactored
frequency
to make it smarter and faster.
frequency
's core algorithm essentially compiles an in-memory hashmap to determine the frequency of each unique value for each column. It does this using multi-threaded, multi-I/O techniques to make it blazing fast.
However, for columns with ALL unique values (e.g. ID columns), this takes a comparatively long time and consumes a lot of memory as it essentially compiles a hashmap of the ENTIRE column, with a hashmap entry for each column value with a count of 1.
Now, with the new--stats-mode
option (enabled by default),frequency
can compile the dataset in a more intelligent way by looking up a column's cardinality in the stats cache.
If the cardinality of a column is equal to the CSV's rowcount (indicating a column with ALL unique values), it short-circuits frequency calculations for that column - dramatically reducing the time and memory requirements for the ID column as it eliminates the need to maintain a hashmap for it.
Practically speaking, this makesfrequency
able to handle "real-world" datasets of any size.
To ensurefrequency
is as fast as possible, be sure toindex
and computestats
for your datasets beforehand. - Setting the stage for Datapusher+ v1 and...
The "itches we've been scratching" the past few months have been informed by our work at several clients towards the release of Datapusher+ 1.0 and qsv pro 1.0 (more info below) - both targeted for release this month.
DP+ is our third-gen, high-speed data ingestion/registration tool for CKAN that uses qsv as its data wrangling/analysis engine. It will enable us to reinvent the way data is ingested into CKAN - with exponentially faster data ingestion, metadata inferencing, data validation, computed metadata fields, and more!
We're particularly excited how qsv will allow us to compute and infer high-quality metadata for datasets (with a focus on inferring optional recommended DCAT-US v3 metadata fields) in "near real-time", while dataset publishers are still entering metadata. This will be a game-changer for CKAN administrators and data publishers! - ...qsv pro 1.0
qsv pro is datHere's enterprise-grade data wrangling/curation workbench thatβs planned for v1.0 release this month.
Building the core functionality of qsv pro's Workflow feature is one of the primary reasons for a v1.0 release.
We feel qsv pro may be a game-changer for data wranglers and data curators who need to work with spreadsheets and large datasets to view statistical data and metadata while also performing complex data wrangling operations in a user-friendly way without having to write code.
Added
docs
: added Shell Completion section 556a2ffdocs:
add πͺ emoji in legend to indicate "automagical" commands 2753c90- Add building deb package (WIP) by @tino097 in #2029
- Added GitHub workflow to test debian package (WIP) by @tino097 in #2032
tests
: added false positive to _typos.toml configuration d576af2- added more benchmarks
- added more tests
Changed
fetch
&fetchpost
: remove expired diskcache entries on startup 9b6ab5dfrequency
: smarter frequency compilation with new--stats-mode
option #2030json
: refactored for maintainability & performance 62e9216 and 4e44b18- improved
self-update
messages 5c874e0 and 0aa0b13 contrib(completions)
:frequency
updates & remove bashly/fish by @rzmk in #2031- Debian package update by @tino097 in #2017
publish
: optimized enabled CPU features when building release binaries in all GitHub Actions "publishing" workflowspublish
: ensure latest Python patch release is used when buildingqsvpy
binary variants 2ab03a0 and ec6f486tests
: also enabled CPU features in CI testsdocs
: wordsmith qsv "elevator pitch" cc47fe6docs
: point to https://100.dathere.com in Whirlwind tour fc49aefdeps
: bump polars to latest upstream post py-1.41.1 release at the time of this release- build(deps): bump bytes from 1.6.1 to 1.7.0 by @dependabot in #2018
- build(deps): bump bytes from 1.7.0 to 1.7.1 by @dependabot in #2021
- build(deps): bump flate2 from 1.0.30 to 1.0.31 by @dependabot in #2027
- build(deps): bump indexmap from 2.2.6 to 2.3.0 by @dependabot in #2020
- build(deps): bump jaq-parse from 1.0.2 to 1.0.3 by @dependabot in #2016
- build(deps): bump redis from 0.26.0 to 0.26.1 by @dependabot in #2023
- build(deps): bump regex from 1.10.5 to 1.10.6 by @dependabot in #2025
- build(deps): bump serde_json from 1.0.121 to 1.0.122 by @dependabot in #2022
- build(deps): bump sysinfo from 0.30.13 to 0.31.0 by @dependabot in #2019
- build(deps): bump sysinfo from 0.31.0 to 0.31.2 by @dependabot in #2024
- build(deps): bump tempfile from 3.11.0 to 3.12.0 by @dependabot in #2033
- build(deps): bump serde from 1.0.204 to 1.0.205 by @dependabot in #2036
- apply select clippy suggestions
- updated several indirect dependencies
- made various usage text improvements
- bumped MSRV to 1.80.1
Fixed
sqlp
&joinp
: fixed.ssv.sz
output auto-compression support 5397f6c & d86ba63docs
: fix link by @uncenter in #2026tests
: correct misnamed test 8ae6000tests
: fix flakyreverse
property tests d86ba63
Removed
docs
: "Quicksilver" is the name of the logo horse, not how you pronounce "qsv" e4551ae
New Contributors
Full Changelog: 0.130.0...0.131.0