Skip to content

0.125.0

Compare
Choose a tag to compare
@jqnatividad jqnatividad released this 01 Apr 12:38
· 2550 commits to master since this release
d559548

In this release, we focused on the ๐ŸŽ๏ธ need for even more speed ๐ŸŽ๏ธ .

This was done primarily by tweaking several supporting qsv crates. qsv-docopt now parses command-line arguments slightly faster. qsv-stats, the crate behind commands like stats, schema, tojsonl, and frequency, has been further optimized for speed. qsv-dateparser has been updated to support new timezone handling options in datefmt. qsv-sniffer also got a speed boost.

Per the benchmark suite, stats is 25% faster (1.563 secs vs 2.067 secs) when computing the 13 "streaming" stats and 14% faster when computing --everything (17 columns of addl stats - 3.149 secs vs 3.656 secs) for the 1M row, 41 column, 520mb sample of NYC's 311 data.

The count command has been refactored to utilize Polars' SQLContext, which leverages LazyFrames evaluation to automagically count even very large files in just a few seconds. Previously, count was already using Polars, but it mistakenly fell back to a slower counting mode. Now, it consistently delivers fast performance, even without an index. On the same benchmark suite, it takes 0.052 secs vs 0.503 seconds - almost 10x faster!

As count is not just a top-level command, but also a widely used helper used by several qsv commands, this gives the entire suite a nice performance boost.

Continuing on the performance front, the excel command now has a new short --metadata mode, allowing users to just get a "shorter" version of the metadata report that only list the workbook's top level metadata (sheet index, sheet name, sheet type, visibility) instead of the full metadata report (which also has info like num rows, column metadata, etc.). On the benchmark suite, the short metadata report takes all of 0.005 secs vs 11.237 secs for the 1M row xlsx version of the same NYC 311 data - more than 3 orders of magnitude faster! (it may actually be faster since 0.005 secs is at the limits of what hyperfine can measure)

The datefmt command also got some major enhancements with new timezone handling and timestamp parsing options, though at the cost of a small 15% performance penalty.

Lastly, we are excited to announce that qsv will be featured at the CSV,Conf,V8 conference in Puebla, Mexico on May 28-29. I'll be presenting a talk titled "qsv: A Blazing Fast CSV Data-Wrangling Toolkit". Hope to see you there!.


Added

  • excel: added short mode to --metadata option #1699
  • datefmt: added ts-resolution option to specify resolution to use when parsing unix timestamps #1704
  • datefmt: added timezone handling options #1706 #1707 #1642

Changed

  • count: refactored to use Polars SQLContext 43a236f
  • stats: refactored stats_path helper function 174c30e
  • apply, applydp, datefmt, excel, geocode, py, validate: use std::mem::take to avoid clone 1fd187f 8402d3a 8496157
  • excel: optimized workbook opening operation 67f662e
  • build(deps): bump flexi_logger from 0.27.4 to 0.28.0 by @dependabot in #1673
  • build(deps): bump polars from 0.38.2 to 0.38.3 by @dependabot in #1674
  • build(deps): bump uuid from 1.7.0 to 1.8.0 by @dependabot in #1675
  • build(deps): bump hashbrown from 0.14.3 to 0.14.4 by @dependabot in #1680
  • build(deps): bump reqwest from 0.11.26 to 0.11.27 by @dependabot in #1679
  • build(deps): bump bytes from 1.5.0 to 1.6.0 by @dependabot in #1685
  • build(deps): bump regex from 1.10.3 to 1.10.4 by @dependabot in #1686
  • build(deps): bump indexmap from 2.2.5 to 2.2.6 by @dependabot in #1687
  • build(deps): bump rayon from 1.9.0 to 1.10.0 by @dependabot in #1688
  • build(deps): bump qsv_docopt from 1.6.0 to 1.7.0 by @dependabot in #1691
  • build(deps): bump reqwest from 0.12.1 to 0.12.2 by @dependabot in #1693
  • build(deps): bump serde_json from 1.0.114 to 1.0.115 by @dependabot in #1694
  • build(deps): bump itoa from 1.0.10 to 1.0.11 by @dependabot in #1695
  • build(deps): bump actions/setup-python from 5.0.0 to 5.1.0 by @dependabot in #1700
  • build(deps): bump rust_decimal from 1.34.3 to 1.35.0 by @dependabot in #1701
  • build(deps): bump chrono from 0.4.35 to 0.4.37 by @dependabot in #1702
  • build(deps): bump tokio from 1.36.0 to 1.37.0 by @dependabot in #1703
  • build(deps): bump qsv-sniffer from 0.10.2 to 0.10.3 by @dependabot in #1708
  • build(deps): bump titlecase from 2.2.1 to 3.0.0 by @dependabot in #1709
  • build(deps): bump qsv-stats from 0.13.0 to 0.14.0 by @dependabot in #1710
  • applied select clippy recommendations
  • updated several indirect dependencies
  • added several benchmarks for new/changed commands
  • bumped MSRV to 1.77.1
  • use #[cfg(debug_assertions)] conditional compilation to avoid compiling debug code in release mode
  • use patched forks of jsonschema, cached, self_update and localzone crates to avoid old dependencies
    which was causing dependency bloat

Fixed

  • count: fixed polars_count_input helper, as it was always falling back to "slow" counting mode 3484c89

Full Changelog: 0.124.1...0.125.0