Skip to content

Releases: dathere/qsv

0.88.0

13 Feb 16:47
Compare
Choose a tag to compare

Added

  • extdedup: new command to deduplicate arbitrarily large CSV/text files using a memory-buffered, on-disk hash table. Not only does it dedup very large files using constant memory, it does so while retaining the file's original sort order, unlike dedup which loads the entire file into memory to sort it first before deduping by comparing neighboring rows #762
  • Added Out-of-Memory (OOM) handling for "non-streaming" commands (i.e. commands that load the entire file into memory) using a heuristic that if an input file's size is lower than the free memory available minus a default headroom of 20 percent, qsv processing stops gracefully with a detailed message about the potential OOM condition. This headroom can be adjusted using the QSV_FREEMEMORY_HEADROOM_PCT environment variable, which has a minimum value of 10 percent #767
  • add -Q, --quiet option to all commands that return counts to stderr (dedup, extdedup, search, searchset and replace) in #768

Changed

  • sort & sortcheck: separate test suites and link from usage text #756
  • frequency: amortize allocations, preallocate with_capacity. Informal benchmarking shows an improvement of ~30%! 🚀 #761
  • extsort: refactor. Aligned options with extdedup; now also support stdin/stdout; added --memory-limit option #763
  • safenames: minor optimization a7df378
  • excel: minor optimization 75eac78
  • stats: add date inferencing false positive warning, with a recommendation how to prevent false positives a84a4e6
  • sortcheck: added note to usage text that dupe_count is only valid if file is sorted ab69f14
  • reorganized Installation section to differentiate installation options 9ef8bfc
  • bump MSRV to 1.67.1
  • applied select clippy recommendations
  • Bump flexi_logger from 0.25.0 to 0.25.1 by @dependabot in #755
  • Bump pyo3 from 0.18.0 to 0.18.1 by @dependabot in #757
  • Bump serde_json from 1.0.92 to 1.0.93 by @dependabot in #760
  • Bump filetime from 0.2.19 to 0.2.20 by @dependabot in #759
  • Bump self_update from 0.34.0 to 0.35.0 by @dependabot in #765
  • cargo update bump several indirect dependencies
  • pin Rust nightly to 2023-02-12

Fixed

  • sortcheck: correct wrong progress message showing invalid dupe_count (as dupe count is only valid if the file is sorted) 8eaa824
  • py & luau: correct usage text about stderr 1b56e72

Full Changelog: 0.87.1...0.88.0

0.87.1

03 Feb 04:07
Compare
Choose a tag to compare

Changed

  • safenames: refactor in #754
    • better handling of headers that start with a digit, instead of replacing the digit with a _, prepend the unsafe prefix
    • quoted identifiers are also considered unsafe, unless conditional mode is used
    • verbose modes now also return a list of duplicate header names
  • update MSRV to 1.67.0
  • cargo update bump depedencies
  • disable optimization on test profile for faster CI compilation, which was taking much longer than test run time
  • optimize prebuilt nightlies to compile with target-cpu=native
  • pin Rust nightly to 2023-02-01

Fixed

  • safenames: fixed mode behavior inconsistencies #754
    all modes now use the same safenames algorithm. Before, the verbose modes used a simpler one leading to inconsistencies between modes (resolves safenames handling inconsistent between modes #753)

Full Changelog: 0.87.0...0.87.1

0.87.0

30 Jan 02:23
Compare
Choose a tag to compare

Added

  • apply: add decimal separator --replacement option to thousands operation. This fully rounds out thousands formatting, as it will allow formatting numbers to support "euro-style" formats (e.g. 1.234.567,89 instead of 1,234,567.89) #749
  • apply: add round operation; also refactored thousands operation to use more appropriate --formatstr option instead of --comparand option to specify "format" of thousands separator policy #751
  • applydp: add round operation #752

Changed

  • changed MSRV policy to track latest Rust version in Homebrew, instead of latest Rust stable
  • removed excess trailing whitespace in apply & applydp usage text
  • moved round_num function from stats.rs to util.rs so it can be used in round operation in apply and applydp
  • cargo update bump dependencies, notably tokio from 1.24.2 to 1.25.0
  • pin Rust nightly to 2023-01-28

Fixed

  • apply: corrected thousands operation usage text - hexfour not hex_four 6545aa2

Full Changelog: 0.86.0...0.87.0

0.86.0

29 Jan 15:14
Compare
Choose a tag to compare

Added

  • apply: added thousands operation which adds thousands separators to numeric values.
    Specify the separator policy with --comparand (default: comma). The valid policies are:
    comma, dot, space, underscore, hexfour (place a space every four hex digits) and
    indiancomma (place a comma every two digits, except the last three digits). #748
  • searchset: added --unmatched-output option. This was done to allow Datapusher+ to screen for PIIs more efficiently. Writing PII candidate records in one CSV file, and the "clean" records in another CSV in just one pass. #742

Changed

  • fetch & fetchpost: expanded usage text info on HTTP2 Adaptive Flow Control support
  • fetchpost: added more detail about --compress option
  • stats: added more tests
  • updated prebuilt zip archive READMEs 072973e
  • Bump redis from 0.22.2 to 0.22.3 by @dependabot in #741
  • Bump ahash from 0.8.2 to 0.8.3 by @dependabot in #743
  • Bump jql from 5.1.4 to 5.1.6 by @dependabot in #747
  • applied select clippy recommendations
  • cargo update bump several indirect dependencies
  • pin Rust nightly to 2023-01-27

Fixed

  • stats: fixed antimodes null display. Use the literal NULL instead of just "" when listing NULL as an antimode. #745
  • tojsonl: fixed invalid escaping of JSON values #746

Full Changelog: 0.85.0...0.86.0

0.85.0

23 Jan 03:46
Compare
Choose a tag to compare

Added

  • Update csvs_convert by @kindly in #736
  • sniff: added --delimiter option #732
  • fetchpost: add --compress option in #737
  • searchset: several tweaks for PII screening requirement of Datapusher+. --flag option now shows regex labels instead of just row number; new --flag-matches-only option sends only matching rows to output when used with --flag; --json option returns rows_with_matches, total_matches and rowcount as json to stderr. #738

Changed

  • luau: minor tweaks to increase code readability 31d01c8
  • stats: now normalizes after rounding. Normalizing strips trailing zeroes and converts -0.0 to 0.0. f838272
  • safenames: mention CKAN-specific options f371ac2
  • fetch & fetchpost: document decompression priority 43ce13c
  • Bump actix-governor from 0.3.2 to 0.4.0 by @dependabot in #728
  • Bump sysinfo from 0.27.6 to 0.27.7 by @dependabot in #730
  • Bump serial_test from 0.10.0 to 1.0.0 by @dependabot in #729
  • Bump pyo3 from 0.17.3 to 0.18.0 by @dependabot in #731
  • Bump reqwest from 0.11.13 to 0.11.14 by @dependabot in #734
  • cargo update bump for other dependencies
  • pin Rust nightly to 2023-01-21

Fixed

  • sniff: now checks that --sample size is greater than zero cd4c390

Full Changelog: 0.84.0...0.85.0

0.84.0

15 Jan 00:11
Compare
Choose a tag to compare

Added

  • headers: added --trim option to trim quote and spaces from headers #726

Changed

  • input: --trim-headers option also removes excess quotes #727
  • safenames: trim quotes and spaces from headers 0260833
  • cargo update bump dependencies
  • pin Rust nightly to 2022-01-13

Full Changelog: 0.83.0...0.84.0

0.83.0

13 Jan 18:22
Compare
Choose a tag to compare

Added

  • stats: add sparsity to "streaming" statistics #719
  • schema: also infer enum constraints for integer fields. Not only good for validation, this is also required by tojsonl for smarter boolean inferencing #721

Changed

  • stats: change --typesonly so it will not automatically --infer-dates. Let the user decide. #718
  • stats: if median is already known, use it to calculate Median Absolute Deviation 08ed08d
  • tojsonl: smarter boolean inferencing. It will infer a column as boolean if it only has a domain of two values,
    and the first character of the values are one of the following case-insensitive "truthy/falsy"
    combinations: t/f; t/null; 1/0; 1/null; y/n & y/null are treated as true/false. #722 and #723
  • safenames: process --reserved option before --prefix option. b333549
  • strum and strum-macros are no longer optional dependencies as we use it with all the binary variants now bea6e00
  • Bump qsv-stats from 0.6.0 to 0.7.0
  • Bump sysinfo from 0.27.3 to 0.27.6
  • Bump hashbrown from 0.13.1 to 0.13.2 by @dependabot in #720
  • Bump actions/setup-python from 4.4.0 to 4.5.0 by @dependabot in #724
  • change MSRV from 1.66.0 to 1.66.1
  • cargo update bump indirect dependencies
  • pin Rust nightly to 2023-01-12

Fixed

  • safenames: fixed --prefix option. When checking for invalid underscore prefix, it was checking for hyphen, not underscore, causing a problem with Datapusher+ 4fbbfd3

Full Changelog: 0.82.0...0.83.0

0.82.0

09 Jan 17:05
Compare
Choose a tag to compare

Added

Changed

  • validate: schema-less validation error improvements #703
  • stats: faster date inferencing #706
  • stats: minor performance tweaks 15e6284 3f0ed2b
  • stats: refactored modes compilation, with antimodes no longer unnecessarily compiling more than 10 antimodes it won't show anyway. 6e448b0
  • stats: simplify if condition ae7cc85
  • luau: show luau version when invoking --version f7f9c42
  • excel: add "sheet" suffix to end msg for readability ae3a8e3
  • cache util::count_rows result, so if a CSV without an index is queried, it caches the result and future calls to count_rows in the same session will be instantaneous e805ded
  • Bump console from 0.15.3 to 0.15.4 by @dependabot in #704
  • Bump cached from 0.41.0 to 0.42.0 by @dependabot in #709
  • Bump mlua from 0.8.6 to 0.8.7 by @dependabot in #712
  • Bump qsv-stats from 0.5.2 to 0.6.0 with the new MAD statistic support and faster, more memory-efficient antimodes compilation
  • cargo update bump dependencies - notably mimalloc from 0.1.32 to 0.1.34, luau0-src from 0.4.1_luau553 to 0.5.0_luau555, csvs_convert from 0.7.9 to 0.7.11 and regex from 1.7.0 to 1.7.1
  • pin Rust nightly to 2023-01-08

Fixed

  • tojsonl: fix escaping of unicode string. Replace hand-rolled escape fn with built-in escape_default fn #707. Fixes #705
  • tojsonl: more robust boolean inferencing #710. Fixes #708

New Contributors

Full Changelog: 0.81.0...0.82.0

0.81.0

02 Jan 12:24
Compare
Choose a tag to compare

[0.81.0] - 2023-01-02

Added

  • stats: added range statistic #691
  • stats: added additional mode stats. For mode, added mode_count and mode_occurrences. Added "antimode" (opposite of mode - least frequently non-zero occurring value), antimode_count and antimode_occurrences. #694
  • qsv-dateparser now recognizes unix timestamp values with fractional seconds to nanosecond precision as dates. stats, sniff, apply datefmt and schema, which all use qsv-dateparser, now infer unix timestamps as dates - a29ff8e #702

USAGE NOTE: As timestamps can be float or integer, and data type inferencing will guess dates last, preprocess timestamp columns with apply datefmt first to more date-like, non-timestamp formats, so they are recognized as dates by other qsv commands.

Changed

  • apply: document numtocurrency --comparand & --replacement behavior cc88fe9
  • index: explicitly flush buffer after creating index ee5d790
  • sample: no longer requires an index to do percentage sampling 45d4657
  • slice: removed unneeded utf8 check 5a199f4
  • schema: expand usage text regarding --strict-dates 3d22829
  • stats: date stats refactor. Date stats are returned in rfc3339 format. Dates are converted to timestamps with millisecond precision while calculating date stats. #690 e7c2977
  • filter out variance/stddev in tests as float precision issues are causing flaky CI tests #696
  • Bump qsv-dateparser from 0.4.4 to 0.6.0
  • Bump qsv-stats from 0.4.6 to 0.5.2
  • Bump qsv-sniffer from 0.5.0 to 0.6.0
  • Bump serde from 1.0.151 to 1.0.152 by @dependabot in #692
  • Bump csvs_convert from 0.7.7 to 0.7.8 by @dependabot in #693
  • Bump once_cell from 0.16.0 to 0.17.0 d3ac255
  • Bump self-update from 0.32.0 to 0.34.0 5f95933
  • Bump cpc from 1.8 to 1.9; set csvs_convert dependency to minor version ee91648
  • applied select clippy recommendations
  • deeplink to Cookbook from Table of Contents
  • pin Rust nightly to 2023-01-01
  • implementation comments on stats, sample, sort & Python distribution

Fixed

  • stats: prevent premature rounding, and make sum statistic use the same rounding method 879214a 1a13620
  • fix autoindex so we return the index path properly d3ce6a3
  • fetch & fetchpost: corrected typo 684036b

Full Changelog: 0.80.0...0.81.0

0.80.0

23 Dec 23:50
Compare
Choose a tag to compare

Added

  • new to command. Converts CSVs "to" PostgreSQL, SQLite, XLSX, Parquet and Data Package by @kindly in #656
  • apply: add numtocurrency operation #670
  • sort: add --ignore-case option #673
  • stats: now computes summary statistics for dates as well #684
  • added --updatenow option, resolves #661 #662
  • replace footnotes in Available Commands list with emojis 😄

Changed

  • apply & applydp: expose --batch size option #679
  • validate: add last valid row to validation error 7680011
  • input: add last valid row to error message 492e51f
  • upgrade to csvs-convert 0.7.5 by @kindly in #668
  • Bump serial_test from 0.9.0 to 0.10.0 by @dependabot in #671
  • Bump csvs_convert from 0.7.5 to 0.7.7 by @dependabot in #674
  • Bump num_cpus from 1.14.0 to 1.15.0 by @dependabot in #678
  • Bump robinraju/release-downloader from 1.6 to 1.7 by @dependabot in #677
  • Bump actions/stale from 6 to 7 by @dependabot in #676
  • Bump actions/setup-python from 4.3.1 to 4.4.0 by @dependabot in #683
  • added concurrency check to CI tests so that redundant CI test are canceled when new ones are launched
  • instead of saying "descriptive statistics", use more understandable "summary statistics"
  • changed publishing workflows to enable to feature for applicable target platforms
  • cargo update bump dependencies, notably qsv-stats from 0.4.5 to 0.4.6 and qsv_currency from 0.5.0 to 0.6.0
  • pin Rust nightly to 2022-12-22

Fixed

  • stats: fix leading zero handling #667
  • apply: fix currencytonum bug #669

Full Changelog: 0.79.0...0.80.0