Releases: dathere/qsv
Releases · dathere/qsv
0.88.0
Added
extdedup
: new command to deduplicate arbitrarily large CSV/text files using a memory-buffered, on-disk hash table. Not only does it dedup very large files using constant memory, it does so while retaining the file's original sort order, unlikededup
which loads the entire file into memory to sort it first before deduping by comparing neighboring rows #762- Added Out-of-Memory (OOM) handling for "non-streaming" commands (i.e. commands that load the entire file into memory) using a heuristic that if an input file's size is lower than the free memory available minus a default headroom of 20 percent, qsv processing stops gracefully with a detailed message about the potential OOM condition. This headroom can be adjusted using the
QSV_FREEMEMORY_HEADROOM_PCT
environment variable, which has a minimum value of 10 percent #767 - add
-Q, --quiet
option to all commands that return counts to stderr (dedup
,extdedup
,search
,searchset
andreplace
) in #768
Changed
sort
&sortcheck
: separate test suites and link from usage text #756frequency
: amortize allocations, preallocate with_capacity. Informal benchmarking shows an improvement of ~30%! 🚀 #761extsort
: refactor. Aligned options withextdedup
; now also support stdin/stdout; added--memory-limit
option #763safenames
: minor optimization a7df378excel
: minor optimization 75eac78stats
: add date inferencing false positive warning, with a recommendation how to prevent false positives a84a4e6sortcheck
: added note to usage text that dupe_count is only valid if file is sorted ab69f14- reorganized Installation section to differentiate installation options 9ef8bfc
- bump MSRV to 1.67.1
- applied select clippy recommendations
- Bump flexi_logger from 0.25.0 to 0.25.1 by @dependabot in #755
- Bump pyo3 from 0.18.0 to 0.18.1 by @dependabot in #757
- Bump serde_json from 1.0.92 to 1.0.93 by @dependabot in #760
- Bump filetime from 0.2.19 to 0.2.20 by @dependabot in #759
- Bump self_update from 0.34.0 to 0.35.0 by @dependabot in #765
- cargo update bump several indirect dependencies
- pin Rust nightly to 2023-02-12
Fixed
sortcheck
: correct wrong progress message showing invalid dupe_count (as dupe count is only valid if the file is sorted) 8eaa824py
&luau
: correct usage text about stderr 1b56e72
Full Changelog: 0.87.1...0.88.0
0.87.1
Changed
safenames
: refactor in #754- better handling of headers that start with a digit, instead of replacing the digit with a _, prepend the unsafe prefix
- quoted identifiers are also considered unsafe, unless conditional mode is used
- verbose modes now also return a list of duplicate header names
- update MSRV to 1.67.0
- cargo update bump depedencies
- disable optimization on test profile for faster CI compilation, which was taking much longer than test run time
- optimize prebuilt nightlies to compile with target-cpu=native
- pin Rust nightly to 2023-02-01
Fixed
safenames
: fixed mode behavior inconsistencies #754
all modes now use the same safenames algorithm. Before, the verbose modes used a simpler one leading to inconsistencies between modes (resolves safenames handling inconsistent between modes #753)
Full Changelog: 0.87.0...0.87.1
0.87.0
Added
apply
: add decimal separator --replacement option to thousands operation. This fully rounds outthousands
formatting, as it will allow formatting numbers to support "euro-style" formats (e.g. 1.234.567,89 instead of 1,234,567.89) #749apply
: add round operation; also refactored thousands operation to use more appropriate--formatstr
option instead of--comparand
option to specify "format" of thousands separator policy #751applydp
: add round operation #752
Changed
- changed MSRV policy to track latest Rust version in Homebrew, instead of latest Rust stable
- removed excess trailing whitespace in
apply
&applydp
usage text - moved
round_num
function fromstats.rs
toutil.rs
so it can be used in round operation inapply
andapplydp
- cargo update bump dependencies, notably tokio from 1.24.2 to 1.25.0
- pin Rust nightly to 2023-01-28
Fixed
apply
: corrected thousands operation usage text -hexfour
nothex_four
6545aa2
Full Changelog: 0.86.0...0.87.0
0.86.0
Added
apply
: addedthousands
operation which adds thousands separators to numeric values.
Specify the separator policy with --comparand (default: comma). The valid policies are:
comma, dot, space, underscore, hexfour (place a space every four hex digits) and
indiancomma (place a comma every two digits, except the last three digits). #748searchset
: added--unmatched-output
option. This was done to allow Datapusher+ to screen for PIIs more efficiently. Writing PII candidate records in one CSV file, and the "clean" records in another CSV in just one pass. #742
Changed
fetch
&fetchpost
: expanded usage text info on HTTP2 Adaptive Flow Control supportfetchpost
: added more detail about--compress
optionstats
: added more tests- updated prebuilt zip archive READMEs 072973e
- Bump redis from 0.22.2 to 0.22.3 by @dependabot in #741
- Bump ahash from 0.8.2 to 0.8.3 by @dependabot in #743
- Bump jql from 5.1.4 to 5.1.6 by @dependabot in #747
- applied select clippy recommendations
- cargo update bump several indirect dependencies
- pin Rust nightly to 2023-01-27
Fixed
stats
: fixed antimodes null display. Use the literalNULL
instead of just "" when listing NULL as an antimode. #745tojsonl
: fixed invalid escaping of JSON values #746
Full Changelog: 0.85.0...0.86.0
0.85.0
Added
- Update csvs_convert by @kindly in #736
sniff
: added--delimiter
option #732fetchpost
: add--compress
option in #737searchset
: several tweaks for PII screening requirement of Datapusher+.--flag
option now shows regex labels instead of just row number; new--flag-matches-only
option sends only matching rows to output when used with--flag
;--json
option returns rows_with_matches, total_matches and rowcount as json to stderr. #738
Changed
luau
: minor tweaks to increase code readability 31d01c8stats
: now normalizes after rounding. Normalizing strips trailing zeroes and converts -0.0 to 0.0. f838272safenames
: mention CKAN-specific options f371ac2fetch
&fetchpost
: document decompression priority 43ce13c- Bump actix-governor from 0.3.2 to 0.4.0 by @dependabot in #728
- Bump sysinfo from 0.27.6 to 0.27.7 by @dependabot in #730
- Bump serial_test from 0.10.0 to 1.0.0 by @dependabot in #729
- Bump pyo3 from 0.17.3 to 0.18.0 by @dependabot in #731
- Bump reqwest from 0.11.13 to 0.11.14 by @dependabot in #734
- cargo update bump for other dependencies
- pin Rust nightly to 2023-01-21
Fixed
sniff
: now checks that--sample
size is greater than zero cd4c390
Full Changelog: 0.84.0...0.85.0
0.84.0
Added
headers
: added--trim
option to trim quote and spaces from headers #726
Changed
input
:--trim-headers
option also removes excess quotes #727safenames
: trim quotes and spaces from headers 0260833- cargo update bump dependencies
- pin Rust nightly to 2022-01-13
Full Changelog: 0.83.0...0.84.0
0.83.0
Added
stats
: add sparsity to "streaming" statistics #719schema
: also infer enum constraints for integer fields. Not only good for validation, this is also required bytojsonl
for smarter boolean inferencing #721
Changed
stats
: change--typesonly
so it will not automatically--infer-dates
. Let the user decide. #718stats
: if median is already known, use it to calculate Median Absolute Deviation 08ed08dtojsonl
: smarter boolean inferencing. It will infer a column as boolean if it only has a domain of two values,
and the first character of the values are one of the following case-insensitive "truthy/falsy"
combinations: t/f; t/null; 1/0; 1/null; y/n & y/null are treated as true/false. #722 and #723safenames
: process--reserved
option before--prefix
option. b333549strum
andstrum-macros
are no longer optional dependencies as we use it with all the binary variants now bea6e00- Bump qsv-stats from 0.6.0 to 0.7.0
- Bump sysinfo from 0.27.3 to 0.27.6
- Bump hashbrown from 0.13.1 to 0.13.2 by @dependabot in #720
- Bump actions/setup-python from 4.4.0 to 4.5.0 by @dependabot in #724
- change MSRV from 1.66.0 to 1.66.1
- cargo update bump indirect dependencies
- pin Rust nightly to 2023-01-12
Fixed
safenames
: fixed--prefix
option. When checking for invalid underscore prefix, it was checking for hyphen, not underscore, causing a problem with Datapusher+ 4fbbfd3
Full Changelog: 0.82.0...0.83.0
0.82.0
Added
diff
: Find the difference between two CSVs ludicrously fast! by @janriemer in #711stats
: added Median Absolute Deviation (MAD) #715- added Testing section to README 517d69b
Changed
validate
: schema-less validation error improvements #703stats
: faster date inferencing #706stats
: minor performance tweaks 15e6284 3f0ed2bstats
: refactored modes compilation, with antimodes no longer unnecessarily compiling more than 10 antimodes it won't show anyway. 6e448b0stats
: simplify if condition ae7cc85luau
: show luau version when invoking --version f7f9c42excel
: add "sheet" suffix to end msg for readability ae3a8e3- cache
util::count_rows
result, so if a CSV without an index is queried, it caches the result and future calls to count_rows in the same session will be instantaneous e805ded - Bump console from 0.15.3 to 0.15.4 by @dependabot in #704
- Bump cached from 0.41.0 to 0.42.0 by @dependabot in #709
- Bump mlua from 0.8.6 to 0.8.7 by @dependabot in #712
- Bump qsv-stats from 0.5.2 to 0.6.0 with the new MAD statistic support and faster, more memory-efficient antimodes compilation
- cargo update bump dependencies - notably mimalloc from 0.1.32 to 0.1.34, luau0-src from 0.4.1_luau553 to 0.5.0_luau555, csvs_convert from 0.7.9 to 0.7.11 and regex from 1.7.0 to 1.7.1
- pin Rust nightly to 2023-01-08
Fixed
tojsonl
: fix escaping of unicode string. Replace hand-rolled escape fn with built-in escape_default fn #707. Fixes #705tojsonl
: more robust boolean inferencing #710. Fixes #708
New Contributors
- @janriemer made their first contribution in #711
Full Changelog: 0.81.0...0.82.0
0.81.0
[0.81.0] - 2023-01-02
Added
stats
: added range statistic #691stats
: added additional mode stats. For mode, added mode_count and mode_occurrences. Added "antimode" (opposite of mode - least frequently non-zero occurring value), antimode_count and antimode_occurrences. #694- qsv-dateparser now recognizes unix timestamp values with fractional seconds to nanosecond precision as dates.
stats
,sniff
,apply datefmt
andschema
, which all use qsv-dateparser, now infer unix timestamps as dates - a29ff8e #702
USAGE NOTE: As timestamps can be float or integer, and data type inferencing will guess dates last, preprocess timestamp columns with
apply datefmt
first to more date-like, non-timestamp formats, so they are recognized as dates by other qsv commands.
Changed
apply
: document numtocurrency --comparand & --replacement behavior cc88fe9index
: explicitly flush buffer after creating index ee5d790sample
: no longer requires an index to do percentage sampling 45d4657slice
: removed unneeded utf8 check 5a199f4schema
: expand usage text regarding--strict-dates
3d22829stats
: date stats refactor. Date stats are returned in rfc3339 format. Dates are converted to timestamps with millisecond precision while calculating date stats. #690 e7c2977- filter out variance/stddev in tests as float precision issues are causing flaky CI tests #696
- Bump qsv-dateparser from 0.4.4 to 0.6.0
- Bump qsv-stats from 0.4.6 to 0.5.2
- Bump qsv-sniffer from 0.5.0 to 0.6.0
- Bump serde from 1.0.151 to 1.0.152 by @dependabot in #692
- Bump csvs_convert from 0.7.7 to 0.7.8 by @dependabot in #693
- Bump once_cell from 0.16.0 to 0.17.0 d3ac255
- Bump self-update from 0.32.0 to 0.34.0 5f95933
- Bump cpc from 1.8 to 1.9; set csvs_convert dependency to minor version ee91648
- applied select clippy recommendations
- deeplink to Cookbook from Table of Contents
- pin Rust nightly to 2023-01-01
- implementation comments on
stats
,sample
,sort
& Python distribution
Fixed
stats
: prevent premature rounding, and make sum statistic use the same rounding method 879214a 1a13620- fix autoindex so we return the index path properly d3ce6a3
fetch
&fetchpost
: corrected typo 684036b
Full Changelog: 0.80.0...0.81.0
0.80.0
Added
- new
to
command. Converts CSVs "to" PostgreSQL, SQLite, XLSX, Parquet and Data Package by @kindly in #656 apply
: add numtocurrency operation #670sort
: add --ignore-case option #673stats
: now computes summary statistics for dates as well #684- added --updatenow option, resolves #661 #662
- replace footnotes in Available Commands list with emojis 😄
Changed
apply
&applydp
: expose --batch size option #679validate
: add last valid row to validation error 7680011input
: add last valid row to error message 492e51f- upgrade to csvs-convert 0.7.5 by @kindly in #668
- Bump serial_test from 0.9.0 to 0.10.0 by @dependabot in #671
- Bump csvs_convert from 0.7.5 to 0.7.7 by @dependabot in #674
- Bump num_cpus from 1.14.0 to 1.15.0 by @dependabot in #678
- Bump robinraju/release-downloader from 1.6 to 1.7 by @dependabot in #677
- Bump actions/stale from 6 to 7 by @dependabot in #676
- Bump actions/setup-python from 4.3.1 to 4.4.0 by @dependabot in #683
- added concurrency check to CI tests so that redundant CI test are canceled when new ones are launched
- instead of saying "descriptive statistics", use more understandable "summary statistics"
- changed publishing workflows to enable
to
feature for applicable target platforms - cargo update bump dependencies, notably qsv-stats from 0.4.5 to 0.4.6 and qsv_currency from 0.5.0 to 0.6.0
- pin Rust nightly to 2022-12-22
Fixed
Full Changelog: 0.79.0...0.80.0