A set of command line tools for data analysis, data mining and exploration. Most scripts are designed to work on columnar numeric data, for instance CSV data, with a configurable separator (defaulting to \t) and with an optional separator
- Python 3.6. Other versions of python my work, but this is what I've tested with
- pipx / pip
Just run ./build.sh
! This installs pipx
if it's not installed using pip and uses pipx
to build the library in a virtual env, putting that on your path. by default, this is at
~/.local/bin
, which may not be on your path, pipx
tries to add it.
The build.sh
script also accepts an optional parameter "iterm" to facilitate inline plotting in iterm2
-
describe
: provides a wide variety of descriptive statistics to each column. Good for quick summary analysis. -
reservoir_sampling
: samples a specified number of rows from input, works for numeric and non-numeric data -
shuffle_lines
: shuffles the input data (in memory) -
plot_hist
: plots a histogram of values on the input columns -
plot_lines
: line plots of the input columns -
plot_hex
: Hex plots- 2D histograms on the crossproduct of input columns -
json_to_csv
: convert json data to csv with various options -
csv_to_json
: convert csv data to json with various options -
ztest
: perform the z-test -
normal
: draw from a normal distribution -
exponential
: draw from an exponential distribution -
poisson
: draw from a poisson distribution -
column_selector
: slightly more powerful version of cut
normal | shuffle_lines | reservoir_sample -n 5
normal -D 2 | plot_hex
paste <(perl -e 'for ($i = 0; $i < 10000; $i ++) { print rand(), "\t", ('a'..'z')[int(rand()*26)], "\t", 50*rand(), "\n" } ') <(normal -n 10000) | describe
perl -e 'BEGIN{print "SHOES\tGLASSES\n"} for ($i = 0; $i < 100; $i ++) { print rand(), "\t", 50*rand(), "\n" } ' | plot_xy -O -H
paste <(normal -m 5 -s 20 -n 10000) <(perl -e 'foreach (1..10000) { print rand(), "\n"}' ) | plot_hex
paste <(normal -m 5 -s 20 -n 10000) <(perl -e 'foreach (1..10000) { print rand(), "\n"}' ) <(exponential -n 10000) <(poisson -n 10000 -l 1) | describe
paste <(normal -m 5 -s 20 -n 10000) <(perl -e 'foreach (1..10000) { print rand(), "\n"}' ) | csv_to_json -L | json_to_csv -L -i
paste <(poisson -n 10000) <(normal -m 5 -s 20 -n 10000) <(perl -e 'foreach (1..10000) { print rand(), "\n"}' ) | csv_to_json -L | json_to_csv -L -i | column_selector -C 0,1 -H
paste <(perl -le 'foreach (1..50){ $r = rand(); $x = (-2.0*log(($r < 0.5) ? $r : 1 - $r))**0.5; $o = $x - ((0.010328*$x + 0.802853)*$x + 2.515517)/(((0.001308*$x + 0.189269)*$x + 1.432788)*$x + 1.0); print (($r < 0.5) ? $o : -$o)}') <(perl -le 'foreach (1..50){ print rand() }') | plot_xy