Skip to content

Fast summary statistics, histograms, and binning – ignoring NaNs

License

Notifications You must be signed in to change notification settings

brenhinkeller/NaNStatistics.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NaNStatistics

Docs CI CI (julia nightly) Coverage

Because NaN is just missing with hardware support!

Fast summary statistics, histograms, and binning — all ignoring NaNs, as if NaN represented missing data.

See also JuliaSIMD/VectorizedStatistics.jl for similar vectorized implementations that don't ignore NaNs.

Summary statistics

Summary statistics exported by NaNStatistics are generally named the same as their normal counterparts, but with "nan" in front of the name, similar to the Matlab and NumPy conventions. Options include:

Reductions
  • nansum
  • nanminimum
  • nanmaximum
  • nanextrema
Measures of central tendency
  • nanmean   arithmetic mean, ignoring NaNs
  • nanmedian   median, ignoring NaNs
  • nanmedian!   as nanmedian but quicksorts in-place for efficiency
Measures of dispersion
  • nanvar   variance
  • nanstd   standard deviation
  • nancov   covariance
  • nancor   Pearson's product-moment correlation
  • nanaad   mean (average) absolute deviation from the mean
  • nanmad   median absolute deviation from the median
  • nanmad!   as nanmad but quicksorts in-place for efficiency
  • nanrange   range between nanmaximum and nanminimum
  • nanpctile   percentile
  • nanpctile!   as nanpctile but quicksorts in-place for efficiency
Other summary statistics
  • nanskewness   skewness
  • nankurtosis   excess kurtosis

Note that, regardless of implementation, functions involving medians or percentiles are generally significantly slower than other summary statistics, since calculating a median or percentile requires a quicksort or quickselect of the input array; if not done in-place as in nanmedian! and nanpctile! then a copy of the entire array must also be made.

These functions will generally support the same dims keyword argument as their normal Julia counterparts (though are most efficient when operating on an entire collection). As an alternative to dims, the dim keyword is also supported, which behaves identially to dims except that it also (as is the norm in some other languages) drops any singleton dimensions that have been reduced over.

julia> a = rand(100000);

julia> minimum(a)
9.70221275542471e-7

julia> using NaNStatistics

julia> nanminimum(a)
9.70221275542471e-7

julia> a[rand(1:100000, 10000)] .= NaN;

julia> nanminimum(a)
7.630517166790085e-6

Histograms

The main 1D and 2D histogram function is histcounts (with an in-place variant histcounts!), and will, as you might expect for this package, ignore NaNs. However, it might be worth using for speed even if your data don't contain any NaNs:

julia> b = 10 * rand(100000);

julia> using StatsBase

julia> @btime fit(Histogram, $b, 0:1:10, closed=:right)
  526.750 μs (2 allocations: 208 bytes)
Histogram{Int64, 1, Tuple{StepRange{Int64, Int64}}}
edges:
  0:1:10
weights: [10042, 10105, 9976, 9980, 10073, 10038, 9983, 9802, 10056, 9945]
closed: right
isdensity: false

julia> using NaNStatistics

julia> @btime histcounts($b, 0:1:10)
  155.083 μs (2 allocations: 176 bytes)
10-element Vector{Int64}:
 10042
 10105
  9976
  9980
 10073
 10038
  9983
  9802
 10056
  9945

(Timings as of Julia v1.10.4, NaNStatistics v0.6.36, Apple M1 Max)

In addition, several functions are provided to estimate the summary statistics of a dataset from its histogram, specifically

  • histmean   arithmetic mean
  • histvar   variance
  • histstd   standard deviation
  • histskewness   skewness
  • histkurtosis   excess kurtosis

Binning

NaNStatistics also provides functions that will efficiently calculate the summary statistics of a given dependent variable y binned by an independent variable x. These currently include:

  • nanbinmean / nanbinmean!
  • nanbinmedian / nanbinmedian!
julia> x = 10 * rand(100000);

julia> y = x.^2 .+ randn.();

julia> xmin, xmax, nbins = 0, 10, 10;

julia> @btime nanbinmean($x,$y,xmin,xmax,nbins)
  222.542 μs (2 allocations: 288 bytes)
10-element Vector{Float64}:
  0.3482167982440996
  2.32463720126215
  6.348942343257478
 12.352990978599395
 20.34955219534221
 30.31123519946431
 42.3578375163112
 56.33841854482159
 72.23884588251572
 90.30275863080671

Other functions

  • movmean A simple moving average function, which can operate in 1D or 2D, ignoring NaNs.
julia> A = rand(1:10, 4,4)
4×4 Matrix{Int64}:
 3  5  10  3
 4  2   5  8
 5  6   8  8
 2  6  10  6

julia> movmean(A, 3)
4×4 Matrix{Float64}:
 3.5      4.83333  5.5      6.5
 4.16667  5.33333  6.11111  7.0
 4.16667  5.33333  6.55556  7.5
 4.75     6.16667  7.33333  8.0
  • nanstandardize / nanstandardize! De-mean and set to unit variance

About

Fast summary statistics, histograms, and binning – ignoring NaNs

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages