-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add bincount #812
Comments
Thanks for the suggestion @ogrisel!
Signatures do all seem compatible. There is at least one difference in behavior I spotted: the accepted input is non-negative integer values, but the case of negative values is not treated the same - it may raise (NumPy), or clip (JAX). The NumPy issue tracker contains a lot of open issues about The PyTorch docs also contain a warning about non-deterministic gradients - probably related to the data-dependent behavior:
Yes, that's non-ideal. CuPy also adds a warning about The number of APIs that use data-dependent shapes is still quite low. As of now, I think this is the full list:
|
Just curious, but are you sure you want to keep using the name |
To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization. Then you can aggregate those values as a histogram of per bin counts if you wish, but it's not the only thing you can do. In machine learning, binned values are used in many various ways and only some involve histograms, and often only filtered subsets of the original binned array. |
Right, I agree that that's what happens when you "bin" floating point values. This function (
I understand what you're saying. But, it's also a fact that the result of I still think that "bincount" is a bad name. It is just "binning", and its name should reflect the technical term that people use. No one calls it the "bincount" or "bincounting". If I had to guess, I think its name reflects a very human tendency to combine two names that would have each been fine:
I think someone just glued these two names. This is like when people say "step foot" when "set foot" or "step" would each be fine, but the combination just isn't. |
An alternative to choosing a different name would be to replace def integer_histogram(x,
/,
bins: int | Array,
range: None | tuple[int, int] = None,
density: bool = False,
weights: Array
) -> Array:
"""Compute the histogram of an integer dataset.
Args:
x: The input data. The histogram is computed over the flattened array.
bins: If bins is an int, it defines the number of equal-width bins in the given range.
If bins is an integer array, it defines a monotonically increasing array of bin edges,
including the rightmost edges, which allows for non-uniform bin widths.
range: The lower and upper range of the bins. range[1] - range[0] + 1 must be divisible by
the integer bins.
weights: An array of weights, of the same shape as a. Each value in a only contributes its
associated weight towards the bin count (instead of 1). If density is True, the weights
are normalized, so that the integral of the density over the range remains 1.
density: If False, the result will contain the number of samples in each bin. If True, the
result is the value of the probability density function at the bin, normalized such that
the integral over the range is 1. Note that the sum of the histogram values will not be
equal to 1 unless bins of unity width are chosen; it is not a probability mass function.
""" It's just a generalization that supports normalization and general-width bins. Scanning how |
It seems quite widely adopted:
Although I have not checked how compatible they are.
JAX in particular requires an extra
length
argument to statically define the length of the output array to make JIT possible and dask uses(nan,)
shape to represent a data-dependent output shapes.The text was updated successfully, but these errors were encountered: