You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This comment is for @mrshirts and @jchodera with relation to the implementation and theory in fes.py for samples
I managed to accidentally recreate the bug in #502 and have attached a minimal script to throw the error. However, in investigating what happened, I found 2 different bugs in the logic I wanted to discuss what the "correct" way to handle is.
Bug 1: Empty Bins
#502 seems to throw an error when a bin in the bin does not have a sample. In theory, the logic here is supposed to ensure every bin has a sample but, because of the logic slightly above it to assign bin_labels it will only ever check bins with samples, so the sanity logic can never trigger, nor should it for bootstrap samples. This will not manifest until the user gets the histogram here because its iterating over the N-D grid of bins generated here
I'm not sure what the correct approach is here, as suggested in #502, we could just flag any bin without samples at the line and set a np.nanas we do right below checking the bin label, but I don't know if thats technically or scientifically correct here since we can estimate bins without samples using MBAR (although in an FES histogram bin, it might make sense to nan the bin because it doesn't have samples.
Thoughts?
Bug 2: Samples > Bin upper bound are assigned to the maximum bin
The logic used in calls to np.digitize are assigning samples greater than the upper bound of a dimension to the maximum bin instead of omitting it. This leads to a bias on the bins closest to the maximum in any dimension. This does NOT throw an error through, even though its scientifically wrong.
The main function in question is np.digitize which does bin assignment for a sequence of samples and bin edges.
The two problematic lines in question are this line
Where a value is measured, checked against bins, and then its digitize index is shifted down one to say that "sample X is assigned to bin index Y." Which is fine for any sample less than the max bin edge for given dimension d. The logic of digitize is X < bins.min() --> 0 and X >= bins.max() --> len(bins). This means out of bounds on the low end is 0 which downshifted is index -1 that is checked against in various places in the code. However out of bounds on the high end is downshifted to len(bens) -1 which is a valid index for bins and does NOT throw an error, assigning out of bounds samples to a bin.
My proposed fix for this with minimal effort is to run the following logic:
bin_assignment=np.digitize(x_n[:, d], bins[d])
bin_assignments[np.where(bin_assignments==len(bins[d]))] =0# Shift all out-of-upper-bounds bins to invalidbin_n[:, d] =bin_assignments-1# reassign to bin index
and
fes_ref_bin=np.digitize(fes_reference[d], bins[d])
fes_ref_bin[np.where(fes_ref_bin==len(bins[d]))] =0# Shift all out-of-upper-bounds bins to invalidfes_ref_grid[d] =fes_ref_bin-1# reassign to bin index
The other two digitize calls in the code are captured here and also suffer this problem because they are allowing assignment of things out of bounds to the bin. They can be fixed in similar ways.
Questions: Am I correct in identifying this is a bug? Is the proposed solution a good fix for it?
Also cc'ing @mikemhenry since he wanted to be tagged in this.
The text was updated successfully, but these errors were encountered:
Script to reproduce conditions of both bugs is here:
importnumpyasnpfrompymbarimportFESu_kn=np.zeros((3, 3))
N_k=np.ones(3)
bin_edge=np.linspace(1, 3, 10)
x_n=np.linspace(0, 3, 3) +0.1# The top and bottom sample should not be assigned bins, the upper one is.u_n=np.ones(3)
histogram_parameters=dict()
histogram_parameters["bin_edges"] =bin_edgefes=FES(u_kn, N_k)
fes.generate_fes(u_n, x_n, fes_type="histogram", histogram_parameters=histogram_parameters)
bin_centers=bin_edge+ (bin_edge[1] -bin_edge[0])/2results=fes.get_fes(
bin_centers,
reference_point="from-specified",
fes_reference=70, # This should throw an error, it doesn't.uncertainty_method="analytical",
)
This comment is for @mrshirts and @jchodera with relation to the implementation and theory in
fes.py
for samplesI managed to accidentally recreate the bug in #502 and have attached a minimal script to throw the error. However, in investigating what happened, I found 2 different bugs in the logic I wanted to discuss what the "correct" way to handle is.
Bug 1: Empty Bins
#502 seems to throw an error when a bin in the bin does not have a sample. In theory, the logic here is supposed to ensure every bin has a sample but, because of the logic slightly above it to assign
bin_labels
it will only ever check bins with samples, so the sanity logic can never trigger, nor should it for bootstrap samples. This will not manifest until the user gets the histogram here because its iterating over the N-D grid of bins generated hereI'm not sure what the correct approach is here, as suggested in #502, we could just flag any bin without samples at the line and set a
np.nan
as we do right below checking the bin label, but I don't know if thats technically or scientifically correct here since we can estimate bins without samples using MBAR (although in an FES histogram bin, it might make sense tonan
the bin because it doesn't have samples.Thoughts?
Bug 2: Samples > Bin upper bound are assigned to the maximum bin
The logic used in calls to
np.digitize
are assigning samples greater than the upper bound of a dimension to the maximum bin instead of omitting it. This leads to a bias on the bins closest to the maximum in any dimension. This does NOT throw an error through, even though its scientifically wrong.The main function in question is
np.digitize
which does bin assignment for a sequence of samples and bin edges.The two problematic lines in question are this line
and this line
Where a value is measured, checked against bins, and then its
digitize
index is shifted down one to say that "sample X is assigned to bin index Y." Which is fine for any sample less than the max bin edge for given dimensiond
. The logic ofdigitize
isX < bins.min() --> 0
andX >= bins.max() --> len(bins)
. This means out of bounds on the low end is0
which downshifted is index-1
that is checked against in various places in the code. However out of bounds on the high end is downshifted tolen(bens) -1
which is a valid index forbins
and does NOT throw an error, assigning out of bounds samples to a bin.My proposed fix for this with minimal effort is to run the following logic:
and
The other two
digitize
calls in the code are captured here and also suffer this problem because they are allowing assignment of things out of bounds to the bin. They can be fixed in similar ways.Questions: Am I correct in identifying this is a bug? Is the proposed solution a good fix for it?
Also cc'ing @mikemhenry since he wanted to be tagged in this.
The text was updated successfully, but these errors were encountered: