Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dtype of quality metrics before and after merging #3497

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

zm711
Copy link
Collaborator

@zm711 zm711 commented Oct 22, 2024

MRE

import pandas as pd
df = pd.DataFrame({'test': [1,2,3]})
new_df = pd.DataFrame(index=df.index, columns=df.columns)

df.test.dtype
Out[7]: dtype('int64')

new_df.test.dtype
Out[8]: dtype('O')

Basically when you copy a dataframe from a previous dataframe columns it forces the dtype to be object instead of numeric.

Easy Solution

Using the to_numeric will bring us back to numeric values.

Caveats

This switches the dtype from the Pandas Float64 to the numpy float64. I don't think this is too bad, since doing the queries should still be fine no?

Testing

i added a small test to test merging, but let me know if we'd prefer not to have it.

@zm711 zm711 added the qualitymetrics Related to qualitymetrics module label Oct 22, 2024
# we can iterate through the columns and convert them back to numbers with
# pandas.to_numeric. coerce allows us to keep the nan values.
for column in metrics.columns:
metrics[column] = pd.to_numeric(metrics[column], errors="coerce")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok for me.
pandas behavior is becoming quite cryptic for me.
using old_metrics[col].dtype could be also used no ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I agree Pandas is making their own dtypes like NADType which doesn't play nicely with numpy in my scipts I tend to just query based on numpy stuff). So I don't know for sure. I could test that later. Although for me I would prefer to coerce everything to numpy types since that's what I'm used to. None of my tables are big enough that I worry about dtype inefficiency stuff that Pandas has been working on with the new backend.

Copy link
Collaborator

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zm711 this looks great good catch! super useful test too. Weird behaviour from pandas. Some minor comments:

  • I think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64) will have the same effect. You loose the coerce on error behaviour but assuming the data is always going to be filled with NaN this shouldn't be a problem. However it is more implicit and provides less information on the weird pandas behaviour than the loop approach.

  • The results of this operation mean all columns are np.float64 but in the original metrics as returned from compute_quality_metrics some columns are Int64Dtype. This seems to be dynamic based on contents (e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances. num_spikes I guess will always be int. The only time I can imagine this being a problem is if some equality check is performed e.g.num_spikes == 1 which might work for the original compute_quality_metrics output but fail after merging as data will be float. So maybe it is simplest just to cast num_spikes -> Int64Dtype and leave the rest as float?

@zm711
Copy link
Collaborator Author

zm711 commented Oct 24, 2024

Thanks so much @JoeZiminski!

think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64)

I'm no Pandas expert so I'm happy to have changes here if they are better! I just don't know have an intuition for what is the smartest strategy so if you know Pandas really well then I'll make the change :)

e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances.

True. This is our mistake for letting Pandas infer. presence ratio is a float between 0 and 1. But if they are all 0 or all 1 it casts to int for memory purposes. Users should never assume it is an int although it could be in extreme cases. It would be better for us to explicitly make it a float and take the memory hit in my opinion.

num_spikes I guess will always be int.

This is true and when I scanned the table I forgot about this one. It would be better to make that one an int. I don't think a user should ever do num_spikes == x. I could imagine a query of num_spikes>=x. I think (BUT correct me if I'm wrong) that for floats only the decimal is unstable such that a float(1) >= int(1). In this case testing against a min number of spikes should not be a problem. If I'm wrong then I think we have to cast that one back to int in a separate step.

super useful test too

<3 Thanks. I figure we really need to protect ourselves from some of these small regressions. So I'm trying :)

@zm711
Copy link
Collaborator Author

zm711 commented Nov 1, 2024

@alejoe91, do you have any opinions of implementing this? Happy to change to a different method if you prefer something. I think the only thing we are failing to maintain is num_spikes.

assert len(metrics.index) > len(new_metrics.index)

# dtype should be fine after merge but is cast from Float64->float64
assert np.float64 == new_metrics["snr"].dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a test on int coercion if we end up using the suggestion here: https://github.com/SpikeInterface/spikeinterface/pull/3497/files#r1827487180

@zm711
Copy link
Collaborator Author

zm711 commented Nov 4, 2024

So the problem is that Pandas will infer the dtype and sometimes this is actually wrong. Like the presence ratio above which should technically always be a float between 0.0 and 1.0, but if it is all 1s and 0s will be stored as an int. Then if we merge and get a fraction then the dtype is wrong. I think it might be better to hard code the int64 for num_spikes since everything else is a float.

I basically implemented Sam's idea. But this fails. Unless we hard code the dtype of the different metrics rather than allow Pandas to infer them. What do people think about me adding a line to coerce everything in the original calculator to float64 except num_spikes? Then we have that be int. Then we should be fine to do the casting that I do here in this code. If we make sure we know the dtype of our metrics we are safer.

@zm711
Copy link
Collaborator Author

zm711 commented Nov 22, 2024

Okay so changes in this PR

  1. now we ensure all columns have the dtype specified in our name_to_dtype dict (pandas is not allowed to infer)
  2. test for empty units was changed because we no longer put nans in num_spikes (because that should be 0 for empty units)--let me know if we should discuss
  3. merging now recasts the columns to the original dtypes
  4. test added for merging

@zm711 zm711 changed the title Fix dtype of quality metrics after merging Fix dtype of quality metrics before and after merging Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qualitymetrics Related to qualitymetrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants