Fix dtype of quality metrics before and after merging #3497

zm711 · 2024-10-22T20:01:01Z

MRE

import pandas as pd
df = pd.DataFrame({'test': [1,2,3]})
new_df = pd.DataFrame(index=df.index, columns=df.columns)

df.test.dtype
Out[7]: dtype('int64')

new_df.test.dtype
Out[8]: dtype('O')

Basically when you copy a dataframe from a previous dataframe columns it forces the dtype to be object instead of numeric.

Easy Solution

Using the to_numeric will bring us back to numeric values.

Caveats

This switches the dtype from the Pandas Float64 to the numpy float64. I don't think this is too bad, since doing the queries should still be fine no?

Testing

i added a small test to test merging, but let me know if we'd prefer not to have it.

samuelgarcia · 2024-10-23T12:39:55Z

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

+        # we can iterate through the columns and convert them back to numbers with
+        # pandas.to_numeric. coerce allows us to keep the nan values.
+        for column in metrics.columns:
+            metrics[column] = pd.to_numeric(metrics[column], errors="coerce")


this is ok for me.
pandas behavior is becoming quite cryptic for me.
using old_metrics[col].dtype could be also used no ?

Maybe. I agree Pandas is making their own dtypes like NADType which doesn't play nicely with numpy in my scipts I tend to just query based on numpy stuff). So I don't know for sure. I could test that later. Although for me I would prefer to coerce everything to numpy types since that's what I'm used to. None of my tables are big enough that I worry about dtype inefficiency stuff that Pandas has been working on with the new backend.

JoeZiminski

Hey @zm711 this looks great good catch! super useful test too. Weird behaviour from pandas. Some minor comments:

I think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64) will have the same effect. You loose the coerce on error behaviour but assuming the data is always going to be filled with NaN this shouldn't be a problem. However it is more implicit and provides less information on the weird pandas behaviour than the loop approach.
The results of this operation mean all columns are np.float64 but in the original metrics as returned from compute_quality_metrics some columns are Int64Dtype. This seems to be dynamic based on contents (e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances. num_spikes I guess will always be int. The only time I can imagine this being a problem is if some equality check is performed e.g.num_spikes == 1 which might work for the original compute_quality_metrics output but fail after merging as data will be float. So maybe it is simplest just to cast num_spikes -> Int64Dtype and leave the rest as float?

zm711 · 2024-10-24T12:09:03Z

Thanks so much @JoeZiminski!

think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64)

I'm no Pandas expert so I'm happy to have changes here if they are better! I just don't know have an intuition for what is the smartest strategy so if you know Pandas really well then I'll make the change :)

e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances.

True. This is our mistake for letting Pandas infer. presence ratio is a float between 0 and 1. But if they are all 0 or all 1 it casts to int for memory purposes. Users should never assume it is an int although it could be in extreme cases. It would be better for us to explicitly make it a float and take the memory hit in my opinion.

num_spikes I guess will always be int.

This is true and when I scanned the table I forgot about this one. It would be better to make that one an int. I don't think a user should ever do num_spikes == x. I could imagine a query of num_spikes>=x. I think (BUT correct me if I'm wrong) that for floats only the decimal is unstable such that a float(1) >= int(1). In this case testing against a min number of spikes should not be a problem. If I'm wrong then I think we have to cast that one back to int in a separate step.

super useful test too

<3 Thanks. I figure we really need to protect ourselves from some of these small regressions. So I'm trying :)

zm711 · 2024-11-01T17:59:48Z

@alejoe91, do you have any opinions of implementing this? Happy to change to a different method if you prefer something. I think the only thing we are failing to maintain is num_spikes.

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

alejoe91 · 2024-11-04T10:13:18Z

src/spikeinterface/qualitymetrics/tests/test_quality_metric_calculator.py

+    assert len(metrics.index) > len(new_metrics.index)
+
+    # dtype should be fine after merge but is cast from Float64->float64
+    assert np.float64 == new_metrics["snr"].dtype


we can add a test on int coercion if we end up using the suggestion here: https://github.com/SpikeInterface/spikeinterface/pull/3497/files#r1827487180

Co-authored-by: Alessio Buccino <[email protected]>

…to merge-qc

for more information, see https://pre-commit.ci

zm711 · 2024-11-04T21:08:25Z

So the problem is that Pandas will infer the dtype and sometimes this is actually wrong. Like the presence ratio above which should technically always be a float between 0.0 and 1.0, but if it is all 1s and 0s will be stored as an int. Then if we merge and get a fraction then the dtype is wrong. I think it might be better to hard code the int64 for num_spikes since everything else is a float.

I basically implemented Sam's idea. But this fails. Unless we hard code the dtype of the different metrics rather than allow Pandas to infer them. What do people think about me adding a line to coerce everything in the original calculator to float64 except num_spikes? Then we have that be int. Then we should be fine to do the casting that I do here in this code. If we make sure we know the dtype of our metrics we are safer.

…to merge-qc

zm711 · 2024-11-22T21:57:04Z

Okay so changes in this PR

now we ensure all columns have the dtype specified in our name_to_dtype dict (pandas is not allowed to infer)
test for empty units was changed because we no longer put nans in num_spikes (because that should be 0 for empty units)--let me know if we should discuss
merging now recasts the columns to the original dtypes
test added for merging

zm711 added 2 commits October 22, 2024 15:36

fix dtype of quality metrics after merging

6527c90

fix test

1e596f8

zm711 added the qualitymetrics Related to qualitymetrics module label Oct 22, 2024

samuelgarcia reviewed Oct 23, 2024

View reviewed changes

JoeZiminski approved these changes Oct 24, 2024

View reviewed changes

zm711 added 2 commits October 24, 2024 08:39

Merge branch 'main' into merge-qc

2e53fbf

Merge branch 'main' into merge-qc

aeab562

alejoe91 approved these changes Nov 4, 2024

View reviewed changes

alejoe91 reviewed Nov 4, 2024

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_calculator.py Outdated Show resolved Hide resolved

alejoe91 reviewed Nov 4, 2024

View reviewed changes

zm711 and others added 6 commits November 4, 2024 15:19

Alessio's idea

812376e

Co-authored-by: Alessio Buccino <[email protected]>

add int test

4a6b1e3

Merge branch 'main' into merge-qc

af5b93a

Merge branch 'merge-qc' of https://github.com/zm711/spikeinterface in…

81307a6

…to merge-qc

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f2e2f1

for more information, see https://pre-commit.ci

try different dtype approach

0cc1fa1

zm711 added 3 commits November 22, 2024 16:06

wip

e175bdc

fix synchrony

bf96fe1

Merge branch 'main' into merge-qc

a132acc

zm711 mentioned this pull request Nov 22, 2024

fix handling of synchrony columns in qualitymetrics #3549

Open

zm711 added 3 commits November 22, 2024 16:29

fix nan and empty units

5b77ba1

Merge branch 'merge-qc' of https://github.com/zm711/spikeinterface in…

91c8a84

…to merge-qc

fix test

807e771

zm711 requested review from JoeZiminski, alejoe91 and samuelgarcia November 22, 2024 21:57

zm711 changed the title ~~Fix dtype of quality metrics after merging~~ Fix dtype of quality metrics before and after merging Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dtype of quality metrics before and after merging #3497

Fix dtype of quality metrics before and after merging #3497

zm711 commented Oct 22, 2024

samuelgarcia Oct 23, 2024

zm711 Oct 23, 2024

JoeZiminski left a comment •

edited

Loading

zm711 commented Oct 24, 2024 •

edited

Loading

zm711 commented Nov 1, 2024

alejoe91 Nov 4, 2024

zm711 commented Nov 4, 2024 •

edited

Loading

zm711 commented Nov 22, 2024

Fix dtype of quality metrics before and after merging #3497

Are you sure you want to change the base?

Fix dtype of quality metrics before and after merging #3497

Conversation

zm711 commented Oct 22, 2024

MRE

Easy Solution

Caveats

Testing

samuelgarcia Oct 23, 2024

Choose a reason for hiding this comment

zm711 Oct 23, 2024

Choose a reason for hiding this comment

JoeZiminski left a comment • edited Loading

Choose a reason for hiding this comment

zm711 commented Oct 24, 2024 • edited Loading

zm711 commented Nov 1, 2024

alejoe91 Nov 4, 2024

Choose a reason for hiding this comment

zm711 commented Nov 4, 2024 • edited Loading

zm711 commented Nov 22, 2024

JoeZiminski left a comment •

edited

Loading

zm711 commented Oct 24, 2024 •

edited

Loading

zm711 commented Nov 4, 2024 •

edited

Loading