Unify slicing behavior in SpikeInterface #1989

alejoe91 · 2023-09-13T09:49:35Z

Related to #1881, #1886 and #1979 .

Let's discuss how we should handle the behavior of start_frame/end_frame/channel_ids in the get_traces().
Plus, let's discuss how to do add generalizing testing once we reach consensus.

Some proposals:

Be restrictive:

end_frame > start_frame: passing start_frame == end_frame will raise an Exception
end_frame < num_samples

Numpy-like

For the channel_ids is the behavior is already established, but we should consider adding tests for them when we add generalized testing.

The text was updated successfully, but these errors were encountered:

h-mayorquin · 2023-09-13T11:08:27Z

OK, so initially I wanted to follow numpy as much as possible. I had two cases in mind. Out of bounds slicing where we should do clipping and start_frame=end_frame slicing where we should return an empty array.

In [1]: x = np.ones(10)

In [2]: x[15:13]
Out[2]: array([], dtype=float64)

In [3]: x[5:3]
Out[3]: array([], dtype=float64)

In [4]: x[10:10]
Out[4]: array([], dtype=float64)

Note that standard lists [ ] in python behave in exactly the same way

Using two arguments:

We already load most of our data as numpy memmaps (so we don't need checks) and then we also interact with scipy and numpy functions (and therefore should folllow their specification as they probably expect input with numpy like qualities).
We don't need to come up with a standard ourselves. "Let's just do what numpy does" is easy to follow and avoids the cost of coordinating /agreeing / discssuing between us if we disagree about a specific rule. We can then outsource policing to someone else, less coordinate costs for us.

But then I remembered that there is actually a standard in the matter that I think we should follow:
https://data-apis.org/array-api/latest/

And this is what it has to say about indexing (quoting from the relevant sections):
https://data-apis.org/array-api/latest/API_specification/indexing.html

Out of bounds check: This specification does not require bounds checking. The behavior for out-of-bounds integer indices is left unspecified.

Rationale: this is consistent with bounds checking for integer indexing; the behavior of out-of-bounds indices is left unspecified. Implementations may choose to clip (consistent with Python list slicing semantics), raise an exception, return junk values, or some other behavior depending on device requirements and performance considerations.

start_frame=end_frame slicing: the specification says the following:

If i equals j, a slice must return an empty array, whose axis (dimension) size along the indexed axis is 0.

I think we should just follow that. We can start second guessing standards because we think we know better but I think we increase costs for interaction with the ecosystem and we increase costs of discussing between us if a non-standard rule were we disgagree comes in the future.

So my reading of the standard and my current position that is fine that we throw error for end or start frame outside of bounds (as this is unespecified behavior) but that we should return an empty array with the dimension of the channels if start_frame=end_frame. Plus, if any other disagrement comes in the future, we should defer to the array standard.

JoeZiminski · 2023-09-13T15:34:01Z

This is a great post and important question with no clear best-answer. I think the decision depends in part on the broader aims of spikeinterface and the intended audience for which it will be used. On one extreme, there is a highly restrictive API which does not allow any indexing procedure that could be interpreted as a mistake. i.e. indexing the recording out of bounds, or with start_frame >= end_frame. On the other extreme is a very flexible API with few checks that could allow unintended operations to pass silently.

The benefits of a more restrictive API is 1) easier to track errors and detect bugs 2) enforces very explicit handling of all use cases 3) easier for non-expert users to detect mistakes in their code. The downside is that it increases initial overhead when writing code, and breaks with existing data-slicing API, and is less flexible.

At present, I lean towards a highly restrictive API that will throw an error if any potentially unintended indexing operations are performed. Although this might lead to some initial overhead to explicitly handle all indexing cases, I think it will pay dividends down the line with increase ease of bug-fixing and reduced number of user-reported issues.

Also, I think that SI should aim to be the de-facto tool for in vivo electrophysiology analysis in Python, inevitably drawing novice Python users with it's easy-to-use API. Often in practice those analysing electrophysiological data will have little computational training. Having robust checks on common indexing mistakes will make SI more secure and save a lot of headaches for new users.

Fundamentally, I think the problem of allowing such indexing operations is it is impossible to tell whether this indexing was done on purpose, or is due to a mistake or oversight. The possible options then are to allow both intentional and unintentional indexing of this sort (i.e. flexible API), or not allow either (restrictive API). I think in general it is safer to handle the intended cases at a higher level than get_traces(), and catch all unintended cases, rather than let both intended and unintended cases pass.

h-mayorquin · 2023-09-14T08:46:36Z

Tagging @CodyCBakerPhD and @pauladkisson as we usually model roi extractors after the conventions in spikeinterface.

pauladkisson · 2023-09-14T15:52:28Z

Restrictive indexing makes sense to me -- I just wrote some input validation for the MultiPlaneScanImage (see PR) to prevent out-of-bounds channel and plane indexing. Otherwise when the specified channel/plane number exceeded the actual number of planes/channels, some other frame was returned (silent error) due to the way that ScanImage stores its data (sequentially in a round-robin format).

Seems like restrictive indexing is the much safer way to approach this.

samuelgarcia · 2023-09-14T16:06:00Z

My feeling is: we should go to more strict

In short raise if:

end_frame >= start_frame
end_frame > num_samples
start_frame < 0

Mainly because:

there is no use case to be numpy like on the border. we need just need a continuous buffer retriever which is not
empty.
this is more work to check and maintain every sub function that could affected by non strict slicing
we would discover bugs earlier

In short, less work, less code and less bugs.
I am lazy. it is a known fact.

h-mayorquin · 2023-09-15T12:18:23Z

The point that @pauladkisson brings to the table moves me in the direction of more restrictive and away from my initial position. We indeed deal with multiple segments in SI by default and I can see going over the first segment and getting no errors being even more confusing in this context.

h-mayorquin · 2023-09-28T11:14:55Z

So, it seem that everbody wants to throw errors for outside of bounds (including negative). I was the only one that was negative on that but I have been convinced by you guys and the standard not having specification for that behavior. Let's say we give this another week and if we don't have any strong arguments by 2023/10/06 we run on with the restrictive version. That is, we will raising errors if:

end_frame > num_samples
start_frame < 0

What is less clear to me is what should we do for the start_frame=end_frame case where the standard urges us to return an empty frame. Or the more general end_frame > start_frame where the standard is not clear at all on my view. I myself l am leaning to follow the standard but Alessio included this point under the restrictive approach and most people voted for that. So we should probably move forward with having the restrictive view as well. That is, we will move forward with throwing assertions if:

end_frame >= start_frame

Sombeody else that is not me put a date on that one as this is too much of a hard pill to swallow : )

DradeAW · 2023-09-28T13:50:10Z

@h-mayorquin You're not alone!

I know my comment probably won't change anything, since almost everybody prefers throwing errors, but I still wanted to give my opinion (but I respect the fact that the majority wins).

I also prefer doing it the "numpy way", where giving end_frame == start_frame returns an empty array. I think at some point, I used the fact that it didn't give an error to use "out-of-bounds" values.
I think it can be useful (although yes, it can hide some errors), and since Python and Numpy does it this way, it makes more sense.

JoeZiminski · 2023-10-02T09:16:40Z

These are very good points, and the current discussion on documentation reminds again how important it is to check what the other main packages are doing and follow as best as possible. So any divergence should be done with very good reason.

In this case, I think there is a reason to deviate from the slicing behaviour of Python / Numpy. The purpose of these tools are to provide general purpose array structures that support a very broad number of cases. I can see how in the generalised concept of an array, trying to index end_frame == start_frame makes perfect sense, and should return an empty array. i.e. the question "What are the array elements between index 1:1" receives the answer "There are no array values between the indices 1:1".

However, in the context of spikeinterface, the concept is not of a generalised array but of an extracellular ephys timeseries. In this case, it does not make sense to try and index from and to the same timepoint, especially in the case of get_traces() after preprocessing in which it is commonly used. For example, I feel the question "Get me the pre-processed data between time 1s and 1s" should receive the answer "Why are you trying to do that? there are clearly no timepoints between the two times you have provided" (where of course timepoints are expressed as indices in the API, but what we are really representing is timepoints) .

h-mayorquin · 2024-07-08T19:32:32Z

One think that we did not discuss is whether we follow numpy convention of accepting slicing with None:

In [1]: np.arange(10)[None:]
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [2]: np.arange(10)[None:None]
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]: np.arange(10)[:None]
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

As a proponent above of "doing what numpy does" I think we should but given that slicing police side won above maybe you also want to throw an error here? @JoeZiminski @samuelgarcia @alejoe91 @samuelgarcia

I will add this to the next maintenace discussion topic so we can agree on this, wrote some guidelines and maybe close this issue.

This comes from the discussion in #3159 where the solution depends on the decision.

JoeZiminski · 2024-07-11T15:23:10Z

Good point @h-mayorquin for this I think if the user has explicitly passed None its okay to default to Numpy-like behaviour. I think this is what is done by default for BaseRecording.get_traces() as None is default args there?

h-mayorquin · 2024-07-11T15:40:12Z

It is a good point that taking the opposite stance would break backwards compatbility. I prefer to comply as much we the array standard as we can and this does:

https://data-apis.org/array-api/latest/API_specification/indexing.html#api-specification-indexing--page-root

Probably everbody agress with this one.

alejoe91 added the core Changes to core module label Sep 13, 2023

h-mayorquin added the discussion General discussions and community feedback label Sep 13, 2023

h-mayorquin mentioned this issue Sep 13, 2023

Fix zero_channel_pad case when selected frames are outside of original data region #1979

Merged

h-mayorquin mentioned this issue Sep 13, 2023

Force original_end_frame >= original_start_frame in tracepad #1886

Closed

alejoe91 mentioned this issue Sep 28, 2023

Documentation Proposals #2046

Closed

JoeZiminski mentioned this issue Nov 30, 2023

plot_timeseries time axis is incorrect if requesting time longer than the recording. #2278

Closed

alejoe91 mentioned this issue Jan 9, 2024

Add DecimateRecording #2385

Merged

h-mayorquin mentioned this issue May 7, 2024

[NEW PROJECT]: Standardize SpikeInterface API SpikeInterface/SpikeInterface-Hackathon-Edinburgh-May24#6

Open

3 tasks

h-mayorquin mentioned this issue Jun 21, 2024

Check start_frame/end_frame in BaseRecording.get_traces() #3059

Merged

This was referenced Jul 8, 2024

Concatenating recordings causes TypeError: unsupported operand type(s) for %: 'NoneType' and 'int' #3159

Closed

Fix segment start/end frame None in concatenate_recordings #3161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify slicing behavior in SpikeInterface #1989

Unify slicing behavior in SpikeInterface #1989

alejoe91 commented Sep 13, 2023 •

edited by h-mayorquin

Loading

h-mayorquin commented Sep 13, 2023 •

edited

Loading

JoeZiminski commented Sep 13, 2023 •

edited

Loading

h-mayorquin commented Sep 14, 2023

pauladkisson commented Sep 14, 2023

samuelgarcia commented Sep 14, 2023

h-mayorquin commented Sep 15, 2023

h-mayorquin commented Sep 28, 2023

DradeAW commented Sep 28, 2023

JoeZiminski commented Oct 2, 2023

h-mayorquin commented Jul 8, 2024

JoeZiminski commented Jul 11, 2024

h-mayorquin commented Jul 11, 2024

Unify slicing behavior in SpikeInterface #1989

Unify slicing behavior in SpikeInterface #1989

Comments

alejoe91 commented Sep 13, 2023 • edited by h-mayorquin Loading

h-mayorquin commented Sep 13, 2023 • edited Loading

JoeZiminski commented Sep 13, 2023 • edited Loading

h-mayorquin commented Sep 14, 2023

pauladkisson commented Sep 14, 2023

samuelgarcia commented Sep 14, 2023

h-mayorquin commented Sep 15, 2023

h-mayorquin commented Sep 28, 2023

DradeAW commented Sep 28, 2023

JoeZiminski commented Oct 2, 2023

h-mayorquin commented Jul 8, 2024

JoeZiminski commented Jul 11, 2024

h-mayorquin commented Jul 11, 2024

alejoe91 commented Sep 13, 2023 •

edited by h-mayorquin

Loading

h-mayorquin commented Sep 13, 2023 •

edited

Loading

JoeZiminski commented Sep 13, 2023 •

edited

Loading