Need to re-design dataset validators? #210

niksirbi · 2024-06-06T14:59:22Z

The problem

Prior to PR #201 , a "movement dataset" was synonymous to a "poses dataset", because movement only supported pose tracking data. For this reason, we were using the ValidPosesDataset validator everywhere:

when loading data from file
prior to saving data to file
sometimes before computing a derivative (though that is now being removed, see Accessor compute_ to only validate required position data variable #206).

Going forward, movement will support bbox-tracking data as well (and perhaps even other types in the future). We still would like the accessor to work both for poses and bboxes, i.e. both of those should still be a movement dataset. But this means we have to fundamentally re-design our validation strategy (we can't keep using the same ValidPosesDataset validator for everything).

Potential solution

Probably we will end up defining several "entities":

the movement dataset which is the "base" dataset, and its requirements should be minimal. Perhaps its self.validate() method should only check for the existence of the position and confidence data variables, and only require the space and time dimensions. Maybe these checks should be implemented as part of a ValidMovementDataset validator. If I'm not mistaken these checks are sufficient for all our current filters and kinematic variable computations to work, i.e. all accessor methods will/should work for a valid movement dataset.
the bboxes dataset which is a subcategory of a movement dataset and additionally requires a shape array. This should be validated via the ValidBboxesDataset validator. This validator should run when loading/saving bbox data from/to files, and before any operation that only works on bboxes. The individuals dimension should be optional perhaps, but that is a separate discussion (see below).
the poses dataset which is also a subcategory of movement dataset and additionally requires a keypoints dimension. This is based on the fact that "pose" is usually defined as a set of keypoints. The ValidPosesDataset could be used to validate such data during I/O and before a computation that only works on pose tracks.

The above arrangement has some kinks though. For example, what should we do in cases where only 1 keypoint is being tracked per individual (as in the Aeon dataset, for example). That's not a "pose" strictly speaking, but it can very well be accommodated within a poses dataset with a single keypoint. However, this raises the question of whether a singleton keypoint dimensions should exist in such case, as @vigji has raised, see this issue and this zulip thread). As an alternative we could agree that all point tracking data is "poses", and make the keypoints dimension optional (i.e. a poses dataset is essentially the same as the "base" movement dataset.

Related to the above, I think the individuals dimension, plus any other extra dimensions (like views), should be always optional, i.e. their presence/absence should not be validated by the dataset validators, and they should be only created and validated when and as needed (basically agreeing with what was expressed in the zulip thread).

The question is, can we restructure dataset validation in a way that accommodates something like the above scheme, with the kinks ironed out? I'm fully open to better ideas on this.

The text was updated successfully, but these errors were encountered:

vigji · 2024-06-07T13:27:13Z

Just read this thread. As I am trying out locally things to move forward #197 (struggling with the test structure rn), I would probably make sure I do not end up finding solutions for the validator that are then overcome by a redefinition of those classes, what do you think @niksirbi ?

For what matters, I think it makes sense to start as early as possible to allow for dimensions optionality like the keypoints or the individual ones. But I do not know the classes structure in enough detail to really give an insightful opinion!

niksirbi · 2024-06-07T14:28:35Z

Hey @vigji, basically we have two design contraints right now, and both of them hinge on redefining the validators:

accommodate data from bbox tracking as well as pose tracking experiments
allow flexibility in number of dimensions (make many of them optional), which is what you brought up

I think it would be ideal, if the re-designed validators solve both problems in one sweep, especially because they are somewhat inter-related. I agree with you that it's better to tackle such issues early rather than when the project is more mature. This means that the validators + io functions are about to undergo an unstable period till we settle on a new structure that works.

Regarding your experiments in #197, I'd say feel free to continue experimenting on point 2, but don't worry about getting any of the code "camera ready" just yet, because likely we'd have to alter it to match the ongoing changes.

Regarding the structure of tests, is there anything we can do to help? I'd be open to hopping on a quick zoom cal some time next week if that'll help clarify things.

vigji · 2024-06-07T15:02:44Z

Regarding your experiments in #197, I'd say feel free to continue experimenting on point 2, but don't worry about getting any of the code "camera ready" just yet, because likely we'd have to alter it to match the ongoing changes.

Ok!

Regarding the structure of tests, is there anything we can do to help? I'd be open to hopping on a quick zoom cal some time next week if that'll help clarify things.

I'll dm you on Zulip :)

niksirbi · 2024-06-14T12:41:24Z

@b-peri had a good idea that might help with this:

When we load data, we know what type it is (poses or bboxes), so we could add a dataset attribute (e.g. ds.tracking_type) that keeps that information. Subsequent validation can be done taking into account the value of this attribute. For example:

if ds.tracking_type == "bboxes" check that the shape data variable is present.

The more general validation steps (e.g. existence of space and time dimensions) can run independently of the value of this attribute, while more specialised validation will depend on the tracking type.

niksirbi · 2024-10-24T09:19:01Z

As discussed in #322, this should also include moving _validate_dataset() from save_poses.py to the validators module.

niksirbi added the question Further information is requested label Jun 6, 2024

niksirbi added this to movement progress tracker Jun 6, 2024

github-project-automation bot moved this to 🤔 Triage in movement progress tracker Jun 6, 2024

niksirbi mentioned this issue Jun 6, 2024

Add a ValidBboxesDataset class #201

Merged

6 tasks

lochhh mentioned this issue Jul 9, 2024

Refactor filtering module to take DataArrays as input #209

Merged

10 tasks

niksirbi mentioned this issue Jul 11, 2024

Reorder dimensions to start with space and time #236

Closed

lochhh mentioned this issue Oct 23, 2024

Drop the .move accessor #322

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to re-design dataset validators? #210

Need to re-design dataset validators? #210

niksirbi commented Jun 6, 2024

vigji commented Jun 7, 2024

niksirbi commented Jun 7, 2024

vigji commented Jun 7, 2024

niksirbi commented Jun 14, 2024

niksirbi commented Oct 24, 2024

Need to re-design dataset validators? #210

Need to re-design dataset validators? #210

Comments

niksirbi commented Jun 6, 2024

The problem

Potential solution

vigji commented Jun 7, 2024

niksirbi commented Jun 7, 2024

vigji commented Jun 7, 2024

niksirbi commented Jun 14, 2024

niksirbi commented Oct 24, 2024