-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naming constraints break compatibility with existing datasets #624
Comments
Related discussionsFirst, links to some related topics:
Rationale behind
|
Here are my current thoughts about this issue. To solve it, I can investigate the third point, and then we can decide whether there is something to do/change in the code. Maybe @ivirshup's experience can be valuable?
|
Thanks @aeisenbarth for summing up the status quo. My vote would be to go for an allow list (eventually extending the current one), I'd ask for the feedback to others and see what is the consensus. In fact, I could imagine also going for a deny list, by combining together the forbid lists for Zarr, Parquet and for filenames in Windows, macOS and UNIX in general. Regarding the last point, in my experience I have found some bugs sometimes (example 1, example 2) around the dimension separator, so I would be for disallowing |
Kindly asking some other folks to join the discussion. This message summarizes the discussion so you don't to read all of the above. In your experience with Zarr, AnnData etc, do you think that we should put a constraint on the characters that can be used for the names of the images etc (e.g. We want to avoid possible bugs due to clashes between special characters and how internal libraries we depend on deal with them (for instance having a Thank you! @giovp @kevinyamauchi @joshmoore @d-v-b @will-moore |
My research on the third point:
For SpatialData, I see problems with:
|
For my original issue, it would be enough to allow |
Zarr allows names containing "." characters, and from a purely technical perspective there should be no interference between the names of arrays / groups and the names of chunks, since the encoding of chunk keys in zarr is unambiguously specified in a manner that is completely independent of node names. |
Thanks for sharing the above information; I read through the Zarr discussion and it is indeed very informative. My take on this is that we could proceed as Zarr does in https://github.com/zarr-developers/zarr-specs/pull/196/files (which allows unicode characters, provides a short deny list, recommends a normalization approach and recommends to use common characters, such as I would take a more conservative approach since I think that, at least for the moment, we should aim at improving simplicity and robustness, and that allowing unicode would fix some bugs related to datasets that have already been written using now forbidden characters, but would probably lead to new ones because of datasets that can be written and read by one OS/language but not from another, which I think is worse. Let's wait for feedback from the others and happy to discuss and agree on a path forward together; meanwhile here is my proposal:
|
In particular the proposal above would allow for the |
Thanks for the discussion, just went through this, and I agree that
|
Thanks @giovp. Since the approach I described is conservative and if in the future we end up allowing unicode etc we can always relax, I would implement the approach and close this issue. @aeisenbarth are you free to work on this task (the 4 checkboxes of my post above)? Otherwise I can also work on this. |
Yes, I can do it (sorry, I didn't receive the notification for some reasons). |
There is another complication: The current proposal allows different character cases (as in Possible solutions would be:
I couldn't find any mention in the specs, especially the table proposal ngff#64. For AnnData, there is some implementation we could make use of, or further extend instead of implementing on SpatialData side (anndata#321). I will go for solution 2 because it has the least impact on existing datasets (or table colums imported from CSV). Also I will test to what extent we can rely on existing name validation in AnnData so that we don't have to duplicate that functionality. |
I noticed the previous implementation was not restricted to |
As discussed in the last SpatialData meeting, I changed to keep Python's wider meaning of alphanumeric, as in 137e1e0 (contrary to Zarr v3's |
_check_valid_name
implemented stricter naming constraints in 137e1e0. We already have existing SpatialData datasets where.
is used as separator for naming components with different meanings, likeSlide1.A2.0.pre_maldi
. When loading them with spatialdata 0.2.1, we get the error "Name must contain only alphanumeric characters, underscores, and hyphens". I would understand for Unicode characters, or characters reserved for SpatialData itself (/
) or characters problematic on a subset of supported platforms (:
,?
,\
…).Is there a reason for not allowing
.
anymore? Will it be used by SpatialData itself, or the NGFF specification?What would be a solution, can
.
be allowed again or should we change the way how we create new datasets and adjust all existing datasets to the new naming constraints?_
feels more like a replacement for space-
feels more fitting for joining parts of a naming component, not separating them.The text was updated successfully, but these errors were encountered: