Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodata is inconsistent after loading without explicit nodata #1646

Open
emmaai opened this issue Oct 24, 2024 · 9 comments
Open

nodata is inconsistent after loading without explicit nodata #1646

emmaai opened this issue Oct 24, 2024 · 9 comments

Comments

@emmaai
Copy link
Contributor

emmaai commented Oct 24, 2024

Expected behaviour

nodata information should be consistent between the data loaded and nodata in xr.DataArray.attrs

Actual behaviour

nodata information is missing in xr.DataArray.attrs when Measurement.nodata=None. However, in loading, the nodata value is populated as nan by

dest.dtype.type(measurement.nodata),
when Measurement.dtype=float, emits error when Measurement.dtype=int.

nodata and dtype from geotiff is overwritten in the case where Measurement doesn't match metadata from geotiff. E.g., dtype=int and nodata=255 in geotiff, while dtype=float and nodata=None in Meaurement, the loaded data dtype is casted into float and 255 is substituted with nan. It is reasonable to choose one over another when there is conflict, though the non-presence of nodata in DataArray.attrs causes confusion as the other choice of geotiff metadata over Measurement is also plausible.

Steps to reproduce the behaviour

  • Construct a Dataset from stac instead of odc query
  • load data with this Dataset whose measurement will have default dtype=float and nodata=None

Environment information

datacube == 1.8.19

Note: Stale issues will be automatically closed after a period of six months with no activity.
To ensure critical issues are not closed, tag them with the Github pinned tag.
If you are a community member and not a maintainer please escalate this issue to maintainers via
GIS StackExchange or Discord.

@SpacemanPaul
Copy link
Contributor

We store metadata for a reason.

If the geotiff metadata doesn't match the indexed metadata, I guess it's reasonable to go with the geotiff at load time, but we should at least raise a warning that the indexed metadata doesn't match the file.

@emmaai
Copy link
Contributor Author

emmaai commented Oct 25, 2024

It'd be great it could go with the geotiff at load time if conflicts. From what I read in the code, it requires more changes to populate xr.DataArray.attrs["nodata"] after loading. By the time, all "storage" is created, the loading function can only see the memory/array.

@emmaai
Copy link
Contributor Author

emmaai commented Oct 25, 2024

I guess another way is to "peek" the geotiff before loading, and deal with conflicts then instead of when loading actually happens.

@SpacemanPaul
Copy link
Contributor

SpacemanPaul commented Oct 25, 2024

Yeah, it's not a quick/easy fix. Would also be hard to make it play nice with e.g. NetCDF reading and Zarr reading in 1.9.

What geotiff are we talking about here? Why does it not match the metadata in the index?

@emmaai
Copy link
Contributor Author

emmaai commented Oct 25, 2024

cuz it's not indexed...we need to retain some intermediate results in the process, but they don't have to be indexed. It's fairly easy to create a Dataset with stac so data indexed or not can go through the same data pipeline, with nice lineage and metadata.

@SpacemanPaul
Copy link
Contributor

SpacemanPaul commented Oct 27, 2024

with nice lineage and metadata.

Except the band type and nodata values are wrong. Presumably this metadata is NOT included in the STAC item? THAT would seem to be the real issue.

@Kirill888
Copy link
Member

Nodata can only be populated from metadata, not from file data. File nodata is remapped to metadata nodata as different files can have inconsistent nodata set. With Dask arrays read from S3 we might not even have credentials to peek at nodata, as data will be accessed on a different machine with machine credentials, and that machine might not have been even started yet!

@Kirill888
Copy link
Member

the error on the linked line still exists though, as it should ONLY map nodata=None when output pixels are floating point, for ints it should either pass through None (meaning that all pixel values are valid) or default to something sensible like nodata=0. In odc-stac we have more robust nodata handling, but we have completely halted any work on unifying loader code between datacube and odc-stac after initial dev effort payed for by CSIRO, because 💰, so there is that, maybe @woodcockr knows how permanent that halt is.

@SpacemanPaul
Copy link
Contributor

we have completely halted any work

I was planning to have a go splitting the loading code out of odc-stac into a separate repo later this month (after FOSS4G).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants