Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of top level masking functions for bitmask and enumerated… #187

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

alexgleith
Copy link
Contributor

@alexgleith alexgleith commented Nov 20, 2024

A work in progress, but currently includes functions:

  • odc.geo.masking.bits_to_bool, which makes a mask using bit flags
  • odc.geo.masking.enum_to_bool, which makes a mask using isin lookups
  • scale_and_offset, which applies offset and scale values.

Also:

  • odc.geo.masking.mask_clouds and the two functions below
  • odc.geo.masking.mask_ls: apply an opinionated mask to landsat c2l2sr data
  • odc.geommasking.mask_s2: apply an opinionated mask to sentinel-2 c1l2 data.

I've added odc xarray extensions for them, which I think works. Tested briefly.

So, thoughts? Opinions? Fears and doubts?

@alexgleith alexgleith marked this pull request as draft November 20, 2024 04:26
@alexgleith alexgleith force-pushed the add-masking-functions branch 2 times, most recently from 2ecd02a to 479dc32 Compare November 20, 2024 04:40
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 95.90164% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.45%. Comparing base (a00c882) to head (39c42cb).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
odc/geo/masking.py 95.53% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop     #187    +/-   ##
=========================================
  Coverage    95.44%   95.45%            
=========================================
  Files           31       32     +1     
  Lines         5534     5656   +122     
=========================================
+ Hits          5282     5399   +117     
- Misses         252      257     +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link

github-actions bot commented Nov 20, 2024

@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 04:42 Inactive
@Kirill888
Copy link
Member

@alexgleith I think that chosen approach for representing "bit query" in bits_to_bool is confusing and open to interpretations, in particular the "value part", say you need to check that bit 5 is 1, is value 1<<5 or just 1.

I'd prefer if we just took a string like this

mask = bits_to_bool(xx, "xx01_1xxx")

instead of proposed

mask = bits_to_bool(xx, [3,4,5], b0001_1000)

@Kirill888
Copy link
Member

Other option I guess is a dictionary int -> 0|1, so something like

mask = bits_to_bool(xx, {5:0, 4:1, 3:1})

but that's way more prone to errors: off by 1 or LSB/MSB confusion.

odc/geo/masking.py Outdated Show resolved Hide resolved
odc/geo/masking.py Outdated Show resolved Hide resolved
odc/geo/masking.py Outdated Show resolved Hide resolved
odc/geo/masking.py Outdated Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 07:47 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:05 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:09 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:12 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:16 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:22 Inactive
@alexgleith
Copy link
Contributor Author

Would be great to get your review @robbibt and perhaps an opinionated DEA Fmask implementation mask_dea or mask_fmask?


return mask_clouds(
xx=xx,
qa_name=qa_name,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there is a default qa_name, I'm concerned this might fail without clear information to the user if the data they've loaded does not have the scl band, and they haven't realised. Should there be checks in the mask_s2 and mask_ls functions that check whether the supplied qa_name band is actually available?

SNOW = 5
CLEAR = 6
WATER = 7
# Not sure how to implement these yet...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've developed a way to do this in eo-insights: https://github.com/frontiersi/eo-insights/blob/412cfb5e4b416c1cfc7d52286a8b9eb241e730d2/src/eo_insights/masking.py#L267-L268

A two-bit pair can have four values: 0 = 0b00, 1 = 0b01, 2 = 0b10, 3 = 0b11. You can appropriately represent this by inserting the requested integer value at the minimum bit using integer_value << min(bits) such that a value of 2 at bits [2, 3] for a 4-bit integer produces 0b1000 and a value of 3 at bits [2, 3] for a 4-bit integer produces 0b1100.

Comment on lines +317 to +319
"""
Perform cloud masking for Sentinel-2 L2A products.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything to prevent the user from mistakenly using these functions on inappropriate datasets? What happens if the user has loaded Harmonised Landsat Sentinel-2, which uses fmask, which has a different set of flag definitions (and nodata value) to either of the defaults included here? They would need to know that they can't use either of these functions, and have to use the mask_clouds function with custom variables relevant to their product. I'm worried that the inclusion of opinionated masking here is going to trip people up, and there need to be appropriate checks to inform the user that they might be using the function inappropriately.


# "Scales" and "offsets" is used by GDAL.
if scale is None:
scale = xx.attrs.get("scales")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if these attributes aren't available? Should you return None?

@caitlinadams
Copy link

@alexgleith -- is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2 products you have opinionated masking for, either from a single tiff, or a STAC query? I'd like to run a few tests to understand what might happen if a user applies the opinionated masking functions naively and makes a mistake. You could just send this to me directly if you don't want to have it as part of the repo.

@alexgleith
Copy link
Contributor Author

is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2

Here's a notebook: https://gist.github.com/alexgleith/84ebc85e17b904a48319845e42947352

@caitlinadams
Copy link

Thanks for putting this together, Alex! At this stage, my main concern is that it would be very easy for the user to misuse mask_ls and mask_s2. Given this, my current preference would be to implement mask_clouds as part of odc-geo, and move the opinionated elements to a different form. This could be a repo with importable configurations, or providing examples to users about what values they need to put into mask_clouds for different products, or something else we haven't thought of yet!

What do you think?

Copy link
Contributor

@robbibt robbibt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback:

This is brilliant, thanks so much for working on this @alexgleith - getting masking functionality into odc-geo is something I've wanted for a long time, and I can see this making many of our current workflows so much easier.

I do think I agree with Caitlin that that we should probably break this PR into two chunks to consider seperately. The generic tooling is "easier" to progress, and is a perfect fit for this repo. However, the opinionated masking is trickier, especially given all the possible edge cases that can occur when users naively start applying the functions to various datasets (e.g. USGS Level 1 Landsat, USGS Level 2, DEA Collection 3, DE Africa, E84, HLS etc). So perhaps, just include the following functions in this initial PR:

  • bits_to_bool: generic bitmasking
  • enum_to_bool: generic "is-in" masking
  • scale_and_offset: generic re-scaling of data by scale/offsets
  • mask_invalid_data: generic conversion of integer nodata values to NaN (given both this and scale_and_offset convert integer data to floating point, I wonder if mask_invalid_data should become just a special variant of this function and call scale_and_offset under-the-hood?)
  • mask_clouds: generic framework/workflow for masking clouds from a dataset (e.g. qa band selection, re-scaling, mask creation, mask application

For mask_clouds in particular, I think it would be really nice if rather than passing in a custom function, a user could simply provide the bit flags or the enumerated values directly to the function, e.g. perhaps something like this:

# Enum masking
ds.odc.mask_clouds(qa_name="fmask", qa_enum=[0, 4], scale=0.0001, offset=0.0)

# Bit masking
ds.odc.mask_clouds(qa_name="pixel_qa", qa_bits=[4, 3], scale=0.0000275, offset=-0.2)

That would let mask_clouds be used for generic cloud masking tasks and for future satellite datasets with different sets of masking bands and flags than we have seen so far.

Specific feedback:

  1. .odc.mask_invalid_data is great - although the first time I ran it, I was suprised that it operated "in-place", e.g. modifying the existing array without having to assign it to a new output. Is this intended? Both approaches could be useful - I guess we could include an inplace=False param in there to allow users to decide how they want to apply it.

    ds.odc.mask_invalid_data()
    # vs
    ds_masked = ds.odc.mask_invalid_data()
    
  2. .odc.enum_to_bool works a treat! Again, this one returns a new output instead of modifying data in-place, so we should probably standardise to make sure functionality is expected.

  3. I've tested odc.bits_to_bool on our WO data, and it works great for masking by individual flags, e.g. ds.water.isel(time=0).odc.bits_to_bool(bits=[7]) - I actually quite like being able to pass in a simple list of bits. However, I'm not sure how to use it for more complex things like this (which I think relates to @caitlinadams' comments above):

    image

    Bitmasking is something I've found super confusing forever, and I know it's something our users struggle with a lot too. The datacube.utils.make_mask function is probably still the most intuitive implementation I've seen, where users can do something like this using human-readable flags. I wonder if we could extent this function to allow something similar here, if variable flags exist in the data, or if a user manually provides them?

    make_mask(ds.water, water_observed=True)
    
  4. I tried running ds.odc.scale_and_offset(scale=0.0001) on DEA's ARD scaled between 0-10000. It did work, but it also operated in-place which was unexpected, and it appeared to restore nodata values back into the dataset even after converting to float (so the data had values from -999.0 to 1.0. I think it's probably a safe assumption that nodata values can be set to NaN after rescaling to floating point - I don't think having an integer nodata value is useful at that point.

def bits_to_bool(
xx: DataArray,
bits: Sequence[int] | None = None,
bitflags: int | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that bits and bitflags are mutually exclusive, e.g. users could provide either option? If so, we should clarify this in the documentation.

) -> DataArray | Dataset:
"""
Apply scale and offset to the DataArray. Leave scale and offset blank to use
the values from the DataArray's attrs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these attributes that come with a particular STAC extension? If so, makes me wonder if we should configure these in our DEA datacube products too

:param scale: Scale value for the dataset
:param offset: Offset value for the dataset
:param clip: Clip range for the data
:param includ_cirrus: Whether to include cirrus in the mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here doesn't match the current params

xx: DataArray | Dataset,
scale: float | None = None,
offset: float | None = None,
clip: Annotated[Sequence[int | float], 2] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this can be applied at the dataset level, we should add a "skip_bands" param here to match

@robbibt
Copy link
Contributor

robbibt commented Nov 22, 2024

I guess one other thing to consider here is that there's a bunch of similar masking tools in odc.algo which I believe were written to optimise Dask performance: https://github.com/opendatacube/odc-algo/blob/main/odc/algo/_masking.py

We should see if any of that content can be re-used here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants