First draft of top level masking functions for bitmask and enumerated… #187

alexgleith · 2024-11-20T04:25:05Z

A work in progress, but currently includes functions:

odc.geo.masking.bits_to_bool, which makes a mask using bit flags
odc.geo.masking.enum_to_bool, which makes a mask using isin lookups
scale_and_offset, which applies offset and scale values.

Also:

odc.geo.masking.mask_clouds and the two functions below
odc.geo.masking.mask_ls: apply an opinionated mask to landsat c2l2sr data
odc.geommasking.mask_s2: apply an opinionated mask to sentinel-2 c1l2 data.

I've added odc xarray extensions for them, which I think works. Tested briefly.

So, thoughts? Opinions? Fears and doubts?

codecov · 2024-11-20T04:42:25Z

Codecov Report

Attention: Patch coverage is 95.90164% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.45%. Comparing base (a00c882) to head (39c42cb).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
odc/geo/masking.py	95.53%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           develop     #187    +/-   ##
=========================================
  Coverage    95.44%   95.45%            
=========================================
  Files           31       32     +1     
  Lines         5534     5656   +122     
=========================================
+ Hits          5282     5399   +117     
- Misses         252      257     +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

github-actions · 2024-11-20T04:42:32Z

🚀 Deployed on https://673d9d0f12517841db7e5ea6--odc-geo-docs.netlify.app

Kirill888 · 2024-11-20T07:06:45Z

@alexgleith I think that chosen approach for representing "bit query" in bits_to_bool is confusing and open to interpretations, in particular the "value part", say you need to check that bit 5 is 1, is value 1<<5 or just 1.

I'd prefer if we just took a string like this

mask = bits_to_bool(xx, "xx01_1xxx")

instead of proposed

mask = bits_to_bool(xx, [3,4,5], b0001_1000)

Kirill888 · 2024-11-20T07:09:51Z

Other option I guess is a dictionary int -> 0|1, so something like

mask = bits_to_bool(xx, {5:0, 4:1, 3:1})

but that's way more prone to errors: off by 1 or LSB/MSB confusion.

odc/geo/masking.py

… masks

alexgleith · 2024-11-20T20:51:13Z

Would be great to get your review @robbibt and perhaps an opinionated DEA Fmask implementation mask_dea or mask_fmask?

caitlinadams · 2024-11-20T23:20:11Z

odc/geo/masking.py

+
+    return mask_clouds(
+        xx=xx,
+        qa_name=qa_name,


Given that there is a default qa_name, I'm concerned this might fail without clear information to the user if the data they've loaded does not have the scl band, and they haven't realised. Should there be checks in the mask_s2 and mask_ls functions that check whether the supplied qa_name band is actually available?

caitlinadams · 2024-11-20T23:33:24Z

odc/geo/masking.py

+    SNOW = 5
+    CLEAR = 6
+    WATER = 7
+    # Not sure how to implement these yet...


I've developed a way to do this in eo-insights: https://github.com/frontiersi/eo-insights/blob/412cfb5e4b416c1cfc7d52286a8b9eb241e730d2/src/eo_insights/masking.py#L267-L268

A two-bit pair can have four values: 0 = 0b00, 1 = 0b01, 2 = 0b10, 3 = 0b11. You can appropriately represent this by inserting the requested integer value at the minimum bit using integer_value << min(bits) such that a value of 2 at bits [2, 3] for a 4-bit integer produces 0b1000 and a value of 3 at bits [2, 3] for a 4-bit integer produces 0b1100.

caitlinadams · 2024-11-20T23:41:49Z

odc/geo/masking.py

+    """
+    Perform cloud masking for Sentinel-2 L2A products.
+    """


Is there anything to prevent the user from mistakenly using these functions on inappropriate datasets? What happens if the user has loaded Harmonised Landsat Sentinel-2, which uses fmask, which has a different set of flag definitions (and nodata value) to either of the defaults included here? They would need to know that they can't use either of these functions, and have to use the mask_clouds function with custom variables relevant to their product. I'm worried that the inclusion of opinionated masking here is going to trip people up, and there need to be appropriate checks to inform the user that they might be using the function inappropriately.

caitlinadams · 2024-11-20T23:46:29Z

odc/geo/masking.py

+
+    # "Scales" and "offsets" is used by GDAL.
+    if scale is None:
+        scale = xx.attrs.get("scales")


What happens if these attributes aren't available? Should you return None?

caitlinadams · 2024-11-21T00:24:56Z

@alexgleith -- is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2 products you have opinionated masking for, either from a single tiff, or a STAC query? I'd like to run a few tests to understand what might happen if a user applies the opinionated masking functions naively and makes a mistake. You could just send this to me directly if you don't want to have it as part of the repo.

alexgleith · 2024-11-21T01:11:53Z

is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2

Here's a notebook: https://gist.github.com/alexgleith/84ebc85e17b904a48319845e42947352

caitlinadams · 2024-11-21T02:30:56Z

Thanks for putting this together, Alex! At this stage, my main concern is that it would be very easy for the user to misuse mask_ls and mask_s2. Given this, my current preference would be to implement mask_clouds as part of odc-geo, and move the opinionated elements to a different form. This could be a repo with importable configurations, or providing examples to users about what values they need to put into mask_clouds for different products, or something else we haven't thought of yet!

What do you think?

robbibt

General feedback:

This is brilliant, thanks so much for working on this @alexgleith - getting masking functionality into odc-geo is something I've wanted for a long time, and I can see this making many of our current workflows so much easier.

I do think I agree with Caitlin that that we should probably break this PR into two chunks to consider seperately. The generic tooling is "easier" to progress, and is a perfect fit for this repo. However, the opinionated masking is trickier, especially given all the possible edge cases that can occur when users naively start applying the functions to various datasets (e.g. USGS Level 1 Landsat, USGS Level 2, DEA Collection 3, DE Africa, E84, HLS etc). So perhaps, just include the following functions in this initial PR:

bits_to_bool: generic bitmasking
enum_to_bool: generic "is-in" masking
scale_and_offset: generic re-scaling of data by scale/offsets
mask_invalid_data: generic conversion of integer nodata values to NaN (given both this and scale_and_offset convert integer data to floating point, I wonder if mask_invalid_data should become just a special variant of this function and call scale_and_offset under-the-hood?)
mask_clouds: generic framework/workflow for masking clouds from a dataset (e.g. qa band selection, re-scaling, mask creation, mask application

For mask_clouds in particular, I think it would be really nice if rather than passing in a custom function, a user could simply provide the bit flags or the enumerated values directly to the function, e.g. perhaps something like this:

# Enum masking
ds.odc.mask_clouds(qa_name="fmask", qa_enum=[0, 4], scale=0.0001, offset=0.0)

# Bit masking
ds.odc.mask_clouds(qa_name="pixel_qa", qa_bits=[4, 3], scale=0.0000275, offset=-0.2)

That would let mask_clouds be used for generic cloud masking tasks and for future satellite datasets with different sets of masking bands and flags than we have seen so far.

Specific feedback:

.odc.mask_invalid_data is great - although the first time I ran it, I was suprised that it operated "in-place", e.g. modifying the existing array without having to assign it to a new output. Is this intended? Both approaches could be useful - I guess we could include an inplace=False param in there to allow users to decide how they want to apply it.
```
ds.odc.mask_invalid_data()
# vs
ds_masked = ds.odc.mask_invalid_data()
```
.odc.enum_to_bool works a treat! Again, this one returns a new output instead of modifying data in-place, so we should probably standardise to make sure functionality is expected.
I've tested odc.bits_to_bool on our WO data, and it works great for masking by individual flags, e.g. ds.water.isel(time=0).odc.bits_to_bool(bits=[7]) - I actually quite like being able to pass in a simple list of bits. However, I'm not sure how to use it for more complex things like this (which I think relates to @caitlinadams' comments above):

Bitmasking is something I've found super confusing forever, and I know it's something our users struggle with a lot too. The datacube.utils.make_mask function is probably still the most intuitive implementation I've seen, where users can do something like this using human-readable flags. I wonder if we could extent this function to allow something similar here, if variable flags exist in the data, or if a user manually provides them?
```
make_mask(ds.water, water_observed=True)
```
I tried running ds.odc.scale_and_offset(scale=0.0001) on DEA's ARD scaled between 0-10000. It did work, but it also operated in-place which was unexpected, and it appeared to restore nodata values back into the dataset even after converting to float (so the data had values from -999.0 to 1.0. I think it's probably a safe assumption that nodata values can be set to NaN after rescaling to floating point - I don't think having an integer nodata value is useful at that point.

robbibt · 2024-11-22T00:36:38Z

odc/geo/masking.py

+def bits_to_bool(
+    xx: DataArray,
+    bits: Sequence[int] | None = None,
+    bitflags: int | None = None,


Is it correct that bits and bitflags are mutually exclusive, e.g. users could provide either option? If so, we should clarify this in the documentation.

robbibt · 2024-11-22T00:39:44Z

odc/geo/masking.py

+) -> DataArray | Dataset:
+    """
+    Apply scale and offset to the DataArray. Leave scale and offset blank to use
+    the values from the DataArray's attrs.


Are these attributes that come with a particular STAC extension? If so, makes me wonder if we should configure these in our DEA datacube products too

robbibt · 2024-11-22T00:41:12Z

odc/geo/masking.py

+    :param scale: Scale value for the dataset
+    :param offset: Offset value for the dataset
+    :param clip: Clip range for the data
+    :param includ_cirrus: Whether to include cirrus in the mask


The docstring here doesn't match the current params

robbibt · 2024-11-22T01:08:57Z

odc/geo/masking.py

+    xx: DataArray | Dataset,
+    scale: float | None = None,
+    offset: float | None = None,
+    clip: Annotated[Sequence[int | float], 2] | None = None,


Given this can be applied at the dataset level, we should add a "skip_bands" param here to match

robbibt · 2024-11-22T01:29:16Z

I guess one other thing to consider here is that there's a bunch of similar masking tools in odc.algo which I believe were written to optimise Dask performance: https://github.com/opendatacube/odc-algo/blob/main/odc/algo/_masking.py

We should see if any of that content can be re-used here too.

alexgleith requested review from robbibt and caitlinadams November 20, 2024 04:25

alexgleith marked this pull request as draft November 20, 2024 04:26

alexgleith force-pushed the add-masking-functions branch 2 times, most recently from 2ecd02a to 479dc32 Compare November 20, 2024 04:40

github-actions bot temporarily deployed to pull request November 20, 2024 04:42 Inactive

Kirill888 reviewed Nov 20, 2024

View reviewed changes

odc/geo/masking.py Outdated Show resolved Hide resolved

Kirill888 reviewed Nov 20, 2024

View reviewed changes

odc/geo/masking.py Outdated Show resolved Hide resolved

Kirill888 reviewed Nov 20, 2024

View reviewed changes

odc/geo/masking.py Outdated Show resolved Hide resolved

Kirill888 reviewed Nov 20, 2024

View reviewed changes

odc/geo/masking.py Outdated Show resolved Hide resolved

github-actions bot temporarily deployed to pull request November 20, 2024 07:47 Inactive

github-actions bot temporarily deployed to pull request November 20, 2024 08:05 Inactive

github-actions bot temporarily deployed to pull request November 20, 2024 08:09 Inactive

alexgleith force-pushed the add-masking-functions branch from 3a4eb8c to 504bbb8 Compare November 20, 2024 08:10

github-actions bot temporarily deployed to pull request November 20, 2024 08:12 Inactive

alexgleith force-pushed the add-masking-functions branch from 504bbb8 to f4f5a5b Compare November 20, 2024 08:13

github-actions bot temporarily deployed to pull request November 20, 2024 08:16 Inactive

github-actions bot temporarily deployed to pull request November 20, 2024 08:22 Inactive

First draft of top level masking functions for bitmask and enumerated…

39c42cb

… masks

alexgleith force-pushed the add-masking-functions branch from 7cfd466 to 39c42cb Compare November 20, 2024 08:23

github-actions bot deployed to pull request November 20, 2024 08:25 View deployment

caitlinadams reviewed Nov 20, 2024

View reviewed changes

robbibt requested changes Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First draft of top level masking functions for bitmask and enumerated… #187

First draft of top level masking functions for bitmask and enumerated… #187

alexgleith commented Nov 20, 2024 •

edited

Loading

codecov bot commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading

Kirill888 commented Nov 20, 2024

Kirill888 commented Nov 20, 2024

alexgleith commented Nov 20, 2024

caitlinadams Nov 20, 2024

caitlinadams Nov 20, 2024

caitlinadams Nov 20, 2024

caitlinadams Nov 20, 2024

caitlinadams commented Nov 21, 2024

alexgleith commented Nov 21, 2024

caitlinadams commented Nov 21, 2024

robbibt left a comment •

edited

Loading

robbibt Nov 22, 2024

robbibt Nov 22, 2024

robbibt Nov 22, 2024

robbibt Nov 22, 2024

robbibt commented Nov 22, 2024 •

edited

Loading

First draft of top level masking functions for bitmask and enumerated… #187

Are you sure you want to change the base?

First draft of top level masking functions for bitmask and enumerated… #187

Conversation

alexgleith commented Nov 20, 2024 • edited Loading

codecov bot commented Nov 20, 2024 • edited Loading

Codecov Report

github-actions bot commented Nov 20, 2024 • edited Loading

Kirill888 commented Nov 20, 2024

Kirill888 commented Nov 20, 2024

alexgleith commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caitlinadams commented Nov 21, 2024

alexgleith commented Nov 21, 2024

caitlinadams commented Nov 21, 2024

robbibt left a comment • edited Loading

Choose a reason for hiding this comment

General feedback:

Specific feedback:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robbibt commented Nov 22, 2024 • edited Loading

alexgleith commented Nov 20, 2024 •

edited

Loading

codecov bot commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading

robbibt left a comment •

edited

Loading

robbibt commented Nov 22, 2024 •

edited

Loading