Add back working download script #1

phinate · 2024-07-16T16:26:14Z

This just restores the functionality written by @dfulu in his sat_pred code, with the additional feature of being able to run via CLI cloudcast download and slicing the data by step-size (--data-inner-steps), which defaults to 15 minutes (3*5 min intervals).

dfulu · 2024-07-17T09:07:51Z

src/cloudcast/download.py

+        ds = (
+            xr.open_zarr(path, chunks={})
+            .sortby("time")
+            .sel(time=slice(start_date_stamp, end_date_stamp, data_inner_steps))


@phinate The satellite data on GCP is a bit gappy, and frequently 5 or 10 minutes is missing, and sometimes even longer gaps. This stepping just takes every 3rd timestamp regardless of these gaps. I think it might be better to use:

mask = np.mod(ds.time.dt.minute, data_inner_steps*5)==0 ds_filtered_time = ds.sel(time=mask)

If we used data_inner_steps=3 this would select all the data where the time is HH:00, HH:15, HH:30, and HH:45. Because of the random gaps in the satellite data, the code as it is here would have a mixture of all minutes rather than strict multiples of 15 minutes.

I think being stricter here will make things easier in the dataloader

That sounds sensible in principle (and I do realise you already suggested this on my gist — apologies). How sure are we that the 00/15/30/45 marks are the most prevalent ones (i.e. is there a point where a slight offset may have occured?)

My thinking is basically:

Presume we want to train a model to forecast in 15 minutely steps and imagine the satellite timestamps in the cloud dataset were [HH:00, HH:05, HH:15, HH:20, HH:25, HH:30, HH:35, ...] i.e. just the HH:10 is missing.

In this case, with slicing every 3rd time slice, like the original code, we'd only have downloaded [HH:00, HH:20, HH:35, HH:50, ...]. This sample might slightly confuse the model since after the first timestamp everything is 5 minutes ahead of where the model would expect it to be. So we would probably just filter out this time range as unuseable in the dataloader. But with slicing the 00/15/30/45 minutes explicitly we get a sample from this data which we use the train the model.

At times where we are missing the 00/15/30/45 timestamp we would lose this time range, but I think its still simpler to be strict and hopefully we have enough data that we can afford to lose some

dfulu

Besides that one comment this looks great!

dfulu · 2024-07-17T17:26:28Z

pyproject.toml

 [tool.ruff]
 src = ["src"]
 exclude = []
-line-length = 88  # how long you want lines to be
+line-length = 100 # how long you want lines to be


phinate added 7 commits July 15, 2024 17:01

fix

8e801e9

logger

5d92cea

mypy happy

e6daa2c

add chunking back

7f293a9

readme

28f748e

fix ci

a62de75

fix ci

134deae

dfulu reviewed Jul 17, 2024

View reviewed changes

add test

9b1c297

phinate requested a review from dfulu July 17, 2024 09:27

dfulu approved these changes Jul 17, 2024

View reviewed changes

add suggested change to filter strict minute intervals

5269961

phinate merged commit 7969d79 into main Jul 17, 2024
3 checks passed

phinate deleted the fix-errs branch July 17, 2024 10:52

dfulu reviewed Jul 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back working download script #1

Add back working download script #1

phinate commented Jul 16, 2024

dfulu Jul 17, 2024

phinate Jul 17, 2024

dfulu Jul 17, 2024

dfulu left a comment

dfulu Jul 17, 2024

Add back working download script #1

Add back working download script #1

Conversation

phinate commented Jul 16, 2024

dfulu Jul 17, 2024

Choose a reason for hiding this comment

phinate Jul 17, 2024

Choose a reason for hiding this comment

dfulu Jul 17, 2024

Choose a reason for hiding this comment

dfulu left a comment

Choose a reason for hiding this comment

dfulu Jul 17, 2024

Choose a reason for hiding this comment