-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add back working download script #1
Conversation
src/cloudcast/download.py
Outdated
ds = ( | ||
xr.open_zarr(path, chunks={}) | ||
.sortby("time") | ||
.sel(time=slice(start_date_stamp, end_date_stamp, data_inner_steps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phinate The satellite data on GCP is a bit gappy, and frequently 5 or 10 minutes is missing, and sometimes even longer gaps. This stepping just takes every 3rd timestamp regardless of these gaps. I think it might be better to use:
mask = np.mod(ds.time.dt.minute, data_inner_steps*5)==0
ds_filtered_time = ds.sel(time=mask)
If we used data_inner_steps=3
this would select all the data where the time is HH:00, HH:15, HH:30, and HH:45. Because of the random gaps in the satellite data, the code as it is here would have a mixture of all minutes rather than strict multiples of 15 minutes.
I think being stricter here will make things easier in the dataloader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds sensible in principle (and I do realise you already suggested this on my gist — apologies). How sure are we that the 00/15/30/45 marks are the most prevalent ones (i.e. is there a point where a slight offset may have occured?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is basically:
Presume we want to train a model to forecast in 15 minutely steps and imagine the satellite timestamps in the cloud dataset were [HH:00, HH:05, HH:15, HH:20, HH:25, HH:30, HH:35, ...]
i.e. just the HH:10 is missing.
In this case, with slicing every 3rd time slice, like the original code, we'd only have downloaded [HH:00, HH:20, HH:35, HH:50, ...]
. This sample might slightly confuse the model since after the first timestamp everything is 5 minutes ahead of where the model would expect it to be. So we would probably just filter out this time range as unuseable in the dataloader. But with slicing the 00/15/30/45 minutes explicitly we get a sample from this data which we use the train the model.
At times where we are missing the 00/15/30/45 timestamp we would lose this time range, but I think its still simpler to be strict and hopefully we have enough data that we can afford to lose some
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that one comment this looks great!
[tool.ruff] | ||
src = ["src"] | ||
exclude = [] | ||
line-length = 88 # how long you want lines to be | ||
line-length = 100 # how long you want lines to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
This just restores the functionality written by @dfulu in his sat_pred code, with the additional feature of being able to run via CLI
cloudcast download
and slicing the data by step-size (--data-inner-steps
), which defaults to 15 minutes (3*5 min intervals).