Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add back working download script #1

Merged
merged 9 commits into from
Jul 17, 2024
Merged

Add back working download script #1

merged 9 commits into from
Jul 17, 2024

Conversation

phinate
Copy link
Collaborator

@phinate phinate commented Jul 16, 2024

This just restores the functionality written by @dfulu in his sat_pred code, with the additional feature of being able to run via CLI cloudcast download and slicing the data by step-size (--data-inner-steps), which defaults to 15 minutes (3*5 min intervals).

ds = (
xr.open_zarr(path, chunks={})
.sortby("time")
.sel(time=slice(start_date_stamp, end_date_stamp, data_inner_steps))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phinate The satellite data on GCP is a bit gappy, and frequently 5 or 10 minutes is missing, and sometimes even longer gaps. This stepping just takes every 3rd timestamp regardless of these gaps. I think it might be better to use:

mask = np.mod(ds.time.dt.minute, data_inner_steps*5)==0
ds_filtered_time = ds.sel(time=mask)

If we used data_inner_steps=3 this would select all the data where the time is HH:00, HH:15, HH:30, and HH:45. Because of the random gaps in the satellite data, the code as it is here would have a mixture of all minutes rather than strict multiples of 15 minutes.

I think being stricter here will make things easier in the dataloader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds sensible in principle (and I do realise you already suggested this on my gist — apologies). How sure are we that the 00/15/30/45 marks are the most prevalent ones (i.e. is there a point where a slight offset may have occured?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is basically:

Presume we want to train a model to forecast in 15 minutely steps and imagine the satellite timestamps in the cloud dataset were [HH:00, HH:05, HH:15, HH:20, HH:25, HH:30, HH:35, ...] i.e. just the HH:10 is missing.

In this case, with slicing every 3rd time slice, like the original code, we'd only have downloaded [HH:00, HH:20, HH:35, HH:50, ...]. This sample might slightly confuse the model since after the first timestamp everything is 5 minutes ahead of where the model would expect it to be. So we would probably just filter out this time range as unuseable in the dataloader. But with slicing the 00/15/30/45 minutes explicitly we get a sample from this data which we use the train the model.

At times where we are missing the 00/15/30/45 timestamp we would lose this time range, but I think its still simpler to be strict and hopefully we have enough data that we can afford to lose some

@phinate phinate requested a review from dfulu July 17, 2024 09:27
Copy link
Collaborator

@dfulu dfulu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that one comment this looks great!

@phinate phinate merged commit 7969d79 into main Jul 17, 2024
3 checks passed
@phinate phinate deleted the fix-errs branch July 17, 2024 10:52
[tool.ruff]
src = ["src"]
exclude = []
line-length = 88 # how long you want lines to be
line-length = 100 # how long you want lines to be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants