Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SQL-Style Joins between Xarray datasets and Dask/Pandas dataframes #5

Open
alxmrs opened this issue Jan 31, 2024 · 5 comments · May be fixed by #33
Open

Support SQL-Style Joins between Xarray datasets and Dask/Pandas dataframes #5

alxmrs opened this issue Jan 31, 2024 · 5 comments · May be fixed by #33
Milestone

Comments

@alxmrs
Copy link
Owner

alxmrs commented Jan 31, 2024

Here's an example workflow that I'd like to support once this feature exists. This is from Jake Wall of the Mara Elephant Project. Here, he would make use of raster and table data from Earth Engine.

Yeah, so one example, is to extract a NDVI value from an IC for every GPS point recorded by an elephant. We have millions of points that get translated into features. Then a reduce operation is run on the point to get the closest n values in time to when the GPS point occurred. We then spit this back out as an array and join it with the original geopandas dataframe.

I'm imagining this would look like a left join from a Dask Dataframe that had the elephant coordinates to an EE ImageCollection that was opened with Xee via Qarray. Some details are fuzzy, like how we'd interject a NN lookup (maybe, this could be done via a SQL aggregation?).

In general, I think there is broad demand for being able to join raster and tabular data with each other. Later in the line, I bet we could implement geo-aware joins that would make use of geometry.

@alxmrs
Copy link
Owner Author

alxmrs commented Feb 17, 2024

This should be possible to demo once #8 is complete. If we figure this out, we should document it in the README.

@alxmrs alxmrs added this to the MVP milestone Feb 17, 2024
@alxmrs alxmrs linked a pull request Mar 12, 2024 that will close this issue
@alxmrs
Copy link
Owner Author

alxmrs commented Aug 27, 2024

I’ve been reading more into how this is done in the status quo. The best example I can find for joining rasters and point data (and vectors) comes from using a hierarchical spatial index like h3 or s2.

https://github.com/uber/h3-py-notebooks/blob/master/notebooks/unified_data_layers.ipynb

I wonder if this is the technique that underpins Fused.io.

@alxmrs
Copy link
Owner Author

alxmrs commented Aug 27, 2024

For non-geospatial data, could we use a kdtree to create a hierarchical index? 🤔

@alxmrs
Copy link
Owner Author

alxmrs commented Sep 4, 2024

This podcast episode is incredibly validating of the use case that this library (and issue) solves.

https://overcast.fm/+AAU1XJb7r0Y/6:55

@alxmrs
Copy link
Owner Author

alxmrs commented Oct 11, 2024

https://github.com/DahnJ/H3-Pandas

This gives me more confidence that an index system (geospatial via s2 and h3, or pre-computed via kdtrees) is a good integration. To me, this is proof of demand for such features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant