-
Notifications
You must be signed in to change notification settings - Fork 10
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Item and Collection parquet Models #24
Comments
Can you expand a bit on the reason for JSON encoding the fields like links, assets, properties, etc.? Working with these nested fields can be a pain, but it seems to me like the tooling around these nested dtypes is improving (pandas-dev/pandas#54938, etc.). Being able to filter on, e.g. |
BTW: I really like the idea of defining a data model (or multiple?), maybe as a pyarrow or parquet schema, for this type of data, and properly documenting it. https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/ and https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ are some pretty detailed blog posts on making a data model for OpenTelemetry data. |
💯 there's a STAC Sprint next week that I think this would fit into well. I'm a fan of having arrow-native/parquet-native types for stac-geoparquet. Maybe I'll try to write up a "mini spec" for this with a read and write implementation using pyarrow in python? Maybe as a PR here? I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory |
+1 to using arrow directly (and then somehow adding the geoarrow metadata.). There's a few places where we have to fixup issues with object-dtype ndarrays we're getting from pandas. If we do need geopandas for anything, then we can explore pandas' new-ish support for arrow-backed arrays. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
This should note be considered as an Issue but as a Discussion (but not enabled in this repo yet)
👋 @TomAugspurger , Thanks for starting this tool. I'm personally interested in stac-geoparquet to create easily shareable files for large STAC Collections. My usual way of doing is to create NewLine delimited GeoJSON (https://github.com/vincentsarago/MAXAR_opendata_to_pgstac) but GeoParquet seems to be a nice alternative and will also provide some simple Query capacity.
I've looked at the code and implemented a
simplified
version of a STAC to GeoParquet function. I say simplified because I really tried to minimize the data model, mostly by not creating column for properties properties.In ☝️ I'm creating
columns
for each STAC object properties (not the item properties) and creating columns for the datetimes properties (to ease temporal filtering). But then I'm creatingstring
for all the List and Dict object.Model:
I do the same for the collection
cc @kylebarron @gadomski
The text was updated successfully, but these errors were encountered: