Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Item and Collection parquet Models #24

Closed
vincentsarago opened this issue Sep 19, 2023 · 4 comments
Closed

Item and Collection parquet Models #24

vincentsarago opened this issue Sep 19, 2023 · 4 comments

Comments

@vincentsarago
Copy link
Member

vincentsarago commented Sep 19, 2023

This should note be considered as an Issue but as a Discussion (but not enabled in this repo yet)

👋 @TomAugspurger , Thanks for starting this tool. I'm personally interested in stac-geoparquet to create easily shareable files for large STAC Collections. My usual way of doing is to create NewLine delimited GeoJSON (https://github.com/vincentsarago/MAXAR_opendata_to_pgstac) but GeoParquet seems to be a nice alternative and will also provide some simple Query capacity.

I've looked at the code and implemented a simplified version of a STAC to GeoParquet function. I say simplified because I really tried to minimize the data model, mostly by not creating column for properties properties.

from typing import Any, Dict

import json
import geopandas as gpd
import pandas as pd
import shapely


# From https://github.com/stac-utils/stac-geoparquet/blob/main/stac_geoparquet/utils.py#L60-L72
def fix_empty_multipolygon(
    item_geometry: Dict[str, Any]
) -> shapely.geometry.base.BaseGeometry:
    # Filter out missing geoms in MultiPolygons
    # https://github.com/shapely/shapely/issues/1407
    # geometry = [shapely.geometry.shape(x["geometry"]) for x in items2]
    if item_geometry["type"] == "MultiPolygon":
        item_geometry = dict(item_geometry)
        item_geometry["coordinates"] = [
            x for x in item_geometry["coordinates"] if any(x)
        ]

    return shapely.geometry.shape(item_geometry)


    with open("items_s3.json", "r") as f:
    items = [json.loads(line.rstrip()) for line in f.readlines()]

# Create A GeoParquet
geometry = []
for item in items:
    # Create geometries
    geometry.append(fix_empty_multipolygon(item.pop("geometry")))

    # Create `datetime`/`start_datetime`/`end_datetime` columns
    item["datetime"] = item["properties"]["datetime"]
    item["start_datetime"] = item["properties"].get("start_datetime")
    item["end_datetime"] = item["properties"].get("end_datetime")
    
    # TODO: Maybe split the bbox in minx/miny/maxx/maxY columns

    # Convert Dict and List to String
    for k, v in item.items():
        if isinstance(v, (dict, list)):
            item[k] = json.dumps(v)


gdf = gpd.GeoDataFrame(items, geometry=geometry, crs="WGS84")   
gdf.to_parquet("items.parquet")

In ☝️ I'm creating columns for each STAC object properties (not the item properties) and creating columns for the datetimes properties (to ease temporal filtering). But then I'm creating string for all the List and Dict object.

Model:

Type stac_version id properties datetime start_datetime end_datetime links assets bbox stac_extensions collection geometry
str str str JSON str date date date JSON str JSON str JSON str JSON str str geom

I do the same for the collection

with open("collections.json", "r") as f:
    collections = [json.loads(line.rstrip()) for line in f.readlines()]

# Create A GeoParquet
geometry = []
for collection in collections:
    # Create geometries from extent
    geometry.append(
        # The extent is in form of `global bbox`, `small bbox1`, `small bbox 2` , ....
        shapely.geometry.GeometryCollection(
            [
                shapely.geometry.Polygon.from_bounds(*bbox)
                for bbox in collection["extent"]["spatial"]["bbox"]
            ]
        )
    )

    interval = collection["extent"]["temporal"]["interval"]
    # Create `start_datetime`/`end_datetime` columns
    collection["start_datetime"] = collection["extent"]["temporal"]["interval"][0][0] if interval else None
    collection["end_datetime"] = collection["extent"]["temporal"]["interval"][0][1] if interval else None
    
    # Convert Dict to String
    for k, v in collection.items():
        if isinstance(v, (dict, list)):
            collection[k] = json.dumps(v)


gdf = gpd.GeoDataFrame(collections, geometry=geometry, crs="WGS84")
gdf.to_parquet("collections.parquet")

cc @kylebarron @gadomski

@TomAugspurger
Copy link
Collaborator

Can you expand a bit on the reason for JSON encoding the fields like links, assets, properties, etc.?

Working with these nested fields can be a pain, but it seems to me like the tooling around these nested dtypes is improving (pandas-dev/pandas#54938, etc.). Being able to filter on, e.g. .properties.eo:cloud_cover at the parquet level seems pretty useful.

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Sep 19, 2023

BTW: I really like the idea of defining a data model (or multiple?), maybe as a pyarrow or parquet schema, for this type of data, and properly documenting it.

https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/ and https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ are some pretty detailed blog posts on making a data model for OpenTelemetry data.

@kylebarron
Copy link
Collaborator

BTW: I really like the idea of defining a data model and properly documenting it.

💯 there's a STAC Sprint next week that I think this would fit into well. I'm a fan of having arrow-native/parquet-native types for stac-geoparquet. Maybe I'll try to write up a "mini spec" for this with a read and write implementation using pyarrow in python? Maybe as a PR here? I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory

@TomAugspurger
Copy link
Collaborator

I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory

+1 to using arrow directly (and then somehow adding the geoarrow metadata.). There's a few places where we have to fixup issues with object-dtype ndarrays we're getting from pandas.

If we do need geopandas for anything, then we can explore pandas' new-ish support for arrow-backed arrays.

@stac-utils stac-utils locked and limited conversation to collaborators Sep 19, 2023
@gadomski gadomski converted this issue into discussion #25 Sep 19, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants