Skip to content

Commit

Permalink
Feature/update table (#94)
Browse files Browse the repository at this point in the history
* Refactor h3_utils to use h3ronpy

* Update lib.py for use of h3ronpy

* Update ValueError for invalid fields

* Add documentation on bug

* Fix issue with Point generation

h3ronpy currently has a bug with cells_to_wkb_points which returns matching coordinates.

* Integrate release from h3ronpy that solves cells_to_wkb_points

* Remove looping on geometries

* Update user documentation

* Update user doc notebooks (#76)

* create stac catalog from sample kenya data

* testing metadata

* update metadata files

* typo

* Update catalog with link to self and title based on stac-check best practice

* Move METADATA to space2stats_ingest

* fix sources link

* rewrite catalog with self link and item titles

* Rename duplicated variable gdf to adm_gdf

* Corrected the color breaks

* Adapt based on h3ronpy and adapt colormap

* Change formatting, ensure clear runs, and remove unused imports

* Add nbqa pre-commit for notebooks

---------

Co-authored-by: Andres Chamorro <[email protected]>

* Update ingest mechanics

Remove download commands as we move to multiple files approach. Update validation of STAC metadata.

* Update ingestion logic to support update via a new file

* wip on updating data

Using database approach to handle merge causes issues

* Use arrow for merging data

Still has issues with reading database table which implies some performance issues

* Refacto approach to using database for join

* Add check for existing column names and update docker database settings

* Add rollback logic if update fails and more tests

* Update database documentation for testing with update process

* Add notebook dependency group and update python library example

* Update notebook based on nbqa

* Remove unused dependencies in core

* Update poetry lock

---------

Co-authored-by: Andres Chamorro <[email protected]>
Co-authored-by: Benjamin P. Stewart <[email protected]>
  • Loading branch information
3 people authored Nov 22, 2024
1 parent f559e92 commit 741ed96
Show file tree
Hide file tree
Showing 12 changed files with 998 additions and 453 deletions.
24 changes: 19 additions & 5 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@ version: '3'

services:
database:
# at time of writing this, ARM64 is not supported so we make sure to use
# a supported platform: https://github.com/postgis/docker-postgis/issues/216
# Could possibly switch to https://github.com/vincentsarago/containers
platform: linux/amd64
image: postgis/postgis:15-3.4
environment:
Expand All @@ -13,6 +10,23 @@ services:
- POSTGRES_DB=postgis
ports:
- 5439:5432
command: postgres -N 500
command: >
postgres -N 500
-c checkpoint_timeout=30min
-c synchronous_commit=off
-c max_wal_senders=0
-c max_connections=8
-c shared_buffers=2GB
-c effective_cache_size=6GB
-c maintenance_work_mem=512MB
-c checkpoint_completion_target=0.9
-c wal_buffers=16MB
-c default_statistics_target=100
-c random_page_cost=1.1
-c effective_io_concurrency=200
-c work_mem=256MB
-c huge_pages=off
-c min_wal_size=1GB
-c max_wal_size=4GB
volumes:
- ./.pgdata:/var/lib/postgresql/data
- ./.pgdata:/var/lib/postgresql/data
50 changes: 29 additions & 21 deletions docs/acceptance/db.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,32 +54,15 @@ You can use the CLI tool for data ingestion. First, ensure you have the required
poetry install
```

To download the Parquet file from S3 and load it into the database, run the following command:
To load a Parquet file it into the database, run the following command:

```bash
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
poetry run space2stats-ingest load \
"postgresql://username:password@localhost:5439/postgres" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
"<item_path>" \
"local.parquet"
```

Alternatively, you can run the `download` and `load` commands separately:

1. **Download the Parquet file**:
```bash
poetry run space2stats-ingest download "s3://<bucket>/space2stats.parquet" --local-path "local.parquet"
```

2. **Load the Parquet file into the database**:
```bash
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
"postgresql://username:password@localhost:5439/postgres" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
```

### Database Configuration

Once connected to the database via `psql` or a PostgreSQL client (e.g., `pgAdmin`), execute the following SQL command to create an index on the `space2stats` table:
Expand Down Expand Up @@ -110,3 +93,28 @@ SELECT sum_pop_2020 FROM space2stats WHERE hex_id IN ('86beabd8fffffff', '86beab
### Conclusion

Ensure all steps are followed to verify the ETL process, database setup, and data ingestion pipeline. Reach out to the development team for any further assistance or troubleshooting.


#### Updating test

- Spin up database with docker:
```
docker-compose up
```
- Download initial dataset:
```
aws s3 cp s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet .
download: s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet to ./space2stats.parquet
```
- Upload initial dataset:
```
space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_population_2020.json space2stats.parquet
```
- Generate second dataset:
```
python space2stats_ingest/METADATA/generate_test_data.py
```
- Upload second dataset:
```
space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_reupload_test.json space2stats_test.parquet
```
302 changes: 151 additions & 151 deletions space2stats_api/src/poetry.lock

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

# Load the original Parquet file
input_file = "space2stats.parquet"
table = pq.read_table(input_file)

# Select only the 'hex_id' column
table = table.select(["hex_id"])

# Create the new 'test_column' with random values
num_rows = table.num_rows
test_column = pa.array(np.random.random(size=num_rows), type=pa.float64())

# Add 'test_column' to the table
table = table.append_column("test_column", test_column)

# Save the modified table to a new Parquet file
output_file = "space2stats_test.parquet"
pq.write_table(table, output_file)

print(f"Modified Parquet file saved as {output_file}")
Loading

0 comments on commit 741ed96

Please sign in to comment.