Skip to content

Commit

Permalink
Merge pull request #11 from gzt5142/gt-010-fetch-data
Browse files Browse the repository at this point in the history
Fetch data source
  • Loading branch information
Gene Trantham authored Dec 28, 2022
2 parents 1147e4a + a2ccb3b commit e14486a
Show file tree
Hide file tree
Showing 15 changed files with 4,266 additions and 41 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
.nox
.pytest_cache
.ipynb_checkpoints
*.geojson
__pycache__
poetry.lock
docs/_build
15 changes: 15 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""
Configuration for SPHINX document generator
"""
project = "NLDI Crawler"
author = "USGS"
copyright = f"2022, {author}"
extensions= [
"sphinx.ext.autodoc",
"sphinx.ext.napoleon",
"sphinx_autodoc_typehints",
"myst_parser",
"sphinx_rtd_theme",
'sphinxcontrib.mermaid'
]
html_theme = "sphinx_rtd_theme"
8 changes: 8 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
SS-Delineate Documentation
==========================

.. toctree::
:hidden:

source_table
workflow
51 changes: 51 additions & 0 deletions docs/source_table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Sources Table

Annotations for the `crawler_source` table, which holds information for finding and processing feature sources:

| Type | Column Name | Description |
|------|-------------| ------------|
| integer | crawler_source_id | The unique identifier to differentiate sources in this table. |
| string | source_name | A human-readable, friendly discriptor for the data source |
| string | source_suffix | This string is used to build table names internally. It should be a unique string with no special characters |
| string | source_uri | The web address from which feature data is retrieved. |
| string | feature_id | The returned GeoJSON from `source_uri` includes feature properties/attributes. This field identifies the name of the property which uniquely identifies the feature within the feature collection. This is treated as the `KEY` within the feature collection |
| string | feature_name | The property name within the returned GeoJSON which holds the name of the feature. |
| string | feature_uri | the property name within the returned GeoJSON which holds the URL by which a feature can be accessed directly. |
| string | feature_reach | The property name within the returned GeoJSON which holds the reach identifier |
| string | feature_measure | The property name within the returned GeoJSON which holds the M-value along the `feature_reach` where this feature can be found |
| string | ingest_type | The type of feature to be parsed. This string should be one of [ `reach` , `point` ] |
| string | feature_type | Unknown. This string is one of [ `hydrolocation` , `point` , `varies` ]


## Example

```sql
SELECT * from nldi_data.crawler_source where crawler_source_id = 10
```
|Source number `10` contains the following data:

|Column | Value |
|-------|-------|
|crawler_source_id | 10
|source_name | Vigil Network Data
|source_suffix | vigil
|source_uri | https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=vigil.geojson
|feature_id | SBID
|feature_name | Site Name
|feature_uri | SBURL
|feature_reach | REACHCODE
|feature_measure | REACH_measure
|ingest_type | reach
|feature_type | hydrolocation

If we fetch the GeoJSON for this source, we see that the feature table looks like this:

| SBID | Site Name | SBURL | REACHCODE | REACH_measure | Location | geometry | ... |
|------|-----------|-------|-----------|---------------|----------|----------| ----|
|5fe395bbd34ea5387deb4950 | Aching Shoulder Slope, New Mexico, USA | https://www.sciencebase.gov/catalog/item/5fe395bbd34ea5387deb4950 | null | null | Mitten Rock, New Mexico USA | Point() | ... |
5fe39807d34ea5387deb4970 | Armells Creek, Montana, USA | https://www.sciencebase.gov/catalog/item/5fe39807d34ea5387deb4970 | 10100001000709 | 90.193048735368549 | Yellowstone River Basin, Southeastern Montana, USA | Point() | ... |
|...|
|...|



47 changes: 47 additions & 0 deletions docs/workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Workflow

The crawler CLI will bulk-download feature data from pre-defined sources. The sequence is a follows:

## Sequence Diagram

```mermaid
%%{init: {
"theme": "default",
"mid-width": 2500,
"max-width": 5000,
"sequence": {"showSequenceNumbers": true }
}
}%%
sequenceDiagram
actor CLI
CLI->>Crawler: launch
Crawler->>+NLDI-DB: Get Source Information
Note left of NLDI-DB: SELECT * FROM nldi_data.crawler_source
NLDI-DB-->>-Crawler: Sources table
Crawler->>+FeatureSource: Request Features
Note left of FeatureSource: HTTP GET ${crawler_source.source_uri}
FeatureSource-->>-Crawler: GeoJSON FeatureCollection
loop foreach feature in Collection
Crawler-->>+Crawler: ORM
Note right of Crawler: Parses and maps feature to SQL
Crawler->>-NLDI-DB: Add to feature table
Note left of NLDI-DB: INSERT INTO nldi_data.features
end
Crawler->>NLDI-DB: Relate Features
%NLDI-DB-->>-Crawler: Success
```

## Annotations

1) Launch CLI tool
2) Connect to NLDI master database, requesting the list of configured feature sources.
3) Returns a list of feature sources. The crawler can either:
* list all sources and exit
* Proceed to 'crawl' one of the sources in the table
4) For the identified feature source, make a GET request via HTTP. The URI is taken from the `crawler_sources` table.
5) The feature source returns GeoJSON. Among the returned data is a list of 'features'.
6) **[Per-Feature]** Use the ORM to map the feature data to the schema reflected from the `features` table
7) **[Per-Feature]** Insert the new feature to the master NLDI database
8) "Relate" features -- build the relationships matching features to their adjacent features in the NLDI topology.
148 changes: 128 additions & 20 deletions notebooks/ORM.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [
{
Expand All @@ -19,7 +19,7 @@
"'2.0.0b1'"
]
},
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -31,7 +31,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -44,7 +44,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -57,7 +57,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -67,7 +67,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -89,7 +89,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -99,32 +99,140 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1 :: Water Quality Portal :: https://www.waterqualitydata.us/data/Station/sea...\n",
" 2 :: HUC12 Pour Points :: https://www.sciencebase.gov/catalogMaps/mapping/...\n",
" 5 :: NWIS Surface Water Sites :: https://www.sciencebase.gov/catalog/file/get/60c...\n",
" 6 :: Water Data Exchange 2.0 Sites :: https://www.hydroshare.org/resource/5f665b7b82d7...\n",
" 7 :: geoconnex.us reference gages :: https://www.hydroshare.org/resource/3295a17b4cc2...\n",
" 8 :: Streamgage catalog for CA SB19 :: https://sb19.linked-data.internetofwater.dev/col...\n",
" 9 :: USGS Geospatial Fabric V1.1 Poin :: https://www.sciencebase.gov/catalogMaps/mapping/...\n",
"10 :: Vigil Network Data :: https://www.sciencebase.gov/catalog/file/get/60c...\n",
"11 :: NWIS Groundwater Sites :: https://www.sciencebase.gov/catalog/file/get/60c...\n",
"12 :: New Mexico Water Data Initative :: https://locations.newmexicowaterdata.org/collect...\n",
"13 :: geoconnex contribution demo site :: https://geoconnex-demo-pages.internetofwater.dev...\n"
" 1 :: Water Quality Portal \n",
"\t Source Suffix: WQP\n",
"\t Source URI: https://www.waterqualitydata.us/data/Station/search?mimeType=geojson&minactivities=1&counts=no\n",
"\t Feature ID: MonitoringLocationIdentifier\n",
"\t Feature Name: MonitoringLocationName\n",
"\t Feature URI: siteUrl\n",
"\t Feature Reach: None\n",
"\t Feature Measure:None\n",
"\t Ingest Type: point\n",
"\t Feature Type varies\n",
" 2 :: HUC12 Pour Points \n",
"\t Source Suffix: huc12pp\n",
"\t Source URI: https://www.sciencebase.gov/catalogMaps/mapping/ows/57336b02e4b0dae0d5dd619a?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb:fpp&outputFormat=json\n",
"\t Feature ID: HUC_12\n",
"\t Feature Name: HUC_12\n",
"\t Feature URI: HUC_12\n",
"\t Feature Reach: None\n",
"\t Feature Measure:None\n",
"\t Ingest Type: point\n",
"\t Feature Type hydrolocation\n",
" 5 :: NWIS Surface Water Sites \n",
"\t Source Suffix: nwissite\n",
"\t Source URI: https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=usgs_nldi_gages.geojson\n",
"\t Feature ID: provider_id\n",
"\t Feature Name: name\n",
"\t Feature URI: subjectOf\n",
"\t Feature Reach: nhdpv2_REACHCODE\n",
"\t Feature Measure:nhdpv2_REACH_measure\n",
"\t Ingest Type: reach\n",
"\t Feature Type hydrolocation\n",
" 6 :: Water Data Exchange 2.0 Sites \n",
"\t Source Suffix: wade\n",
"\t Source URI: https://www.hydroshare.org/resource/5f665b7b82d74476930712f7e423a0d2/data/contents/wade.geojson\n",
"\t Feature ID: feature_id\n",
"\t Feature Name: feature_name\n",
"\t Feature URI: feature_uri\n",
"\t Feature Reach: None\n",
"\t Feature Measure:None\n",
"\t Ingest Type: point\n",
"\t Feature Type varies\n",
" 7 :: geoconnex.us reference gages \n",
"\t Source Suffix: ref_gage\n",
"\t Source URI: https://www.hydroshare.org/resource/3295a17b4cc24d34bd6a5c5aaf753c50/data/contents/nldi_gages.geojson\n",
"\t Feature ID: id\n",
"\t Feature Name: name\n",
"\t Feature URI: subjectOf\n",
"\t Feature Reach: nhdpv2_REACHCODE\n",
"\t Feature Measure:nhdpv2_REACH_measure\n",
"\t Ingest Type: reach\n",
"\t Feature Type hydrolocation\n",
" 8 :: Streamgage catalog for CA SB19 \n",
"\t Source Suffix: ca_gages\n",
"\t Source URI: https://sb19.linked-data.internetofwater.dev/collections/ca_gages/items?f=json&limit=10000\n",
"\t Feature ID: site_id\n",
"\t Feature Name: sitename\n",
"\t Feature URI: uri\n",
"\t Feature Reach: rchcd_medres\n",
"\t Feature Measure:reach_measure\n",
"\t Ingest Type: reach\n",
"\t Feature Type hydrolocation\n",
" 9 :: USGS Geospatial Fabric V1.1 Poin\n",
"\t Source Suffix: gfv11_pois\n",
"\t Source URI: https://www.sciencebase.gov/catalogMaps/mapping/ows/609c8a63d34ea221ce3acfd3?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb::gfv11&outputFormat=json\n",
"\t Feature ID: prvdr_d\n",
"\t Feature Name: name\n",
"\t Feature URI: uri\n",
"\t Feature Reach: n2_REACHC\n",
"\t Feature Measure:n2_REACH_\n",
"\t Ingest Type: reach\n",
"\t Feature Type hydrolocation\n",
"10 :: Vigil Network Data \n",
"\t Source Suffix: vigil\n",
"\t Source URI: https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=vigil.geojson\n",
"\t Feature ID: SBID\n",
"\t Feature Name: Site Name\n",
"\t Feature URI: SBURL\n",
"\t Feature Reach: nhdpv2_REACHCODE\n",
"\t Feature Measure:nhdpv2_REACH_measure\n",
"\t Ingest Type: reach\n",
"\t Feature Type hydrolocation\n",
"11 :: NWIS Groundwater Sites \n",
"\t Source Suffix: nwisgw\n",
"\t Source URI: https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=nwis_wells.geojson\n",
"\t Feature ID: provider_id\n",
"\t Feature Name: name\n",
"\t Feature URI: subjectOf\n",
"\t Feature Reach: None\n",
"\t Feature Measure:None\n",
"\t Ingest Type: point\n",
"\t Feature Type point\n",
"12 :: New Mexico Water Data Initative \n",
"\t Source Suffix: nmwdi-st\n",
"\t Source URI: https://locations.newmexicowaterdata.org/collections/Things/items?f=json&limit=100000\n",
"\t Feature ID: id\n",
"\t Feature Name: name\n",
"\t Feature URI: geoconnex\n",
"\t Feature Reach: None\n",
"\t Feature Measure:None\n",
"\t Ingest Type: point\n",
"\t Feature Type point\n",
"13 :: geoconnex contribution demo site\n",
"\t Source Suffix: geoconnex-demo\n",
"\t Source URI: https://geoconnex-demo-pages.internetofwater.dev/collections/demo-gpkg/items?f=json&limit=10000\n",
"\t Feature ID: id\n",
"\t Feature Name: GNIS_NAME\n",
"\t Feature URI: uri\n",
"\t Feature Reach: NHDPv2ReachCode\n",
"\t Feature Measure:NHDPv2Measure\n",
"\t Ingest Type: reach\n",
"\t Feature Type hydrolocation\n"
]
}
],
"source": [
"stmt = select(CrawlerSource).order_by(CrawlerSource.crawler_source_id) #.where(CrawlerSource.crawler_source_id == 1)\n",
"with Session(eng) as session:\n",
" for source in session.scalars(stmt):\n",
" print(f\"{source.crawler_source_id:2} :: {source.source_name[0:32]:32} :: {source.source_uri[0:48]:48}...\")"
" print(f\"{source.crawler_source_id:2} :: {source.source_name[0:32]:32}\")\n",
" print(f\"\\t Source Suffix: {source.source_suffix}\")\n",
" print(f\"\\t Source URI: {source.source_uri}\")\n",
" print(f\"\\t Feature ID: {source.feature_id}\") \n",
" print(f\"\\t Feature Name: {source.feature_name}\")\n",
" print(f\"\\t Feature URI: {source.feature_uri}\") \n",
" print(f\"\\t Feature Reach: {source.feature_reach}\") \n",
" print(f\"\\t Feature Measure:{source.feature_measure}\") \n",
" print(f\"\\t Ingest Type: {source.ingest_type}\")\n",
" print(f\"\\t Feature Type {source.feature_type}\")"
]
},
{
Expand Down
Loading

0 comments on commit e14486a

Please sign in to comment.