Scripts to be used by MINT or other systems to download new datasets as they become available and register them in MINT Data Catalog.
git clone https://github.com/mintproject/MINT-Data-Sync.git
cd MINT-Data-Sync
docker build -t mint-data-sync
docker run -e "earthdata_username=REPLACE_ME" -e "earthdata_password=REPLACE_ME" -e "mint_data_username=REPLACE_ME" -e "mint_data_password=REPLACE_ME" -it --rm mint-data-sync:latest
Currently, we sync GLDAS data, which requires Earthdata login credentials; hence the need for earthdata_username
and earthdata_password
credentials above.
By default, the above container will start a cron process that will trigger sync.py
script every day at 01:00 (am). That logic can be modified
by editing cronjobs
file and rebuilding the Docker image
To add a new data source, you would need to write a scraper that checks the source for data availability. Assuming that the scraper is implemented, the general data sync process goes as follows:
-
Check data source for the latest data available (by e.g., temporal coverage)
-
Check MINT data catalog for the latest available data
-
If there is a mismatch, generate a list of missing resources based on 1) and 2)
-
[Optionally] Download missing resources
-
[Optionally] Upload them to MINT data storage
-
Generate appropriate resource metadata
-
Register missing resources in MINT data catalog