The goat-data repository holds scripts and configuration files to support the import of data from an S3 bucket at goat.cog.sanger.ac.uk into an elasticsearch datastore to support the Genomes on a Tree (GoaT) web site and API.
The goat-data repository contains import configuration files in goat-data/sources
and test configuration files goat-data/tests
. The (TSV format) data files associated with these configuration files are hosted in the s3 bucket, which contains a sources directory (s3://sources
) with copies of the files used by the latest successful run of the import pipeline and a resources directory (s3://resources
) with newly updated files to be imported in the next run of the release pipeline.
New data and configuration files should be added to s3://resources
so they can be tested in the next run of the release pipeline and incorporated into s3://sources
and goat-data/sources
only if they can be imported successfully. The release pipeline is robust to failures caused by these new files and will rollback to files used in a previous successful run as described below.
Fetches files that can be automatically updated based on curl
, API scripts and genomehubs parse
commands. Updated files are moved to s3://resources
. If any fetches fail, the import will fallback to the previous version of the file on s3.
Initialises snapshot repositories and uses taxonomy files from s3://resources
to initialise a new release with the genomehubs init
command.
Indexes files in subdirectories of goat-data/sources
to populate the indexes using the genomehubs index
command. For each subdirectory, two rounds of the import may be attempted. Each import is run in a separate temporary working directory, with files copied to destination paths on success.
Before the first import round, a snapshot is made of the current state of all relevant indexes to allow the import to be rolled back on failure.
The first import fetches YAML configuration files from goat-data/sources
and s3://resources
. For files found in both locations, the version in s3://resources
is used in preference.
Each YAML is parsed to find the associated data file under .file.name
. Each data file is fetched from s3://resources
if possible or from s3://sources
if no matching file is found in s3://resources
. If an index tests directory is listed under .file.tests
, the associated directory and any files are fetched from s3://resources
or s3://sources
. If present, a names directory is fetched using the same pattern. Files/directories retrieved from s3://resources
are added to a list of filenames in from_resources.txt
.
If the import is successful, and all associated index tests pass, YAML files listed in from_resources.txt
are copied to the relevant subdirectory of goat-data/sources
, while non-YAML files are copied to a dated subdirectory of s3://releases
.
If the import in round 1 fails, the snapshot is restored and the import is restarted using files from goat-data/sources
and s3://sources
only. These are files that imported successfully in a previous release so the import is expected to succeed. If this import fails for any sources subdirectory, the current release will fail.
After all subdirectories have been imported, changes are commited to a dated branch in the goat-data repository.
Files and analyses are indexed into the file/analysis indexes in parallel with the main data import using the genomehubs index
command. This import should be incremental, using the previous release snapshot and only adding files that have been added since the previous release.
Taxon index values are summarised and filled using the genomehubs fill
command.
After the import and fill steps have completed, a series of tests are run to check the API and UI responses. Local copies of the API and UI are hosted inside a genomehubs/genomehubs-test
Docker container. This container runs API tests against a series of config files inside the goat-data/tests/api
directory. UI tests are based on checking that it is possible to fetch reports defined by config files in the goat-data/tests/ui
directory. If any test fails, the current release will fail.
If all tests pass, the report images generated by the UI tests are copied to a /projects
subdirectory of s3://releases
.
If all tests pass, release snapshots of the final index state are copied across and restored on the production server. The content of the previous s3://sources
directory is moved to s3://previous
and the content of the latest s3://releases
directory is copied to s3://sources
. Changes in the goat-data release branch are merged into the main branch.