Skip to content

genomehubs/goat-data

Repository files navigation

goat-data

The goat-data repository holds scripts and configuration files to support the import of data from an S3 bucket at goat.cog.sanger.ac.uk into an elasticsearch datastore to support the Genomes on a Tree (GoaT) web site and API.

File locations

The goat-data repository contains import configuration files in goat-data/sources and test configuration files goat-data/tests. The (TSV format) data files associated with these configuration files are hosted in the s3 bucket, which contains a sources directory (s3://sources) with copies of the files used by the latest successful run of the import pipeline and a resources directory (s3://resources) with newly updated files to be imported in the next run of the release pipeline.

New data and configuration files should be added to s3://resources so they can be tested in the next run of the release pipeline and incorporated into s3://sources and goat-data/sources only if they can be imported successfully. The release pipeline is robust to failures caused by these new files and will rollback to files used in a previous successful run as described below.

Release pipeline

1. Fetch resources

Fetches files that can be automatically updated based on curl, API scripts and genomehubs parse commands. Updated files are moved to s3://resources. If any fetches fail, the import will fallback to the previous version of the file on s3.

2. Init release

Initialises snapshot repositories and uses taxonomy files from s3://resources to initialise a new release with the genomehubs init command.

3. Index directories

Indexes files in subdirectories of goat-data/sources to populate the indexes using the genomehubs index command. For each subdirectory, two rounds of the import may be attempted. Each import is run in a separate temporary working directory, with files copied to destination paths on success.

3.1. Index files from s3://resources

Before the first import round, a snapshot is made of the current state of all relevant indexes to allow the import to be rolled back on failure.

The first import fetches YAML configuration files from goat-data/sources and s3://resources. For files found in both locations, the version in s3://resources is used in preference.

Each YAML is parsed to find the associated data file under .file.name. Each data file is fetched from s3://resources if possible or from s3://sources if no matching file is found in s3://resources. If an index tests directory is listed under .file.tests, the associated directory and any files are fetched from s3://resources or s3://sources. If present, a names directory is fetched using the same pattern. Files/directories retrieved from s3://resources are added to a list of filenames in from_resources.txt.

If the import is successful, and all associated index tests pass, YAML files listed in from_resources.txt are copied to the relevant subdirectory of goat-data/sources, while non-YAML files are copied to a dated subdirectory of s3://releases.

3.2. Index files from s3://sources

If the import in round 1 fails, the snapshot is restored and the import is restarted using files from goat-data/sources and s3://sources only. These are files that imported successfully in a previous release so the import is expected to succeed. If this import fails for any sources subdirectory, the current release will fail.

3.3. Post-import steps

After all subdirectories have been imported, changes are commited to a dated branch in the goat-data repository.

4. Index files

Files and analyses are indexed into the file/analysis indexes in parallel with the main data import using the genomehubs index command. This import should be incremental, using the previous release snapshot and only adding files that have been added since the previous release.

5. Fill values

Taxon index values are summarised and filled using the genomehubs fill command.

6. API/UI tests

After the import and fill steps have completed, a series of tests are run to check the API and UI responses. Local copies of the API and UI are hosted inside a genomehubs/genomehubs-test Docker container. This container runs API tests against a series of config files inside the goat-data/tests/api directory. UI tests are based on checking that it is possible to fetch reports defined by config files in the goat-data/tests/ui directory. If any test fails, the current release will fail.

If all tests pass, the report images generated by the UI tests are copied to a /projects subdirectory of s3://releases.

7. Updating the latest GoaT instance

If all tests pass, release snapshots of the final index state are copied across and restored on the production server. The content of the previous s3://sources directory is moved to s3://previous and the content of the latest s3://releases directory is copied to s3://sources. Changes in the goat-data release branch are merged into the main branch.