Skip to content

Commit

Permalink
docs: make another pass over docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ivan-aksamentov committed Jan 5, 2024
1 parent 609d7c6 commit c6e5207
Show file tree
Hide file tree
Showing 3 changed files with 77 additions and 28 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Nextclade datasets

This repository contains data and tools to maintain Nextclade datasets.
This repository contains Nextclade datasets and tools to maintain them.

Documentation:

- [Nextclade dataset curation documentation](https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-curation-guide.md) - if you have a custom Nextclade dataset or want to create one
- [Nextclade dataset curation guide](https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-curation-guide.md) - if you have a custom Nextclade dataset, or want to create one, or to contribute it to the Nextclade dataset collection.

- [Nextclade dataset server maintenance documentation](https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-server-maintenance.md) - if you are maintainer of this repository or want to deploy your own dataset sever
- [Nextclade dataset server maintenance guide](https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-server-maintenance.md) - if you are maintainer of the official Nextclade dataset server or want to deploy your own dataset sever.

Additional links:

Expand Down
60 changes: 42 additions & 18 deletions docs/dataset-curation-guide.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Nextclade dataset curation guide

This guide explains how to create, update and test Nextclade datasets.
This guide explains how to create, update and test Nextclade datasets as well as how to contribute them into the official Nextclade dataset collection.

> ⚠️ If you are a user of Nextclade CLI or Nextclade Web and looking for documentation on how to use Nextclade, see [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/index.html) instead.
> ⚠️ If you are a user of Nextclade Web or Nextclade CLI and looking for documentation on how to use Nextclade, see [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/index.html) instead.
> ⚠️ If you are looking for Nextclade software developer documentation, see [Nextclade developer guide](https://github.com/nextstrain/nextclade/blob/master/docs/dev/developer-guide.md) instead.
> ⚠️ This guide serves for advanced Nextclade users and enthusiasts who want to create and maintain their own Nextclade datasets, e.g. to add a yet unsupported pathogen or strain. It assumes basic familiarity with Nextclade CLI and Nextclade Web and some experience with different datasets as a user. If you are not yet comfortable using Nextclade and want to learn more about Nextclade datasets, please refer to the [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/) first.
## Basic principles

Nextclade software is built to be agnostic to pathogens it analyzes. Instead, the information about particularities of certain pathogens is provided in the form of so-called Nextclade datasets. A Nextclade dataset is a set of predefined files (in a directory or a zip archive) which adds support for a particular pathogen or a strain to Nextclade CLI and Nextclade Web.
Nextclade software is built to be agnostic to pathogens it analyzes. Instead, the information about particularities of certain pathogens is provided in the form of so-called Nextclade datasets. A Nextclade dataset is a set of predefined files (in a directory or in a zip archive) which adds support for a particular pathogen or a strain to Nextclade CLI and Nextclade Web.

In this repository:

Expand All @@ -22,49 +22,73 @@ In this repository:

> ⚠️ Never modify `data_output/` directory! All manual changes will be overwritten by the automation.
- The contents of `data_output/` directory is deployed to the production dataset servers (see [Dataset server maintenance guide](dataset-server-maintenance.md)). The GitHub URL to this directory can also be used as a temporary substitute for a dataset server (GitHub serves data for us in this case).
- The contents of `data_output/` directory is deployed to the production dataset servers (see [Dataset server maintenance guide](dataset-server-maintenance.md)), which makes it available to all Nextclade Web and Nextclade CLI users. The GitHub URL to this directory can also be used as a temporary substitute for a dataset server (in this case GitHub website acts as a server for us).

## Curating datasets

### Migrating from v2 to v3

If you already have a dataset for Nextclade v2 and want to upgrade it to Nextclade v3, see instructions in the [Dataset migration guide](migration-guide-v3.md).

### Obtaining source code of this repository

We use GitHug pull requests to manage contributions to Nextclade datasets.

In order to add or modify datasets you will need to have a local copy of nextstrain/nextclade_data GitHub repository on your computer, make the desired changes, commit & push the changes to a new git branch, and submit a pull request. The pull request will be reviewed by Nextclade maintainers and considered for inclusion to the Nextclade dataset collection.

Make sure you have [git](https://git-scm.com/) installed, have an account on [GitHub](https://github.com) and can pull and push code from GitHub repositories.

[Make a fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) of the [nextstrain/nextclade_data origin repository](https://github.com/nextstrain/nextclade_data) and clone your forked repository:

```bash
git clone [email protected]:<your_github_username>/nextclade
```

Refer to [GitHub documentation "Contributing to projects"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) for more details.

> 💡 Make sure you [keep your local code up to date](https://github.com/git-guides/git-pull) with the origin repo, [especially if it's forked](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork).
> 💡 If you are a member of Nextstrain team, then you don't need a fork, and you can contribute directly to the origin repository `nextstrain/nextclade_data`. Nonetheless, please still submit a pull requests for review, rather than pushing changes to branches directly.
### Adding a new dataset

This section describes steps to perform when you want to add a dataset for a new pathogen or a strain.
This section describes a sequence of steps to add a new Nextclade dataset for a pathogen or a strain. It assumes that you have a local copy of the nextstrain/nextclade_data repo available.

- Optionally, [create a new git branch](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-branches). Give it a name that briefly explains planned additions. Refer to documentation of git and of GitHub for more details.

- Add a directory under `data/community/<dataset_path>`.

> ⚠️ We will discuss some important considerations about dataset paths in the [Dataset paths](#dataset-paths) section later on. If you want to submit your dataset to the community dataset collection, please follow these recommendations carefully.
> 💡 If you are a member of Nextstrain organization, then submit to the "nextstrain" collection, rather than to "community".
> 💡 If you are a member of Nextstrain organization, then submit to the "nextstrain" collection (`data/nextstrain`), rather than to "community" (`data/community`).
- Add `pathogen.json` file. Use `pathogen.json` files of the existing datasets as a template and modify them as needed.
- Add a `CHANGELOG.md` file containing second-level heading `## Unreleased` (spelled exactly like this) and free-form text under it, describing proposed changes. You can use Markdown syntax. In this case you can write that it's the first release of this dataset.

> ⚠️ It is important to name the section exactly: 2 hashes, space and the word "Unreleased", starting with the capital letter "U". This text will be used to automatically find and extract release notes, which are then published along with the next dataset release.
- Add a `CHANGELOG.md` file containing second-level heading `## Unreleased` (spelled exactly like this) and free-form text under it, describing proposed changes. You can use Markdown syntax. In this case you can write that it's the first release of this dataset.

> ⚠️ It is important to name the section exactly: two hashes, space and the word "Unreleased", starting with the capital letter "U". This text will be used to automatically find and extract release notes, which are then published along with the next dataset release.
- Add remaining dataset files. At a very minimum, you should have required files: `reference.fasta`, `pathogen.json` and `CHANGELOG.md`.

- Optionally, [test your dataset locally](#testing-datasets-locally)

- Submit your changes as a pull request to this repository.
- Commit and push your changes to your forked repository on GitHub. Refer to documentation of git and of GitHub for more details.

- Submit your changes as a [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) to the `nextstrain/nextclade_data` repository on GitHub.

> ⚠️ Note that `nextstrain-bot` will automatically run the rebuild and will commit changes to your branch to the `data/` and `data_output/` directory. Don't forget to pull them if you are going to make more commits.
> ⚠️ Note that `nextstrain-bot` will automatically run the rebuild script, and will commit changes to the `data/` and `data_output/` directories on your git branch. Don't forget to pull these changes if you are going to make more commits.
- Optionally, [test your dataset from GitHub](#testing-datasets-from-github)

- Wait for maintainers and community members to review and either accept or reject your proposal. Be ready to discuss the proposed changes, and to apply some modifications if requested.
- Wait for maintainers and community members to review and either accept or reject your proposal. Be ready to discuss the proposed changes, and to apply modifications if requested.

### Updating an existing dataset

- Find your dataset in the `data/` directory and modify dataset files as you see fit.

- On the very top of the `CHANGELOG.md`, add a second-level heading `## Unreleased` (spelled exactly like this) and a paragraph under it describing proposed changes in free form. You can use the usual Markdown syntax. Keep records for previous releases in place. If there's already an `## Unreleased` section (meaning this dataset already has changes that are yet to be released) append the summary of your changes to the existing summary.

- Submit the result for consideration as a pull request
- Submit the result for consideration as a pull request (similarly to how it was done in the "Adding a new dataset" section)

- Wait for maintainers and community members to review and either accept or reject your proposal. Be ready to discuss the proposed changes, and to apply some modifications if requested.

Expand All @@ -82,7 +106,7 @@ If your pull request is not merged yet, simply close the pull request. Explain y

If you want to signal users that a dataset is no longer maintained, is inaccurate or obsolete, rather than deleting it you can set the field `"deprecated": true` in `pathogen.json` file. In this case the dataset will be listed in Nextclade with a "deprecated" badge and at the bottom of the list. Please explain the reason for deprecation in the changelog section (as described in the usual update steps), and add some details to the readme file if there is one, so that users could make an informed decision themselves whether to use it or not.

### What happens to accepted adatasets
### What happens to accepted datasets

If your pull request is accepted and merged, your data enters `master` branch and is automatically deployed to the `master` environment. In a few minutes after merge it should be visible at https://master.clades.nextstrain.org

Expand All @@ -108,21 +132,21 @@ Let's split this path into segments and describe meaning of each segment:

- `data/`: this is the root of the dataset collection storage
- `community/`: this is the root of the "community" dataset collection
- `your-org/your-name/`: replace these path segments with your organization name, if you are submitting on behalf of the organization, as well as your GitHub nickname. This way every organization (and organization member) has its own directory. Feel free to create some nested structure relevant for your organization - you can nest the subdirectories arbitrarily and this is not limited to 2 levels. We only ask to not submit datasets directly into the `community/`, to avoid clashes between datasets from different authors and organizations.
- `your-org/your-name/`: replace these path segments with your organization name, if you are submitting on behalf of an organization, as well as with your GitHub nickname. This way every organization (and organization member) has its own directory. Feel free to create some nested structure relevant for your organization - you can nest the subdirectories arbitrarily and this is not limited to 2 levels. We only ask to not submit datasets directly into the `community/`, to avoid clashes between datasets from different authors and organizations.
- `pathogen-name/strain-name/`: in these segments feel free to use the name of the pathogen and potentially strain name and/or an accession of the reference sequence. Again, you can nest the subdirectories arbitrarily and this is not limited to 2 levels.
- `other/features/`: if you need some more levels of path segments to describe a particular dataset, for example a particular geographic location, time period or a host organism, then you can create additional path segments for it.

Note, that this structure is only relevant if you want to submit your datasets into this repository. You can of course use any dataset directory structure on your local computer or in your own repositories.

### Requirements for dataset paths

- Dataset paths are used as dataset identifiers, for example in an argument of Nextclade CLI, so they should not be excessively long.
- Dataset paths are used as dataset identifiers, for example in an argument of Nextclade CLI invocation and in URL parameters of Nextclade WEb, so they should not be excessively long.

- These paths are used in directory names and URLs, so please avoid using spaces and special characters. Prefer lowercase letters and dashes (`-`) over underscores (`_`) where possible.

- Meaningful, readable names and directory structure is encouraged. At the time of submission of the pull request, please avoid temporary names or names that cannot be understood by the potential users of the dataset.

- Be consistent in your naming conventions to avoid confusion. For example, choose between "flu" and "influenza" and avoid mixing both names. You can explain your naming and other choices in the `README.md` file, which will be visible for all users.
- Be consistent in your naming conventions to avoid confusion. For example, choose between "flu" and "influenza", stick to it, and avoid mixing both names. You can explain your naming and other choices in the `README.md` file, which will be visible for all users.

- In order to allow reproducibility of Nextclade analysis results, released datasets are immutable. Once a dataset is submitted and released, it cannot be revoked and the path cannot be changed! You can of course later submit the same dataset under a new path, but this will lose continuity of versioning - your potential users won't know about this new dataset and will not be able to receive updates, because they will still be using the old path (for example hardcoded in their analysis pipeline's code). So design your dataset path hierarchies carefully and with consideration for further updates and potential future additions, such that the paths are final and won't need to be modified.

Expand All @@ -134,12 +158,12 @@ The guide in [Test datasets locally](https://github.com/nextstrain/nextclade/blo

## Testing datasets from GitHub

Once the pull request is submitted, `nextclade-bot` will start a GitHub action, running `./scripts/rebuild` and pushing the produced build to the `data_output/` directory. Once in place on GitHub, the `data_output/` directory can be used in Nextclade Web and Nextclade CLI:
Once the pull request is submitted, a [GitHub Actions](https://docs.github.com/en/actions) instance will be started. It will run `./scripts/rebuild` and commit and push the produced build to the `data_output/` directory on behalf of `nextstrain-bot` GitHub user. You will see the new commits appearing in your pull request. Once this is in place, the link to the `data_output/` GitHub directory can be used in Nextclade Web and Nextclade CLI as an alternative dataset server:

- First wait the GitHub Action runs in the "checks" section in your pull request to complete successfully.
- Wait for `nextstrain-bot` commit to appear in the list of commits of the pull request (refresh the page if needed).
- Obtain a full URL to the `data_output/` directory on your pull request's branch. For example, you can select your branch on GitHub, navigate to `data_output/` in the directory tree and copy the resulting URL in the address bar of your browser.
- In Nextclade CLI, use `--server=<github_url>` argument for `run` and `dataset get` command.
- In Nextclade CLI, use `--server=<github_url>` argument for `run` and `dataset get` commands.
- In Nextclade Web, add `dataset-server=<github_url>` URL parameter.

This will tell Nextclade to fetch your modified datasets right from GitHub (GitHub acts a dataset "server" here).
Expand Down
Loading

0 comments on commit c6e5207

Please sign in to comment.