You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Despite our myriad quality filters in the workflow including a global clock rate filter, local clock rate filter, an outliers list, and dropping sequences with poor alignments, our trees still occasionally include low-quality sequences that would have been flagged with a "bad" QC status by Nextclade. In most of my recent experiences with this issue, the low-quality sequences have too many private mutations.
In the best case, these low-quality sequences look strange in the tree. In the worst case, these sequences break the date inference for internal nodes and produce an invalid time tree topology, requiring the builds to be run again.
Description
Since we already plan to migrate away from Nextalign to Nextclade, we should align sequences with Nextclade and produce the metadata output file with QC statuses. We should filter any sequences that have a QC status of "bad".
We could approach this functionality in a couple of ways:
Modify the existing align rule to use Nextclade and produce QC output, add a subsequent post-alignment filter rule before the tree building rule to omit "bad" QC records (using augur filter on the alignment sequences and "metadata" from Nextclade), and pass the filtered alignment to the tree rule. We'd probably want to merge the original metadata records with the Nextclade metadata prior to filtering much like the merge we do in the flu_frequencies workflow.
OR
Run Nextclade on all sequences upstream of the main phylogenetic workflow, merge the complete metadata with the Nextclade annotations, upload these combined metadata to S3, start the phylogenetic workflow from the combined metadata files, and apply custom filters on QC in the subsampling logic for each build. The only changes to the main phylogenetic workflow required by this approach would be additions to the build YAML files to include a filter for Nextclade QC status. The bigger changes happen outside of the main workflow in our sequence upload logic.
The benefit of the first approach is that we could implement it now without much additional infrastructure planning, since the changes all happen inside the phylogenetic workflow. The annotations would be very fast, since we'd only run Nextclade on the subsampled data. The main disadvantage is the additional complexity to the workflow and the redundant runs of Nextclade across multiple builds for the same lineages and segments.
The benefits of the second approach are that it would introduce no complexity to the existing workflow and it would produce a valuable resource that other current workflows (like flu_frequencies) and future workflows (forecasts?) could benefit from. The disadvantage is the additional infrastructural complexity of setting up the Nextclade runs with GitHub Actions for different references (e.g., A/Wisconsin/67/2005 or A/Darwin/6/2021 for H3N2) and storing the merged metadata in S3 in a way that allows us to unambiguously grab the outputs from the desired Nextclade dataset.
I think we want to end up at the second approach eventually, so maybe it is worth the extra planning effort to figure that approach out now instead of using the first approach.
The text was updated successfully, but these errors were encountered:
Context
Despite our myriad quality filters in the workflow including a global clock rate filter, local clock rate filter, an outliers list, and dropping sequences with poor alignments, our trees still occasionally include low-quality sequences that would have been flagged with a "bad" QC status by Nextclade. In most of my recent experiences with this issue, the low-quality sequences have too many private mutations.
In the best case, these low-quality sequences look strange in the tree. In the worst case, these sequences break the date inference for internal nodes and produce an invalid time tree topology, requiring the builds to be run again.
Description
Since we already plan to migrate away from Nextalign to Nextclade, we should align sequences with Nextclade and produce the metadata output file with QC statuses. We should filter any sequences that have a QC status of "bad".
We could approach this functionality in a couple of ways:
augur filter
on the alignment sequences and "metadata" from Nextclade), and pass the filtered alignment to the tree rule. We'd probably want to merge the original metadata records with the Nextclade metadata prior to filtering much like the merge we do in the flu_frequencies workflow.OR
The benefit of the first approach is that we could implement it now without much additional infrastructure planning, since the changes all happen inside the phylogenetic workflow. The annotations would be very fast, since we'd only run Nextclade on the subsampled data. The main disadvantage is the additional complexity to the workflow and the redundant runs of Nextclade across multiple builds for the same lineages and segments.
The benefits of the second approach are that it would introduce no complexity to the existing workflow and it would produce a valuable resource that other current workflows (like flu_frequencies) and future workflows (forecasts?) could benefit from. The disadvantage is the additional infrastructural complexity of setting up the Nextclade runs with GitHub Actions for different references (e.g., A/Wisconsin/67/2005 or A/Darwin/6/2021 for H3N2) and storing the merged metadata in S3 in a way that allows us to unambiguously grab the outputs from the desired Nextclade dataset.
I think we want to end up at the second approach eventually, so maybe it is worth the extra planning effort to figure that approach out now instead of using the first approach.
The text was updated successfully, but these errors were encountered: