Merge branch 'master' into fix/root-state-in-sequence-reconstruction

nextstrain · Dec 23, 2024 · f556cc9 · f556cc9
2 parents 5b717d6 + 77ae31e
commit f556cc9
Show file tree

Hide file tree

Showing 24 changed files with 1,318 additions and 109 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -41,19 +41,18 @@ jobs:
     strategy:
       matrix:
         python-version:
-          - '3.8'
           - '3.9'
           - '3.10'
           - '3.11'
           - '3.12'
-        biopython-version: 
-          # list of Biopython versions with support for a new Python version 
-          # from https://github.com/biopython/biopython/blob/master/NEWS.rst 
+        biopython-version:
+          # list of Biopython versions with support for a new Python version
+          # from https://github.com/biopython/biopython/blob/master/NEWS.rst
           - '1.80' # first to support Python 3.10 and 3.11
           - '1.82' # first to support Python 3.12
-          - ''     # latest 
-        exclude: 
-          # some older Biopython versions are incompatible with later Python versions 
+          - ''     # latest
+        exclude:
+          # some older Biopython versions are incompatible with later Python versions
           - { biopython-version: '1.80', python-version: '3.12' }
     defaults:
       run:
@@ -115,7 +114,11 @@ jobs:
           - lassa
           - measles
           - mpox
+          - oropouche
+          - rabies
           - seasonal-cov
+          - wnv
+          - yellow-fever
           - zika
 
     name: pathogen-repo-ci (${{ matrix.pathogen }})

diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml
@@ -10,7 +10,7 @@ on:
         type: string
 jobs:
   run:
-    if: github.ref == github.event.repository.default_branch
+    if: github.ref_name == github.event.repository.default_branch
     uses: ./.github/workflows/ci.yaml
     secrets: inherit
     with:

diff --git a/.pylintrc b/.pylintrc
@@ -317,13 +317,6 @@ max-line-length=100
 # Maximum number of lines in a module
 max-module-lines=1000
 
-# List of optional constructs for which whitespace checking is disabled. `dict-
-# separator` is used to allow tabulation in dicts, etc.: {1  : 1,\n222: 2}.
-# `trailing-comma` allows a space between comma and closing bracket: (a, ).
-# `empty-line` allows space-only lines.
-no-space-check=trailing-comma,
-               dict-separator
-
 # Allow the body of a class to be on the same line as the declaration if body
 # contains single statement.
 single-line-class-stmt=no

diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -11,6 +11,9 @@ build:
       # generated from the full git history in conf.py.
       - git fetch --unshallow
 
+sphinx:
+  configuration: docs/conf.py
+
 python:
   install:
     - method: pip

diff --git a/CHANGES.md b/CHANGES.md
@@ -9,9 +9,28 @@
 ### Bug Fixes
 
 * ancestral, refine: Explicitly specify how the root and ambiguous states are handled during sequence reconstruction and mutation counting. [#1690][] (@rneher)
+* titers: Fix type errors in code associated with cross-validation of models. [#1688][] (@huddlej)
 
+[#1688]: https://github.com/nextstrain/augur/pull/1688
 [#1690]: https://github.com/nextstrain/augur/pull/1690
 
+## 27.0.0 (9 December 2024)
+
+### Major Changes
+
+- Drop support for Python 3.8. [#1693] (@victorlin)
+- Drop support for older versions of jsonschema (<4.18.0). [#1691] (@victorlin)
+- Drop support for xopen <2.0.0. [#1692] (@victorlin)
+
+### Bug fixes
+
+- export: validation will no longer crash with `KeyError: 'tree'` when newer versions of jsonschema (≥4.18.0) are installed. [#1358] (@victorlin)
+
+[#1358]: https://github.com/nextstrain/augur/issues/1358
+[#1691]: https://github.com/nextstrain/augur/pull/1691
+[#1692]: https://github.com/nextstrain/augur/pull/1692
+[#1693]: https://github.com/nextstrain/augur/pull/1693
+
 ## 26.2.0 (20 November 2024)
 
 ### Features

diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,45 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+
+preferred-citation:
+  type: article
+  title: "Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens"
+  doi: "10.21105/joss.02906"
+  journal: "Journal of Open Source Software"
+  year: 2021
+  month: 1
+  volume: 6
+  issue: 57
+  start: 2906
+  end: 2906
+
+  authors:
+    - family-names: Huddleston
+      given-names:  John
+
+    - family-names: Hadfield
+      given-names:  James
+
+    - family-names: Sibley
+      given-names:  Thomas R.
+
+    - family-names: Lee
+      given-names:  Jover
+
+    - family-names: Fay
+      given-names:  Kairsten
+
+    - family-names: Ilcisin
+      given-names:  Misja
+
+    - family-names: Harkins
+      given-names:  Elias
+
+    - family-names: Bedford
+      given-names:  Trevor
+
+    - family-names: Neher
+      given-names:  Richard A.
+
+    - family-names: Hodcroft
+      given-names:  Emma B.
diff --git a/DEPRECATED.md b/DEPRECATED.md
@@ -6,7 +6,7 @@ available for backwards compatibility, but should not be used in new code.
 
 ## `xopen` major version 1
 
-*Deprecated in version 25.1.0 (July 2024). Planned for removal November 2024 or after.*
+*Deprecated in version 25.1.0 (July 2024). Removed in version 27.0.0 (December 2024).*
 
 ## `augur parse` preference of `name` over `strain` as the sequence ID field
 

diff --git a/README.md b/README.md
@@ -40,6 +40,8 @@ Try out an analysis of real virus data by [completing the Zika tutorial](https:/
 
 Huddleston J, Hadfield J, Sibley TR, Lee J, Fay K, Ilcisin M, Harkins E, Bedford T, Neher RA, Hodcroft EB, (2021). Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens. Journal of Open Source Software, 6(57), 2906, https://doi.org/10.21105/joss.02906
 
+For other formats, refer to [CITATION.cff](./CITATION.cff).
+
 ## License and copyright
 
 Copyright 2014-2022 Trevor Bedford and Richard Neher.

diff --git a/augur/__main__.py b/augur/__main__.py
@@ -24,11 +24,7 @@ def main():
         errors="backslashreplace",
         newline=None,
 
-        # Always line-buffer stderr since we only use it for messaging, not
-        # data output.  This is the Python default from 3.9 onwards, but we
-        # also run on 3.8 where it's not.  Be consistent regardless of Python
-        # version.
-        line_buffering=True,
+        # By default, stderr is always line-buffered.
     )
 
     return augur.run( argv[1:] )

diff --git a/augur/__version__.py b/augur/__version__.py
@@ -1,4 +1,4 @@
-__version__ = '26.2.0'
+__version__ = '27.0.0'
 
 
 def is_augur_version_compatible(version):

diff --git a/augur/io/file.py b/augur/io/file.py
@@ -2,21 +2,9 @@
 from contextlib import contextmanager
 from io import IOBase
 from textwrap import dedent
-from xopen import xopen
+from xopen import xopen, _PipedCompressionProgram
 from augur.errors import AugurError
 
-# Workaround to maintain compatibility with both xopen v1 and v2
-# Around November 2024, we shall drop support for xopen v1
-# by removing the try-except block and using
-# _PipedCompressionProgram directly
-try:
-    from xopen import _PipedCompressionProgram as PipedCompressionReader
-    from xopen import _PipedCompressionProgram as PipedCompressionWriter
-except ImportError:
-    from xopen import (  # type: ignore[attr-defined, no-redef]  
-        PipedCompressionReader,
-        PipedCompressionWriter,
-    )
 
 ENCODING = "utf-8"
 
@@ -63,7 +51,7 @@ def open_file(path_or_buffer, mode="r", **kwargs):
                 Try re-saving the file using the {e.encoding!r} encoding."""))
 
 
-    elif isinstance(path_or_buffer, (IOBase, PipedCompressionReader, PipedCompressionWriter)):
+    elif isinstance(path_or_buffer, (IOBase, _PipedCompressionProgram)):
         yield path_or_buffer
 
     else:

diff --git a/augur/titer_model.py b/augur/titer_model.py
@@ -38,42 +38,42 @@ def load_from_file(filenames, excluded_sources=None):
         >>> type(measurements)
         <class 'dict'>
         >>> len(measurements)
-        11
+        248
         >>> len(strains)
-        13
+        62
         >>> len(sources)
-        5
+        15
 
         Inspect specific measurements. First, inspect a measurement that has a
         specific value in the input.
 
-        >>> measurements[("A/Acores/11/2013", ("A/Alabama/5/2010", "F27/10"))]
-        [80.0]
+        >>> measurements[("A/Wisconsin/3/2007", ("A/Wisconsin/3/2007", "A/Wis3/07"))]
+        [5120.0]
 
         Next, inspect a measurement that has a thresholded value at the lower
-        bound of detection (e.g., "<80"). This measurement should be reported as
-        one half of its threshold value (e.g., 40.0).
+        bound of detection (e.g., "<40"). This measurement should be reported as
+        one half of its threshold value (e.g., 20.0).
 
-        >>> measurements[("A/Acores/11/2013", ("A/Victoria/208/2009", "F7/10"))]
-        [40.0]
+        >>> measurements[("A/HongKong/1/1968", ("A/Victoria/3/1975", "A/Vic/3/75"))]
+        [20.0]
 
         Inspect a measurement that has a thresholded value at the upper bound of
-        detection (">1280"). This measurement should be reported as twice its
-        threshold value (e.g., 2560.0).
+        detection (">5120"). This measurement should be reported as twice its
+        threshold value (e.g., 10240.0).
 
-        >>> measurements[("A/Acores/SU43/2012", ("A/Texas/50/2012", "F36/12"))]
-        [2560.0]
+        >>> measurements[("A/Wisconsin/3/2007", ("A/Uruguay/716/2007", "A/Uru716/07"))]
+        [10240.0]
 
         Confirm that excluding sources produces fewer measurements.
 
-        >>> measurements, strains, sources = TiterCollection.load_from_file("tests/data/titer_model/h3n2_titers_subset.tsv", excluded_sources=["NIMR_Sep2013_7-11.csv"])
+        >>> measurements, strains, sources = TiterCollection.load_from_file("tests/data/titer_model/h3n2_titers_subset.tsv", excluded_sources=["Hay2001"])
         >>> len(measurements)
-        5
+        223
 
         Request measurements for a test/reference/serum tuple that should not
         exist after excluding its source.
 
-        >>> measurements.get(("A/Acores/11/2013", ("A/Alabama/5/2010", "F27/10")))
+        >>> measurements.get(("A/HongKong/1/1968", ("A/HongKong/1/1968", "A/HK/1/68")))
         >>>
 
         Missing titer data should produce an error.
@@ -150,12 +150,10 @@ def count_strains(titers):
         --------
         >>> measurements, strains, sources = TiterCollection.load_from_file("tests/data/titer_model/h3n2_titers_subset.tsv")
         >>> titer_counts = TiterCollection.count_strains(measurements)
-        >>> titer_counts["A/Acores/11/2013"]
-        6
-        >>> titer_counts["A/Acores/SU43/2012"]
-        3
-        >>> titer_counts["A/Cairo/63/2012"]
-        2
+        >>> titer_counts["A/Auckland/6/2003"]
+        4
+        >>> titer_counts["A/Brisbane/9/2006"]
+        15
         """
         counts = defaultdict(int)
         for key in titers.keys():
@@ -187,22 +185,26 @@ def filter_strains(titers, strains):
         --------
         >>> measurements, strains, sources = TiterCollection.load_from_file("tests/data/titer_model/h3n2_titers_subset.tsv")
         >>> len(measurements)
-        11
+        248
 
         Test the case when a test strain exists in the subset but the none of
         its corresponding reference strains do.
 
-        >>> len(TiterCollection.filter_strains(measurements, ["A/Acores/11/2013"]))
+        >>> len(TiterCollection.filter_strains(measurements, ["A/Oslo/244/1997"]))
         0
 
-        Test when both the test and reference strains exist in the subset.
+        Test when both the test and reference strains exist in the subset. This
+        first test gets a heterologous pair (first and second strain) and the
+        autologous pair for the second strain.
 
-        >>> len(TiterCollection.filter_strains(measurements, ["A/Acores/11/2013", "A/Alabama/5/2010", "A/Athens/112/2012"]))
+        >>> len(TiterCollection.filter_strains(measurements, ["A/Oslo/244/1997", "A/Johannesburg/33/1994"]))
         2
-        >>> len(TiterCollection.filter_strains(measurements, ["A/Acores/11/2013", "A/Acores/SU43/2012", "A/Alabama/5/2010", "A/Athens/112/2012"]))
-        3
+
+        Test when no strains are provided.
+
         >>> len(TiterCollection.filter_strains(measurements, []))
         0
+
         """
         return {key: value for key, value in titers.items()
                 if key[0] in strains and key[1][0] in strains}
@@ -226,7 +228,7 @@ def __init__(self, titers, **kwargs):
         else:
             self.titers = titers
             strain_counts = type(self).count_strains(titers)
-            self.strains = strain_counts.keys()
+            self.strains = list(strain_counts.keys())
 
     def read_titers(self, fname):
         self.titer_fname = fname
@@ -318,11 +320,11 @@ def strain_census(self, titers):
         >>> titers = TiterCollection(measurements)
         >>> sera, ref_strains, test_strains = titers.strain_census(measurements)
         >>> len(sera)
-        9
+        66
         >>> len(ref_strains)
-        9
+        27
         >>> len(test_strains)
-        13
+        62
 
         Parameters
         ----------
@@ -415,7 +417,7 @@ def make_training_set(self, training_fraction=1.0, subset_strains=False, **kwarg
                 from random import sample
                 tmp = set(self.test_strains)
                 tmp.difference_update(self.ref_strains) # don't use references viruses in the set to sample from
-                training_strains = sample(tmp, int(training_fraction*len(tmp)))
+                training_strains = sample(sorted(tmp), int(training_fraction*len(tmp)))
                 for tmpstrain in self.ref_strains:      # add all reference viruses to the training set
                     if tmpstrain not in training_strains:
                         training_strains.append(tmpstrain)
@@ -504,7 +506,7 @@ def validate(self, plot=False, cutoff=0.0, validation_set = None, fname=None):
             pred_titer = self.predict_titer(key[0], key[1], cutoff=cutoff)
             validation[key] = (val, pred_titer)
 
-        validation_array = np.array(validation.values())
+        validation_array = np.array(list(validation.values()))
         actual = validation_array[:,0]
         predicted = validation_array[:,1]
 
@@ -517,7 +519,7 @@ def validate(self, plot=False, cutoff=0.0, validation_set = None, fname=None):
                         'rms_error': np.sqrt(np.mean((actual-predicted)**2)),
         }
         pprint(model_performance)
-        model_performance['values'] = validation.values()
+        model_performance['values'] = list(validation.values())
 
         self.validation = model_performance
 

diff --git a/augur/util_support/node_data.py b/augur/util_support/node_data.py
@@ -31,8 +31,6 @@ def deep_add_or_update(self, d, key, value):
                     raise exception
         """
 
-        # TODO Python 3.9: Use the new dictionary union operator (https://www.python.org/dev/peps/pep-0584/)
-
         if key not in d or (
             not isinstance(d[key], dict) and not isinstance(value, dict)
         ):