Skip to content

Commit

Permalink
augur {read,write}-file
Browse files Browse the repository at this point in the history
Add commands to read and write files using Augur's conventions.  This
allows external programs to do i/o like Augur by piping from/to `augur
read-file` or `augur write-file`.  In some simple testing, the overhead
of passing text i/o thru Python vs. not is minimal for our use cases and
worth the cost of consistent compression and newline handling.

I'll be using this to allow SQLite to read/write files like Augur.
  • Loading branch information
tsibley committed Jul 30, 2024
1 parent 2c88298 commit 54ebcf3
Show file tree
Hide file tree
Showing 10 changed files with 170 additions and 0 deletions.
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,16 @@

## __NEXT__

### Features

* Two new commands, `augur read-file` and `augur write-file`, now allow external programs to do i/o like Augur by piping from/to these new commands. They provide handling of compression formats and newlines consistent with the rest of Augur. [#1562][] (@tsibley)

### Bug Fixes

* Embedded newlines in quoted field values of metadata files are now properly handled. [#1561][] (@tsibley)

[#1561]: https://github.com/nextstrain/augur/pull/1561
[#1562]: https://github.com/nextstrain/augur/pull/1562



Expand Down
2 changes: 2 additions & 0 deletions augur/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@
"version",
"import_",
"measurements",
"read_file",
"write_file",
]

COMMANDS = [importlib.import_module('augur.' + c) for c in command_strings]
Expand Down
73 changes: 73 additions & 0 deletions augur/read_file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
r"""
Read a file like Augur, with transparent optimized decompression and universal newlines.
Input is read from the given file path, as the compression format detection
requires a seekable stream. The given path may be "-" to explicitly read from
stdin, but no decompression will be done.
Output is always to stdout.
Universal newline translation is always performed, so \n, \r\n, and \r in the
input are all translated to the system's native newlines (e.g. \n on Unix, \r\n
on Windows) in the output.
"""
import io
import os
import signal
import sys
from shutil import copyfileobj

from .io.file import open_file
from .utils import first_line


# The buffer size used by xopen() (which underlies open_file()), which notes:
# 128KB [KiB] buffer size also used by cat, pigz etc. It is faster than the 8K
# [KiB] default.
BUFFER_SIZE = max(io.DEFAULT_BUFFER_SIZE, 128 * 1024)

SIGPIPE = getattr(signal, "SIGPIPE", None)


def register_parser(parent_subparsers):
parser = parent_subparsers.add_parser("read-file", help=first_line(__doc__))
parser.add_argument("path", metavar="PATH", help="path to file")
return parser


def run(args):
with open_file(args.path, "rt", newline=None) as f:
# It's tempting to want to splice(2) here, but it turns out to make
# little sense. Firstly, the availability of splice(2) is Linux,
# Python ≥3.10, and one of the files needs to be a pipe. The chance of
# all of those together is slim-to-none, particularly because even in
# the common case of xopen() reading from a pipe—the stdout of an
# external decompression process—we can't use that pipe directly
# because xopen() always buffers the first block of the file into
# Python.¹ Secondly, we want universal newline handling—so that
# callers get behaviour consistent with the rest of Augur—and that
# rules out splice(2).
#
# Copying the data thru Python instead of with splice(2) seems fast
# enough in some quick trials with large files (e.g. against `zstd`
# directly), and the bottlenecks in pipelines using this command will
# often not be this command's i/o.
# -trs, 11 July 2024, updated 24 July 2024
#
# ¹ <https://github.com/pycompression/xopen/blob/67651844/src/xopen/__init__.py#L374>

# Handle SIGPIPE, which Python converts to BrokenPipeError, gracefully
# and like most Unix programs. See also
# <https://docs.python.org/3/library/signal.html#note-on-sigpipe>.
try:
copyfileobj(f, sys.stdout, BUFFER_SIZE)

# Force a flush so if SIGPIPE is going to happen it happens now.
sys.stdout.flush()
except BrokenPipeError:
# Avoid errors from Python automatically flushing stdout on exit.
devnull = os.open(os.devnull, os.O_WRONLY)
os.dup2(devnull, sys.stdout.fileno())

# Return conventional exit status for "killed by SIGPIPE" on Unix.
return 128 + SIGPIPE if SIGPIPE else 1
54 changes: 54 additions & 0 deletions augur/write_file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
r"""
Write a file like Augur, with transparent optimized compression and universal newlines.
Input is always from stdin.
Output is to the given file path, as the compression format detection require
it. The given path may be "-" to explicitly write to stdout, but no
decompression will be done.
Universal newline translation is always performed, so \n, \r\n, and \r in the
input are all translated to the system's native newlines (e.g. \n on Unix, \r\n
on Windows) in the output.
"""
import io
import sys
from shutil import copyfileobj

from .io.file import open_file
from .utils import first_line


# The buffer size used by xopen() (which underlies open_file()), which notes:
# 128KB [KiB] buffer size also used by cat, pigz etc. It is faster than the 8K
# [KiB] default.
BUFFER_SIZE = max(io.DEFAULT_BUFFER_SIZE, 128 * 1024)


def register_parser(parent_subparsers):
parser = parent_subparsers.add_parser("write-file", help=first_line(__doc__))
parser.add_argument("path", metavar="PATH", help="path to file")
return parser


def run(args):
with open_file(args.path, "wt", newline=None) as f:
# It's tempting to want to splice(2) here, but it turns out to make
# little sense. Firstly, the availability of splice(2) is Linux,
# Python ≥3.10, and one of the files needs to be a pipe. The chance of
# all of those together is slim-to-none, particularly because even in
# the common case of xopen() reading from a pipe—the stdout of an
# external decompression process—we can't use that pipe directly
# because xopen() always buffers the first block of the file into
# Python.¹ Secondly, we want universal newline handling—so that
# callers get behaviour consistent with the rest of Augur—and that
# rules out splice(2).
#
# Copying the data thru Python instead of with splice(2) seems fast
# enough in some quick trials with large files (e.g. against `zstd`
# directly), and the bottlenecks in pipelines using this command will
# often not be this command's i/o.
# -trs, 11 July 2024, updated 24 July 2024
#
# ¹ <https://github.com/pycompression/xopen/blob/67651844/src/xopen/__init__.py#L374>
copyfileobj(sys.stdin, f, BUFFER_SIZE)
7 changes: 7 additions & 0 deletions docs/api/developer/augur.read_file.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
augur.read\_file module
=======================

.. automodule:: augur.read_file
:members:
:undoc-members:
:show-inheritance:
2 changes: 2 additions & 0 deletions docs/api/developer/augur.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Submodules
augur.lbi
augur.mask
augur.parse
augur.read_file
augur.reconstruct_sequences
augur.refine
augur.sequence_traits
Expand All @@ -55,3 +56,4 @@ Submodules
augur.validate
augur.validate_export
augur.version
augur.write_file
7 changes: 7 additions & 0 deletions docs/api/developer/augur.write_file.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
augur.write\_file module
========================

.. automodule:: augur.write_file
:members:
:undoc-members:
:show-inheritance:
2 changes: 2 additions & 0 deletions docs/usage/cli/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ We're in the process of adding examples and more extensive documentation for eac
version
import
measurements
read-file
write-file
9 changes: 9 additions & 0 deletions docs/usage/cli/read-file.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
===============
augur read-file
===============

.. argparse::
:module: augur
:func: make_parser
:prog: augur
:path: read-file
9 changes: 9 additions & 0 deletions docs/usage/cli/write-file.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
================
augur write-file
================

.. argparse::
:module: augur
:func: make_parser
:prog: augur
:path: write-file

0 comments on commit 54ebcf3

Please sign in to comment.