- Fix typos discovered by codespell by @cclauss in #160
- Migrate to GitHub Actions CI and resolve dependency issues by @tw4l in #164
- Add very simple test for version argument and use importlib feature instead of deprecated pkg_resources for version by @white-gecko in #173
- Run pytest directly. setup.py test was removed in setuptools 72. by @white-gecko in #172
- Update codecov/codecov-action from v1 to v4 by @white-gecko in #171
- Remove superfluous ci step by @white-gecko in #174
- Test python 3.12 by @white-gecko in #175
- chore: finish py3.12 by @wumpus in #176
- feat: add darwin and windows CI by @wumpus in #178
- doc: document how to use brotli; test brotli by @wumpus in #179
- feat: test old ubuntu version by @wumpus in #180
- feat: try py 3.13, plus typos by @wumpus in #184
- Handle deprecation of naive datetime functions like utcnow() by @tw4l in #185
- bump version to 1.7.5 by @ikreymer in #186
capture_http
support for chunk-encoded requests #116- indexer: option to enable
verify_http
#116 - Enable writing block digests for warcinfo records #115
- Fix documentation for capture_http filter_records #110
- Fix capture_http with http and https proxies #113
- Ensure 1.1 revisit profile used with WARC/1.1 revisits #96
- Include record offsets in
warcio check
output #98 - CI fix for python 2.7, use jinja<3.0.0 (#105)
- Fix in
StatusAndHeaders
when writing, then reading record #106 - Fix issues related to http header re-encoding, ensure correct content-length and %-encoding #106, #107
- Windows fixes: Fix reading from stdin, ensure all WARCs/ARCs are treated as binary #86
- Fix
ensure_digest(block=True)
breaking on an existing record, RecordBuilder supportsheader_filter
#85
- Docs and Misc Cleanup: add docs for
extract
tool, correct doc forget_statuscode()
, move all CLI tools to separate modules for better reusability. - Support indexing a WARC read from stdin #79
- Automatically %-encode urls that have a space in
WARC-Target-URI
#80 - Separate record creation into
RecordBuilder
class to allow building WARC records without aWARCWriter
, which now derives fromRecordBuilder
#63 - Support the ability to optionally check ARC/WARC record's block and payload digests #54, #58, #68, #77
- Creation of
ArchiveIterator
andArcWarcRecordLoader
now accept ancheck_digests
boolean keyword argument indicating if each records digest should be checked, defaults toFalse
- Core digest checking functionality is provided by
DigestChecker
andDigestVerifyingReader
importable from warcio.digestverifyingreader - New block and payload digest checking utility class,
Checker
, has been added and is importable from warcio.checker - The CLI has been updated to provide
warcio check
, a command for performing block and payload digest checking
- Creation of
- Ensured that ARCHeadersParser's splitting on spaces does not split any spaces in uri's #62
- Move the
compute_headers_buffer
method andheaders_buff
property to the StatusAndHeaders and fix incorrect digests in some test WARCs #67 - Ensured that the
BaseWARCWriter
does not use a mutable default value for thewarc_header_dict
keyword argument #70
- Make
warcio recompress
more robust in fixing improperly compressed WARCs, --verbose mode for printing results #52 - BufferedReader supports streaming all members of multi-member gzip file with
read_all_members=True
option.
- Ensure any non-ascii data in http headers is %-encoded, even if non-conformant to RFC 8187 #51
- Fixes for
warcio.utils.open()
not opening files in binary mode in Python 2.7 on Windows #49 capture_http()
various fixes and improvements, default writer,WARC-IP-Address
header support #50
- Support WARC/1.1 standard WARC records, reading #39 and writing #46 with microsecond precision
WARC-Date
- Support simplified semantics for capturing http traffic to a WARC #43
- Support parsing incorrect wget 1.19 WARCs with angle brackets, eg:
WARC-Target-URI: <uri>
#42 - Correct encoding of non-ascii HTTP headers per RFC 8187 #45
- New Util Added:
warcio.utils.open
provides exclusive creation modeopen(..., 'x')
for Python 2.7
- ArchiveIterator calls new
close_decompressor()
function in BufferedReader instead of close() to only close decompressor, not underlying stream. #35
- Write any errors during decompression to stderr #31
to_native_str()
returns original value unchanged if not a string/bytes typeWarcWriter.create_visit_record()
accepts additional WARC headers dictionaryArchiveIterator.close()
added which callsdecompressor.flush()
to address possible issues in #34- Switch
Warc-Record-ID
uuid creation touuid4()
fromuuid1()
- remove
test/data
from wheel build, as it breaks latest setuptools wheel installation - add
Content-Length
when addingContent-Range
viaStatusAndHeaders.add_range
#29
- new extract cli command #26 (by @nlevitt)
- fix for writing WARC record with no content-type #27 (by @thomaspreece)
- better verification of chunk header before attempting to de-chunk with ChunkedDataReader
- MANIFEST.in added (by @pmlandwehr)
- Indexing API improvements:
- Indexer class moved to
indexer.py
and all aspects of indexing process can be extended. - Support for accessing http headers with
http:
-prefixed fields #22 - Special fields:
filename
field andhttp:status
- JSON
offset
andlength
fields returned as strings for consistency. ArchiveIterator
API: addget_record_offset()
andget_record_length()
to return current offset/length, iterator now tracks current record
- Indexer class moved to
StatusAndHeaders
accepts headers in more flexible formats (mapping, byte or string) and normalizes to string tuples #19
- Continuous read for more data to decompress (introduced in 1.3.2 for brotli decomp) should only happen if no unused data remaining. Otherwise, likely at gzip member end.
- Set default read
block_size
to 16384, ensureblock_size
is never None (caused an issue in py2.7)
- Fixes issues with BufferedReader returning empty response due to brotli decompressor requiring additional data, for more details see: #21
- Fixes #15, including:
WARCWriter.create_warc_record()
works correctly when specifying a payload with no length param.- Writing DNS records now works (tests included).
- HTTP headers only expected for writing
request
,response
records if the URI has ahttp:
orhttps:
scheme (consistent with reading).
- Support for reading "streaming" WARC records, with no
Content-Length
set.Content-Length
and digests computed as expected when the record is written. - Additional tests for streaming WARC records, loading HTTP headers+payload from buffer, POST request record, arc2warc conversion.
recompress
command now parses records fully and generates correct block and payload digests.WARCWriter.writer.create_record_from_stream()
removed, redundant withArcWarcRecordLoader()
- Support for special field
offset
to include WARC record offset when indexing (by @nlevitt, #4) ArchiveIterator
supports full iterator semantics- WARC headers encoded/decoded as UTF-8, with fallback to ISO-8859-1 (see #6, #7)
ArchiveIterator
,StatusAndHeaders
andWARCWriter
now available from package root (by @nlevitt, #10)StatusAndHeaders
supports dict-like API (by @nlevitt, #11)- When reading, http headers never added by default, unless
ensure_http_headers=True
is set (see #12, #13) - All tests run on Windows, CI using Appveyor
- Additional tests for writing/reading resource, metadata records
warcio -V
now outputs current version.
- Header filtering: support filtering via custom header function, instead of an exclusion list
- Add tests for invalid data passed to
recompress
, remove unused code
Initial Release!