All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.4.0).
- Upgrade to wombat 3.8.6 (#334)
- Fix wombat setup settings (especially
isSW
) (#293)
- Stop checking main entry processability when it is already found (#424)
- Upgrade to wombat 3.8.3 (#414)
- Enrich test website with img srcset situations (in preparation for #403)
- Upgrade dependencies, including wombat 3.8.2 (#407)
- HTML document can be retrieved as
fetch
resource type (#405)
- Upgrade dependencies, including wombat 3.8.0 (#386)
- New fuzzy-rule for cheatography.com (#342), der-postillon.com (#330), iranwire.com (#363)
- Properly rewrite redirect target url when present in HTML tag (#237)
- New
--encoding-aliases
argument to pass encoding/charset aliases (#331) - Add support for SVG favicon (#148)
- Automatically index PDF content and use PDF title (#289 and #290)
- Upgrade to python-scraperlib 4.0.0
- Generate fuzzy rules tests in Python and Javascript (#284)
- Refactor HTML rewriter class to make it more open to change and expressive (#305)
- Detect charset in document header only for HTML documents (#331)
- Use
software
property fromwarcinfo
record to set ZIMScraper
metadata (#357) - Store
ContentDate
as metadata, based onWARC-Date
(#358) - Remove domain specific rules (#328)
- Revisit retrieve_illustration logic to prefer best favicons (#352 and #369)
- Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376)
### Fixed
- Handle case where the redirect target is bad / unsupported (#332 and #356)
- Fixed WARC files handling order to follow creation order (#366)
- Remove subsequent slashes in URLs, both in Python and JS (#365)
- Ignore non HTTP(S) WARC records (#351)
- Fix
vimeo_cdn_fix
fuzzy rule for proper operation in Javascript (#348) - Performance issue linked to new "extensible" HTML rewriting rules (#370)
- Moved rules definition from JSON to YAML and documented update process (#216)
- Upgrade to wombat.js 3.7.11
### Added
- Exit with cleaner message when no entries are expected in the ZIM (#336) and when main entry is not processable (#337)
- Add debug log for items whose content is empty (#344)
- Some resources rewrite mode are still not correctly identified (#326)
- Add
--ignore-content-header-charsets
option to disable automatic retrieval of content charsets from content first bytes (#318) - Add
--content-header-bytes-length
option to specify how many first bytes to consider when searching for content charsets in header (#320) - Add
--ignore-http-header-charsets
option to disable automatic retrieval of content charsets from content HTTPContent-Type
headers (#318)
- Simplify logic deciding content charset, stop guessing with chardet (#312)
- Rewrite only content with mimetype
text-html
whenWARC-Resource-Type
ishtml
(#313)
- Add support for multiple languages in
--lang
CLI argument (#300)
- Use the new
WARC-Resource-Type
header to decide rewrite mode (when present in WARC) (#296) - Upgrade Python dependencies + wombat.js 3.7.5
- Drop
integrity
attribute in HTML<script>
and<link>
tags (#298) - Use automatic detection of content encoding also for JS, JSON and CSS files (#301)
- Set correct charset in HTML documents (#253)
- Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (#168)
- New test website to test many known situations supposed to be handled (#166)
- Replace Service Worker approach by scraper-side rewriting of static content (kiwix/overview#95)
- Adopted Python bootstrap conventions (#152)
- Upgrade dependencies, especially move to Python 3.12 (only) and zimscraperlib 3.3.2
- Change wording in logs about the return code 100 (which is not an error code)
- Added checks in
converter.py
to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (#106) - Added check for invalid zim file names (#232)
- Changed default publisher metadata from 'Kiwix' to 'openZIM' (#150)
- Code restructuration in preparation for 2.x
- Using wabac.js 2.16.11
- Using
cover
resize method for favicon to prevent issues with too-small ones - Fixed direct link hack when inside an outer frame (kiwix-serve 3.5+) #119
- Using wabac.js 2.16.9
- Using scraperlib 3.1.1, openZIM metatadata now always set, using default if missing
- Using wabac.js 2.16.6
- Using wabac.js 2.15.2
- Don't crash on failure to convert illustration (skip illus instead)
- Fixed 404 page (#96)
- Dont't crash on missing Location headers on potential redirect
- Fixed incorrect ISO-639-3 --lang not replaced with
eng
- Don't fallback to
eng
if the host doesnt have the matching locale - Using wabac.js 2.15.0 with fix for scope conflict in SW/DB
- Payload entries now uses original ~
text/html
mimetype instead oftext/html;raw=true
- dont't crash on icon link with no href
- Using wabac.js 2.12.0
- Prevent duplicate entries from failing (including illustrations)
- Fixed crash on HTTP 300 records (#94)
- Additional fuzzy matching rules for youtube and vimeo, and additional test cases
- Support for youtube videos, which require POST request handling to work.
- Support for canonicalizing POST request data into URL for fuzzy matching (using cdxj-indexer)
- Support loading custom sw.js from a local file path
- Updated zimscraperlib to 1.6 using libzim7.2
- Updated warcio to 1.7.4
- Added support for {period} replacement in --zim-file
- Using fixed MarkupSafe version (Jinja2 dependency)
- updated zimscraperlib (for libzim fix)
- don't crash on records without WARC-Target-URI
- fixed failure if url contains a fragment
- updated wabac.js to 2.7.3
- Added
--custom-css
option
- Added
--progress-file
option
- Update to wabac.js 2.1.6
- Favicon loading fixes: In topFrame.html, load favicon URL directly from ZIM A/ record, bypassing service worker H/ lookup.
- Supports 'fuzzy matching' with additional redirects add from normalized URL to exact URL
- Add fuzzy matching rules for youtube and '?timestamp' URLs
- Fix canonicaliziation where URLs that contain http/https were being incorrectly stripped (openzim/zimit#37)
- Accepts directory inputs as well as individual files. If directory given, which will process all .warc and .warc.gz files recursively in the directory.
- If trailing slash is missing on main URL,
--url https://example.com?test=value
, slash added and URL treated as--url https://example.com/?test=value
- Now defaults to including all URLs unless --include-domains is specifief (removed
-a
) - Arguments are now checked before starting. Also returns
100
on valid arguments but no WARC provided.
- Now skipping WARC records that redirect to self (http -> https mostly)
- Initial release