Skip to content

Commit

Permalink
Fix logic of rewrite mode computation for cases raised in #326
Browse files Browse the repository at this point in the history
  • Loading branch information
benoit74 committed Jun 27, 2024
1 parent 5ea9e97 commit f8a4081
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 3 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Exit with cleaner message when no entries are expected in the ZIM (#336) and when main entry is not processable (#337)

### Fixed

- Some resources rewrite mode are still not correctly identified (#326)

## [2.0.2] - 2024-06-18

### Added
Expand Down
12 changes: 12 additions & 0 deletions docs/technical_architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,18 @@ Computation of the ZIM path is hence mostly straightforward:
- decode the hostname which is puny-encoded
- decode the path and query parameter which might be url-encoded

## Rewriting documents

Some documents (HTML, CSS, JS and JSON for now) needs to be rewritten, e.g. to rewrite URLs, adapt some code to the ZIM context, ...

The first important step when processing a WARC entry to add it as a ZIM entry is hence to properly detect which kind of document we are dealing with.

This is done in the `get_rewrite_mode` function of the `Rewriter` class. Before 2.0.1, scraper was relying only on mimetype as returned in `Content-Type` HTTP response.

Unfortunately, this caused problems where some server are returning wrong information is this header, e.g. Cloudflare seems to frequently return `text/html` for woff2 fonts ; this causes the scraper to fail, because it is impossible to know in advance that we should ignore these errors, we could have a real document which should be rewriten but is failing.

Since 2.0.1, we've enriched the logic by using the new WARC header `WARC-Resource-Type` which contains the type of resources "as perceived by the browser" (from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType, see https://github.com/webrecorder/browsertrix-crawler/pull/481). Unfortunately this information is not sufficient because of some very generic value returned like `fetch` or `xhr`. Scraper stills need to mix this information with the mimetype. Ideally, we would have prefer to find a single source of truth not relying on something returned by the server, but it is not available for now (see https://github.com/openzim/warc2zim/issues/340 for a discussion on this topic).

### URL rewriting

In addition to the computation of the relative path from the current document URL to the URL to rewrite, URL rewriting also consists in computing the proper ZIM path (with same operation as above) and properly encoding it so that the resulting URL respects [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986). Some important stuff has to be noted in this encoding.
Expand Down
6 changes: 3 additions & 3 deletions src/warc2zim/content_rewriting/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ def get_rewrite_mode(self, record, mimetype):
def get_resourcetype_rewrite_mode(self, record, resourcetype, mimetype):
"""Get current record rewrite mode based on WARC-Resource-Type and mimetype"""

if resourcetype == "document" and mimetype == "text/html":
if resourcetype in ["document", "xhr"] and mimetype == "text/html":
# TODO : Handle header "Accept" == "application/json"
if getattr(record, "method", "GET") == "GET":
return "html"
Expand All @@ -164,12 +164,12 @@ def get_resourcetype_rewrite_mode(self, record, resourcetype, mimetype):
if resourcetype == "stylesheet":
return "css"

if resourcetype in ["script", "fetch", "xhr", "manifest"] and (
if resourcetype in ["script", "fetch", "other", "xhr", "manifest"] and (
mimetype == "application/json" or self.path.value.endswith(".json")
):
return "json"

if resourcetype in ["script", "xhr"] and mimetype in [
if resourcetype in ["script", "other", "xhr"] and mimetype in [
"text/javascript",
"application/javascript",
"application/x-javascript",
Expand Down

0 comments on commit f8a4081

Please sign in to comment.