-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zimit2: Fix URL encoding of ZIM items #206
Comments
Just curious as to whether this is a situation we need to handle. Clearly there's a web design issue on that site in that they forgot to URI-encode the URLs. Browsers are lax enough to work around that. The failure point could be anywhere: in the crawler, in the Archiving software (Replay), or in warc2zim, or even in the reader (Kiwix Serve or other reader)... We should probably double-check the problem isn't in our readers. EDIT: I see you say the PDFs are missing -- I presume you mean they never made their way into the ZIM at all? |
ZIM has plenty of PDF:
|
Entry ❯ curl -I 'https://dev.library.kiwix.org/content/thalesdoc_en_all_2024-03/thalesdocs.com/gphsm/luna/7/docs/network/Content/PDF_Network/Product%20Overview.pdf'
HTTP/2 404 |
Ah, so the issue may be reader-side. Oh no, yet another 2GB file to download in order to test properly..... |
I confirm PDFs are in ZIM, but they can't be accessed in the PWA or Browser Extension either. Failure seems to be because the backend URI-decodes the URL before requesting it from the ZIM, but unfortunately, as you see from screenshot the PDF URLs are actually stored URI-encoded in the ZIM. I think if these URLs were stored with spaces -- the ZIM spec does say that all URLs should be stored unencoded, but not sure where we stand on that wrt Zimit2 --- then I believe the readers would work, though it needs testing. |
|
This kind of issue was rife with zimit1 -- see the first bullet point in #86 --, but I think the issue was masked by the Replay software. I noticed the inconsistencies because I was trying to handle reading ZIMs without the full Replay system. Now we don't have the sticking plaster of the exact same software being used to encode and to decode. It would be useful to establish whether zimit2 archives should, at least in principle, respect the OpenZIM spec on the question of storing URLs unencoded or whether we continue to consider them a "special case" like with zimit1. Query strings are a particular issue IMHO, especially as there is an issue with Kiwix Serve not being able to handle a raw |
I confirm this is a uri encoding issue.
This is already specified (https://github.com/openzim/warc2zim/blob/warc2zim2/src/warc2zim/url_rewriting.py#L4-L46) At this time, I still don't know from where the encoding comes from (warc2zim, scrapper, or source content) |
The path is url encoded in the warc ( Decoding it is pretty simple. The real question is do we have to do it ? In fact, either:
|
@mgautierfr URL paths should better not be encoded IMHO. Think about the fact that you have actually many different but valid ways to encode the very same URL. |
I agree. My point is that we don't know if a path is currently url encoded (and we have to decode it). A path may contain a |
I agree that it is not possible to know whether a specific URL should be decoded or not. I struggled with this a lot in my first Zimit (1) implementation, because (as I noted in #86) sometimes they were stored encoded and sometimes not, and I suppose it depended on some aspect of the encoding in the original website. One possibility that occurs to me is that the WARC software is in fact encoding all URLs with It might be difficult to find a test case to check this. Some Chinese sites? |
I will check browsertrix crawler behavior and see what has to be done. Might be that we will have to discuss the point with them as well (not a problem). |
My idea is that there may be a missing NB I have not looked at the code. This is merely a (wild) guess for now. |
There is no If it is not the case, we cannot decode in URLs can be encoded and decoded as many time as wanted, but you cannot decode something which is not already encoded, or you will transform its value (and create other problems) |
Yes, of course, I agree. My experience from zimit1 was mixed: sometimes URLs could only be extracted after decoding, and sometimes only when asking for the encoded version. I never got to the bottom of why... (I actually opted to try encoded, and then try unencoded in my original zimit1 support -- very clunky!). |
Having a look at browsertrix crawler codebase and behavior, as well as browsers behaviors, I confirm that the crawler does no does not seems to encode any URL on its own, but browsers definitely do. I asked a question at webrecorder/browsertrix-crawler#492 to have confirmation from Webrecorder folks. The browser is processing the URL found in the HTML code before issuing the request. For example at https://tmp.kiwix.org/ci/test-website/url-encoding.html, the first image So it looks like the path is mostly url-encoded, except special character On https://tmp.kiwix.org/ci/test-website/url-encoding.html, this meant we have two warc records for the 3 images.
Side-node: on tmp.kiwix.org, server always serves the file named without URL encoding (i.e. it decodes the URL before searching for a file), this is why the Back to warc2zim, I think that the change needed on the scraper is not exactly straightforward. solution 1For every How to properly url-encode is still a bit of a mystery, it looks like we must first split the URL on every reserved character of RFC3986 and only url-encode each part without the reserved characters which are kept as-is. This also means that while we have to create two ZIM entries (one per WARC record above), we are supposed to store their path without URL encoding to match the ZIM specification. So the path of both ZIM entries would be identical ... and we would hence have a conflict. solution 2Another solution would be to consider this is too complex, that the risk of collision illustrated in https://tmp.kiwix.org/ci/test-website/url-encoding.html is both extremely rare AND usually leading to the same content, and we still do not know how to store these collisions in the ZIM. This would mean we can directly decode the For the rare collision situations, we can search for these collisions before anything else in the scraper, and then check if content is really distinct, in which case it raises an error and stops the scraper, so we will know about it. Since we do not expect this to happen, it would probably not be a concern. conclusionMy preference for now is to go for solution 2. While not totally exact (collisions could theoretically happen), it is both the simplest move and the one the more in-line with the ZIM specification which expects paths without URL encoding. Feedback on this is welcomed! |
Solution2 would still need to decode properly according to the RFC (ie. part by part). We've seen that already. On WikiHow maybe? |
Why do you need to decode part by part in solution 2? Since the URL is already encoded, every |
Just to add my tuppence: in JavaScript at least (and I know that may not be the context in which decoding would be done in warc2zim2), there are two decoding functions: The former does not decode encoded characters that are part of the URI syntax (e.g. ;, /, ?, :, @, &, =, +, $, #). The latter only decodes things that are between those separators (the components of the URL). So:
Which to use would depend on how a URL has been encoded in the first place. Component separators normally wouldn't be in encoded form unless they've been "wrongly" encoded. A raw ampersand should always be part of the querystring, and should never appear elsewhere in an encoded URL (and it should never be encoded if it is part of the querystring and is a separator). For example, if we have a URL that looks like: |
Sorry for further post, but I presume we need to be 100% clear on whether we want to store valid URLs in the ZIM, or whether we want to store fully-decoded strings that are not in fact valid URLs (which doesn't matter, so long as the correct string is presented to the backend to be able to locate the entry in the ZIM). And do we want to treat the querystring differently from the rest of the URL? In the end, unless we want to do a lot of work adapting the readers, we are forced to store in the ZIM a URL in a form that current readers (or libzim?) will transparently derive from a given URL. |
You are right, we need to decode only the path, not the query and probably not the fragments. Thank you |
Nevermind, fragments are ignored to compute the entry path anyway, so we just need to decode the path. |
I wonder what we should do regarding hostname which are stored in punycode in the WARC (because the browser requests it with punycode). This is definitely an edge case, but needs to be tackled as well. The complexity comes from the fact that:
Another concern I have is that we say that ZIM item path should not be urlencoded but match RFC1738: https://wiki.openzim.org/wiki/ZIM_file_format#URL_Encoding
This is a contradiction, because as far as I've understood, utf-8 chars are not allowed in HTTP url according to RFC1738. There is something I don't get. |
This is what I meant when I said we need to be 100% clear whether we are storing valid URLs, or whether we are simply storing ZIM URLs, which are strings that are not valid URLs. The difference is that traditionally, a ZIM URL would have fully decoded characters, for example, a title like As I understand it, the proposal is to revert to ZIM URLs, which means we store unencoded. It doesn't matter of course whether what is stored is a valid URL or not, it just has to be a string pointer to the asset that we want to extract from the ZIM. The point is that the readers, as they currently stand (assuming we don't want to modify them) must be able to derive EDIT: mwOffliner uses underscores rather than spaces, but I think this is merely following the way Wikipedia URLs are represented for specific article titles. In practice, as we've seen, we rarely encounter spaces in a URL either encoded or unencoded. EDIT2: Querystrings pose specific problems, because existing readers may not know how to deal with them, and probably don't decode them from, say, a URL hyperlink (when user clicks on one). But this needs empirical testing (or perhaps all those based on libzim / libkiwix have a standard behaviour). |
OK, but then why the ZIM specification is speaking about "URL"? And why are we referring to RFC1738? Either the word "URL" is badly chosen (I can live with it) and we stop referring to RFC1738, or there is something I still don't get. If we admit that, it means for zimit that:
This seems a bit weird (lots of different way to handle things), but is probably the simplest (only?) solution if we do not want to modify all readers! |
@benoit74 I would recommend here to have a:
|
@benoit74 My cautions about querystring handling are not based on testing in all the readers, just on what I had to do to make it work in KJS / PWA. An empirical, test-driven solution focusing on libzim's handling sounds good. |
Summary of where we're at and why it's a Catch22 situation IMHOWhat we found out in openzim/libzim#865:
Why this is a Catch22 situation to deal with:
This is not hypothetical. There is no way I can code a workaround for this situation that is guaranteed to work in all cases, and it's not possible either in Kiwix Serve to deal with percent-encoded URLs that contain encoded spaces or legitimately encoded question marks with 100% certainty. I see only two solutions:
Personally, I think there are probably too many unknowns and risks to do 2 at this late stage, not least the huge querystrings that control video. This drives me to the conclusion that 1 is what is needed, and that you should additionally alter the OpenZIM spec to say that Zimit ZIMs (zimit1 and zimit2) store conforming RFC1738 URLs with percent encoding. I think this is de facto the case. |
Zimit2 querystrings are currently over-encoded and included in the title. But this may not always be the case. See openzim/warc2zim#206.
Solution 1 is a nogo for me on the short term because it probably means updating all readers, not only libkiwix (you can maybe easily update KJS, but we cannot update all readers which are in the wild easily, we are an offline service providers and cannot assume readers are regularly updated). And it does not solve the core of this issue. Querystring are not questioned in Thales ZIM, all but URLs with special characters are not working. Any item with a special character is simply not working currently. I'm close to have implemented something like solution 2. We will discuss this in corresponding PR and you will have a ZIM to test soon (by the end of the day hopefully). |
I understand and thanks for your hard work! No rush from my end. I'll be happy to test.
In my draft PR I got Thale PDFs working in KJS by re-encoding the (decoded) URL with |
I totally agree with solution 2. Having two records in the warc seems to be a "bug" for me. (Not a real bug from warc pov as warc store a visit of a site web, not a site web) We should store this one and only this one. If we know that the browser always urlencode than we should always urldecode.
I think you example is wrong. In this case, the "real" path is |
Thank you -- I'm perfectly happy with Solution 2, and can work with it. It respects the spec. Was just getting nervous about how it would work in practice with zimit2 ZIMs (I presume we don't touch zimit1). |
For those testing the solidarité numérique ZIM in #218, I found an embedded video in the article "Utiliser le Décodex pour vérifier les fausses informations ou fake news" (it's the tile next to "Comprendre les cookies".) The video works fine in the PWA (v3.1.5), but it's not working in Kiwix Desktop 2.3.1 (on Windows): I'm posting it in the issue rather than in the PR, because I don't think it's related to the specific PR, but could be related to encoding of querystrings. @benoit74 I can't remember where we're at with video in zimit2, and whether this would be a regression or not. EDIT: Video also not working in latest nightly (27-03-24) appimage version on Linux (WSL). |
Solved by #218 |
Original page:
https://thalesdocs.com/gphsm/luna/7/docs/network/Content/Home_Luna.htm
ZIM page:
https://dev.library.kiwix.org/viewer#thalesdoc_en_all_2024-03/thalesdocs.com/gphsm/luna/7/docs/network/Content/Home_Luna.htm
In the ZIM, all the PDF are missing, although nothing special about them is to remark from my POV: links are straight links to PDF files in the same mirrored domain.
Remark: PDF filenames have spaces!
The text was updated successfully, but these errors were encountered: