-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit handling of special characters in ZIM / HTML URLs #218
Conversation
4216d66
to
17eabca
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## warc2zim2 #218 +/- ##
=============================================
- Coverage 87.55% 85.98% -1.57%
=============================================
Files 13 13
Lines 980 1049 +69
Branches 179 195 +16
=============================================
+ Hits 858 902 +44
- Misses 102 116 +14
- Partials 20 31 +11 ☔ View full report in Codecov by Sentry. |
…ation to support encoded URL and query strings
@Jaifroid FYI if you are interested in testing very early stage new zimit2 ZIMs |
Just want to say... wow! That's a lot of work! 🎉 |
@benoit74 With a small adjustment PR (kiwix/kiwix-js-pwa#577), the Thales ZIM is now working well in the PWA. PDFs open, and offline video is working fine (this was a fear of mine, so congrats)! To test video on other readers just search for YouTube in the search bar. Will fix KJS with same small adjustment in due course. Hope to test the other ZIM soon. Many thanks! |
@Jaifroid Thank you for the test and confirmation, and glad you found the Youtube video! And of course, glad that it is easy to support these in KJS and PWA readers! Unfortunately we still have significant other issues to fix until zimit2 reach an acceptable level, so you will get even more ZIMs to test in due time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
Can you clarify role of self.indexed_urls
now that we have self.existing_zim_paths
?
I've now tested the solidarité numérique ZIM mentioned in first post of this PR, and it is definitely an improvement, though I confirm the two bugs remaining that were mentioned there. I don't think it's related, but I noticed this ZIM, at least in Kiwix JS, the PWA and Kiwix Desktop on Windows, seems to have a character encoding issue (see screenshot, and look at any character that should have an accent). I don't know if this happened with the zimit1 version (I can't find a zimit1 version of this ZIM either in the zimit directory nor in the development library). If it is an error with zimit2 due to the assumption that all OpenZIM archives are UTF-8 encoded, then we may need a new issue to handle different character sets in zimit2? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change in the expected value for relative path disturb me.
Now the relative paths have a extra ../
and changing that cannot be a small side effect.
I have the feeling that either the previous version or the new version can work, but not both. And the previous version was working pretty well (with relative path at least)...
Very good question!
I think they both serve a different purpose, e.g. I however find both very confusing and badly named (even the new I propose to postpone this topic for a next PR where I will rename these two lists, merge them with some additional status info and use them for detecting scraper issues. |
Edit: I finally renamed First comment updated. |
This is also used to skip potential duplicated entries in the WARC. The
|
This rule was needed most probably only because of a trailing ? in some URLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm less sure about the last commit but we can indeed readd it later if needed. Time to move on with this PR.
@rgaudin may I merge or do you still need to review this? @mgautierfr I agree about uncertainties regarding last commit, but my tests are successful and I hate code which "might be useful but we are not sure anymore" |
@benoit74 Many thanks for finalizing this! Do we perhaps need a new test ZIM (or ZIMs) based on the merged code? I need to ensure my reader-side code is doing the correct number of decode steps before extracting articles from the ZIM when handling links clicked by the user. Or else tell me if it is safe to test against the ZIMs in the first post of this PR. |
I do not think we are yet at a stage where it is worth to test new ZIMs, we have many issues to address first (including wombat.js configuration that needs to be adapted as well). Be sure I will create some in due time, we have all readers to test anyways, not only yours. I prefer to spare my time in order to focus on fixing everything that needs to be, rather than creating new ZIMs and getting many feedbacks where I would probably too often respond "yes, I know, this is issue xxx". |
OK, thanks, I understand! Note that I try to test issues I've found on Kiwix Serve and Kiwix Desktop, to corroborate, not just on the readers I'm responsible for. I'll await further advice. |
Fix #206
Fix #210
Rationale
Following openzim/libzim#865 (comment) experiments, it seems now clear that:
Following webrecorder/browsertrix-crawler#492, it is also now clear that URLs found in WARC record (WARC-Target-URI) is always URL-encoded.
This PR implements required changes to match these two "new" understandings.
Changes
Main changes are done with commit 202d6c9:
normalize
function)known_urls
argument / attribute is renamed toexisting_zim_paths
expected_zim_items
(and same name is used in the whole codebase for clarity)indexed_urls
attribute is renamed toadded_zim_items
reduce
method toapply_fuzzy_rules
(convey way more meaning / less confusion)from_normalized
method toget_document_uri
HttpUrl
andZimPath
classes so that it is now way clearer when we are dealing with a URL and when we are dealing with a Pathapply_fuzzy_rules
,get_document_uri
andnormalize
It also includes smaller changes / fixes:
?
does not provide any meaning, I suggest to remove it as well?
when present in a URL, so WARC records won't be present with trailing?
as target URIresource
WARC records: 4068d85resource
WARC records for now #198 but work on this PR made me realize we have some code which was still executed, and even a test WARC archiveTest ZIMs