Revisit retrieve_illustration logic to prefer best favicons #372
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR replaces #355 with another approach: do not trust HTML to get favicon sizes + if user provided a favicon, simply use it (if it is a bad favicon ... it is bad ...).
Fix #352
Fix #369
Changes in logic around finding the ZIM illustration:
Logic has been significantly revisited to not have to loop over all warc records since most potential favicons are not inside the WARC and this would waste significant time / resources on big crawls, so we prefer to store favicons on-the-fly in memory in initial
gather_information_from_warc
. The favicon is anyway used only few moments later, and we do not expect to have many huge favicons in memory.Note that the fact that this change contradicts significantly with what has been discussed and decided in #202, since there is a significant chance we will download the illustration.
In #202, we said that there is probably no situation where the best icon is not already present inside the WARC and should be downloaded. This is wrong.
This change is grounded on a real use case: https://womenshistory.si.edu/. In this use case, we have only two WARC items fetched by the crawler:
Both are too small for a ZIM illustration and will need to be upscaled. This is not appropriate because scraper could know that best icon possible is https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/android-chrome-192x192.png, and this file is available for download. This is now what will happen with the change in this PR.