Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit retrieve_illustration logic to prefer best favicons #372

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Aug 7, 2024

This PR replaces #355 with another approach: do not trust HTML to get favicon sizes + if user provided a favicon, simply use it (if it is a bad favicon ... it is bad ...).

Fix #352
Fix #369

Changes in logic around finding the ZIM illustration:

  • if user provided a favicon to use, we use it
  • instead of prefering to use WARC items (or prefering to download as it was before Should we remove the favicon download fallback #202), we prefer to use the most suited favicon.
  • potential favicons are sourced from main HTML page.
  • ll favicons are retrieved either from the WARC or downloaded to inspect their sizes.
  • we use the most suited one (i.e. 48x48 or bigger if possible or the biggest one).
  • we still fallback to default ZIM illustration if no favicon is found, to avoid loosing all time spent crawling the website.

Logic has been significantly revisited to not have to loop over all warc records since most potential favicons are not inside the WARC and this would waste significant time / resources on big crawls, so we prefer to store favicons on-the-fly in memory in initial gather_information_from_warc. The favicon is anyway used only few moments later, and we do not expect to have many huge favicons in memory.

Note that the fact that this change contradicts significantly with what has been discussed and decided in #202, since there is a significant chance we will download the illustration.

In #202, we said that there is probably no situation where the best icon is not already present inside the WARC and should be downloaded. This is wrong.

This change is grounded on a real use case: https://womenshistory.si.edu/. In this use case, we have only two WARC items fetched by the crawler:

Both are too small for a ZIM illustration and will need to be upscaled. This is not appropriate because scraper could know that best icon possible is https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/android-chrome-192x192.png, and this file is available for download. This is now what will happen with the change in this PR.

@benoit74 benoit74 self-assigned this Aug 7, 2024
@benoit74 benoit74 marked this pull request as ready for review August 7, 2024 08:12
@benoit74 benoit74 requested a review from rgaudin August 7, 2024 08:12
@benoit74
Copy link
Collaborator Author

benoit74 commented Aug 7, 2024

I just added missing CHANGELOG entry in last commit

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good 👍

src/warc2zim/icon_finder.py Outdated Show resolved Hide resolved
src/warc2zim/icon_finder.py Outdated Show resolved Hide resolved
provided a favicon to use.

Instead of prefering to use WARC items (or prefering to download as it
was before #202), we prefer to use the most suited favicon.

Potential favicons are sourced from main HTML page.

All favicons are retrieved either from the WARC or downloaded to inspect
their sizes.

We use the most suited one (i.e. 48x48 or bigger if possible or the
biggest one).

We still fallback to default ZIM illustration if no favicon is found, to
avoid loosing all time spent crawling the website.
@benoit74 benoit74 closed this Aug 7, 2024
@benoit74 benoit74 deleted the choose_icon2 branch August 7, 2024 12:18
@benoit74 benoit74 restored the choose_icon2 branch August 7, 2024 12:18
@benoit74 benoit74 reopened this Aug 7, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Aug 7, 2024

Closed by mismanipulation (typo in my terminal)

@benoit74 benoit74 merged commit 6cad55b into main Aug 7, 2024
9 checks passed
@benoit74 benoit74 deleted the choose_icon2 branch August 7, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants