Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose proper format to download youtube videos #356

Closed
benoit74 opened this issue Oct 8, 2024 · 4 comments · Fixed by #373
Closed

Choose proper format to download youtube videos #356

benoit74 opened this issue Oct 8, 2024 · 4 comments · Fixed by #373
Assignees
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Oct 8, 2024

In #355, I've changed the format setting passed to yt-dlp to select proper video streams.

"historical" value before this change was best[ext={vidext}]/bestvideo[ext={vidext}]+bestaudio[ext={audext}]/best

Where {vidext} and {audext} are computed based on the format chosen at youtube2zim CLI:

  • if --format mp4 is used, {vidext} is mp4 and {audext} is m4a
  • if --format webm is used, {vidext} is webm and {audext} is webm
  • not other --format setting is supported

This format selector is not working because on some cases we do not have a webm format discovered by yt-dlp (we basically always use --format webm), so best[ext={vidext}] and bestvideo[ext={vidext}]+bestaudio[ext={audext}]/ do not match. And the fallback to best does not work either because there is no stream with both audio and video because platforms (and especially Youtube) tends to now proper audio-only and video-only streams, since players are now widely capable to "combine" the two streams on-the-fly.

This format was hence "buggy" in the sense that it failed to download the video while it was in fact quite possible to find a good one.

I changed the format in mentionned PR to bestvideo*[ext={vidext}]+bestaudio[ext={audext}]/bestvideo*+bestaudio/best.

See #351 for some discussion around all this.

I feel like this setting is still not the most appropriate one because in many cases, Youtube (at least) seems to not propose many webm streams, in favor of mp4 (see #351 (comment)).

Let's take Youtube video 7_N0yozUnWY as an example:

yt_dlp yt-dlp --list-formats https://www.youtube.com/watch\?v\=7_N0yozUnWY
[youtube] Extracting URL: https://www.youtube.com/watch?v=7_N0yozUnWY
[youtube] 7_N0yozUnWY: Downloading webpage
[youtube] 7_N0yozUnWY: Downloading ios player API JSON
[youtube] 7_N0yozUnWY: Downloading web creator player API JSON
[youtube] 7_N0yozUnWY: Downloading m3u8 information
[info] Available formats for 7_N0yozUnWY:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC          VBR ACODEC      ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb2 mhtml 48x27        0    │                  mhtml │ images                                  storyboard
sb1 mhtml 80x45        0    │                  mhtml │ images                                  storyboard
sb0 mhtml 160x90       0    │                  mhtml │ images                                  storyboard
233 mp4   audio only        │                  m3u8  │ audio only          unknown             [fr] Default
234 mp4   audio only        │                  m3u8  │ audio only          unknown             [fr] Default
139 m4a   audio only      2 │    4.63MiB   49k https │ audio only          mp4a.40.5   49k 22k [fr] low, m4a_dash
140 m4a   audio only      2 │   12.28MiB  129k https │ audio only          mp4a.40.2  129k 44k [fr] medium, m4a_dash
251 webm  audio only      2 │    9.84MiB  104k https │ audio only          opus       104k 48k [fr] medium, webm_dash
269 mp4   256x144     25    │ ~ 11.27MiB  119k m3u8  │ avc1.4D400C    119k video only
160 mp4   256x144     25    │    3.56MiB   38k https │ avc1.4D400C     38k video only          144p, mp4_dash
230 mp4   640x360     25    │ ~ 38.99MiB  411k m3u8  │ avc1.4D401E    411k video only
134 mp4   640x360     25    │   14.17MiB  150k https │ avc1.4D401E    150k video only          360p, mp4_dash
18  mp4   640x360     25  2 │ ≈ 26.38MiB  278k https │ avc1.42001E         mp4a.40.2       44k [fr] 360p
605 mp4   640x360     25    │ ~ 28.83MiB  304k m3u8  │ vp09.00.21.08  304k video only
243 webm  640x360     25    │    8.90MiB   94k https │ vp9             94k video only          360p, webm_dash
232 mp4   1280x720    25    │ ~109.26MiB 1153k m3u8  │ avc1.64001F   1153k video only
136 mp4   1280x720    25    │   39.45MiB  416k https │ avc1.64001F    416k video only          720p, mp4_dash
270 mp4   1920x1080   25    │ ~172.02MiB 1815k m3u8  │ avc1.640028   1815k video only
137 mp4   1920x1080   25    │   69.28MiB  731k https │ avc1.640028    731k video only          1080p, mp4_dash

Only webm streams available are:

  • stream 243 for video at 640x360, video bitrate 94k
  • stream 251 for audio, audio bitrate 104k

While we have much better mp4 streams on this video:

  • stream 137 for video at 1920x1080, video bitrate 731k (I don't think we can select the m3u8 stream 270 which is even better, if I'm not mistaken we only use http protocol, anyway the point is still the same)
  • stream 140 for audio, audio bitrate 129k

As far as I've understood, original idea was to select best stream possible, and then reencode to our preset, so that we ensure we will have the best chance to not loose quality by reencoding a limited quality video into another limited quality video. While favoring our "preferred" output format might help a bit in the past, I feel like this is now causing more harm than good.

I propose to change this format setting to bestvideo*+bestaudio/best (mostly yt-dlp default according to the documentation)

@rgaudin
Copy link
Member

rgaudin commented Oct 8, 2024

I agree ; the previous setting were meant to save reencoding for those not using --low-quality and it was working fine.
Given we always reencode, this path becomes the exception and choosing best then reencoding seems more appropriate.

@benoit74 benoit74 modified the milestones: 3.2.0, 3.3.0, 3.2.1 Oct 11, 2024
@benoit74
Copy link
Collaborator Author

We were indeed not always reencoding. I find this very confusing, especially since we have the high quality setting in zimscraperlib. I propose to really always reencode, to avoid "strange unexpected behaviors", even if it means a bit more processing on some cases. I will propose a PR with this.

@rgaudin
Copy link
Member

rgaudin commented Nov 1, 2024

I agree

@rgaudin
Copy link
Member

rgaudin commented Nov 1, 2024

I think it was left as a quick dev path before we implement the S3 cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants