Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add web archive as a final fallback [WIP] #40

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

shinji257
Copy link

@shinji257 shinji257 commented Oct 24, 2024

This implements the web archive as a potential fallback for when the archive is no longer available during a batch download. This does not allow for downloading removed archives if you do not already have the image metadata in your local database unless you add that data yourself.

You may want to wait to merge (if at all) until the web archive is actually stable. It seems to go up and down. I added a warn line right before it downloads the image that indicates it is getting the copy from the internet archive. Mostly for myself since I'll be using this code and want to make sure the images are pulled properly.

This is for #39

@shinji257 shinji257 changed the title Add web archive as a final fallback (for https://github.com/9-FS/nhentai_archivist/issues/39) Add web archive as a final fallback Oct 24, 2024
This implements the web archive as a potential fallback for when the archive is no longer available during a batch download.  This does not allow for downloading removed archives if you do not already have the image metadata in your local database unless you add that data yourself.
Add ARCHIVE_ORG optional config to enable this functionality.  If image does not exist on internet archive you will still get a 404.  If they are down you will get a random different error.  The functionality is disabled by default.
@shinji257
Copy link
Author

I'm wrapping the code in a config option in order to not add any extra stress to the server right now. The code exists but it is disabled by default. Checks for option ARCHIVE_ORG to be set to true. There might be more graceful ways to implement this but this is how I've done it.

Moved the pull so it isn't part of the main loop.  I did duplicate the download image function with the only part removing the server loop and replacing the image download part with the one that rewrites and attempts to download the web archive version.  Previous version of the code tried to download from the archvie 5 times (once for each attempt pass) and I didn't realize this at first.  This one only does it once and only after the 5 tries have been tried with the regular servers.  Ideally I'd like to reduce the code duplicate (or eliminate outright) and have it abort the archive attempt at the first error.  At least it is better about it now.
@shinji257 shinji257 changed the title Add web archive as a final fallback Add web archive as a final fallback [WIP] Oct 27, 2024
@shinji257
Copy link
Author

shinji257 commented Oct 27, 2024

Moved the web archive code out of the main download loop. Fixes the following items.

  • Download failure warnings for downloading from the main servers are restored (they were broken with my original code and I didn't notice at first)
  • No longer tries to pull from the web archive 5 times. I didn't notice this before and it was spamming the server pretty badly I guess.

Current patch duplicates the download_image function with an archive_image function and modified it a little to point to the web archive. It's only called if the main download loop fails (it uses image_download_success to check like the full fail message does then resets it to true for it's own loop) and if webarchive is set to true in config. Otherwise it should get skipped.

I'll try to see about cleaning up the code so it doesn't duplicate but not sure on the approach. I'd also like to have the archive attempt hard fail on the first error. No point in having it go further because the archive will be incomplete at that point. A person can try again later if the issue is due to archive.org being down but if you got another 404 chances are it won't get fixed in the future. 404 error happens when the Internet Archive has not archived a specific file or page.

src/config.rs Outdated Show resolved Hide resolved
@@ -218,6 +219,44 @@ impl Hentai
}
if image_download_success {break;} // if all images were downloaded successfully: continue with cbz creation
}
if !image_download_success && webarchive == true { // Web Archive Loop
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to download images from web archive, only the metadata as the images should not have been purged from the nhentai media servers (see famously 177013, last section in my readme). Correct me if I'm wrong.

Then hypothetically speaking, if we wanted to download images from the web archive: Completely duplicating the outer download logic is likely unnecessary. If I see this correctly, the only thing that's different is that you call a different download image function Self::archive_image. Why didn't you just implement fallback to web archive logic in the original download_image function?

@9-FS
Copy link
Owner

9-FS commented Oct 27, 2024

Ah okay, I should've read all of your comments first before making my review comments. :D It's good to see that we're on the same page here that the current duplicated code would not be ideal. But again, I don't really want to download images from the web archive because they're under enough stress as it is, I would only like to download metadata that has been purged from nhentai.

@shinji257
Copy link
Author

shinji257 commented Oct 27, 2024

For me the point would be to grab what I can from the IA but at the moment I've been just running it in a different set of configuration on galleries that fail to grab images from the main servers directly then bailing when it throws an error. At this point I'd already have metadata because it is stored in the DB from the original API pulls. This isn't setup to get any metadata from IA.

For now this is just taking my own idea and running with it but I'm also learning Rust a bit with this so I do not expect you to commit this at all. You did say you added it to your ideas list so I expect that you may implement this in your own way at a later date which would fit much better in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants