Add web archive as a final fallback [WIP] #40

shinji257 · 2024-10-24T17:34:34Z

This implements the web archive as a potential fallback for when the archive is no longer available during a batch download. This does not allow for downloading removed archives if you do not already have the image metadata in your local database unless you add that data yourself.

You may want to wait to merge (if at all) until the web archive is actually stable. It seems to go up and down. I added a warn line right before it downloads the image that indicates it is getting the copy from the internet archive. Mostly for myself since I'll be using this code and want to make sure the images are pulled properly.

This is for #39

This implements the web archive as a potential fallback for when the archive is no longer available during a batch download. This does not allow for downloading removed archives if you do not already have the image metadata in your local database unless you add that data yourself.

Add ARCHIVE_ORG optional config to enable this functionality. If image does not exist on internet archive you will still get a 404. If they are down you will get a random different error. The functionality is disabled by default.

shinji257 · 2024-10-24T20:22:15Z

I'm wrapping the code in a config option in order to not add any extra stress to the server right now. The code exists but it is disabled by default. Checks for option ARCHIVE_ORG to be set to true. There might be more graceful ways to implement this but this is how I've done it.

Moved the pull so it isn't part of the main loop. I did duplicate the download image function with the only part removing the server loop and replacing the image download part with the one that rewrites and attempts to download the web archive version. Previous version of the code tried to download from the archvie 5 times (once for each attempt pass) and I didn't realize this at first. This one only does it once and only after the 5 tries have been tried with the regular servers. Ideally I'd like to reduce the code duplicate (or eliminate outright) and have it abort the archive attempt at the first error. At least it is better about it now.

shinji257 · 2024-10-27T06:34:43Z

Moved the web archive code out of the main download loop. Fixes the following items.

Download failure warnings for downloading from the main servers are restored (they were broken with my original code and I didn't notice at first)
No longer tries to pull from the web archive 5 times. I didn't notice this before and it was spamming the server pretty badly I guess.

Current patch duplicates the download_image function with an archive_image function and modified it a little to point to the web archive. It's only called if the main download loop fails (it uses image_download_success to check like the full fail message does then resets it to true for it's own loop) and if webarchive is set to true in config. Otherwise it should get skipped.

I'll try to see about cleaning up the code so it doesn't duplicate but not sure on the approach. I'd also like to have the archive attempt hard fail on the first error. No point in having it go further because the archive will be incomplete at that point. A person can try again later if the issue is due to archive.org being down but if you got another 404 chances are it won't get fixed in the future. 404 error happens when the Internet Archive has not archived a specific file or page.

src/config.rs

9-FS · 2024-10-27T11:23:36Z

src/hentai.rs

@@ -218,6 +219,44 @@ impl Hentai
            }
            if image_download_success {break;} // if all images were downloaded successfully: continue with cbz creation
        }
+        if !image_download_success && webarchive == true { // Web Archive Loop


I don't want to download images from web archive, only the metadata as the images should not have been purged from the nhentai media servers (see famously 177013, last section in my readme). Correct me if I'm wrong.

Then hypothetically speaking, if we wanted to download images from the web archive: Completely duplicating the outer download logic is likely unnecessary. If I see this correctly, the only thing that's different is that you call a different download image function Self::archive_image. Why didn't you just implement fallback to web archive logic in the original download_image function?

9-FS · 2024-10-27T11:27:08Z

Ah okay, I should've read all of your comments first before making my review comments. :D It's good to see that we're on the same page here that the current duplicated code would not be ideal. But again, I don't really want to download images from the web archive because they're under enough stress as it is, I would only like to download metadata that has been purged from nhentai.

shinji257 · 2024-10-27T15:25:46Z

For me the point would be to grab what I can from the IA but at the moment I've been just running it in a different set of configuration on galleries that fail to grab images from the main servers directly then bailing when it throws an error. At this point I'd already have metadata because it is stored in the DB from the original API pulls. This isn't setup to get any metadata from IA.

For now this is just taking my own idea and running with it but I'm also learning Rust a bit with this so I do not expect you to commit this at all. You did say you added it to your ideas list so I expect that you may implement this in your own way at a later date which would fit much better in the code.

shinji257 changed the title ~~Add web archive as a final fallback (for https://github.com/9-FS/nhentai_archivist/issues/39)~~ Add web archive as a final fallback Oct 24, 2024

shinji257 force-pushed the archive branch from 1a218c5 to 18b29c5 Compare October 24, 2024 19:21

Wrap internet archive code in an option check

14ab00b

Add ARCHIVE_ORG optional config to enable this functionality. If image does not exist on internet archive you will still get a 404. If they are down you will get a random different error. The functionality is disabled by default.

shinji257 changed the title ~~Add web archive as a final fallback~~ Add web archive as a final fallback [WIP] Oct 27, 2024

9-FS reviewed Oct 27, 2024

View reviewed changes

src/config.rs Outdated Show resolved Hide resolved

9-FS reviewed Oct 27, 2024

View reviewed changes

shinji257 added 2 commits October 27, 2024 11:29

Rename ARCHIVE_ORG to FALLBACK_TO_ARCHIVE_ORG

9f524af

Comment...

77a4f24

9-FS force-pushed the dev branch from 537fa30 to a730659 Compare November 4, 2024 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add web archive as a final fallback [WIP] #40

Add web archive as a final fallback [WIP] #40

shinji257 commented Oct 24, 2024 •

edited

Loading

shinji257 commented Oct 24, 2024

shinji257 commented Oct 27, 2024 •

edited

Loading

9-FS Oct 27, 2024

9-FS commented Oct 27, 2024

shinji257 commented Oct 27, 2024 •

edited

Loading

Add web archive as a final fallback [WIP] #40

Are you sure you want to change the base?

Add web archive as a final fallback [WIP] #40

Conversation

shinji257 commented Oct 24, 2024 • edited Loading

shinji257 commented Oct 24, 2024

shinji257 commented Oct 27, 2024 • edited Loading

9-FS Oct 27, 2024

Choose a reason for hiding this comment

9-FS commented Oct 27, 2024

shinji257 commented Oct 27, 2024 • edited Loading

shinji257 commented Oct 24, 2024 •

edited

Loading

shinji257 commented Oct 27, 2024 •

edited

Loading

shinji257 commented Oct 27, 2024 •

edited

Loading