Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really really slow (8+ hours) to generate patches for large files (450 mb) with 24GB RAM using bsdiff4. #154

Open
mchaniotakis opened this issue Jul 5, 2024 · 13 comments
Assignees
Labels
performance Related to performance

Comments

@mchaniotakis
Copy link

First off, thanks a lot for this contribution of tufup, it is a great package and the only reliable solution as of now for an auto updating framework, I really appreciate the effort put in this and maintaining it.

Describe the bug
I generate 2 versions of my app, exactly the same with the only difference being the version number. Following #69 I use os.environ["PYTHONSEED] = "0" and os.environ["SOURCE_DATE_EPOCH"] = "1577883661" on the file I am running pyinstaller.run() and on the .spec file as well (although its probably not needed in the spec file). Using bsdiff4 to generate patches between the 2 versions:

with gzip.open(file_1, mode='rb') as src_file:
    with gzip.open(file_2, mode='rb') as dst_file:
        bsdiff4.diff(src_bytes=src_file.read() , dst_bytes=dst_file.read())

Looking at my RAM it doesnt seem to become full at any point. This patch generation has been running now for about 8-9 hours.

Using this package: detools I can test the following:
image

Provided that I could generate a patch with the detools library, it would be possible to manually do so after a publish, with skip_patch = True and infuse the patch later. However, the patches generated for these bundles are around 350MB to 450MB, which is suspicious and not practical.
Here is some code to create patches using detools:

pip install detools

and

from detools.create import create_patch , create_patch_filenames
output_file = "../../../mypatch.patch
with gzip.open(file_1, mode='rb') as src_file:
    with gzip.open(file_2, mode='rb') as dst_file:
        with open(output_file, "wb") as fpatch:
            create_patch(src_file ,dst_file , fpatch, algorithm = "match-blocks" , patch_type = "hdiffpatch" , compression = "none")

To Reproduce
Steps to reproduce the behavior:
I can provide two copies of the exact same versions that I used from my open sourced app. Feel free to use the code above to test patching with dsdiff4 and detools.

Expected behavior
Using bsdiff4 the .diff() never completes (should be very small in size, hopefully less than 45 mb). Using detools the patch generation finishes within 2-10 minutes but the patches are around 350 to 450MB (the application bundle itself is 450 MB)

System info (please complete the following information):

  • OS: Windows
  • OS Version : 11
  • Python version : 3.10.7
  • Pyinstaller version : 6.7.0
  • bsdiff4 version : 1.2.4
  • tufup version : 0.9.0
  • detools version : 0.53.0

Now I understand that this is a problem with possibly the implementation of bsdiff on bsdiff4, however, there is a size limit to files bsdiff can process (at 2 GB) while the hdiffpatch and match-blocks algorithms don't have that limit. I would appreciate any feedback on how should I go about debugging this.

@mchaniotakis
Copy link
Author

mchaniotakis commented Jul 5, 2024

Please ignore my comment above about detools as its only for applying patches. Using HDiffPatch I was able to create small patches using their binaries, and its also superfast. I will have to test if these diffs work with bsdiff patching (HDiffPatch repo says its supported). Eitherway, I do believe having the option to use HDiffPatch to handle large files is a huge advantage.

@dennisvang
Copy link
Owner

@mchaniotakis Thanks for the highly detailed report. Much appreciated!

Although it is well known that bsdiff4 is slow and memory hungry, as noted in #105, I have a feeling the issue you describe, taking 8+ hours without finishing, for a very small change in code, is abnormal.

We recently ran into a similar problem with a ~100MB tarball, where a relatively large change in code required a patch creation time of approx. 20 minutes (i.e. "normal"), whereas a small change of a few characters resulted in patch creation never finishing (at least not within a few hours).

I've spent some time trying to track down the cause, and trying to reproduce this issue in a minimal example using only bsdiff4, but without much success.

I'll see if I can find the time to dive into this again, and compare different patching solutions.

A short term alternative would be to provide a customization option, so users can provide their own patcher.

Note for newcomers: Before #105, we created patches from the .tar.gz, and I never saw this kind of behavior. However, as @mchaniotakis described in #69, the resulting patch files were basically useless, because they were nearly as large as the originals.

@dennisvang
Copy link
Owner

@mchaniotakis
Copy link
Author

mchaniotakis commented Jul 15, 2024

I was able to go around the bsdiff diff method by using the HDiffPatch binary provided in the repo i referenced above. The only part of the process that needs to change is when the patch is created. HDiffPatch can save the patches in the same format as bsdiff. The process is adjusted as follows: 1) Create the bundle as you normally would, but skip patch creation step , 2) Create the patch with HDiffPatch and compute the size and hash just like in here and 3) Do repo.roles.add_or_update_target and repo.publish_changes. The update process on the client should be the same. Here is a sample piece of code that I am using when creating the patch:

def create_patch_with_hdiff(latest_archive , new_archive , output_file):
   """
   latest_archive : path being the tar.gz path to the older version
   new_archive : path being the tar.gz file to the newly created version
   output_file : path to the output location of the .patch generated file
   """
    hdiff_path = "location_of_the_binary/hdiffz.exe "
    with tempfile.TemporaryDirectory() as tmpdirname:
        latest_archive_extracted = os.path.join(tmpdirname , "latest.tar")
        new_archive_extracted = os.path.join(tmpdirname , "new.tar")
        with gzip.open(latest_archive, "rb") as src_file:
            with open(latest_archive_extracted, "wb") as src_file_extracted:
                src_file_extracted.write(src_file.read())
                
        with gzip.open(new_archive, "rb") as dst_file:
            dst_tar_content = dst_file.read()
            dst_size_and_hash = Patcher._get_tar_size_and_hash(tar_content=dst_tar_content)
            with open(new_archive_extracted, "wb") as dst_file_extracted:
                dst_file_extracted.write(dst_tar_content)
            if os.path.exists(output_file):
                logger.info("Found old patch file, removing...")
                os.remove(output_file)
            # create patch
            process = subprocess.Popen([hdiff_path , "-s-4k" ,"-BSD",latest_archive_extracted , new_archive_extracted , output_file ])
            process.wait()
        return dst_size_and_hash

@dennisvang dennisvang added the performance Related to performance label Jul 15, 2024
@JessicaTegner
Copy link
Contributor

So @dennisvang what is your thoughts on this? Should tufup consider swithcing library for computing patches?

@dennisvang
Copy link
Owner

dennisvang commented Oct 18, 2024

Hi @JessicaTegner, sorry for my really slow response.

In the long run it may be good to switch to a different library, but for now I would prefer just to implement a customization option that allows users to specify their own patcher.

One of the reasons is complexity of the update process when switching to a new technology. It's an interesting mix of challenges related to both backward compatibility and forward compatibility.

In addition, I would still like to know the underlying cause of this issue with bsdiff4. I've encountered the issue in production, but still have not been able to reproduce it consistently in a minimal example using bsdiff4 only.

@JessicaTegner
Copy link
Contributor

JessicaTegner commented Oct 18, 2024

Hey @dennisvang good to see that you are still around.
Yes it's very interesting to observe this. I have been able to confirm the wildly different diff times for bsdiff4, and have tried different ways to fix it, or get around it.
A custom patcher option seems like a good first step, however we should really either figure out if we can fix this upstream, or push for switching.
interestingly though: Never saw this behaviour with pyupdater
Edit 1: I tried using the "file_diff", compared to the normal "diff" method, and my ram doesn't seem to be eaten up as much as with the "diff" method. I attribute that partly to the 2 files not being loaded into memory, but that still seems worth noting here.

@dennisvang
Copy link
Owner

interestingly though: Never saw this behaviour with pyupdater

@JessicaTegner Me neither, as far as I can remember.

Then again, neither did I see this behavior with tufup before #105.

Edit 1: I tried using the "file_diff", compared to the normal "diff" method, and my ram doesn't seem to be eaten up as much as with the "diff" method. I attribute that partly to the 2 files not being loaded into memory, but that still seems worth noting here.

May be interesting to give that a try. I do know that pyupdater also used file_diff.

I'm not 100% sure, but seem to recall that pyupdater created the patches based on compressed archives (.zip on windows), instead of uncompressed ones...

@dennisvang dennisvang added this to the upcoming release milestone Dec 13, 2024
@JessicaTegner
Copy link
Contributor

Hey @dennisvang
we should really find a solution for this, as it keeps on happening, especially as the repository grows.
I would say that we should maybe look into reverting #105 even though that might be annoying.
Most of the reasoning I see it happening, is because the compression is not deterministic, so in some cases, it has to spend a lot of compute trying to create a patch for something that would be useless anyway.
As it stands right now, the patch functionality is basically useless, and tbh I have had many accidental 6+ hour runs on GH, spending a lot of $$$, and I don't think I'm the only one :)

@dennisvang
Copy link
Owner

dennisvang commented Dec 14, 2024

we should really find a solution for this, as it keeps on happening, especially as the repository grows.

@JessicaTegner you're right, a solution is needed, but I still don't have a way of reproducing the issue consistently. My preference would be to understand the cause before taking action. Do you know how to reproduce this in a minimal example?

I would say that we should maybe look into reverting #105 even though that might be annoying.

That could be an option, but would make the patch functionality useless again in a different way.

[...] tbh I have had many accidental 6+ hour runs on GH, spending a lot of $$$, and I don't think I'm the only one :)

That's unfortunate. You're probably aware of this, but for those who are not: you can set a timeout for your github workflow, either at the job level or at the step level, as in

jobs:
  my-job:
    ...
    timeout-minutes: 60
    ...

(default is 360 minutes)

@dennisvang
Copy link
Owner

dennisvang commented Dec 14, 2024

Following up on this comment:

I'm not 100% sure, but seem to recall that pyupdater created the patches based on compressed archives (.zip on windows), instead of uncompressed ones...

Looks like PyUpdater did have the same issue as our #69, due to creating patches from compressed .tar.gz archives. On Windows they used .zip by default, and the issue did not arise there: Digital-Sapphire/PyUpdater#121

A .tar.gz is a gzip-compressed tar-archive, whereas a .zip is an archive of individually compressed files (spec), which would explain why the patch size issue only arises for .tar.gz.

@JessicaTegner
Copy link
Contributor

@dennisvang wonder if there's some randomization we need to disable, when it comes to creating the tar or the tar.gz. Can't see anything in the documentation, but it still could be a possibility.
If you don't find any, I'll try to look into what the issue could be, using my own apps as examples.
I suspect, it might be because of the randomization, so it to the patching, looks like a lot has changed, but it's just gotten moved around in the tar.gz

@dennisvang
Copy link
Owner

dennisvang commented Dec 16, 2024

I suspect, it might be because of the randomization, so it to the patching, looks like a lot has changed, but it's just gotten moved around in the tar.gz

Hi @JessicaTegner, afaik any "randomization," as you call it, is mainly introduced by gzip, and causes the large patch size issue (#69) when creating patches from compressed .tar.gz. Small changes in the .tar lead to very large changes in the .tar.gz, and this can be expected, based on the gzip spec.

That was actually the reason for #105, where we started creating patches from the uncompressed .tar archive. So the "randomness" should not be an issue. Note the emphasis on should. ;)

The .tar format itself also introduces some non-determinism, but that's only minor (see e.g. reproducible builds), and should not be a problem when creating patches from the uncompressed .tar archive. This is easily verified, e.g. by creating two tar archives from the same content at different points in time, etc. Also see comments under #93

So, to be clear, the current implementation creates patches from .tar, not from .tar.gz. That's also why archive size has now become a limiting factor (the uncompressed .tar is (much) bigger than the compressed .tar.gz).

W.r.t. the present issue (#154), I may be wrong, but it almost looks like the bsdiff algorithm has trouble creating patches when there are only few small changes scattered throughout a large file. Maybe this is even specific to tar-archives and similar file structures, because I could not reproduce the issue using an equally large "random" byte sequence as dummy archive.

I still think it sounds a lot like debian 409664.

What's your experience in this respect? Does the issue occur only for "small" changes, or also for "large" changes? And what kind of app bundle do you use? Is it also a PyInstaller bundle, or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to performance
Projects
None yet
Development

No branches or pull requests

3 participants