-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Really really slow (8+ hours) to generate patches for large files (450 mb) with 24GB RAM using bsdiff4. #154
Comments
Please ignore my comment above about detools as its only for applying patches. Using HDiffPatch I was able to create small patches using their binaries, and its also superfast. I will have to test if these diffs work with bsdiff patching (HDiffPatch repo says its supported). Eitherway, I do believe having the option to use HDiffPatch to handle large files is a huge advantage. |
@mchaniotakis Thanks for the highly detailed report. Much appreciated! Although it is well known that We recently ran into a similar problem with a ~100MB tarball, where a relatively large change in code required a patch creation time of approx. 20 minutes (i.e. "normal"), whereas a small change of a few characters resulted in patch creation never finishing (at least not within a few hours). I've spent some time trying to track down the cause, and trying to reproduce this issue in a minimal example using only I'll see if I can find the time to dive into this again, and compare different patching solutions. A short term alternative would be to provide a customization option, so users can provide their own patcher. Note for newcomers: Before #105, we created patches from the |
Possibly related: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=409664 |
I was able to go around the bsdiff diff method by using the HDiffPatch binary provided in the repo i referenced above. The only part of the process that needs to change is when the patch is created. HDiffPatch can save the patches in the same format as bsdiff. The process is adjusted as follows: 1) Create the bundle as you normally would, but skip patch creation step , 2) Create the patch with HDiffPatch and compute the size and hash just like in here and 3) Do repo.roles.add_or_update_target and repo.publish_changes. The update process on the client should be the same. Here is a sample piece of code that I am using when creating the patch: def create_patch_with_hdiff(latest_archive , new_archive , output_file):
"""
latest_archive : path being the tar.gz path to the older version
new_archive : path being the tar.gz file to the newly created version
output_file : path to the output location of the .patch generated file
"""
hdiff_path = "location_of_the_binary/hdiffz.exe "
with tempfile.TemporaryDirectory() as tmpdirname:
latest_archive_extracted = os.path.join(tmpdirname , "latest.tar")
new_archive_extracted = os.path.join(tmpdirname , "new.tar")
with gzip.open(latest_archive, "rb") as src_file:
with open(latest_archive_extracted, "wb") as src_file_extracted:
src_file_extracted.write(src_file.read())
with gzip.open(new_archive, "rb") as dst_file:
dst_tar_content = dst_file.read()
dst_size_and_hash = Patcher._get_tar_size_and_hash(tar_content=dst_tar_content)
with open(new_archive_extracted, "wb") as dst_file_extracted:
dst_file_extracted.write(dst_tar_content)
if os.path.exists(output_file):
logger.info("Found old patch file, removing...")
os.remove(output_file)
# create patch
process = subprocess.Popen([hdiff_path , "-s-4k" ,"-BSD",latest_archive_extracted , new_archive_extracted , output_file ])
process.wait()
return dst_size_and_hash |
So @dennisvang what is your thoughts on this? Should tufup consider swithcing library for computing patches? |
Hi @JessicaTegner, sorry for my really slow response. In the long run it may be good to switch to a different library, but for now I would prefer just to implement a customization option that allows users to specify their own patcher. One of the reasons is complexity of the update process when switching to a new technology. It's an interesting mix of challenges related to both backward compatibility and forward compatibility. In addition, I would still like to know the underlying cause of this issue with |
Hey @dennisvang good to see that you are still around. |
@JessicaTegner Me neither, as far as I can remember. Then again, neither did I see this behavior with tufup before #105.
May be interesting to give that a try. I do know that pyupdater also used I'm not 100% sure, but seem to recall that pyupdater created the patches based on compressed archives ( |
Hey @dennisvang |
@JessicaTegner you're right, a solution is needed, but I still don't have a way of reproducing the issue consistently. My preference would be to understand the cause before taking action. Do you know how to reproduce this in a minimal example?
That could be an option, but would make the patch functionality useless again in a different way.
That's unfortunate. You're probably aware of this, but for those who are not: you can set a timeout for your github workflow, either at the job level or at the step level, as in jobs:
my-job:
...
timeout-minutes: 60
... (default is |
Following up on this comment:
Looks like PyUpdater did have the same issue as our #69, due to creating patches from compressed A |
@dennisvang wonder if there's some randomization we need to disable, when it comes to creating the tar or the tar.gz. Can't see anything in the documentation, but it still could be a possibility. |
Hi @JessicaTegner, afaik any "randomization," as you call it, is mainly introduced by gzip, and causes the large patch size issue (#69) when creating patches from compressed That was actually the reason for #105, where we started creating patches from the uncompressed The So, to be clear, the current implementation creates patches from W.r.t. the present issue (#154), I may be wrong, but it almost looks like the bsdiff algorithm has trouble creating patches when there are only few small changes scattered throughout a large file. Maybe this is even specific to tar-archives and similar file structures, because I could not reproduce the issue using an equally large "random" byte sequence as dummy archive. I still think it sounds a lot like debian 409664. What's your experience in this respect? Does the issue occur only for "small" changes, or also for "large" changes? And what kind of app bundle do you use? Is it also a PyInstaller bundle, or something else? |
First off, thanks a lot for this contribution of tufup, it is a great package and the only reliable solution as of now for an auto updating framework, I really appreciate the effort put in this and maintaining it.
Describe the bug
I generate 2 versions of my app, exactly the same with the only difference being the version number. Following #69 I use os.environ["PYTHONSEED] = "0" and os.environ["SOURCE_DATE_EPOCH"] = "1577883661" on the file I am running pyinstaller.run() and on the .spec file as well (although its probably not needed in the spec file). Using bsdiff4 to generate patches between the 2 versions:
Looking at my RAM it doesnt seem to become full at any point. This patch generation has been running now for about 8-9 hours.
Using this package: detools I can test the following:
Provided that I could generate a patch with the detools library, it would be possible to manually do so after a publish, with skip_patch = True and infuse the patch later. However, the patches generated for these bundles are around 350MB to 450MB, which is suspicious and not practical.
Here is some code to create patches using detools:
and
To Reproduce
Steps to reproduce the behavior:
I can provide two copies of the exact same versions that I used from my open sourced app. Feel free to use the code above to test patching with dsdiff4 and detools.
Expected behavior
Using bsdiff4 the .diff() never completes (should be very small in size, hopefully less than 45 mb). Using detools the patch generation finishes within 2-10 minutes but the patches are around 350 to 450MB (the application bundle itself is 450 MB)
System info (please complete the following information):
Now I understand that this is a problem with possibly the implementation of bsdiff on bsdiff4, however, there is a size limit to files bsdiff can process (at 2 GB) while the hdiffpatch and match-blocks algorithms don't have that limit. I would appreciate any feedback on how should I go about debugging this.
The text was updated successfully, but these errors were encountered: