Large patch sizes #69

mchaniotakis · 2023-04-09T18:09:59Z

Describe the bug
I have generated a Version 1 of a python application buddled with pyinstaller. This package contains images, libraries, the .exe and my .py files that have been converted to .pyd (binaries). One of those .pyd files states the version of the file. If only change the version of that .pyd file without running pyinstaller again to generate the second version of the bundle with tufup I get a file difference of 200MB, which is crazy if you take into account that the whole package is 340MB. The last modification date for these files are the same except the .pyd file file that states the version. Using the bsdiff4.file_diff() method between these two version produces the same result. I can provide both of these files If needed.

To Reproduce
Steps to reproduce the behavior:

Run cython and pyinstaller with .spec required to make the bundle.
Copy all files except 1 folder containing some images
Modify version.py file and re-run cython for that file to generate version.pyd and copy over the 1 folder mentioned above from the source (folder has not changed and copied with shutil.cptree() so the modification dates are the same)
When I run the repo.add_bundle(new_bundle_dir=bundle_dir) method I get the filesize mentioned above for the patch

Expected behavior
A patch size that is less than 10 MB. On a previous run, I regenerated just the .exe (running pyinstaller and just copying the .exe and deleting everything else while I follow the steps mentioned above.) The .exe filesize is 17mb while the generated patch was 35MB for that run.

System info (please complete the following information):

OS: [Window 11]
Python version 3.9
Pyinstaller version 5.9.0
Tufup version 0.4.9
bsdiff4 version 1.2.3

dennisvang · 2023-04-11T11:07:26Z

@mchaniotakis Thanks for providing such a detailed report.

You are right, these excessively large patches for small changes are not very useful, to say the least.

Tufup was created as a replacement for PyUpdater (because PyUpdater is no longer maintained). For this reason, the patch creation in tufup using bsdiff4 is basically a naive copy of PyUpdater's make_patch (see inputs here).

Although I did add some tests for basic patch functionality, I must admit, I haven't paid very much attention to the resulting file sizes.

The use of bsdiff4, in itself, does not seem to be a problem. Rather, the problem comes from the fact that we use it, naively, to create binary differences of .tar.gz archives.

It appears that binary diffs of either uncompressed .tar files or non-tar .gz files are okay, but binary diffs of .tar.gz files are troublesome (the diffs are correct, but very large).

There's probably a good explanation for this, so I'll have a closer look at it as soon as I have some free time.

dennisvang · 2023-06-16T13:01:58Z

As a temporary workaround, patches can be disabled using --skip-patches, see PR #68.

On the command line:

tufup targets add --skip-patches <app_version> <bundle_dir> <key_dirs>

or in a script:

...
repo = Repository.from_config()
repo.add_bundle(new_bundle_dir=..., new_version=..., skip_patch=True)
repo.publish_changes(private_key_dirs=...)
...

dennisvang · 2023-11-15T21:09:35Z

Another problem may be the fact that pyinstaller builds are not reproducible by default, as explained in the docs:

In certain cases it is important that when you build the same application twice, using exactly the same set of dependencies, the two bundles should be exactly, bit-for-bit identical.

That is not the case normally. Python uses a random hash to make dicts and other hashed types, and this affects compiled byte-code as well as PyInstaller internal data structures. As a result, two builds may not produce bit-for-bit identical results even when all the components of the application bundle are the same and the two applications execute in identical ways.

but

You can ensure that a build will produce the same bits by setting the PYTHONHASHSEED environment variable to a known integer value before running PyInstaller. [...]

in addition

Changed in version 4.8: The build timestamp in the PE headers of the generated Windows executables is set to the current time during the assembly process. A custom timestamp value can be specified via the SOURCE_DATE_EPOCH environment variable to achieve reproducible builds.

I'll have to do some more tests...

UPDATE:

Hmm... Does not seem to make much of a difference in the tufup-example app. Setting both PYTHONHASHSEED and SOURCE_DATE_EPOCH produces patches that still vary in size between runs, and are still far too big for the small change (only 1.0 changed to 2.0):

archive size v1/v2: 10846 KB
patch size (default): 7064 KB, 6912 KB, ...
patch size ("reproducible"): 6659 KB, 6975 KB, ...

dennisvang · 2023-11-16T10:06:08Z

more useful information:

dennisvang · 2023-11-24T13:32:41Z

Although we can now work around most of the issues with reproducibility with gzip (see #93), one risk remains:

The compressed output from gzip depends on the implementation, and there is no guarantee that identical input will lead to identical output between different implementations. (only equality of decompressed output is guaranteed)

We assume that the tufup archives are created on the same OS that they are used on, and that the gzip implementation is sufficiently stable between versions of the same OS to guarantee byte-for-byte equality. However, this may lead to trouble in the future: If it would turn out that gzip output is unstable between different versions of the same OS, the python-tuf hash check would fail, preventing updates.

There are a few options to prevent this:

Implement support for OS versions in the archive filename, so we can add separate targets for different OSes (or OS versions). This is also in line with multi-platform support as in Can the repository/client support multiple target plaforms? #79.
Register the .tar archives as targets, instead of .tar.gz archives. This would simplify our code, because we would no longer need to worry about gzip reproducibility. To save disk space on the client we could still keep a compressed archive there using the default gzip. Note that gzip compression could still be used for file transmission, e.g. using the Content-Encoding: gzip HTTP-header, but this would depend on the user's update-server configuration and would therefore be outside the scope of tufup (python-tuf automatically handles decompression if that HTTP-header is set).

dennisvang · 2024-02-02T13:40:13Z

After some more thought, here's another option:

We stick with compressed archives (.tar.gz) as our tuf repository targets.

This means the download verification process and the server configuration can remain unaltered.

However:

we take precautions to ensure that our (uncompressed) .tar archives are reproducible
we create a (monolithic) patch file using bsdiff4 from the (uncompressed) .tar archives
we include a file hash for the (uncompressed) destination .tar archive in the target metadata for the patch file, using a CUSTOM object (see Support custom metadata objects. #100)
after reconstructing the destination archive from the patch, on the client side, we verify its integrity using the hash from the custom metadata object, before gzipping the archive (just to save storage space)

The only problem remaining now is that our uncompressed .tar archives can be two or three times the size of the corresponding .tar.gz files. This may cause trouble due to resource limitations, as bsdiff4 requires a lot of memory (and time).

In addition, we should implement some kind of failsafe, so that failed patches will be ignored on the next run, in favor of a full installation. (done: #101)

Why go to the trouble of verifying the integrity of the reconstructed archive?

The integrity and authenticity of the patch and the current archive are already guaranteed by TUF.

Knowing this, it seems highly unlikely that anything could go wrong when applying the patch.

Nevertheless, if anything does go wrong, our self-updating application is likely to be broken. This would require a manual re-install.

Moreover, it is quite possible that a mistake somewhere in the workflow would lead to a patch being applied to the wrong archive: bsdiff4 will happily apply a patch to any src file, regardless of whether the patch was actually created from that file. Obviously, the result would be unusable.

To illustrate the point:

import bsdiff4

original = b'this represents the original file'
updated = b'this represents the updated file'
wrong = b'this is the wrong file'

patch = bsdiff4.diff(src_bytes=original, dst_bytes=updated)
reconstructed = bsdiff4.patch(src_bytes=original, patch_bytes=patch)
assert reconstructed == updated
broken = bsdiff4.patch(src_bytes=wrong, patch_bytes=patch)
assert broken != updated

dennisvang · 2024-12-14T21:58:55Z

To follow up on this comment:

Tufup was created as a replacement for PyUpdater (because PyUpdater is no longer maintained). For this reason, the patch creation in tufup using bsdiff4 is basically a naive copy of PyUpdater's make_patch (see inputs here).

Although I did add some tests for basic patch functionality, I must admit, I haven't paid very much attention to the resulting file sizes.

For completeness, it turns out a similar issue also arose with PyUpdater when using .tar.gz, but not using .zip, which was the default on windows:

Digital-Sapphire/PyUpdater#121

mchaniotakis assigned dennisvang Apr 9, 2023

dennisvang added the bug Something isn't working label Apr 11, 2023

dennisvang added the enhancement New feature or request label Apr 11, 2023

dennisvang mentioned this issue Nov 20, 2023

create patch from .tar instead of .tar.gz #93

Closed

dennisvang mentioned this issue Feb 9, 2024

create patch from .tar instead of .tar.gz (alternative solution) #105

Merged

dennisvang closed this as completed in #105 Feb 12, 2024

mchaniotakis mentioned this issue Jul 5, 2024

Really really slow (8+ hours) to generate patches for large files (450 mb) with 24GB RAM using bsdiff4. #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large patch sizes #69

Large patch sizes #69

mchaniotakis commented Apr 9, 2023

dennisvang commented Apr 11, 2023 •

edited

Loading

dennisvang commented Jun 16, 2023 •

edited

Loading

dennisvang commented Nov 15, 2023 •

edited

Loading

dennisvang commented Nov 16, 2023 •

edited

Loading

dennisvang commented Nov 24, 2023 •

edited

Loading

dennisvang commented Feb 2, 2024 •

edited

Loading

dennisvang commented Dec 14, 2024 •

edited

Loading

Large patch sizes #69

Large patch sizes #69

Comments

mchaniotakis commented Apr 9, 2023

dennisvang commented Apr 11, 2023 • edited Loading

dennisvang commented Jun 16, 2023 • edited Loading

dennisvang commented Nov 15, 2023 • edited Loading

dennisvang commented Nov 16, 2023 • edited Loading

dennisvang commented Nov 24, 2023 • edited Loading

dennisvang commented Feb 2, 2024 • edited Loading

Why go to the trouble of verifying the integrity of the reconstructed archive?

dennisvang commented Dec 14, 2024 • edited Loading

dennisvang commented Apr 11, 2023 •

edited

Loading

dennisvang commented Jun 16, 2023 •

edited

Loading

dennisvang commented Nov 15, 2023 •

edited

Loading

dennisvang commented Nov 16, 2023 •

edited

Loading

dennisvang commented Nov 24, 2023 •

edited

Loading

dennisvang commented Feb 2, 2024 •

edited

Loading

dennisvang commented Dec 14, 2024 •

edited

Loading