Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special case: Widely-spread known objects #17

Open
M-Gonzalo opened this issue Mar 21, 2018 · 1 comment
Open

Special case: Widely-spread known objects #17

M-Gonzalo opened this issue Mar 21, 2018 · 1 comment
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested

Comments

@M-Gonzalo
Copy link
Collaborator

M-Gonzalo commented Mar 21, 2018

There are some pieces of data that exists in identical form on countless devices around the world. Some examples are license files like GPL and re-distributable libraries such as zlib1, qt* or 7z.dll.

For some of these, having an off-line dictionary bundled with the Fairytale distribution in a compressed form could help a lot. A very simple parser could recognize them, and then encode them just as a minuscule reference, leading to denser archives, faster to produce.

This method has an obvious drawback, which is the size of said dictionary. Even so, modern archivers tend to occupy dozens of MB on disk, even more than a hundred, because of their choice of graphical libraries.

This is very subject to discussion. Might be worth giving it a try.

If you happen to know a file that fit these characteristics, please mention it in the comments.

@M-Gonzalo M-Gonzalo added enhancement New feature or request good first issue Good for newcomers question Further information is requested labels Mar 21, 2018
@DedupOperator
Copy link

Most cloud based and enterprise based backup solutions are using such method of multi-user deduplication database.
Every file on the host is hashed and the server is looking for the same hash.
It is a nice idea but I think it won't be a very good practice to build such a pool.
Deduplication database that will fit all kind of users will be huge.

Security and Practically wise will be for each host to build it's own dedup database in time, where there will be some duplicated data.

The Unique data will be compressed and possibly encrypted.
The database could be also compressed with today fast CPUs and SSDs.

Deduplication and Compression is already implemented in a very well tested way by the community members

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants