Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIM Fuse Filesystem #400

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

ZIM Fuse Filesystem #400

wants to merge 8 commits into from

Conversation

juuz0
Copy link
Collaborator

@juuz0 juuz0 commented Apr 13, 2024

Fixes kiwix/overview#79

Mount any ZIM file to your filesystem
Usage:
zimfuse zimFile.zim mountDir

mountDir should exist before using zimfuse.

setting up the project
Implemented a tree (trie-like) where a path is broken down per directories.
For example,
my/new/path1.jpg
my/new/path2.jpg

are represented as

/ (root)
 my
  new
    path1.jpg
    path2.jpg
Implemented some common FUSE functions.
We can browse the ZIM file using ls, cd, etc. and open files in text editors or using commands such as "cat" now.
Updated the README to add information about zimfuse
@kelson42
Copy link
Contributor

I guess this would be an implementation of kiwix/overview#79 ?

mappedNodes owns the nodes
Some entries had long filenames, this change reduces that to 100 characters which is will within the allowed limit.
This introduces possible name collisions which will be fixed in the following commit
now if two files have the same name, the 2nd file is changed to "originalFileName(1)"
When reading ZIMs with a lot of files (for example: mediawiki_en_all), getting their filesize took a lot of time. There were 2 choices:
1. Read filesize while creating the tree - this gave fast subsequent responses (on commands such as ls) but the fuse initialization took a good amount of time.
2. Find filesize when requested by a command (and save it for future requests) - This provides fast initialization but the first read takes time (though, not as much as getting all filesizes - only the requested ones). They are later saved for subsequent requests.
I went with option 2.
@juuz0
Copy link
Collaborator Author

juuz0 commented Apr 29, 2024

I guess this would be an implementation of kiwix/overview#79 ?

Yes.

@juuz0 juuz0 marked this pull request as ready for review April 29, 2024 03:21
@juuz0 juuz0 requested a review from kelson42 April 29, 2024 03:21
@kelson42
Copy link
Contributor

@juuz0 Thanks but most of the CI is broken? Can you fix it?

@juuz0
Copy link
Collaborator Author

juuz0 commented May 2, 2024

CI fails because 'fuse3' is not available on the system...maybe something to add in kiwix-build later?

Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @juuz0.
But I have few general comments better addressed on the full PR than each commit:

  • This kind of tree is a perfect use case for OOP and inheritance.
    A basic Node with name, parent. Then a Dir node (inheriting Node) with children and a Leaf node.
  • You define a Node::Ptr as unique_ptr, but the Node tree is composed of Node*. Node::Ptr is only used in Tree::mappedNodes. This make Tree the direct owner of all the nodes and Nodes only making "reference" to other nodes. It would make more sense to have Dir nodes the owners of their children.
  • The Tree::mappedNodes is kind of global cache. But I'm not sure we need it. Either "directories" contains few entries and loop in the vector should be enough or not, and in this case, it would be better to transform Node::children into a map. We would have the O(1) access at node level and top of ensuring us the unity of children names.
  • The Tree::statCache could be removed by simply move the cached struct stat in the Node itself.
  • Then the Tree could probably be removed as it is simply a Dir node without parent.
  • The collisionCount is local to each node but "global" to all children. This means that four twins foo (x2) and bar (x2), you will find the sanitized names foo, foo(1), bar and bar(2). It should be bar(1).
  • Redirects are not properly handle. You resolve the redirection and return the content of the target but it may break relative links in html. You should treat redirect as symlink (and so implement readlink)

On top of that, I wonder about the memory usage of the tree.
wikipedia_fr_all_maxi contains 7029908 entries (leafs). And size_of(Node) is 144 (without counting the actual data store in it (names/fullPath/originalPath bytes, children ptr). So it means that we need at least 965MiB for the nodes only. Probably more that 2GiB if we add the path and children ptr.
We can reduce that by carefully define Node structure and what we store but at the end, we will always use a lot of data.
(We may simply don't care and tell user that mounting zim files need a lot of memory, at least for now)

@@ -28,13 +28,14 @@ find_library_in_compiler = meson.version().version_compare('>=0.31.0')
rt_dep = dependency('rt', required:false)
docopt_dep = dependency('docopt', static:static_linkage)

with_writer = host_machine.system() != 'windows'
with_writer_and_mount = host_machine.system() != 'windows'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have a on_windows, or split this in two variables with_writer and with_mount.

We should not link writer and mount compilation together (at least not explicitly in one variable)

meson.build Outdated
@@ -28,13 +28,14 @@ find_library_in_compiler = meson.version().version_compare('>=0.31.0')
rt_dep = dependency('rt', required:false)
docopt_dep = dependency('docopt', static:static_linkage)

with_writer = host_machine.system() != 'windows'
with_writer_and_mount = host_machine.system() != 'windows'

if with_writer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZIM Filesystem Fuse Module
3 participants