ZIM Fuse Filesystem #400

juuz0 · 2024-04-13T14:00:41Z

Mount any ZIM file to your filesystem
Usage:
zimfuse zimFile.zim mountDir

mountDir should exist before using zimfuse.

setting up the project

Implemented a tree (trie-like) where a path is broken down per directories. For example, my/new/path1.jpg my/new/path2.jpg are represented as / (root) my new path1.jpg path2.jpg

Implemented some common FUSE functions. We can browse the ZIM file using ls, cd, etc. and open files in text editors or using commands such as "cat" now.

Updated the README to add information about zimfuse

kelson42 · 2024-04-21T09:34:30Z

I guess this would be an implementation of kiwix/overview#79 ?

mappedNodes owns the nodes

Some entries had long filenames, this change reduces that to 100 characters which is will within the allowed limit. This introduces possible name collisions which will be fixed in the following commit

now if two files have the same name, the 2nd file is changed to "originalFileName(1)"

When reading ZIMs with a lot of files (for example: mediawiki_en_all), getting their filesize took a lot of time. There were 2 choices: 1. Read filesize while creating the tree - this gave fast subsequent responses (on commands such as ls) but the fuse initialization took a good amount of time. 2. Find filesize when requested by a command (and save it for future requests) - This provides fast initialization but the first read takes time (though, not as much as getting all filesizes - only the requested ones). They are later saved for subsequent requests. I went with option 2.

juuz0 · 2024-04-29T03:21:04Z

I guess this would be an implementation of kiwix/overview#79 ?

Yes.

kelson42 · 2024-04-29T04:09:18Z

@juuz0 Thanks but most of the CI is broken? Can you fix it?

juuz0 · 2024-05-02T16:49:41Z

CI fails because 'fuse3' is not available on the system...maybe something to add in kiwix-build later?

mgautierfr

Thanks for the PR @juuz0.
But I have few general comments better addressed on the full PR than each commit:

This kind of tree is a perfect use case for OOP and inheritance.
A basic Node with name, parent. Then a Dir node (inheriting Node) with children and a Leaf node.
You define a Node::Ptr as unique_ptr, but the Node tree is composed of Node*. Node::Ptr is only used in Tree::mappedNodes. This make Tree the direct owner of all the nodes and Nodes only making "reference" to other nodes. It would make more sense to have Dir nodes the owners of their children.
The Tree::mappedNodes is kind of global cache. But I'm not sure we need it. Either "directories" contains few entries and loop in the vector should be enough or not, and in this case, it would be better to transform Node::children into a map. We would have the O(1) access at node level and top of ensuring us the unity of children names.
The Tree::statCache could be removed by simply move the cached struct stat in the Node itself.
Then the Tree could probably be removed as it is simply a Dir node without parent.
The collisionCount is local to each node but "global" to all children. This means that four twins foo (x2) and bar (x2), you will find the sanitized names foo, foo(1), bar and bar(2). It should be bar(1).
Redirects are not properly handle. You resolve the redirection and return the content of the target but it may break relative links in html. You should treat redirect as symlink (and so implement readlink)

On top of that, I wonder about the memory usage of the tree.
wikipedia_fr_all_maxi contains 7029908 entries (leafs). And size_of(Node) is 144 (without counting the actual data store in it (names/fullPath/originalPath bytes, children ptr). So it means that we need at least 965MiB for the nodes only. Probably more that 2GiB if we add the path and children ptr.
We can reduce that by carefully define Node structure and what we store but at the end, we will always use a lot of data.
(We may simply don't care and tell user that mounting zim files need a lot of memory, at least for now)

mgautierfr · 2024-05-13T14:34:28Z

meson.build

@@ -28,13 +28,14 @@ find_library_in_compiler = meson.version().version_compare('>=0.31.0')
 rt_dep = dependency('rt', required:false)
 docopt_dep = dependency('docopt', static:static_linkage)

-with_writer = host_machine.system() != 'windows'
+with_writer_and_mount = host_machine.system() != 'windows'


It would be better to have a on_windows, or split this in two variables with_writer and with_mount.

We should not link writer and mount compilation together (at least not explicitly in one variable)

mgautierfr · 2024-05-13T14:35:39Z

meson.build

@@ -28,13 +28,14 @@ find_library_in_compiler = meson.version().version_compare('>=0.31.0')
 rt_dep = dependency('rt', required:false)
 docopt_dep = dependency('docopt', static:static_linkage)

-with_writer = host_machine.system() != 'windows'
+with_writer_and_mount = host_machine.system() != 'windows'

 if with_writer


This should be renamed also.

juuz0 added 4 commits April 10, 2024 19:13

Meson configuration for ZIMFuse

d2567e2

setting up the project

Tree implementation for ZIM file

739fde5

Implemented a tree (trie-like) where a path is broken down per directories. For example, my/new/path1.jpg my/new/path2.jpg are represented as / (root) my new path1.jpg path2.jpg

FUSE commands implemented: readdir, getattr, read, open

3bfb8ad

Implemented some common FUSE functions. We can browse the ZIM file using ls, cd, etc. and open files in text editors or using commands such as "cat" now.

Update README.md

cf4483e

Updated the README to add information about zimfuse

juuz0 force-pushed the zimfuse branch from 42146ef to 770e65f Compare April 29, 2024 03:17

juuz0 added 4 commits April 29, 2024 08:48

Switch to smart pointers

7e017f8

mappedNodes owns the nodes

Don't allow filenames to exceed 500 chars

717afc0

Some entries had long filenames, this change reduces that to 100 characters which is will within the allowed limit. This introduces possible name collisions which will be fixed in the following commit

Add collision number to same file names

93ab559

now if two files have the same name, the 2nd file is changed to "originalFileName(1)"

juuz0 force-pushed the zimfuse branch from 770e65f to 7af6fa8 Compare April 29, 2024 03:19

juuz0 marked this pull request as ready for review April 29, 2024 03:21

juuz0 requested a review from kelson42 April 29, 2024 03:21

kelson42 requested review from rgaudin and mgautierfr April 29, 2024 04:04

mgautierfr requested changes May 16, 2024

View reviewed changes

mgautierfr mentioned this pull request Jun 27, 2024

Develop a graphical ZIM explorer openzim/overview#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZIM Fuse Filesystem #400

ZIM Fuse Filesystem #400

juuz0 commented Apr 13, 2024 •

edited

Loading

kelson42 commented Apr 21, 2024

juuz0 commented Apr 29, 2024

kelson42 commented Apr 29, 2024

juuz0 commented May 2, 2024

mgautierfr left a comment

mgautierfr May 13, 2024

mgautierfr May 13, 2024

ZIM Fuse Filesystem #400

Are you sure you want to change the base?

ZIM Fuse Filesystem #400

Conversation

juuz0 commented Apr 13, 2024 • edited Loading

kelson42 commented Apr 21, 2024

juuz0 commented Apr 29, 2024

kelson42 commented Apr 29, 2024

juuz0 commented May 2, 2024

mgautierfr left a comment

Choose a reason for hiding this comment

mgautierfr May 13, 2024

Choose a reason for hiding this comment

mgautierfr May 13, 2024

Choose a reason for hiding this comment

juuz0 commented Apr 13, 2024 •

edited

Loading