Skip to content

Commit

Permalink
docs(mkdwarfs): update man page with all new options
Browse files Browse the repository at this point in the history
  • Loading branch information
mhx committed Dec 17, 2023
1 parent 30a40e7 commit 430e7e7
Showing 1 changed file with 126 additions and 12 deletions.
138 changes: 126 additions & 12 deletions doc/mkdwarfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ Most other options are concerned with compression tuning:
`--window-step` and `--order`. See the output of `mkdwarfs --help` for
a table listing the exact defaults used for each compression level.

- `--categorize`[`=`*categorizer*[`,`...]]:
Enable one or more categorizers in the given order.
See [CATEGORIZERS](#categorizers) for more details.

- `-S`, `--block-size-bits=`*value*:
The block size used for the compressed filesystem. The actual block size
is two to the power of this value. Larger block sizes will offer better
Expand All @@ -91,9 +95,8 @@ Most other options are concerned with compression tuning:
and values greater than 10 are mapped to "background" priority.

- `--num-scanner-workers=`*value*:
Number of worker threads used for building the filesystem. This defaults
to the number of processors available on your system. Use this option if
you want to limit the resources used by `mkdwarfs` or to optimize build
Number of worker threads used for scanning the filesystem. Use this option
if you want to limit the resources used by `mkdwarfs` or to optimize build
speed. This option affects only the scanning phase. By default, the same
value is used as for `--num-workers`.
In the scanning phase, the worker threads are used to scan files in the
Expand All @@ -102,7 +105,20 @@ Most other options are concerned with compression tuning:
computation, depending on the `--order` option. File discovery itself
is single-threaded and runs independently from the scanning threads.

- `-B`, `--max-lookback-blocks=`*value*:
- `--num-segmenter-workers=`*value*:
Number of worker threads used for segmenting the input data. By default,
the same value is used as for `--num-workers`.
Segmenting the input data is one of the most time consuming tasks when
building a file system, and cannot easily be parallelized. However, when
using the categorizer, a separate segmenter will be used for each category
(and subcategory, if present). This option controls how many segmenters
can run simultaneously. When `--compress-niceness` is set to the default,
segmenter threads will always have a higher priority than compression
threads, making sure that compression doesn't slow down segmentation.
This option also controls the number of threads used for ordering the
input to the segmenter.

- `-B`, `--max-lookback-blocks=[*category*`::`]`*value*:
Specify how many of the most recent blocks to scan for duplicate segments.
By default, only the current block will be scanned. The larger this number,
the more duplicate segments will likely be found, which may further improve
Expand All @@ -111,7 +127,7 @@ Most other options are concerned with compression tuning:
files can now potentially span multiple filesystem blocks. Passing `-B0`
will completely disable duplicate segment search.

- `-W`, `--window-size=`*value*:
- `-W`, `--window-size=[*category*`::`]`*value*:
Window size of cyclic hash used for segmenting. This is an exponent
to a base of two. Cyclic hashes are used by `mkdwarfs` for finding
identical segments across multiple files. This is done on top of duplicate
Expand All @@ -128,7 +144,7 @@ Most other options are concerned with compression tuning:
size will grow. Passing `-W0` will completely disable duplicate segment
search.

- `-w`, `--window-step=`*value*:
- `-w`, `--window-step=[*category*`::`]`*value*:
This option specifies how often cyclic hash values are stored for lookup.
It is specified relative to the window size, as a base-2 exponent that
divides the window size. As a concrete example, if `--window-size=16`
Expand All @@ -141,7 +157,7 @@ Most other options are concerned with compression tuning:
If you use a larger value for this option, the increments become *smaller*,
and `mkdwarfs` will be slightly slower and use more memory.

- `--bloom-filter-size`=*value*:
- `--bloom-filter-size`=[*category*`::`]*value*:
The segmenting algorithm uses a bloom filter to determine quickly if
there is *no* match at a given position. This will filter out more than
90% of bad matches quickly with the default bloom filter size. The default
Expand All @@ -161,7 +177,7 @@ Most other options are concerned with compression tuning:
algorithms, so if you're short on memory it might be worth tweaking the
compression options.

- `-C`, `--compression=`*algorithm*[`:`*algopt*[`=`*value*][`:`...]]:
- `-C`, `--compression=`[*category*`::`]*algorithm*[`:`*algopt*[`=`*value*][`:`...]]:
The compression algorithm and configuration used for file system data.
The value for this option is a colon-separated list. The first item is
the compression algorithm, the remaining item are its options. Options
Expand Down Expand Up @@ -211,6 +227,12 @@ Most other options are concerned with compression tuning:
metadata to uncompressed metadata without having to rebuild or recompress
all the other data.

- `--recompress-categories=`[`!`]*category*[`,`...]:
When `--recompress` is set to `all` or `block`, this option controls
which categories of blocks will be recompressed. Adding a `!` in front
of the list allows you to specify which categories will *not* be
recompressed.

- `-P`, `--pack-metadata=auto`|`none`|[`all`|`chunk_table`|`directories`|`shared_files`|`names`|`names_index`|`symlinks`|`symlinks_index`|`force`|`plain`[`,`...]]:
Which metadata information to store in packed format. This is primarily
useful when storing metadata uncompressed, as it allows for smaller
Expand Down Expand Up @@ -258,7 +280,7 @@ Most other options are concerned with compression tuning:
"normalize" the permissions across the file system; this is equivalent to
using `--chmod=ug-st,=Xr`.

- `--order=none`|`path`|`revpath`|`similarity`|`nilsimsa`[`:`*max-children*[`:`*max-cluster-size*]]:
- `--order=`[*category*`::`]`none`|`path`|`revpath`|`similarity`|`nilsimsa`[`:`*max-children*[`:`*max-cluster-size*]]:
The order in which inodes will be written to the file system. Choosing `none`,
the inodes will be stored in the order in which they are discovered. With
`path`, they will be sorted asciibetically by path name of the first file
Expand Down Expand Up @@ -367,15 +389,107 @@ Most other options are concerned with compression tuning:
you can switch to `ascii`, which is like `unicode`, but looks less
fancy.

- `--incompressible-min-input-size=`*value*
The minimum size of a file to be checked for incompressibility when
the `incompressible` categorizer is active.

- `--incompressible-block-size=`*value*
The block size used to test data for compressibility. This will also
be the size of the fragments when `--incompressible-fragments` is used.

- `--incompressible-fragments`
Categorize individual fragments of a file as incompressible instead of
only the file as a whole.

- `--incompressible-ratio=`*value*
The ratio above which a file or fragment is categorized as `incompressible`.

- `--incompressible-zstd-level=`*value*
The ZSTD compression level used for incompressible categorization.

- `-h`, `--help`:
Show usage and the most common basic options.

- `-H`, `--long-help`:
Show full usage with all options, including defaults, compression level
detail and supported compression algorithms.

If experimental Python support was compiled into `mkdwarfs`, you can use the
following option to enable customizations via the scripting interface:
## CATEGORIZERS

Categorizers will inspect the input files in the scanning phase and try to
assign them a category. Each categorizer can define a set of categories,
and each of these categories can optionally support subcategories.

Running `mkdwarfs` with the `-H` or `--long-help` option will display the
list of available categorizers and the categories they emit. At the moment,
`mkdwarfs` supports two categorizers, `incompressible` and `pcmaudio`. The
`incompressible` categorizer comes with its own set of options while the
`pcmaudio` categorizer doesn't need any further configuration.

Categorizers are only useful if at least some of the `mkdwarfs` configuration
is category-dependent. The options that can be configured per category are
`--compression`, `--order`, `--max-lookback-blocks`, `--window-size`,
`--window-step`, and `--bloom-filter-size`.

The resulting configuration matrix can be quite overwhelming, which is why
`mkdwarfs` will run with a reasonable set of defaults if you specify the
`--categorize` option with no arguments. These defaults are currently:

--categorize=pcmaudio,incompressible
--compression incompressible::null
--compression pcmaudio/waveform::flac
--order pcmaudio/waveform::revpath
--max-lookback-blocks pcmaudio/waveform::0
--window-size pcmaudio/waveform::0

Note that in case of the `pcmaudio` categorizer, you can override each
option per category (in this case `pcmaudio/waveform`).

It's also worth noting that the order in which the categorizers are given
is important. The first categorizer that will successfully categorize a
file wins and, if possible, no other categorizers will run on the same
file.

### `incompressible` Categorizer

The `incompressible` categorizer will try to compress each input with a
very fast compression algorithm (`zstd` using a negative compression level
by default). If it turns out that this doesn't reduce the size of the input
significantly, the input will be categorized as `incompressible`.

You can use the `incompressible` categorizer in two modes: whole-file or
fragmented categorization. In the former, the whole input file will be
categorized, whereas in the latter, each file can be further broken down
into compressible and incompressible fragments. The size of these fragments
will be equal to the block size used for categorization.

It makes sense to use this categorizer as the last in a list of multiple
categorizers. Not only because it'll likely have the biggest overhead, but
also because it can wrongly classify data as incompressible that can be
compressed properly with a specialized algorithm (e.g. audio data).

### `pcmaudio` Categorizer

The `pcmaudio` categorizer can identify and categorize a wide range of
uncompressed audio data such as `.wav`, `.aiff` and more obscure formats.

It produces two different categories: `pcmaudio/waveform` for the actual
waveform data, and `pcmaudio/metadata` for all other data in the file such
as the file header. The `pcmaudio/waveform` category is again divided into
many subcategories depending on the type of waveform data (e.g. number of
channels, bit depth, byte order, etc.).

In order to efficiently compress `pcmaudio/waveform` data, a suitable
compression algorithm must be selected for this category. `mkdwarfs`
currently supports `flac` compression, which offers the best ratio of
compression speed and achievable compression ratio.

It is worth noting that options such as `--window-size` will operate on
*sample* granularity instead of *byte* granularity when processing
`pcmaudio/waveform` data, where *sample* granularity means one sample
for each channel. For example, a 16-bit stereo file would have a
granularity of 4 bytes and thus `--window-size=10` would refer to a
4 KiB window instead of a 1 KiB windows.

## TIPS & TRICKS

Expand Down Expand Up @@ -642,7 +756,7 @@ own thread and continuously emits file inodes. These will be picked up
by the segmenter thread, which scans the inode contents using a cyclic
hash and determines overlapping segments between previously written
data and new incoming data. The segmenter will look at up to
`--max-lookback-block` previous filesystem blocks to find overlaps.
`--max-lookback-blocks` previous filesystem blocks to find overlaps.

Once the segmenter has produced enough data to fill a filesystem
block, the block is added to a queue where from which the blocks
Expand Down

0 comments on commit 430e7e7

Please sign in to comment.