Skip to content

Commit

Permalink
doc: Document splitstreams
Browse files Browse the repository at this point in the history
Move the original version out of the splitstream source code and
expand/clarify it a bit.
  • Loading branch information
allisonkarlitskaya committed Oct 9, 2024
1 parent 3f494e1 commit bdf3e1e
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 25 deletions.
51 changes: 51 additions & 0 deletions doc/splitstream.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Split Stream

Split Stream is a trivial way of storing file formats (like tar) with the "data
blocks" stored in the composefs object tree with the goal that it's possible to
bit-for-bit recreate the entire file. It's something like the idea behind
[tar-split](https://github.com/vbatts/tar-split), with some important differences:

- although it's designed with `tar` files in mind, it's not specific to `tar`,
or even to the idea of an archive file: any file format can be stored as a
splitstream, and it might make sense to do so for any file format that
contains large chunks of embedded data

- it's based on storing external objects in the composefs object store

- it's based on a trivial binary format

It is expected that the splitstream will be compressed before being stored on
disk. In composefs, this is done using zstd. The main reason for this is
that, after removing the actual file data, the remaining `tar` metadata
contains a very large amount of padding and empty space and compresses
extremely well.

## File format

The file format consists of a number of data blocks.

Each block starts with a u64 le "size" field followed by some amount of data.

```
64bit variable-sized
+--------+---------------....
| size | data...
+--------+---------------....
```

There are two kinds of blocks:

- "Inline" blocks (`size != 0`): in this case the length of the data is equal
to the size. This is "inline data" and is usually used for the metadata
and padding present in the source file. The Split Stream format itself
doesn't have any padding, which implies that the size fields after the
first may be unaligned. This decision was taken to keep the format simple,
and because the data is compressed before being stored, which removes the
main advantages of aligned data.

- "External" blocks (`size == 0`): in this case the length of the data is 32
bytes. This is the binary form of a sha256 hash value and is a reference
to an object in the composefs repository (by its fs-verity digest).

That's it, really. There's no header. The stream is over when there are no
more blocks.
26 changes: 1 addition & 25 deletions src/splitstream.rs
Original file line number Diff line number Diff line change
@@ -1,30 +1,6 @@
/* Implementation of the Split Stream file format
*
* Split Stream is a trivial way of storing file formats (like tar) with the "data blocks" stored
* in the composefs object tree. It's something like tar-split, but is based on content-addressed
* storage of the file data and is implemented using a trivial binary-ish format.
*
* It is expected that the splitstream will be compressed before being stored on disk.
*
* The file format consists of a number of data blocks.
*
*
* Each block starts with a u64 le "size" field followed by some amount of data.
*
* 64bit variable
* +--------+---------------....
* | size | data...
* +--------+---------------....
*
* There are two kinds of blocks.
*
* - size != 0: in this case the length of the data is equal to the size. This is "inline data".
* There is no padding, which implies that the size fields after the first may be unaligned.
*
* - size == 0: in this case the length of the data is 32 bytes. This is the binary form of a
* sha256 hash value and is a reference to an object in the composefs repository.
*
* That's it, really. There's no header. The file is over when there's no more blocks.
* See doc/splitstream.md
*/

use std::{
Expand Down

0 comments on commit bdf3e1e

Please sign in to comment.