revise a decoder and encoder, being pure #140

hannesm · 2024-02-02T16:49:55Z

this is on the path to remove the functors from the main tar library.

while working on this, I encountered some issues in the reader already:

if a zero block is read, the next block (if not another zero block, or a unmarshal error) is treat as a header directly (i.e. no pax/extended header disambiguation ongoing)
the longname/longlink is ignored when there was a PerFileExtendedHeader -- since here after unmarshalling only a "fix_link_indicator" is called, but the longname/longlink arguments are not respected (I understand there's a todo "unclear how/if pax headers should interact with gnu extensions" -- and I don't know whether what this PR provides is what other tar implementations support as well)

happy to hear your feedback. if this PR is fine with you, I'm keen to use the decoder & encoder in the effectful layers (and remove the headerreader/headerwriter/other functors). WDYT?

hannesm · 2024-02-02T16:50:32Z

For reviewing purposes and testing purposes, I adjusted the list in parse_test to use the provided decoder, and their tests are passing :)

hannesm · 2024-02-04T12:48:36Z

This PR proposes a new API, and is ready for review. Tests are passing, missing bits are Tar_gz and the eio code. A thorough review of the functionality deployed in Tar_unix / Tar_lwt_unix would be great. Certainly there are missing bits and pieces (in extract / header_of_file / ..) we can fill in over time.

//cc @MisterDA @reynir who worked on the code (including removing Archive), and who may have ideas about users of the modules and whether these can be (easily) converted to this new API.

reynir

Preliminary review. I reviewed Tar and Tar_unix. TODO on my part is to review Tar_mirage and the test changes. I assume the Tar_lwt_unix changes are similar to Tar_unix.

reynir · 2024-02-05T12:07:50Z

lib/tar.ml

+
+type decode_state = {
+  global : Header.Extended.t option;
+  state : [ `Active of bool


I find the meaning of the `Active of bool constructor not very clear from its name. But the type is not exposed and the scope is fairly limited so that's okay.

reynir · 2024-02-05T12:28:12Z

lib/tar.ml

+  | `Real_header extended ->
+    let* x =
+      Result.map_error
+        (fun _ -> `Fatal `Corrupt_pax_header) (* NB better error *)
+        (Header.unmarshal ~extended data)
    in
-    let pax_payload = Header.Extended.marshal hdr in
-    let pax = Header.make ~link_indicator link_indicator_name
-                (Int64.of_int @@ String.length pax_payload) in
-    write_unextended ?level pax fd >>= function
-    | Error _ as e -> return e
-    | Ok () ->
-      really_write fd pax_payload >>= fun () ->
-      really_write fd (Header.zero_padding pax) >>= fun () ->
-      return (Ok ())
-
-  let write ?level header fd =
-    ( match header.Header.extended with
-      | None -> return (Ok ())
-      | Some e ->
-        write_extended ?level ~link_indicator:Header.Link.PerFileExtendedHeader e fd )
-    >>= function
-    | Error _ as e -> return e
-    | Ok () -> write_unextended ?level header fd
-
-  let write_global_extended_header global fd =
-    write_extended ~link_indicator:Header.Link.GlobalExtendedHeader global fd
-end
+    let t, hdr = construct_header t x in
+    Ok (t, Some (`Header hdr), None)


We don't check the link indicator here. We may want to error out or do something different if it's (another) PAX extended header or a GNU Long{Link,Name}. I doubt there are archives out there that mix GNU LongLink and PAX extended headers so I'm fine with any behavior in that case. I don't know if you're allowed to have PAX extended headers after each other and what the semantics thereof would be.

I'm fine with not making a decision on this in this PR - but maybe we should add a comment then.

It turns out we may actually produce such archives due to #142 if using compatibility level GNU :/

reynir · 2024-02-05T12:32:31Z

lib/tar.ml

+  | `Next_longlink x ->
+    let name = String.sub data 0 (String.length data - 1) in
+    let next_longlink = if x.Header.link_indicator = Header.Link.LongLink then Some name else t.next_longlink in
+    let next_longname = if x.Header.link_indicator = Header.Link.LongName then Some name else t.next_longname in
+    Ok ({ t with next_longlink ; next_longname ; state = `Active false },
+        Some (`Skip (Header.compute_zero_padding_length x)),
+        None)


A hack, but we synthesize an Extended.t from next_longlink and next_longname. I think this is fine as is though.

reynir · 2024-02-05T12:46:54Z

lib/tar.ml

+      Error (`Fatal e)
+
+let encode_long level link_indicator payload =
+  let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in


Maybe:

Suggested change

let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in

let blank = Header.make ~link_indicator:Header.Link.LongLink ~user_id:0 ~uname:"root" ~group_id:0 ~gname:"root" longlink 0L in

We can also rely on ~user_id and ~group_id defaulting to 0.

I don't have a strong opinion on this.

reynir · 2024-02-05T12:48:07Z

lib/tar.ml

+    | Error ((`Checksum_mismatch | `Unmarshal _) as e) ->
+      Error (`Fatal e)
+
+let encode_long level link_indicator payload =


Level is always GNU.

Suggested change

let encode_long level link_indicator payload =

let encode_long link_indicator payload =

reynir · 2024-02-05T14:44:08Z

unix/tar_unix.mli

+(** [extract ~filter ~src dst] extracts the tar archive [src] into the
+    directory [dst]. If [dst] does not exist, it is created. If [filter] is
+    provided (defaults to [fun _ -> true]), any file where [filter hdr] returns
+    [false], is skipped. *)
+val extract :
+  ?filter:(Tar.Header.t -> bool) ->
+  src:string -> string ->
+  (unit, decode_error) result


It is possible to create tar archives where you store headers for the files only and in my experience tar utilities will create the missing directories. For example, it might only contain regular files:

mysterydir/file1

mysterydir/phantom/file2

Extracting this archive will result in the following filesystem structure:

mysterydir/ ├── file1 └── phantom └── file2

Now the question is:

Should we do this, too?

Also somewhat related but independent: What is the meaning if filter { file_name = "mysterydir/"; _ } = false and filter { file_name = "mysterydir/file1"; _ } = true? That is, we filter out the parent directory but keep the child?

We should indeed create directories which are needed.

The filter - here you can decide what you like (you get a Header.t) - and match on "no mysterydir" or "no directory entry" or "nothing below mysterydir".

reynir · 2024-02-05T14:47:15Z

unix/tar_unix.mli

+(** [create ~level ~filter ~src dst] creates a tar archive at [dst]. It uses
+    [src], a directory name, as input. If [filter] is provided
+    (defaults to [fun _ -> true]), any file where [filter hdr] returns [false]
+    is skipped. *)
+val create : ?level:Tar.Header.compatibility ->
+  ?global:Tar.Header.Extended.t ->
+  ?filter:(Tar.Header.t -> bool) ->
+  src:string -> string ->
+  (unit, [ `Msg of string | `Unix of (Unix.error * string * string) ]) result


Here again there are potential pitfalls. If filter skips dir/ but keeps dir/file then we get an archive where only dir/file is present. In my experience tar tools will create missing dirs, so skipping dir/ will in this case not result in dir/ not being present in a sense, but the permissions and ownership will be something the tar utility decides.

reynir · 2024-02-05T14:53:21Z

unix/tar_unix.mli

+(** [append_file ~level ~header filename fd] appends the contents of [filename]
+    to the tar archive [fd]. If [header] is not provided, {header_of_file} is
+    used for constructing a header. *)


I think this could be explained better. For example, the caller will need to call write_end once no more files will be appended.

Suggested change

(** [append_file ~level ~header filename fd] appends the contents of [filename]

to the tar archive [fd]. If [header] is not provided, {header_of_file} is

used for constructing a header. *)

(** [append_file ~level ~header filename fd] appends the contents of [filename]

to the tar archive [fd]. If [header] is not provided, {header_of_file} is

used for constructing a header. It is the responsibility of the caller to write the end of the archive using e.g. [write_end] when no more files will be added. *)

I'm also curious if people might think you can just open an existing archive and call append_file on the newly opened file descriptor.

you can. Unix.openfile with O_APPEND, lseek to SEEK_END - 2 * 512, and then you can append :)

Right :) But is that clear from the documentation? Or could a user not very familiar with the tar format think they can just do let fd = Unix.openfile "backup.tar" [Unix.O_APPEND] 0o777 in append_file "hello.txt" fd; Unix.close fd and they're done? What I'm getting at is this maybe requires some knowledge about tar and otherwise you may think the function does more magic than it really does (like seeking at the end or finishing off with two zero blocks).

I can make a change to the documentation so it is more clear what it does and does not.

reynir · 2024-02-05T15:04:26Z

unix/tar_unix.ml

+        let* dst =
+          Result.map_error unix_err_to_msg
+            (safe Unix.(openfile (Filename.concat dst hdr.Tar.Header.file_name)
+                          [ O_WRONLY ; O_CREAT ]) hdr.Tar.Header.file_mode)
+        in


This is not super safe as the archive can contain file names such as ../../../../etc/passwd or /etc/passwd. Maybe it's better to leave it up to the user to write their own function using fold where they can decide how to handle paths, ownership and mode?

I'd be happy to say "extract into /my/tmp/dir", and nothing leaves that directory... thus if you have "/foo/bar" in your archive, the leading / is removed. ../ paths are ignored? I don't know for sure..

I think that is preferable. Maybe a unsafe_extract could exist, but I think Tar_unix.extract should not be a footgun.

unsafe_extract is something people can write their own with fold ;)

reynir · 2024-02-05T15:08:16Z

unix/tar_unix.ml

+      try
+        let* passwd_entry = safe Unix.getpwuid stat.Unix.LargeFile.st_uid in
+        Ok passwd_entry.Unix.pw_name
+      with Not_found -> Error (`Msg ("No user entry found for UID"))


FWIW GNU tar seems to use the empty string for unknown uids.

reynir · 2024-02-05T15:29:46Z

Overall I'm happy with the Tar changes. There is a thing about PAX headers in Tar.Header.make I'm not happy about but that was there before this change.

For Tar_unix I have some concerns about safety, potentially confusing names and unclear semantics of filtering. But I like how the usage of Tar looks like!

dinosaure

I'm not a big fan of the new API where some details should be know by user to use rightly Tar (and these details are about the tar format). Specifically about Tar_gz API, decompress‧gz is a streaming API (you can fill the decoder with some bytes, only an empty string informs the end of the stream). By extension, an usage of decompress‧gz should lead to a streaming API too but it's actually not possible with the current Tar API. It depends on what we want to expose of course.

dinosaure · 2024-02-05T15:31:17Z

lib/tar.ml

+  else
+    x
+
+type decode_state = {


I'm not a big fan of decode_state, I prefer decoder.

dinosaure · 2024-02-05T15:34:55Z

lib/tar.ml

+        Some (`Skip (Header.compute_zero_padding_length x)),
+        None)
+  | `Active read_zero ->
+    match Header.unmarshal ?extended:t.global data with


That mostly means that at the beginning, the user must provides a 512 bytes string at first (according to Header‧unmarshal), otherwise the user gets an error. Should we precise that the user must read a tar file per 512 bytes? Should we provide a streaming API?

@reynir

CHANGES: - Fix `Header.marshal` and the checksum and the length (@reynir, mirage/ocaml-tar#145) - Delete a mutable field about the level into the header (@hannesm, mirage/ocaml-tar#141) - **BREAKING**: de-functorize the package (@hannesm, @reynir, @dinosaure, mirage/ocaml-tar#140, mirage/ocaml-tar#143, mirage/ocaml-tar#146) These PRs attempt to de-functorize `Tar` so that users can implement I/O themselves, using `Tar`'s own element serialization/deserialization functions to take advantage of read/write methods. This avoids imposing on the user the implementation of a module that is too rigid in his/her case (which could have performance implications). `Tar` offers functions for serializing/deserializing tar-specific elements from `string`. It is then up to the user to know how to obtain or write these `strings`. To this, these PRs add "logics" (see `'a Tar.t`) requiring read and/or write implementations and describing how to extract all entries from a tar file or how to write a tar file according to a "dispenser" (like `Seq.to_dispenser`) of entries. These logics do not depend on a particular "scheduler", and these PRs propose a derivation of these logics with `tar-unix`, `tar-eio` and `tar-mirage`. These latter derivations mean that the API for these packages has only been extended, and there are no breaking changes as such. These logics also make it easy to offer a compression/decompression layer with `decompress`, so you can easily manipulate and/or create a .tar.gz file.

@reynir

CHANGES: - Fix `Header.marshal` and the checksum and the length (@reynir, mirage/ocaml-tar#145) - Delete a mutable field about the level into the header (@hannesm, mirage/ocaml-tar#141) - **BREAKING**: de-functorize the package (@hannesm, @reynir, @dinosaure, mirage/ocaml-tar#140, mirage/ocaml-tar#143, mirage/ocaml-tar#146) These PRs attempt to de-functorize `Tar` so that users can implement I/O themselves, using `Tar`'s own element serialization/deserialization functions to take advantage of read/write methods. This avoids imposing on the user the implementation of a module that is too rigid in his/her case (which could have performance implications). `Tar` offers functions for serializing/deserializing tar-specific elements from `string`. It is then up to the user to know how to obtain or write these `strings`. To this, these PRs add "logics" (see `'a Tar.t`) requiring read and/or write implementations and describing how to extract all entries from a tar file or how to write a tar file according to a "dispenser" (like `Seq.to_dispenser`) of entries. These logics do not depend on a particular "scheduler", and these PRs propose a derivation of these logics with `tar-unix`, `tar-eio` and `tar-mirage`. These latter derivations mean that the API for these packages has only been extended, and there are no breaking changes as such. These logics also make it easy to offer a compression/decompression layer with `decompress`, so you can easily manipulate and/or create a .tar.gz file.

hannesm force-pushed the pure-dec-enc branch from 53d580a to d24ba4c Compare February 3, 2024 15:52

hannesm added 9 commits February 4, 2024 00:17

revise a decoder and encoder, being pure

6c54ca0

remove stuff

c67f945

wip

9ccc73b

fix

ebabd3c

proposed API

ce9337b

add filter

50f6659

initial compiling tar_unix

1b4ae55

remove offset nonsense

984ffe0

lwt-unix

9c1c120

hannesm force-pushed the pure-dec-enc branch from 16d6905 to 9c1c120 Compare February 4, 2024 00:05

hannesm added 2 commits February 4, 2024 11:35

further work, get tests a bit more up to speed

29d884e

more tests are working now

281883b

hannesm force-pushed the pure-dec-enc branch from 0dae890 to 281883b Compare February 4, 2024 12:01

hannesm added 3 commits February 4, 2024 13:22

revive transform test

60d6faa

test tar_unix, use fold for list

462063b

document write_header

2b49b1f

reynir reviewed Feb 5, 2024

View reviewed changes

dinosaure reviewed Feb 5, 2024

View reviewed changes

dinosaure mentioned this pull request Feb 7, 2024

Pure dec enc with gz #143

Merged

dinosaure mentioned this pull request May 15, 2024

New tar gz #146

Merged

dinosaure merged commit 2b49b1f into mirage:main Aug 4, 2024
0 of 2 checks passed

hannesm deleted the pure-dec-enc branch August 5, 2024 06:33

dinosaure mentioned this pull request Aug 5, 2024

[new release] tar (4 packages) (3.0.0) ocaml/opam-repository#26330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revise a decoder and encoder, being pure #140

revise a decoder and encoder, being pure #140

hannesm commented Feb 2, 2024

hannesm commented Feb 2, 2024

hannesm commented Feb 4, 2024

reynir left a comment

reynir Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

hannesm Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

hannesm Feb 5, 2024

reynir Feb 5, 2024

reynir Feb 5, 2024

hannesm Feb 5, 2024

reynir Feb 5, 2024

hannesm Feb 5, 2024

reynir Feb 5, 2024

reynir commented Feb 5, 2024

dinosaure left a comment

dinosaure Feb 5, 2024

dinosaure Feb 5, 2024

	let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in
	let blank = Header.make ~link_indicator:Header.Link.LongLink ~user_id:0 ~uname:"root" ~group_id:0 ~gname:"root" longlink 0L in

	let encode_long level link_indicator payload =
	let encode_long link_indicator payload =

revise a decoder and encoder, being pure #140

revise a decoder and encoder, being pure #140

Conversation

hannesm commented Feb 2, 2024

hannesm commented Feb 2, 2024

hannesm commented Feb 4, 2024

reynir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reynir commented Feb 5, 2024

dinosaure left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment