Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revise a decoder and encoder, being pure #140

Merged
merged 14 commits into from
Aug 4, 2024
Merged

Conversation

hannesm
Copy link
Member

@hannesm hannesm commented Feb 2, 2024

this is on the path to remove the functors from the main tar library.

while working on this, I encountered some issues in the reader already:

  • if a zero block is read, the next block (if not another zero block, or a unmarshal error) is treat as a header directly (i.e. no pax/extended header disambiguation ongoing)
  • the longname/longlink is ignored when there was a PerFileExtendedHeader -- since here after unmarshalling only a "fix_link_indicator" is called, but the longname/longlink arguments are not respected (I understand there's a todo "unclear how/if pax headers should interact with gnu extensions" -- and I don't know whether what this PR provides is what other tar implementations support as well)

happy to hear your feedback. if this PR is fine with you, I'm keen to use the decoder & encoder in the effectful layers (and remove the headerreader/headerwriter/other functors). WDYT?

@hannesm
Copy link
Member Author

hannesm commented Feb 2, 2024

For reviewing purposes and testing purposes, I adjusted the list in parse_test to use the provided decoder, and their tests are passing :)

@hannesm
Copy link
Member Author

hannesm commented Feb 4, 2024

This PR proposes a new API, and is ready for review. Tests are passing, missing bits are Tar_gz and the eio code. A thorough review of the functionality deployed in Tar_unix / Tar_lwt_unix would be great. Certainly there are missing bits and pieces (in extract / header_of_file / ..) we can fill in over time.

//cc @MisterDA @reynir who worked on the code (including removing Archive), and who may have ideas about users of the modules and whether these can be (easily) converted to this new API.

Copy link
Member

@reynir reynir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review. I reviewed Tar and Tar_unix. TODO on my part is to review Tar_mirage and the test changes. I assume the Tar_lwt_unix changes are similar to Tar_unix.


type decode_state = {
global : Header.Extended.t option;
state : [ `Active of bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the meaning of the `Active of bool constructor not very clear from its name. But the type is not exposed and the scope is fairly limited so that's okay.

Comment on lines +722 to +729
| `Real_header extended ->
let* x =
Result.map_error
(fun _ -> `Fatal `Corrupt_pax_header) (* NB better error *)
(Header.unmarshal ~extended data)
in
let pax_payload = Header.Extended.marshal hdr in
let pax = Header.make ~link_indicator link_indicator_name
(Int64.of_int @@ String.length pax_payload) in
write_unextended ?level pax fd >>= function
| Error _ as e -> return e
| Ok () ->
really_write fd pax_payload >>= fun () ->
really_write fd (Header.zero_padding pax) >>= fun () ->
return (Ok ())

let write ?level header fd =
( match header.Header.extended with
| None -> return (Ok ())
| Some e ->
write_extended ?level ~link_indicator:Header.Link.PerFileExtendedHeader e fd )
>>= function
| Error _ as e -> return e
| Ok () -> write_unextended ?level header fd

let write_global_extended_header global fd =
write_extended ~link_indicator:Header.Link.GlobalExtendedHeader global fd
end
let t, hdr = construct_header t x in
Ok (t, Some (`Header hdr), None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check the link indicator here. We may want to error out or do something different if it's (another) PAX extended header or a GNU Long{Link,Name}. I doubt there are archives out there that mix GNU LongLink and PAX extended headers so I'm fine with any behavior in that case. I don't know if you're allowed to have PAX extended headers after each other and what the semantics thereof would be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with not making a decision on this in this PR - but maybe we should add a comment then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we may actually produce such archives due to #142 if using compatibility level GNU :/

Comment on lines +730 to +736
| `Next_longlink x ->
let name = String.sub data 0 (String.length data - 1) in
let next_longlink = if x.Header.link_indicator = Header.Link.LongLink then Some name else t.next_longlink in
let next_longname = if x.Header.link_indicator = Header.Link.LongName then Some name else t.next_longname in
Ok ({ t with next_longlink ; next_longname ; state = `Active false },
Some (`Skip (Header.compute_zero_padding_length x)),
None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hack, but we synthesize an Extended.t from next_longlink and next_longname. I think this is fine as is though.

Error (`Fatal e)

let encode_long level link_indicator payload =
let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in
let blank = Header.make ~link_indicator:Header.Link.LongLink ~user_id:0 ~uname:"root" ~group_id:0 ~gname:"root" longlink 0L in

We can also rely on ~user_id and ~group_id defaulting to 0.

I don't have a strong opinion on this.

| Error ((`Checksum_mismatch | `Unmarshal _) as e) ->
Error (`Fatal e)

let encode_long level link_indicator payload =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Level is always GNU.

Suggested change
let encode_long level link_indicator payload =
let encode_long link_indicator payload =

Comment on lines +36 to +43
(** [extract ~filter ~src dst] extracts the tar archive [src] into the
directory [dst]. If [dst] does not exist, it is created. If [filter] is
provided (defaults to [fun _ -> true]), any file where [filter hdr] returns
[false], is skipped. *)
val extract :
?filter:(Tar.Header.t -> bool) ->
src:string -> string ->
(unit, decode_error) result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to create tar archives where you store headers for the files only and in my experience tar utilities will create the missing directories. For example, it might only contain regular files:

  • mysterydir/file1
  • mysterydir/phantom/file2

Extracting this archive will result in the following filesystem structure:

mysterydir/
├── file1
└── phantom
    └── file2

Now the question is:

  • Should we do this, too?

Also somewhat related but independent: What is the meaning if filter { file_name = "mysterydir/"; _ } = false and filter { file_name = "mysterydir/file1"; _ } = true? That is, we filter out the parent directory but keep the child?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should indeed create directories which are needed.

The filter - here you can decide what you like (you get a Header.t) - and match on "no mysterydir" or "no directory entry" or "nothing below mysterydir".

Comment on lines +45 to +53
(** [create ~level ~filter ~src dst] creates a tar archive at [dst]. It uses
[src], a directory name, as input. If [filter] is provided
(defaults to [fun _ -> true]), any file where [filter hdr] returns [false]
is skipped. *)
val create : ?level:Tar.Header.compatibility ->
?global:Tar.Header.Extended.t ->
?filter:(Tar.Header.t -> bool) ->
src:string -> string ->
(unit, [ `Msg of string | `Unix of (Unix.error * string * string) ]) result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again there are potential pitfalls. If filter skips dir/ but keeps dir/file then we get an archive where only dir/file is present. In my experience tar tools will create missing dirs, so skipping dir/ will in this case not result in dir/ not being present in a sense, but the permissions and ownership will be something the tar utility decides.

Comment on lines +59 to +61
(** [append_file ~level ~header filename fd] appends the contents of [filename]
to the tar archive [fd]. If [header] is not provided, {header_of_file} is
used for constructing a header. *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be explained better. For example, the caller will need to call write_end once no more files will be appended.

Suggested change
(** [append_file ~level ~header filename fd] appends the contents of [filename]
to the tar archive [fd]. If [header] is not provided, {header_of_file} is
used for constructing a header. *)
(** [append_file ~level ~header filename fd] appends the contents of [filename]
to the tar archive [fd]. If [header] is not provided, {header_of_file} is
used for constructing a header. It is the responsibility of the caller to write the end of the archive using e.g. [write_end] when no more files will be added. *)

I'm also curious if people might think you can just open an existing archive and call append_file on the newly opened file descriptor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can. Unix.openfile with O_APPEND, lseek to SEEK_END - 2 * 512, and then you can append :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right :) But is that clear from the documentation? Or could a user not very familiar with the tar format think they can just do let fd = Unix.openfile "backup.tar" [Unix.O_APPEND] 0o777 in append_file "hello.txt" fd; Unix.close fd and they're done? What I'm getting at is this maybe requires some knowledge about tar and otherwise you may think the function does more magic than it really does (like seeking at the end or finishing off with two zero blocks).

I can make a change to the documentation so it is more clear what it does and does not.

Comment on lines +131 to +135
let* dst =
Result.map_error unix_err_to_msg
(safe Unix.(openfile (Filename.concat dst hdr.Tar.Header.file_name)
[ O_WRONLY ; O_CREAT ]) hdr.Tar.Header.file_mode)
in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not super safe as the archive can contain file names such as ../../../../etc/passwd or /etc/passwd. Maybe it's better to leave it up to the user to write their own function using fold where they can decide how to handle paths, ownership and mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to say "extract into /my/tmp/dir", and nothing leaves that directory... thus if you have "/foo/bar" in your archive, the leading / is removed. ../ paths are ignored? I don't know for sure..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is preferable. Maybe a unsafe_extract could exist, but I think Tar_unix.extract should not be a footgun.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsafe_extract is something people can write their own with fold ;)

try
let* passwd_entry = safe Unix.getpwuid stat.Unix.LargeFile.st_uid in
Ok passwd_entry.Unix.pw_name
with Not_found -> Error (`Msg ("No user entry found for UID"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW GNU tar seems to use the empty string for unknown uids.

@reynir
Copy link
Member

reynir commented Feb 5, 2024

Overall I'm happy with the Tar changes. There is a thing about PAX headers in Tar.Header.make I'm not happy about but that was there before this change.

For Tar_unix I have some concerns about safety, potentially confusing names and unclear semantics of filtering. But I like how the usage of Tar looks like!

Copy link
Member

@dinosaure dinosaure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the new API where some details should be know by user to use rightly Tar (and these details are about the tar format). Specifically about Tar_gz API, decompress‧gz is a streaming API (you can fill the decoder with some bytes, only an empty string informs the end of the stream). By extension, an usage of decompress‧gz should lead to a streaming API too but it's actually not possible with the current Tar API. It depends on what we want to expose of course.

else
x

type decode_state = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of decode_state, I prefer decoder.

Some (`Skip (Header.compute_zero_padding_length x)),
None)
| `Active read_zero ->
match Header.unmarshal ?extended:t.global data with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That mostly means that at the beginning, the user must provides a 512 bytes string at first (according to Header‧unmarshal), otherwise the user gets an error. Should we precise that the user must read a tar file per 512 bytes? Should we provide a streaming API?

@dinosaure dinosaure mentioned this pull request Feb 7, 2024
@dinosaure dinosaure mentioned this pull request May 15, 2024
@dinosaure dinosaure merged commit 2b49b1f into mirage:main Aug 4, 2024
0 of 2 checks passed
@hannesm hannesm deleted the pure-dec-enc branch August 5, 2024 06:33
dinosaure added a commit to dinosaure/opam-repository that referenced this pull request Aug 5, 2024
CHANGES:

- Fix `Header.marshal` and the checksum and the length (@reynir, mirage/ocaml-tar#145)
- Delete a mutable field about the level into the header (@hannesm, mirage/ocaml-tar#141)
- **BREAKING**: de-functorize the package (@hannesm, @reynir, @dinosaure, mirage/ocaml-tar#140, mirage/ocaml-tar#143, mirage/ocaml-tar#146)

  These PRs attempt to de-functorize `Tar` so that users can implement I/O
  themselves, using `Tar`'s own element serialization/deserialization functions
  to take advantage of read/write methods. This avoids imposing on the user the
  implementation of a module that is too rigid in his/her case (which could have
  performance implications).

  `Tar` offers functions for serializing/deserializing tar-specific elements
  from `string`. It is then up to the user to know how to obtain or write these
  `strings`.

  To this, these PRs add "logics" (see `'a Tar.t`) requiring read and/or write
  implementations and describing how to extract all entries from a tar file or
  how to write a tar file according to a "dispenser" (like `Seq.to_dispenser`)
  of entries.

  These logics do not depend on a particular "scheduler", and these PRs propose
  a derivation of these logics with `tar-unix`, `tar-eio` and `tar-mirage`.
  These latter derivations mean that the API for these packages has only been
  extended, and there are no breaking changes as such.

  These logics also make it easy to offer a compression/decompression layer with
  `decompress`, so you can easily manipulate and/or create a .tar.gz file.
avsm pushed a commit to avsm/opam-repository that referenced this pull request Sep 5, 2024
CHANGES:

- Fix `Header.marshal` and the checksum and the length (@reynir, mirage/ocaml-tar#145)
- Delete a mutable field about the level into the header (@hannesm, mirage/ocaml-tar#141)
- **BREAKING**: de-functorize the package (@hannesm, @reynir, @dinosaure, mirage/ocaml-tar#140, mirage/ocaml-tar#143, mirage/ocaml-tar#146)

  These PRs attempt to de-functorize `Tar` so that users can implement I/O
  themselves, using `Tar`'s own element serialization/deserialization functions
  to take advantage of read/write methods. This avoids imposing on the user the
  implementation of a module that is too rigid in his/her case (which could have
  performance implications).

  `Tar` offers functions for serializing/deserializing tar-specific elements
  from `string`. It is then up to the user to know how to obtain or write these
  `strings`.

  To this, these PRs add "logics" (see `'a Tar.t`) requiring read and/or write
  implementations and describing how to extract all entries from a tar file or
  how to write a tar file according to a "dispenser" (like `Seq.to_dispenser`)
  of entries.

  These logics do not depend on a particular "scheduler", and these PRs propose
  a derivation of these logics with `tar-unix`, `tar-eio` and `tar-mirage`.
  These latter derivations mean that the API for these packages has only been
  extended, and there are no breaking changes as such.

  These logics also make it easy to offer a compression/decompression layer with
  `decompress`, so you can easily manipulate and/or create a .tar.gz file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants