-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revise a decoder and encoder, being pure #140
Conversation
For reviewing purposes and testing purposes, I adjusted the |
This PR proposes a new API, and is ready for review. Tests are passing, missing bits are Tar_gz and the eio code. A thorough review of the functionality deployed in Tar_unix / Tar_lwt_unix would be great. Certainly there are missing bits and pieces (in extract / header_of_file / ..) we can fill in over time. //cc @MisterDA @reynir who worked on the code (including removing Archive), and who may have ideas about users of the modules and whether these can be (easily) converted to this new API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review. I reviewed Tar
and Tar_unix
. TODO on my part is to review Tar_mirage
and the test changes. I assume the Tar_lwt_unix
changes are similar to Tar_unix
.
|
||
type decode_state = { | ||
global : Header.Extended.t option; | ||
state : [ `Active of bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the meaning of the `Active of bool
constructor not very clear from its name. But the type is not exposed and the scope is fairly limited so that's okay.
| `Real_header extended -> | ||
let* x = | ||
Result.map_error | ||
(fun _ -> `Fatal `Corrupt_pax_header) (* NB better error *) | ||
(Header.unmarshal ~extended data) | ||
in | ||
let pax_payload = Header.Extended.marshal hdr in | ||
let pax = Header.make ~link_indicator link_indicator_name | ||
(Int64.of_int @@ String.length pax_payload) in | ||
write_unextended ?level pax fd >>= function | ||
| Error _ as e -> return e | ||
| Ok () -> | ||
really_write fd pax_payload >>= fun () -> | ||
really_write fd (Header.zero_padding pax) >>= fun () -> | ||
return (Ok ()) | ||
|
||
let write ?level header fd = | ||
( match header.Header.extended with | ||
| None -> return (Ok ()) | ||
| Some e -> | ||
write_extended ?level ~link_indicator:Header.Link.PerFileExtendedHeader e fd ) | ||
>>= function | ||
| Error _ as e -> return e | ||
| Ok () -> write_unextended ?level header fd | ||
|
||
let write_global_extended_header global fd = | ||
write_extended ~link_indicator:Header.Link.GlobalExtendedHeader global fd | ||
end | ||
let t, hdr = construct_header t x in | ||
Ok (t, Some (`Header hdr), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't check the link indicator here. We may want to error out or do something different if it's (another) PAX extended header or a GNU Long{Link,Name}. I doubt there are archives out there that mix GNU LongLink and PAX extended headers so I'm fine with any behavior in that case. I don't know if you're allowed to have PAX extended headers after each other and what the semantics thereof would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with not making a decision on this in this PR - but maybe we should add a comment then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out we may actually produce such archives due to #142 if using compatibility level GNU
:/
| `Next_longlink x -> | ||
let name = String.sub data 0 (String.length data - 1) in | ||
let next_longlink = if x.Header.link_indicator = Header.Link.LongLink then Some name else t.next_longlink in | ||
let next_longname = if x.Header.link_indicator = Header.Link.LongName then Some name else t.next_longname in | ||
Ok ({ t with next_longlink ; next_longname ; state = `Active false }, | ||
Some (`Skip (Header.compute_zero_padding_length x)), | ||
None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A hack, but we synthesize an Extended.t
from next_longlink
and next_longname
. I think this is fine as is though.
Error (`Fatal e) | ||
|
||
let encode_long level link_indicator payload = | ||
let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
let blank = {Header.file_name = longlink; file_mode = 0; user_id = 0; group_id = 0; mod_time = 0L; file_size = 0L; link_indicator = Header.Link.LongLink; link_name = ""; uname = "root"; gname = "root"; devmajor = 0; devminor = 0; extended = None} in | |
let blank = Header.make ~link_indicator:Header.Link.LongLink ~user_id:0 ~uname:"root" ~group_id:0 ~gname:"root" longlink 0L in |
We can also rely on ~user_id
and ~group_id
defaulting to 0.
I don't have a strong opinion on this.
| Error ((`Checksum_mismatch | `Unmarshal _) as e) -> | ||
Error (`Fatal e) | ||
|
||
let encode_long level link_indicator payload = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Level is always GNU
.
let encode_long level link_indicator payload = | |
let encode_long link_indicator payload = |
(** [extract ~filter ~src dst] extracts the tar archive [src] into the | ||
directory [dst]. If [dst] does not exist, it is created. If [filter] is | ||
provided (defaults to [fun _ -> true]), any file where [filter hdr] returns | ||
[false], is skipped. *) | ||
val extract : | ||
?filter:(Tar.Header.t -> bool) -> | ||
src:string -> string -> | ||
(unit, decode_error) result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to create tar archives where you store headers for the files only and in my experience tar utilities will create the missing directories. For example, it might only contain regular files:
mysterydir/file1
mysterydir/phantom/file2
Extracting this archive will result in the following filesystem structure:
mysterydir/
├── file1
└── phantom
└── file2
Now the question is:
- Should we do this, too?
Also somewhat related but independent: What is the meaning if filter { file_name = "mysterydir/"; _ } = false
and filter { file_name = "mysterydir/file1"; _ } = true
? That is, we filter out the parent directory but keep the child?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should indeed create directories which are needed.
The filter - here you can decide what you like (you get a Header.t) - and match on "no mysterydir" or "no directory entry" or "nothing below mysterydir".
(** [create ~level ~filter ~src dst] creates a tar archive at [dst]. It uses | ||
[src], a directory name, as input. If [filter] is provided | ||
(defaults to [fun _ -> true]), any file where [filter hdr] returns [false] | ||
is skipped. *) | ||
val create : ?level:Tar.Header.compatibility -> | ||
?global:Tar.Header.Extended.t -> | ||
?filter:(Tar.Header.t -> bool) -> | ||
src:string -> string -> | ||
(unit, [ `Msg of string | `Unix of (Unix.error * string * string) ]) result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again there are potential pitfalls. If filter
skips dir/
but keeps dir/file
then we get an archive where only dir/file
is present. In my experience tar tools will create missing dirs, so skipping dir/
will in this case not result in dir/
not being present in a sense, but the permissions and ownership will be something the tar utility decides.
(** [append_file ~level ~header filename fd] appends the contents of [filename] | ||
to the tar archive [fd]. If [header] is not provided, {header_of_file} is | ||
used for constructing a header. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be explained better. For example, the caller will need to call write_end
once no more files will be appended.
(** [append_file ~level ~header filename fd] appends the contents of [filename] | |
to the tar archive [fd]. If [header] is not provided, {header_of_file} is | |
used for constructing a header. *) | |
(** [append_file ~level ~header filename fd] appends the contents of [filename] | |
to the tar archive [fd]. If [header] is not provided, {header_of_file} is | |
used for constructing a header. It is the responsibility of the caller to write the end of the archive using e.g. [write_end] when no more files will be added. *) |
I'm also curious if people might think you can just open an existing archive and call append_file
on the newly opened file descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can. Unix.openfile with O_APPEND, lseek to SEEK_END - 2 * 512, and then you can append :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right :) But is that clear from the documentation? Or could a user not very familiar with the tar format think they can just do let fd = Unix.openfile "backup.tar" [Unix.O_APPEND] 0o777 in append_file "hello.txt" fd; Unix.close fd
and they're done? What I'm getting at is this maybe requires some knowledge about tar and otherwise you may think the function does more magic than it really does (like seeking at the end or finishing off with two zero blocks).
I can make a change to the documentation so it is more clear what it does and does not.
let* dst = | ||
Result.map_error unix_err_to_msg | ||
(safe Unix.(openfile (Filename.concat dst hdr.Tar.Header.file_name) | ||
[ O_WRONLY ; O_CREAT ]) hdr.Tar.Header.file_mode) | ||
in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not super safe as the archive can contain file names such as ../../../../etc/passwd
or /etc/passwd
. Maybe it's better to leave it up to the user to write their own function using fold
where they can decide how to handle paths, ownership and mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to say "extract into /my/tmp/dir", and nothing leaves that directory... thus if you have "/foo/bar" in your archive, the leading / is removed. ../ paths are ignored? I don't know for sure..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is preferable. Maybe a unsafe_extract
could exist, but I think Tar_unix.extract
should not be a footgun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsafe_extract
is something people can write their own with fold
;)
try | ||
let* passwd_entry = safe Unix.getpwuid stat.Unix.LargeFile.st_uid in | ||
Ok passwd_entry.Unix.pw_name | ||
with Not_found -> Error (`Msg ("No user entry found for UID")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW GNU tar seems to use the empty string for unknown uids.
Overall I'm happy with the For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of the new API where some details should be know by user to use rightly Tar
(and these details are about the tar
format). Specifically about Tar_gz
API, decompress‧gz
is a streaming API (you can fill the decoder
with some bytes, only an empty string informs the end of the stream). By extension, an usage of decompress‧gz
should lead to a streaming API too but it's actually not possible with the current Tar
API. It depends on what we want to expose of course.
else | ||
x | ||
|
||
type decode_state = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of decode_state
, I prefer decoder
.
Some (`Skip (Header.compute_zero_padding_length x)), | ||
None) | ||
| `Active read_zero -> | ||
match Header.unmarshal ?extended:t.global data with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That mostly means that at the beginning, the user must provides a 512 bytes string
at first (according to Header‧unmarshal
), otherwise the user gets an error. Should we precise that the user must read a tar
file per 512 bytes? Should we provide a streaming API?
CHANGES: - Fix `Header.marshal` and the checksum and the length (@reynir, mirage/ocaml-tar#145) - Delete a mutable field about the level into the header (@hannesm, mirage/ocaml-tar#141) - **BREAKING**: de-functorize the package (@hannesm, @reynir, @dinosaure, mirage/ocaml-tar#140, mirage/ocaml-tar#143, mirage/ocaml-tar#146) These PRs attempt to de-functorize `Tar` so that users can implement I/O themselves, using `Tar`'s own element serialization/deserialization functions to take advantage of read/write methods. This avoids imposing on the user the implementation of a module that is too rigid in his/her case (which could have performance implications). `Tar` offers functions for serializing/deserializing tar-specific elements from `string`. It is then up to the user to know how to obtain or write these `strings`. To this, these PRs add "logics" (see `'a Tar.t`) requiring read and/or write implementations and describing how to extract all entries from a tar file or how to write a tar file according to a "dispenser" (like `Seq.to_dispenser`) of entries. These logics do not depend on a particular "scheduler", and these PRs propose a derivation of these logics with `tar-unix`, `tar-eio` and `tar-mirage`. These latter derivations mean that the API for these packages has only been extended, and there are no breaking changes as such. These logics also make it easy to offer a compression/decompression layer with `decompress`, so you can easily manipulate and/or create a .tar.gz file.
CHANGES: - Fix `Header.marshal` and the checksum and the length (@reynir, mirage/ocaml-tar#145) - Delete a mutable field about the level into the header (@hannesm, mirage/ocaml-tar#141) - **BREAKING**: de-functorize the package (@hannesm, @reynir, @dinosaure, mirage/ocaml-tar#140, mirage/ocaml-tar#143, mirage/ocaml-tar#146) These PRs attempt to de-functorize `Tar` so that users can implement I/O themselves, using `Tar`'s own element serialization/deserialization functions to take advantage of read/write methods. This avoids imposing on the user the implementation of a module that is too rigid in his/her case (which could have performance implications). `Tar` offers functions for serializing/deserializing tar-specific elements from `string`. It is then up to the user to know how to obtain or write these `strings`. To this, these PRs add "logics" (see `'a Tar.t`) requiring read and/or write implementations and describing how to extract all entries from a tar file or how to write a tar file according to a "dispenser" (like `Seq.to_dispenser`) of entries. These logics do not depend on a particular "scheduler", and these PRs propose a derivation of these logics with `tar-unix`, `tar-eio` and `tar-mirage`. These latter derivations mean that the API for these packages has only been extended, and there are no breaking changes as such. These logics also make it easy to offer a compression/decompression layer with `decompress`, so you can easily manipulate and/or create a .tar.gz file.
this is on the path to remove the functors from the main tar library.
while working on this, I encountered some issues in the reader already:
happy to hear your feedback. if this PR is fine with you, I'm keen to use the decoder & encoder in the effectful layers (and remove the headerreader/headerwriter/other functors). WDYT?