-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User guide documentation update #5
base: master
Are you sure you want to change the base?
Changes from 5 commits
20a399b
e8e3fe2
e5a6375
78f6e58
5f258c8
25e86c7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -380,6 +380,21 @@ enums: | |
17: udp | ||
---- | ||
|
||
Alternatively, hexadecimal notation can also be used to define an enumeration: | ||
|
||
[source,yaml] | ||
---- | ||
seq: | ||
- id: key | ||
type: u4 | ||
enum: keys | ||
enums: | ||
keys: | ||
0x77696474: width #widt | ||
0x68656967: height #heig | ||
0x64657074: depth #dept | ||
---- | ||
|
||
There are two things that should be done to declare a enum: | ||
|
||
1. We add `enums` key on the type level (i.e. on the same level as | ||
|
@@ -472,7 +487,25 @@ structure: | |
|
||
[source,yaml] | ||
---- | ||
TODO | ||
seq: | ||
- id: header | ||
type: file_header | ||
- id: metadata | ||
type: metadata_section | ||
types: | ||
file_header: | ||
seq: | ||
- id: version | ||
type: u2 | ||
metadata_section: | ||
seq: | ||
- id: author | ||
type: strz | ||
encoding: UTF-8 | ||
- id: publisher | ||
type: strz | ||
encoding: UTF-8 | ||
if: _parent.header.version >= 2 | ||
---- | ||
|
||
==== `_root` | ||
|
@@ -799,6 +832,39 @@ other value which was not listed explicitly. | |
_: rec_type_unknown | ||
---- | ||
|
||
If an enumeration has already been defined, you can use references to | ||
items in the enumeration instead of specifying integers a second time: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, if you defined There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm good point, I'll update the text accordingly |
||
|
||
[source,yaml] | ||
---- | ||
seq: | ||
- id: key | ||
type: u4 | ||
enum: keys | ||
- id: data | ||
type: | ||
switch-on: key | ||
cases: | ||
keys::width: data_field_width | ||
keys::height: data_field_height | ||
keys::depth: data_field_depth | ||
types: | ||
data_field_width: | ||
seq: | ||
#... | ||
data_field_height: | ||
seq: | ||
#... | ||
data_field_depth: | ||
seq: | ||
#... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pedantic person in me cries for that misaligned types:
data_field_width: # ...
data_field_height: # ...
data_field_depth: # ... for brevity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, agreed |
||
enums: | ||
keys: | ||
0x77696474: width #widt | ||
0x68656967: height #heig | ||
0x64657074: depth #dept | ||
---- | ||
|
||
=== Instances: data beyond the sequence | ||
|
||
So far we've done all the data specifications in `seq` - thus they'll | ||
|
@@ -1024,7 +1090,117 @@ bytes sparsely. | |
|
||
=== Streams and substreams | ||
|
||
TODO | ||
====Introduction and simple example==== | ||
|
||
A stream is a flow of data from an input file into a parser which is | ||
generated by a KS script. The parser can request one or more bits of | ||
data from the stream at a time, but cannot request the same data twice | ||
and cannot request data be provided out of sequential order. A stream | ||
knows the maximum amount of data available to be requested by the | ||
parser and the actual amount of data which has already been | ||
requested by the parser. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This explanation is pretty abstract and somewhat misleading. "Stream" can be re-read as many times as needed, and it can be seeked: that's exactly how positional parse instances work, they use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll think of another way to explain streams then, especially with reference to how |
||
|
||
When a file is first opened for parsing by a parser generated by KS, | ||
a root stream is created. This root stream can be accessed via | ||
`_root._io` at any time and in any place. In this scenario, `_root` | ||
returns the top level object defined in a script, and `_io` is a | ||
method which can be called on an object to return the associated | ||
stream. The root stream will know the maximum amount of data available | ||
to be requested by the parser as the file size of the input file which | ||
is being parsed. Initially, the root stream will know that 0 bits of | ||
data have been requested by the parser. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Streams can be used on in-memory byte arrays too, not necessarily files (which have file sizes). And, actually, stream does not "know" full file size, but it can query it on demand. File size can change if file is modified when KS parsing is in progress, so it's actually ok to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a great point, probably one worth adding to the pitfalls section (or troubleshooting or similar) for the few people who may encounter the issue and not understand what is going on. |
||
|
||
Below is an example script which is used to generate a parser which | ||
is then used to parse an input file. Assume that this example input | ||
file simply consists of a 32-bit unsigned integer value of 1000 | ||
followed by 1000 bytes of payload data. This example input file thus | ||
has a total file size of 1004 bytes. | ||
|
||
[source,yaml] | ||
---- | ||
meta: | ||
- id: example_file | ||
seq: | ||
- id: header | ||
type: file_header | ||
- id: body | ||
type: file_body | ||
size: header.body_size | ||
types: | ||
file_header: | ||
seq: | ||
- id: body_size | ||
type: u4 | ||
file_body: | ||
seq: | ||
- id: payload | ||
size-eos: true | ||
---- | ||
|
||
The parser generated by the script will first request 4 bytes of data | ||
from the root stream to copy into the object `header.body_size`. After | ||
the stream has returned the 4 bytes of data to the parser, the stream | ||
will know that it has returned 4 out of the 1004 bytes of data available | ||
to the parser. The parser is now only able to request 1000 bytes of | ||
additional data from the stream. | ||
|
||
The definition of the `body` object in the example script specifies the | ||
size of the `body` object to be the already-parsed value of | ||
`header.body_size`. Defining an object size results in something | ||
interesting happening with the KS-generated parser--a new substream is | ||
created to specifically parse the `body` object. | ||
|
||
Similar to how the root stream operates, the new substream initially | ||
knows the maximum amount of data available to be requested, and the | ||
actual amount of data already returned. In this example, the substream | ||
upon creation has a maximum of 1000 bytes of data which can be | ||
requested by the parser. The substream will know the actual amount of | ||
data which has been provided is 0 bytes. | ||
|
||
The parser will then continuously request data from the new substream | ||
to copy into the object `file_body.payload`. As the substream receives | ||
requests for more data, the substream will pass all requests to the | ||
root stream. Unlike the root stream, substreams are only able to | ||
request data from either the root stream or other substreams. | ||
Substreams do not read from an input file directly. | ||
|
||
Because `size-eos: true` is specified for the `file_body.payload` | ||
object, the parser will continue requesting data from the substream | ||
until the actual amount of data provided by the substream is 1000 | ||
bytes (the maximum amount of data which the substream is available | ||
to provide). Upon all 1000 bytes of data being copied from the input | ||
file, via the root stream and then via the substream to the | ||
`file_body.payload` object, the internal state of the two streams | ||
would be: | ||
* root stream--maximum bytes of data available remains 1004, actual | ||
amount of data already requested is 1004 bytes | ||
* substream--maximum bytes of data available remains 1000, actual | ||
amount of data already requested is 1000 bytes | ||
|
||
Alternatively, if `header.body_size` happens to be a value larger than | ||
the input file size, the root stream would be unable to fulfill this | ||
request, and the KS-generated parser would abruptly raise an exception | ||
for trying to read non-existent data beyond the end of the input file. | ||
|
||
The `_io` method can be used to access the stream associated with an | ||
object. An object can be obtained by identifier, or alternatively by | ||
methods `_root` and `_parent`. Once a stream has been obtained with | ||
the `_io` method, a number of different methods can be used to obtain | ||
the internal state of the stream: | ||
* `size` to return the maximum amount of data which is available to be | ||
requested from the stream | ||
* `pos` to return the actual amount of data which has already been | ||
requested from the stream | ||
* `eof` to return a boolean value of `false` when `pos != size` and | ||
`true` when `pos == size` (has the maximum amount of data available | ||
via the stream already been requested?) | ||
|
||
Substreams can be nested many layers deep by defining the `size` of | ||
each object in the nested tree. | ||
|
||
Related expressions which are useful when working with streams include: | ||
* `repeat: eos` | ||
* `size-eos: true` | ||
|
||
=== Processing: dealing with compressed, obfuscated and encrypted data | ||
|
||
|
@@ -1903,7 +2079,38 @@ beginner Kaitai Struct users. | |
|
||
=== Specifying size creates a substream | ||
|
||
TODO | ||
In the following example script, an erronous attempt is made to parse | ||
an input file with a file size of 2000 bytes: | ||
|
||
[source,yaml] | ||
---- | ||
seq: | ||
- id: body | ||
type: some_body_type | ||
size: 1000 | ||
types: | ||
some_body_type: | ||
seq: | ||
- id: payload | ||
size: 999 | ||
- id: overflow | ||
size: 2 | ||
---- | ||
|
||
The parser can successfully copy the required 999 bytes into | ||
`body.payload` as the `body` substream has 1000 bytes available to | ||
be requested, and the root stream has 2000 bytes available. | ||
|
||
Where an exception occurs is upon attempting to copy data from the | ||
`body` substream into the `overflow` object. After data has been | ||
copied from the `body` substream into the `payload` object, the | ||
`body` substream will only have 1 byte of data still available for | ||
the parser to request. As 2 bytes of data are attempted to be | ||
requested, the `body` substream is exhausted of available data and | ||
thus an exception occurs. The fact that the root stream still has | ||
1001 bytes available to be requested from the input file does not | ||
matter, as the `body` substream never has the opportunity to request | ||
any more than the first 1000 bytes of the input file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is actually not a pitfall, but a legitimate behavior, and well-explained in previous section. The "pitfall" I was thinking about in this section is the following: when a new substream is created, all parse instances with positions act within that substream by default. So, this one works as expected: seq:
- id: skipped
size: 1000
- id: indexing
type: file_index_entry
# but adding "size: 24" here will ruin "file_body" instance,
# although it looks legitimate at the first glance
types:
file_index_entry:
seq:
- id: file_name
type: str
size: 16
- id: file_pos
type: u4
- id: file_len
type: u4
instances:
file_body:
pos: file_pos
size: file_len To overcome that, one needs to use something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excellent. I didn't know about |
||
|
||
=== Applying `process` without a size | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally ok, but I'd also noted that this is a service provided by YAML, not something specific to KS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that a new section of the document could be created for general syntax and a very brief overview of YAML and what it provides. This example I provided may be better suited there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Construct features are Python features, but I would advertise them just the same. Purpose of documentation is to show capabilities, not attribution. =) Just saying.