-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy loading / Iterating on parts of file #7
Comments
This is only a limitation of the generated Python code. In our last Zoom call, this is what we were talking about that would have to change. Some file formats have very natural breakpoints where eager-reading stops and waits for a user choice about what to read subsequently. For example, it's natural to read a ZIP or HDF5 file up to the point where you get a listing of what's in the file, then wait for the user to choose which subfile or Dataset to actually extract. ROOT files have that stopping point at the TDirectory and TTree metadata, and then you'd want to iterate through data in the TTree, because it is a kind of sequence. Some file formats are only for iteration, without any header/directory structure at all, like CSV or newline-delimited JSONs (or newline-delimited anything). Kaitai is more wide-open about the kinds of files that it supports, so Kaitai itself doesn't have a concept of a directory or other stopping point. Everything is a sequence of data instances, but some of those sequences are headers that you want to read in their entirety while others are big-data payloads that you want to iterate over. Therefore, that information about the stopping point has to be injected somehow. I don't think the Kaitai KSY language has a way to say "this is a point where eagerness processing ends" (although delimited structures might be part of a solution). For most targets other than the new Awkward one, the Kaitai developers expect users to write a As far as I can see, there are two kinds of stopping points:
ZIP is a format that only has stopping point (1); CSV is a format that only has stopping point (2) (because we wouldn't stop reading and return control to the user after only reading the one-line header—there would be no choice for the user to make, anyway). ROOT is a format that has both stopping points (1) ( First question: what kinds of stopping points does EventIO have? If it has a header that requires the user to make a choice about what to read next, like ZIP or ROOT, where is that point? is there more than one? and how would we design an interface to Awkward-Kaitai to say where it should stop that is not EventIO-specific? Here's an idea: there's no reason to avoid reading non-list-like data. If there's a header, we can read all of its scalar fields (I'm including records within records in the word "scalar"). Only list-like data can be potentially large. A rule like "Don't eagerly read any list-like fields" is too strict; in a domain-specific setting like EventIO, some lists are known to be small. For example, the I know that EventIO has stopping points of the (2) type, so there would be a second function—actually, a generator/iterator—that yields batches of data from the list that was excluded from the first read. This second function would have to be configured by "How many entries per batch?", which even a domain-specific EventIO reader would expose to the user. On the Zoom call, I was said that I was considering reintroducing virtual/lazy arrays into Awkward Array as a way to express these stopping points, but @agoose77 talked me out of it. (It would be very disruptive to the Awkward infrastructure, and you can get your laziness by having multiple functions in Awkward-Kaitai, as I've described above.) So Awkward Arrays, the In what I described above, the two functions, "Load from the beginning of the file, but exclude a given set of nested field paths (that have list type)," and "Iterate over batches of a given nested field path (that has list type)" would be purely functional. After the first function returns data whose type is the whole file but has some list-like fields missing, it deletes all of the C++ class instances it used to produce the Awkward Array, closes the file, and returns just the Awkward Array (possibly |
EventIO, the basic file format, has the "object" as only unit. Iterating over these objects is the basic interface. It's made a tiny bit more complicated by the fact that some objects can be "containers", i.e. are known to be streams of objects. Since you can only read most compressed streams forward efficiently, we iterate depth first. That's the basic interface I'd expect for a lazy loading eventio reader. Then there is the question of specific data formats using eventio, we have the two variantes: output of the CORSIKA iact extension storing Cherenkov light on the ground and output of These both consist of some header-like information in multiple eventio objects, followed by a sequence of air shower events, also consisting of multiple objects each. In the end, there is also some footer information, e.g. summary statistics about the simulation. Depending on configuration of the software, the structure changes a bit (more information can be saved, etc.). Here, the natural interface is to read the header part when opening the file and offering iteration / lazy loading of the air shower events, providing the footer information once the loop has been exhausted. |
Referring to
f = EventioKaitaiParser.from_file(TESTFILE)
Checking the code of the generated python extension, this loads the whole file eagerly in
__init__
.It seems also there is no way to do lazy loading / iterating over
seq
parts of the file.Is this a limitation of kaitai in general or only of the generated python code?
Originally posted by @maxnoe in #1 (comment)
The text was updated successfully, but these errors were encountered: