You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In *nix systems, any fileobject where the <nlink> tag has some value n > 1 will duplicate the information about that file object n-1 times. Here's an example I took from a DFXML file of an ext4 filesystem.
In the remainder of the DFXML file, there are 5 other fileobject entries with all the same metadata (<mtime>, <ctime>, <byte_runs>, etc.) tags, effectively using ~5 times more space to represent this file than is necessary, since all but the <filename> and <parent_object> tags are (or at least should be) exactly the same.
I also feel like this is a less than optimal approach given that tools that ingest DFXML must (1) process all six entries, (2) determine for itself that these are duplicates, and (3) handle the various filenames assigned to the inode, when really only the last one should be necessary, and only if the tool cares about the names of such files. It seems to me that since the inode value must be unique within the bounds of the filesystem, that the fileobject entries should reflect this in DFXML by only having one fileobject per inode.
I'd like to see a change with respect to this, although I know it will likely break tools that process DFXML according to the current schema. It seems to me it would be beneficial to introduce a new element that has an unbounded number of child tags with information about each of the filenames associated with the inode. Below is a preliminary example of what this might look like:
I know that this makes fileobject entries more complicated for filesystems like FAT, but overall I feel like this is a better representation of the data as it is stored on the disk. I'd like to hear others' thoughts on this and if there is a better way of solving this issue than what I've thought of.
The text was updated successfully, but these errors were encountered:
Thanks for this issue, Mike. I think there's a harder representation problem underlying this issue, and Issue 12 (and a bit of 14). Files are a point in a coordinate space that has at least two dimensions: Inode, and directory entry. Inodes and dents are independently discoverable; and encountering either when deleted can give you strange results on pointer dereferencing.
I think the most theoretically-satisfying option for your concern would be to have three object streams for a file system:
<inodeobject> with all derivable information from the inode, and some sort of unique-to-dfxml-file ID. (This identifier could be something like a first byte run, as in issue 5.)
<direntobject>, ditto for directory entries.
<fileobject> containing only references to the substantiating inodeobject and direntobject.
idifference and make_differential_dfxml already follow this mentality for object matching and rename detection, but don't do anything like hewing fileobjects into separate objects.
I can't decide if this three-stream approach has a direct code smell to it, but I feel like it's got potential to induce code smell to work with it. Your approach is cleaner than a big Cartesian pairing, but unfortunately wouldn't handle one ugly state in deleted-file analysis:
Find an allocated inode with number N.
Somehow carve out from slack space a deleted inode that also has number N.
Find a deleted directory entry that points to inode number N. Which inode-centered fileobject gets this?
Maybe the right balance is a stream of fileobjects like you proposed, but anything deleted only gets its own inodeobject/direntobject.
So this isn't just a theory ramble, I think there is at least one reduction that can be taken in DFXML-generating tools: The fileobjects created for "." and ".." are nearly always going to be redundant, and probably shouldn't be recorded. (I think the walk-root directory would be the one exception warranting a "." entry.) Offhand and this late in the day, I don't know of file systems that make a record of "." and ".." in their on-disk directory structures; if there aren't any, then recording these in the XML would only illustrate something screwy going on with the kernel's (or tool's) directory hierarchy reporting.
In *nix systems, any
fileobject
where the<nlink>
tag has some valuen
> 1 will duplicate the information about that file objectn
-1 times. Here's an example I took from a DFXML file of an ext4 filesystem.In the remainder of the DFXML file, there are 5 other
fileobject
entries with all the same metadata (<mtime>
,<ctime>
,<byte_runs>
, etc.) tags, effectively using ~5 times more space to represent this file than is necessary, since all but the<filename>
and<parent_object>
tags are (or at least should be) exactly the same.I also feel like this is a less than optimal approach given that tools that ingest DFXML must (1) process all six entries, (2) determine for itself that these are duplicates, and (3) handle the various filenames assigned to the inode, when really only the last one should be necessary, and only if the tool cares about the names of such files. It seems to me that since the
inode
value must be unique within the bounds of the filesystem, that thefileobject
entries should reflect this in DFXML by only having onefileobject
perinode
.I'd like to see a change with respect to this, although I know it will likely break tools that process DFXML according to the current schema. It seems to me it would be beneficial to introduce a new element that has an unbounded number of child tags with information about each of the filenames associated with the inode. Below is a preliminary example of what this might look like:
I know that this makes
fileobject
entries more complicated for filesystems like FAT, but overall I feel like this is a better representation of the data as it is stored on the disk. I'd like to hear others' thoughts on this and if there is a better way of solving this issue than what I've thought of.The text was updated successfully, but these errors were encountered: