Reducing duplicate data by condensing hard-linked files #27

mmabey · 2017-07-05T21:50:00Z

In *nix systems, any fileobject where the <nlink> tag has some value n > 1 will duplicate the information about that file object n-1 times. Here's an example I took from a DFXML file of an ext4 filesystem.

<fileobject>
  <parent_object>
    <inode>2</inode>
  </parent_object>
  <filename>.</filename>
  <partition>1</partition>
  <id>1</id>
  <name_type>d</name_type>
  <!--   Snip   -->
  <inode>2</inode>
  <meta_type>2</meta_type>
  <mode>493</mode>
  <nlink>6</nlink>
  <uid>0</uid>
  <gid>0</gid>
  <!--   Snip   -->
</fileobject>

In the remainder of the DFXML file, there are 5 other fileobject entries with all the same metadata (<mtime>, <ctime>, <byte_runs>, etc.) tags, effectively using ~5 times more space to represent this file than is necessary, since all but the <filename> and <parent_object> tags are (or at least should be) exactly the same.

I also feel like this is a less than optimal approach given that tools that ingest DFXML must (1) process all six entries, (2) determine for itself that these are duplicates, and (3) handle the various filenames assigned to the inode, when really only the last one should be necessary, and only if the tool cares about the names of such files. It seems to me that since the inode value must be unique within the bounds of the filesystem, that the fileobject entries should reflect this in DFXML by only having one fileobject per inode.

I'd like to see a change with respect to this, although I know it will likely break tools that process DFXML according to the current schema. It seems to me it would be beneficial to introduce a new element that has an unbounded number of child tags with information about each of the filenames associated with the inode. Below is a preliminary example of what this might look like:

<fileobject>
  <inode>2</inode>
  <filenames>
    <fs_entry>
      <filename>.</filename>
      <parent_inode>2</parent_inode>
    </fs_entry>
    <fs_entry>
      <filename>..</filename>
      <parent_inode>2</parent_inode>
    </fs_entry>
    <fs_entry>
      <filename>lost+found/..</filename>
      <parent_inode>11</parent_inode>
    </fs_entry>
    <!--   Snip   -->
  </filenames>
  <!--   Snip   -->
</fileobject>

I know that this makes fileobject entries more complicated for filesystems like FAT, but overall I feel like this is a better representation of the data as it is stored on the disk. I'd like to hear others' thoughts on this and if there is a better way of solving this issue than what I've thought of.

The text was updated successfully, but these errors were encountered:

ajnelson-nist · 2017-07-05T23:07:31Z

Thanks for this issue, Mike. I think there's a harder representation problem underlying this issue, and Issue 12 (and a bit of 14). Files are a point in a coordinate space that has at least two dimensions: Inode, and directory entry. Inodes and dents are independently discoverable; and encountering either when deleted can give you strange results on pointer dereferencing.

I think the most theoretically-satisfying option for your concern would be to have three object streams for a file system:

<inodeobject> with all derivable information from the inode, and some sort of unique-to-dfxml-file ID. (This identifier could be something like a first byte run, as in issue 5.)
<direntobject>, ditto for directory entries.
<fileobject> containing only references to the substantiating inodeobject and direntobject.

idifference and make_differential_dfxml already follow this mentality for object matching and rename detection, but don't do anything like hewing fileobjects into separate objects.

I can't decide if this three-stream approach has a direct code smell to it, but I feel like it's got potential to induce code smell to work with it. Your approach is cleaner than a big Cartesian pairing, but unfortunately wouldn't handle one ugly state in deleted-file analysis:

Find an allocated inode with number N.
Somehow carve out from slack space a deleted inode that also has number N.
Find a deleted directory entry that points to inode number N. Which inode-centered fileobject gets this?

Maybe the right balance is a stream of fileobjects like you proposed, but anything deleted only gets its own inodeobject/direntobject.

So this isn't just a theory ramble, I think there is at least one reduction that can be taken in DFXML-generating tools: The fileobjects created for "." and ".." are nearly always going to be redundant, and probably shouldn't be recorded. (I think the walk-root directory would be the one exception warranting a "." entry.) Offhand and this late in the day, I don't know of file systems that make a record of "." and ".." in their on-disk directory structures; if there aren't any, then recording these in the XML would only illustrate something screwy going on with the kernel's (or tool's) directory hierarchy reporting.

Feedback is welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing duplicate data by condensing hard-linked files #27

Reducing duplicate data by condensing hard-linked files #27

mmabey commented Jul 5, 2017

ajnelson-nist commented Jul 5, 2017

Reducing duplicate data by condensing hard-linked files #27

Reducing duplicate data by condensing hard-linked files #27

Comments

mmabey commented Jul 5, 2017

ajnelson-nist commented Jul 5, 2017