Differences PRONOM reports and DROID signature files #233
Replies: 2 comments 5 replies
-
Those missing EOF ranges are fascinating and I would recommend reporting them as a DROID issue on their repo - e.g. fmt/41 itself has two signatures (InternalSignatureIDs 69 and 697) - 69 has the correct EOF range, so why on earth doesn't 697? Very weird. Regarding priorities, the 'priority over' is supposed to ensure that where multiple identification patterns match a given file, one has precedence. This may or may not involve a sub/supertype relationship. Equally a sub/supertype relationship can feasibly have two non-overlapping signature patterns therefore not require a priority relationship (e.g imagine a supertype format with a BOF offset zero magic pattern, and a subtype where a fixed-length, but variable string is prepended to what would otherwise be the same header & magic of the supertype - in this case the priority wouldn't technically be needed). On this basis I'd suggest that inferring a priority relationship is unnecessary, although if there are sub/supertypes where a priority relationship is missing that should be present, again this is something to feedback upstream. |
Beta Was this translation helpful? Give feedback.
-
On closer observation, each one of those EOF sequences that are missing from the DROID output (fmt/41, fmt/1155, fmt/1156), do not have 'Offset' set as 0 in PRONOM, so it sounds like those need to be made explicit. fmt/385 is a weird signature - it has a variably positioned sequence (which therefore should not require offset values), but instead this variable sequence has offset range set as 18-1024. I'm assuming the intent here is that this sequence should be found within 18-1024 bytes of the BOF sequence, but this should therefore be expressed as solely a BOF sequence: further edit actually offset values aside that Variable sequence is identical to the sequence that follows the full wildcard in the BOF, so maybe the intent was to actually remove the Variable sequence altogether at some point? I'll have a closer look at this signature's history tomorrow. fmt/1796 is also expressed weirdly. it has a trailing {0-2} range at the end of the EOF sequence, but additionally the EOF itself has a 0-3 offset range set. The intent behind the signature should be double-checked but from the description it sounds like the intent was 0-2 at EOF (which would be most sensibly handled by dropping the trailing {0-2} and adjusting EOF range to 0-2), but if both patterns were intended this should be expressed as EOF range 0-5. I'm happy to formally raise these issues if you wish, but as you found them I figured you might want to - just let me know... |
Beta Was this translation helpful? Give feedback.
-
You can build a PRONOM signature file either using the XML files for each PRONOM entry (PRONOM reports) or the single XML file of signatures used by DROID (DROID signature file).
I.e. you can either do:
roy build
[builds with PRONOM xml files]OR
roy build -noreports
[builds with DROID signature file]Why have 2 options for what should be the same thing? ... because I implemented building from PRONOM xml files first, only adding the DROID parser later, and, once I had done that, I found that there were some subtle differences between the two.
One of the tests in the siegfried test suite is
TestParseDroid
. This test builds both sets of signatures and checks them against each other for equality. You can view the output from this test here.There are currently 5 formats where the byte signatures differ: fmt/41, fmt/385, fmt/1155, fmt/1156 and fmt/1796.
Most of these issues seem to be with the DROID translation from PRONOM but the last issue for fmt/1796 may be an issue for siegfried. That format's signature uses two ways to express the EOF offset: {0..2} at the end of the signature and 0 to 3 range in the offset fields, DROID adds those offsets together to get a max of 5, sf just picks one of the ranges, the DROID approach is probably the correct way to interpret this.
In the
v1.10.1
release I added additional checks to compare the format priorities built by using PRONOM vs DROID (format priorities are the definition of superior-subordinate relationships between formats used to select the best result).Format priorities currently differ for 32 formats. As an example, fmt/214 has 5 superior formats when PRONOM is used but only one superior format when the DROID signature file is used.
The full list of differences is:
PRONOM-DROID priorities.csv
PRONOM-DROID priorities.xlsx
I haven't checked them all but I expect the reason for the differences is pretty simple: the PRONOM database defines a number of different types of relationships between formats including "lower priority than" as well as "supertype of" etc. Likely only those explicit "lower priority than" relationships go into DROID, whereas my PRONOM parser also includes some of those other relationship types as priorities.
The question is, is this the wrong thing to do?
Beta Was this translation helpful? Give feedback.
All reactions