Wikidata identifier: use pre-processing mechanism to remove duplicate signatures? #215

ross-spencer · 2023-03-23T09:22:03Z

ross-spencer
Mar 23, 2023
Collaborator

For reference ffdev-info/wikidp-issues#32.

The Wikidata identifier includes a -wikidatadebug flag which outputs some of the tool''s pre-processing messages. Using the pre-processing capabilities of the Wikidata implementation in Roy, we can also identify duplicate BOF sequences and remove these from the dataset. This should improve the accuracy of Siegfried's results, at the expense of some files not being identified (by less-specific signatures).

Why would we ever do this?

We would be following a model adopted by PRONOM to attempt to not return multiple-identifications. In Wikidata

This is already hard in Wikidata.
We know there are duplicates that do not tell us anything specific about a file, e.g. the XML duplicates means that the identification asserts the file could be one of hundreds of XML based formats.
Fixing the issue from the Wikidata side will take time.

What are the benefits?

Formats that aren't identified will be looked at by the community and we can chip away at the problem one by one. We contribute to the feedback loop between Siegfried and WIkidata.

What are the drawbacks?

Sometimes we will lose an initial clue as to what a format is by looking at the Siegfried UI.

What are the alternatives?

As always, do nothing. It may be fixed in Wikidata in due course.
Mark duplicate patterns in the Siegfried UI? (This is not a pattern adopted in any other identification tool).
e.g. basis : 'byte match at 0, 5 (Wikidata reference is empty) (DUPLICATE SIGNATURE PATTERN)'

Should we use the pre-processing capabilities of Wikidata/Roy to remove duplicate patterns from the identiifer?

Yes

0%

Yes, but... (explain below)

0%

No (explain below)

0%

0 votes

richardlehane · 2023-03-23T09:44:44Z

richardlehane
Mar 23, 2023
Maintainer

Would this mean that if multiple formats share a signature one would keep it, or the signature would be removed from all the formats?

1 reply

ross-spencer Mar 23, 2023
Collaborator Author

We would have no way to reason about the quality of either, and so both would be removed.

richardlehane · 2023-03-23T10:02:15Z

richardlehane
Mar 23, 2023
Maintainer

In terms of implementation, this would go most naturally into https://github.com/richardlehane/siegfried/blob/main/internal/identifier/parseable.go

You'd just create a new Parseable type that inherits the base but overwrites the Signatures() method with a new version that trims duplicates.

You could then add a config setting that would be picked up by the ApplyConfig() function in the same file.

This would mean that as a config setting it would potentially be available to all Identifier types, but could be made default for Wikidata.

Here's some pseudo code (the isDuplicate function isn't optimised but maybe this doesn't matter as for anything that happens in roy I never really care about performance too much)...

type NoDuplicates struct{ Parseable }

func isDuplicate(sig frames.Signature, prev, succ []frames.Signature) bool {
  for _, sig1 := range prev {
    if sig.Equals(sig1) {
      return true
    }
  }
  for _, sig2 := range succ {
    if sig.Equals(sig2) {
      return true
    }
  }
  return false 
}

// Signatures returns a signature set with corresponding IDs and weights for the bytematcher.
func (nd NoDuplicates) Signatures() ([]frames.Signature, []string, error) {
	sigs, ids, err := nd.Parseable.Signatures()
	if err != nil {
		return sigs, ids, err
	}
	rsigs := make([]frames.Signature, 0, len(sigs))
	rids := make([]string, 0, len(sigs))
	for i, v := range sigs {
          var prev, succ []frames.Signature
          if i > 0 {
            prev = sigs[:i]
          }
          if i < len(sigs)-1 {
            succ = sigs[i+1:]
          }
          if isDuplicate(sig, prev, succ) {
            continue
          }
 	  rsigs = append(rsigs, v)
	  rids = append(rids, ids[i])
	}
	return rsigs, rids, nil
}

3 replies

ross-spencer Mar 23, 2023
Collaborator Author

Ah, of course, so create a flag that makes the option configurable? -- I'd perhaps err on making it:

Default - remove duplicates for Wikidata
Option - include everything!

Would that work for you too?

richardlehane Mar 23, 2023
Maintainer

yep sounds good. If you could make it generic so other signature types could use it too (it may for example be relevant to the LOC identifier) would be cool, but I suppose for the non-wikidatata identifiers you'd want this to default to option (2)

ross-spencer Mar 23, 2023
Collaborator Author

but I suppose for the non-wikidatata identifiers you'd want this to default to option (2)

Absolutely, I feel the nature of Wikidata and where we are with that right now it lends itself to default to 1, but it's an interesting idea to make it optional for the others to take advantage of too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikidata identifier: use pre-processing mechanism to remove duplicate signatures? #215

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Wikidata identifier: use pre-processing mechanism to remove duplicate signatures? #215

ross-spencer Mar 23, 2023 Collaborator

Replies: 2 comments · 4 replies

richardlehane Mar 23, 2023 Maintainer

ross-spencer Mar 23, 2023 Collaborator Author

richardlehane Mar 23, 2023 Maintainer

ross-spencer Mar 23, 2023 Collaborator Author

richardlehane Mar 23, 2023 Maintainer

ross-spencer Mar 23, 2023 Collaborator Author

ross-spencer
Mar 23, 2023
Collaborator

Replies: 2 comments 4 replies

richardlehane
Mar 23, 2023
Maintainer

ross-spencer Mar 23, 2023
Collaborator Author

richardlehane
Mar 23, 2023
Maintainer

ross-spencer Mar 23, 2023
Collaborator Author

richardlehane Mar 23, 2023
Maintainer

ross-spencer Mar 23, 2023
Collaborator Author