Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine the specifics of author strong identifier matching #10029

Open
pidgezero-one opened this issue Nov 13, 2024 · 9 comments · May be fixed by #10092
Open

Determine the specifics of author strong identifier matching #10029

pidgezero-one opened this issue Nov 13, 2024 · 9 comments · May be fixed by #10092
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Needs: Community Discussion This issue is to be brought up in the next community call. [managed] Needs: Response Issues which require feedback from lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Question This issue doesn't require code. A question needs an answer. [managed]

Comments

@pidgezero-one
Copy link
Contributor

pidgezero-one commented Nov 13, 2024

Question

I have an open project about importing books from Wikisource. My import script uses both the Wikidata API as well as the Wikisource API to fetch as much rich information about each book as possible.

While I was developing this script, I learned about the strong identifiers Wikidata offers for authors (like VIAF id, Bookbrainz id, etc). As a proof of concept, I updated my script to include those identifiers in the import records it outputs, and then I modified the import API pipeline to match incoming books to existing authors based on those identifiers. It works, but there's just not much existing data to match to.

Before committing to this change, we should fill out those identifiers for all of OL's existing authors so that the import pipeline can actually use them for matching authors in incoming records. As Wikidata offers that information, and we already know how to get it, we should have a script that can do that backfill.

We should discuss specifics here, such as which IDs (out of this list) should be used for import matching at all (and in which priority) and how to handle conflict resolution. (Also, how do MARC records for authors factor into this?)

Stakeholders

@RayBB @cdrini

@pidgezero-one pidgezero-one added Needs: Community Discussion This issue is to be brought up in the next community call. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Question This issue doesn't require code. A question needs an answer. [managed] labels Nov 13, 2024
@RayBB
Copy link
Collaborator

RayBB commented Nov 13, 2024

While the choice of identifiers is outside my expertise, let me outline the technical approaches we could take.

The core issue appears to be improving import matching accuracy by leveraging additional identifiers from Wikidata. You think the best way to do that is by importing more identifiers from Wikidata to Open Library author records.

In my opinion, the simplest way to solve your problem is to use the information that is currently stored in the Postgres Wikidata table to match identifiers during import. For all Open Library authors with associated Wikidata IDs, we maintain a copy of their Wikidata information in our PostgreSQL database. You extend the import matching functionality to query these additional identifiers within our existing Wikidata entries in the PostgreSQL database. This approach would provide several advantages: it eliminates data duplication across OL, prevents synchronization issues, reduces potential conflicts, and offers a straightforward implementation path.

However, if we decide to store these strong identifiers directly on the author records, the process can still leverage the existing Wikidata information from our PostgreSQL database to populate these fields.

The approach should be determined through a thorough technical evaluation of both options, weighing their respective implementation challenges implications. I would defer making a recommendation until others can chime in.

Side note: There likely exists a subset of authors whose Open Library IDs are referenced in Wikidata, but whose corresponding Wikidata IDs are not yet recorded in Open Library. While I believe a script has been developed to address this, we'd need to ask to be sure.

Anyway, I'm not deeply familiar with the import system but I'm very excited to see it improve and get better matching!

@Freso
Copy link
Contributor

Freso commented Nov 14, 2024

strong identifiers

FWIW, I really dislike this term, over just using “identifiers”. :) Open Library doesn’t currently have any concept of “strength” of identifiers, and I think it would be a mistake to add it.

In my (subjective, personal!) experience, no identifiers are objectively “stronger” than others. Most “strength” you associated with an identifier either relies on use case… or your subjective experience/bias. E.g., library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps.

if we decide to store these […] identifiers directly on the author records

My vote is for doing this. I can expand on my arguments/reasoning here or elsewhere, as appropriate. :)

how do MARC records for authors factor into this?

I’m not sure what you mean? If the MARC record has any identifiers in it, we can use those, and if it doesn’t then, well, it doesn’t.

We should discuss specifics here, such as which IDs (out of this list) should be used for import matching at all (and in which priority) and how to handle conflict resolution.

My take is that for project imports (e.g., Wikisource, LibriVox, Gutenberg, Runeberg, …), the identifiers from that project should reign supreme. This might be difficult to code, though, if not impossible. Importers could run their own preliminary matching though to seed their import data with OLIDs (see #9411), which should bypass this whole process.

My suggestion for identifier-based import logic flow:

  • Super-Ideal case:
    • incoming data has an OLID ⇒ match to that OLID
  • Ideal cases (external ids match a single OL entity):
    • no overlap in known sets of identifiers ⇒ no match, fallback to “normal” matching
    • all known OL ids do not match incoming equivalent ids ⇒ reject match
    • any known OL ids match any incoming recognised ids ⇒ match
  • Troublesome cases (note: ideally, any of these would raise a flag somewhere that a librarian could find for further investigation):
    • incoming identifiers match multiple OL entities, pick entity to continue with (for external ids match a single OL entity):
      • if matches > 2:
        • if a group of entities match ≥ half of the incoming ids, pick that group and go to incoming identifiers match multiple OL entities with these (note: this is probably the most complex calculation here with a lot of internal edge cases, so, for simplicity, this could be dropped)
          • e.g., A match 4, B match 2, C–H match just 1 each, start over with just A,B
        • if entity A matches ≥ half of the known incoming identifiers, pick entity A
          • e.g., A matches on 3 identifiers and B, C, D match on 1 identifier each
          • note: ≥ should be fine, as there should be no case where A matches half and B matches other half, since then matches > 2 would be false and the flow would be in the matches == 2 tree instead; there might be some edge cases with multiple OL entities having the same identifiers assigned though
        • else no match/fall back to non‐identifier matching
      • if matches == 2:
        • if both entities have same amount of matched identifiers, pick oldest (lowest OLID)
          • note: this is far most likely a duplicate that needs merging, and going with the lowest OLID reduces the amount of data that needs updating when the merge is done
        • else pick entity with most matched identifiers
    • external ids match a single OL entity:
      • half or more of incoming identifiers conflict with known OL ids ⇒ reject match (possibly fall back to normal matching?)
      • more incoming identifiers match with known OL ids than conflict ⇒ match

For merging incoming identifier sets with existing sets, I’d say

  • if identifier is unset in OL, set it
  • if identifier is different in OL and plural (e.g., oclc_numbers for Editions), append to OL’s list
  • if identifier is different in OL and singular, keep OL’s version (but flag, if possible)

Of note, this flow has no concept of identifier “strength” and simply tallies up and compares the amount of matching vs. non‐matching identifiers for any given item.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Nov 15, 2024
@cdrini cdrini added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead labels Nov 18, 2024
@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Nov 18, 2024

My take is that for project imports (e.g., Wikisource, LibriVox, Gutenberg, Runeberg, …), the identifiers from that project should reign supreme. This might be difficult to code, though, if not impossible.

On the API side, I don't think this would be hard to code at all, especially if the import record contains information about which project it's coming from. Alternatively, the import record schema could be updated to include an optional field for identifier priority, in which case it would be the responsibility of whatever produces the import record (such as the wikisource script) to denote which identifier it wants to adhere to.

I've been meaning to move this into a separate pull request, but here is where I proposed a modification to the import endpoint that looks to see if there is an existing author that matches any identifiers in the import record. It stops as soon as it finds any match, though, and it just loops through identifiers as they're listed in identifiers.yaml, so there's nothing here yet that tells it to check Wikisource IDs first when the import record comes from Wikisource, for example.

It also saves any identifiers for that author that are coming from the import record, which fills out these fields - is this still where we want that information to live?
image

I'm not terribly opinionated on the logic flow and am happy to implement whatever is preferred by librarians. :)

In my opinion, the simplest way to solve your problem is to use the information that is currently stored in the Postgres Wikidata table to match identifiers during import. For all Open Library authors with associated Wikidata IDs, we maintain a copy of their Wikidata information in our PostgreSQL database. You extend the import matching functionality to query these additional identifiers within our existing Wikidata entries in the PostgreSQL database. This approach would provide several advantages: it eliminates data duplication across OL, prevents synchronization issues, reduces potential conflicts, and offers a straightforward implementation path. However, if we decide to store these strong identifiers directly on the author records, the process can still leverage the existing Wikidata information from our PostgreSQL database to populate these fields.

I like this a lot better than a Wikidata API script, actually! I am still pretty unfamiliar with the depths of our production data, and didn't know we stored Wikidata entries for our authors. If I'm understanding correctly, could we just fill out each author's remote_ids with the information we already have from Wikidata, and then continue leveraging remote_ids for author identification on import like I'm trying to do in my other PR?

@RayBB
Copy link
Collaborator

RayBB commented Nov 19, 2024

If it helps, you can see the db schema here. https://github.com/internetarchive/openlibrary/pull/9130/files
There's a PR open now to start getting a few identifiers to show. So that might be helpful too
https://github.com/internetarchive/openlibrary/pull/9991/files

@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Nov 25, 2024

Thanks, Ray! Over the weekend, I was able to get a script working that loops through every author in the Wikidata table and copies identifiers from the Wikidata json into the author object's remote_ids. Here's the remote_ids I got into John O. Meusebach's page: most of them were already on production, but the Library of Congress one came from my script leveraging the work you just merged in for getting an author's Wikidata properties.

image

It doesn't implement the logic that Freso outlined above, though, so I'm still working on that for a proof of concept. Will hopefully have something to show later this week!

Ultimately if every author's remote_ids are as up to date as possible, the import pipeline can rely on it very, very easily, which I got a POC working for in another PR. It would also let us cleanly merge the identifiers that already exist as outbound links in each author's page with the ones we're pulling from Wikidata.

I'd like to also get the code to a point where fetching this info from Wikidata automatically updates the remote_ids as well, but haven't figured that out just yet.

@pidgezero-one
Copy link
Contributor Author

I'll put up a draft for this most likely tomorrow sometime, but the POC I've drafted up is:

  • Copying wikidata identifiers into the author's remote_ids property whenever wikidata is fetched for that author. If there are too many conflicts with existing identifiers set by OL librarians, the act of consolidating remote_ids is aborted.
  • There's a script that will backfill this for every author with Wikidata stored where the Wikidata object contains a back reference to the author key.
  • Importing follows Freso's logic as written above (I haven't tested this part very much yet).
  • If at any point we can't consolidate existing remote_ids with Wikidata json, or an incoming import record, there should be some way to flag this to librarians, but I'm not sure how to do that yet.

@tfmorris
Copy link
Contributor

E.g., library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all)

Identifiers need to be considered individually, but the gross generalization of "crowd sourced" identifiers being qualitatively "better" in some dimension than professionally maintained identifiers doesn't hold in my experience. One needs look no further than OpenLibrary itself to find an easy, and egregious, counter example.

VIAF is a special case because its "entities" are generated by a clustering algorithm working on input data from the national libraries of the world. It can, and does, make mistakes in clustering and the clusters can change over time.

The LCNAF has identifiers for conflated generic records like "Smith, John" with no dates, but those identifiers are explicitly tagged as not for current/future use in cataloging.

Wikidata generally has pretty good quality links, but many of its authors have two or more OL identifiers associated with them. VIAF has better coverage of authors, but that shouldn't be an issue if the main focus is Wikisource, where the authors should all have Wikidata entries.

Side note: There likely exists a subset of authors whose Open Library IDs are referenced in Wikidata, but whose corresponding Wikidata IDs are not yet recorded in Open Library.

There is actually a large number of these. I come across them all the time. Anecdotally, OL IDs that I added to Wikidata a year ago still haven't shown up here.

@hornc
Copy link
Collaborator

hornc commented Dec 3, 2024

I've been thinking about this a bit more, trying to make sense of the Wikisource importing feature.
This author identifier #10029 issue is really a larger piece of work, effectively an epic, and there aren't sufficient examples of what it needs to be successful.

I don't think this epic is a pre-requisite for the Wikisource feature at all, but it's not an unreasonable piece of work. I just think it needs breaking down, and some
specific goals and definitions we can all agree on.

  1. If Wikidata sourced author metadata were present in an OL import record, with names and birth + death dates in the correct fields, authors should be matched correctly.
    If they are not, there could be a bug in the existing import code, or there might be some formatting issue somewhere. I'd want to ensure those issues with existing code were fixed first.

  2. OL does not currently have a stand-alone author import mechanism. Author records are created from book import records. Author identifiers cannot even be passed to the
    import API currently, so there is no way to match imports on these identifiers. The functionality of matching author records was accomplished by the author name + date mechanism, which
    was good enough for getting the book metadata imported, which was the primary usecase. Matching authors is only an auxiliary feature of the book import process. Changing that to something more sophisticated is more of an epic than a bug-fix.

This epic appears to be asking for / inferring a 'proper' fully featured author import system that "does all the right things", without specifying any examples. Should Authors be importable independently of books? There's nothing here in terms of usecases or value statements to suggest either way. Without a concrete usecase, I'd say no, it overcomplicates things for no clear gain.

#9448 feels like a standalone atomic pre-requisite feature that would fall under this epic. @Freso has done a good job of setting it out as a coherent feature, and I think that feature is ready to be implemented, and would add some value all by itself. As a patron it is helpful to view the various author identifiers, and it helps with the manual disambiguation and tidy up that is an almost required part of the process. It makes sense that if author metadata can be edited in the model, they should be importable. Figuring out an import schema that represents and maps the required identifiers correctly will be part of that feature.

#9411, matching on provided Open Library identifiers would be a good and simple starting point for developing the basic matching feature, but I'm still trying to think of a realistic example where book metadata is known with an Open Library author ID, but the book is not already on Open Library. Concrete examples and usecases would let us work out the specific requirements about how we handle aliases and pen-names (which is one kind of case I can imagine). The OL data model has fields for some of this, but it is not utilised or connected to any imports.

Even the more concrete possible examples of "generic book sources" which get mentioned seem slightly hypothetical -- I haven't been able to find a Wikisource book that doesn't have a source represented on archive.org / OL or project Gutenberg. LibriVox examples have a similar 'hypothetical' import nature since they are all hosted on archive.org, and aren't representative of generic imports either (but that's a different topic). I'm also unsure about Project Gutenberg as a publisher Work vs. Edition, that also seems different from a general book import.

I don't think it is at all obvious from the current description how to handle imported author fields that don't match when an identifier is matched, and without concrete examples or even a concrete usecase that requires this feature, it's hard to make a call.

There will be name variations, J. Smith vs. John Smith, vs Jhon Smith (which may or may not be a typo), or simply mismatched identifiers, there will be differences in date precision, and exact dates. Currently the import process is cautious and will treat these as different records; what is the correct behaviour for each case with a new system? The current system works because it wasn't considered too costly to have the extra author records. We're not demonstrating a clear increase in value over this by suggesting non-specific changes.

The simplest solution would be to incrementally extend the current date matching process using some known identifiers, but again without some concrete examples and expectations, it's hard to be specific.

I can't help think that the impact of this feature will be smaller than it first appears, because the import process is importing and matching books rather than authors, but again it's hard to make an argument without concrete examples.

It occurred to me while writing all this that perhaps your Wikisource imports were some of the first non-MARC imports where the responses were examined for expected and sane match results, since the bulk of non-MARC imports are just shovelled in at scale, and I'm not sure the responses are examined at all. The MARC expectations are not thoroughly examined either, but they have been incrementally monitored to some extent and improved over the years when specific issues were noted. Perhaps some of the MARC specific code is doing extra work that is not being replicated on the bare JSON import path? The MARC path ultimately uses the same import format as the raw JSON, but it does have a fair bit of logic to form that JSON from the metadata it has available. That might be something that needs further investigation if the goal is to bring non-MARC imports up to MARC import level (if that is the real requirement here?).

@Freso
Copy link
Contributor

Freso commented Dec 3, 2024

I'm still trying to think of a realistic example where book metadata is known with an Open Library author ID, but the book is not already on Open Library.

I have been wanting to do a mass-import of LibriVox audiobooks (and also Project Runeberg ebooks, once #9982 hits prod and is both motivation for and motivated by #9984 ). A lot (most? all??) of these authors already exist in OL—many of them several times. Some already have LibriVox IDs, but I was also planning to run a bot to populate LibriVox (and Runeberg) identifiers on OL before doing the imports. This would allow my JSON generator to look up the LV/Rb IDs in OL and thus get the OL IDs to put into the JSON.

LibriVox examples have a similar 'hypothetical' import nature since they are all hosted on archive.org

They are indeed hosted by IA, but the vast majority of LV books are not represented in OL, so they still need importing into OL in some way or other. I suspect this might be true for Gutenberg as well. Not sure if all Runeberg books exist in IA – I know a good chunk do, but what I found looked like a 3rd party import, unaffiliated with either IA or Runeberg, so not sure of how complete or up-to-date it is. Either way, I can’t recall that I have come across a single Runeberg OL Edition yet.

But yes, none of these are “generic” imports either, since the imports are specific to the source they get imported from, but those are the types of cases I specifically had in mind for #9411 and #9448 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Needs: Community Discussion This issue is to be brought up in the next community call. [managed] Needs: Response Issues which require feedback from lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Question This issue doesn't require code. A question needs an answer. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants