Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: consolidate author remote_ids and wikidata identifiers #10092

Draft
wants to merge 272 commits into
base: master
Choose a base branch
from

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Nov 27, 2024

this should be squash merged to avoid conflicts with #9674, which split off from this PR

===This is a WIP===

Closes #10029

Consolidate author's existing known IDs in OL with their known IDs that come from Wikidata. Identifiers that already exist in OL from other imports take precedence.

Technical

  • fetching live wikidata will also save wikidata's author identifiers to the author's remote ids (at least any that don't conflict with existing ones that already exist for that author, and only if there aren't TOO many conflicts)
  • backfill this operation for all existing authors with backfill_author_identifiers.py

TODO: unit tests
TODO: how do we flag conflicts to librarians?
TODO: maybe move the consolidation logic to the author object instead of the wikidata object
TODO: fix pre-commit linting problems (2nd TODO will take care of most of this)

Issues:

  • backfill script has this line: web.ctx.ip = '127.0.0.1' because web.ctx.ip is empty when running the script locally, which means the attempt to save author entities gets rejected

Testing

I copied a wikidata JSON (which included an OL ID) into the wikidata postgres table, and then added the author with ./copydocs.py. I then ran backfill_author_identifiers.py, and identifiers that existed in the Wikidata json but not the author's remote_ids began to show up on the author's page.

Screenshot

Stakeholders

pidgezero-one and others added 30 commits August 1, 2024 22:19
@hornc
Copy link
Collaborator

hornc commented Nov 28, 2024

@pidgezero-one general comments based on the examples in the description:

  1. I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.

  2. I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

@pidgezero-one
Copy link
Contributor Author

@hornc

I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.

If I'm understanding correctly, this means that I just don't need to include that in the import record if it's coming from somewhere that isn't IA?

I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

I added this on a recommendation for books that have no publisher info returned from WD or WS, since publisher is a required field. Is there a better default that could be used instead?

@hornc
Copy link
Collaborator

hornc commented Dec 3, 2024

@pidgezero-one I'm sorry, but I think this PR is getting out of hand with scope creep from what seems to be the feature than prompted it. It seems to be building on features that have not been implemented yet. There's a method named merge_remote_ids(), but it's not yet possible to import Author ids, that's #9448

I think focussing on the original issues and implementing those is needed before taking this back to the planning stage (it's not clear to me the main purpose of this one).

I was going to apologise in case I'm commenting on the wrong PR, but it looks like the file content changed while I was typing this, so I don't know :) I'm getting confused -- the code changes are creeping away from the original issues, that's the problem I want to highlight.

I think you should focus on and deliver either issues #9671 (which I don't think even needs changes to author identifiers for a working MVP), or #9448 , which is a well defined feature that will help direct further discussion on what might be needed next for author identifiers. That would keep the PRs clearer and easier to review.

EDIT github made me chose the wrong issues by mixing in PRs with the issue numbers -- should be correct now. I'm trying to link to the issues that define the features.

@pidgezero-one pidgezero-one changed the title feat: consolidate remote ids and wikisource identifiers feat: consolidate author remote ids and wikisource identifiers Dec 3, 2024
@pidgezero-one pidgezero-one changed the title feat: consolidate author remote ids and wikisource identifiers feat: consolidate author remote_ids and wikisource identifiers Dec 3, 2024
@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Dec 3, 2024

@hornc When I was first testing the output of the Wikisource script, I noticed that its records were producing duplicated authors on import, which led me to investigate a solution for #9448 to see if it would solve my script's problem. Originally these were part of the same PR because it was easier for me to test that way, but I split off the work for #9448 into PR #10110.

The comments that were left on #9674 back when it included the work in #10110 led me to believe that we have plenty of authors whose remote IDs are not filled out yet, but discussions on Slack showed me that a lot of these identifiers are buried in the wikidata table that we automatically record to.

So I split these three semi-connected problems into three pieces, in this order:

  1. This PR does the busy work of filling out remote_ids that we do already have in the Wikidata table but haven't made their way into the author's remote_ids (pulling in production data locally, I found instances of this pretty much immediately). This relies only on code that already exists. If an author has no remote_ids but we have a Wikidata json for them that includes an isni id, for example, it'll extract that info from the wikidata table and populate the author's remote_ids with it. There's also a possibility that the opposite can be the case for some authors, where we already have remote IDs and they conflict with what we got from wikidata, so merge_remote_ids judges that (and ideally would raise this to librarians, something Freso suggested).
  2. [WIP] feat: use author known identifiers in import API #10110 compares incoming import author records to author remote_ids. This won't be very effective if too many remote_ids are missing on our existing authors to begin with, because we can't compare against remote_ids that our author objects don't have. That's the impetus for feat: consolidate author remote_ids and wikidata identifiers #10092, it sets us up to prevent the likelihood of the import API getting false negative author matches caused by missing remote_ids that shouldn't be missing in the first place. This should also use merge_remote_ids to detect if an import record conflicts too much with existing remote_ids.
  3. Import records produced by feat: import books from Wikisource #9674 can then face a severely reduced likelihood of duplicating authors. Other import sources can be updated in the future to also take advantage of [WIP] feat: use author known identifiers in import API #10110.

You're right that #10110 and #9674 can stand alone as features without this PR, but I opened this PR to set those features up to be as effective as possible.

Does this make more sense?

@pidgezero-one pidgezero-one changed the title feat: consolidate author remote_ids and wikisource identifiers feat: consolidate author remote_ids and wikidata identifiers Dec 3, 2024
@hornc
Copy link
Collaborator

hornc commented Dec 3, 2024

@pidgezero-one Thanks for breaking down the history, that helps.

  1. still confuses me, because I'm not sure the relationship that openlibrary/core/wikidata.py has with the Open Library import process (as far as I know it is totally independent), I can't tell if this PR is adding another import-like side channel (I think it is), or whether a import-like side channel was already added for some wikidata sourced values? I didn't like the idea of creating a 'Wikidata table' in OL, and I don't think it has been fully justified.

To be clear, I don't like the idea that author identifiers can be imported separately through some direct not-already existing API side mechanism, because it'll be a nightmare to document and maintain. What's worse, I can't even tell by looking at existing code or documentation whether it's already been done or not. I don't see the the value in having a special 'Wikidata sourced side import channel', where there are already general purpose APIs to import and modify existing records.

This is why I think such changes need planning and discussion. As it is, I don't see a clear purpose to the changes in the PR, and don't really know how to evaluate them. (Other than don't add new single source import channels)

  1. is dependent on Import endpoint should allow for any (known) author identifiers #9448 to expand the import schema to accept author identifiers in the first place. "compares incoming import author records to author remote_ids" is jumping the gun considerably, and I'm worried it's based on assumptions about the import process that have not be made explicit, and may miss some of the general requirements for OL imports. It does seem surprising that OL doesn't use author identifiers to match authors, but it hasn't been a problem that has been explicitly described in well over a decade of importing books. To make the change it'd help to have some clear examples of what is wrong with the existing import process and what was expected.

The problems you noticed with "the output of the Wikisource script, I noticed that its records were producing duplicated authors on import" are not described anywhere, so it's hard to comment on expectations and what is needed to fix it. I think there should be a new issue with existing-import format examples and results, along with your expectations for discussion and coming up with a suitable solution. There could be problems with the input format, but the basic author-matching feature is broader than just Wikisource, so there are probably other factors to consider.

  1. This core feature, import books from Wikisource, could just be implemented as an atomic MVP using the existing book import format which would make it easier to review. Any shortcomings raised by a new import source could then be made very clear and we can work on making the entire import process better. The current multiple-interelated drafts are confusing things and priorities.

FWIW, I think #9448 (accept author identifiers on import, just for basic display purposes at minimum) is the most valuable. 2, using author identifiers to disambiguate authors is probably the most work, but I'd need to see some clear demonstrations of the problem first. I feel that a correct solution might be more subtle that it seems, have other implications, or involve assumptions that need clarification. It may not add as much value as it first seems either, but examples will hopefully demonstrate that either way.

@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Dec 3, 2024

@hornc

I can't tell if this PR is adding another import-like side channel (I think it is), or whether a import-like side channel was already added for some wikidata sourced values? I didn't like the idea of creating a 'Wikidata table' in OL, and I don't think it has been fully justified.

Basically, this was it. It seemed wasteful to me not to make use of data that we already had, so I was proposing with this PR that we could use it to improve author matching. My next step was to test it and then describe what it accomplishes more thoroughly so that I'd have something to show in a discussion that probably would have been on Slack or during open mic, but I didn't get that far before this comments discussion started. If there's disagreement on whether we should be storing that Wikidata data at all then that's fine, I can close this one. Thanks for the context on that.

"compares incoming import author records to author remote_ids" is jumping the gun considerably, and I'm worried it's based on assumptions about the import process that have not be made explicit, and may miss some of the general requirements for OL imports. It does seem surprising that OL doesn't use author identifiers to match authors, but it hasn't been a problem that has been explicitly described in well over a decade of importing books. To make the change it'd help to have some clear examples of what is wrong with the existing import process and what was expected.

FWIW, it's based on Freso's description of this proposed mechanism in #9448, starting at "...and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors." Ze offers a pretty thorough breakdown of how this should work and compares it to Musicbrainz' similar process as an example, although I based the logic in my draft more on hir suggestions in #10029, which is more recent. Currently, authors attached to books imported at /api/import are only matched according to exact name matching and birth/death date, which Tom Morris says is insufficient in the comments of 9674. I can provide some examples from my local testing when I have time to sit down with it later this week (only because it's 1 am here :) ).

There could be problems with the input format, but the basic author-matching feature is broader than just Wikisource, so there are probably other factors to consider.

To be clear, I didn't especially base #10110 on Wikisource, which is mostly why I split it out into a separate piece. My proposed schema update is based on the author identifiers we currently store in OL, which defines the dict keys of remote_ids, so my draft code for author importing follows suit. I just formatted the data output by my Wikisource/Wikidata script to match that, because it was right there on my PC for me to use for generating sample records while I was first learning about how the import API works. Once I had a working POC for #9448, I thought it'd be worth at least seeing if it'd be worthwhile to offer stronger matching for Wikisource right out the gate, but it's easy enough to revert if not.

@cdrini
Copy link
Collaborator

cdrini commented Dec 5, 2024

FWIW, I think #9448 (accept author identifiers on import, just for basic display purposes at minimum) is the most valuable. 2, using author identifiers to disambiguate authors is probably the most work, but I'd need to see some clear demonstrations of the problem first. I feel that a correct solution might be more subtle that it seems, have other implications, or involve assumptions that need clarification. It may not add as much value as it first seems either, but examples will hopefully demonstrate that either way.

W.r.t to examples, I'll provide some examples on the issue #9448 to help clarify here, but this is definitely a feature that would be useful for our import pipeline!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Determine the specifics of author strong identifier matching
6 participants