-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: consolidate author remote_ids and wikidata identifiers #10092
base: master
Are you sure you want to change the base?
feat: consolidate author remote_ids and wikidata identifiers #10092
Conversation
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
@pidgezero-one general comments based on the examples in the description:
|
If I'm understanding correctly, this means that I just don't need to include that in the import record if it's coming from somewhere that isn't IA?
I added this on a recommendation for books that have no publisher info returned from WD or WS, since publisher is a required field. Is there a better default that could be used instead? |
Co-authored-by: Frederik “Freso” S. Olesen <[email protected]>
Co-authored-by: Frederik “Freso” S. Olesen <[email protected]>
…fiers' of https://github.com/pidgezero-one/openlibrary into 10029/feat/consolidate-remote-ids-and-wikisource-identifiers
for more information, see https://pre-commit.ci
@pidgezero-one I'm sorry, but I think this PR is getting out of hand with scope creep from what seems to be the feature than prompted it. It seems to be building on features that have not been implemented yet. There's a method named I think focussing on the original issues and implementing those is needed before taking this back to the planning stage (it's not clear to me the main purpose of this one). I was going to apologise in case I'm commenting on the wrong PR, but it looks like the file content changed while I was typing this, so I don't know :) I'm getting confused -- the code changes are creeping away from the original issues, that's the problem I want to highlight. I think you should focus on and deliver either issues #9671 (which I don't think even needs changes to author identifiers for a working MVP), or #9448 , which is a well defined feature that will help direct further discussion on what might be needed next for author identifiers. That would keep the PRs clearer and easier to review. EDIT github made me chose the wrong issues by mixing in PRs with the issue numbers -- should be correct now. I'm trying to link to the issues that define the features. |
@hornc When I was first testing the output of the Wikisource script, I noticed that its records were producing duplicated authors on import, which led me to investigate a solution for #9448 to see if it would solve my script's problem. Originally these were part of the same PR because it was easier for me to test that way, but I split off the work for #9448 into PR #10110. The comments that were left on #9674 back when it included the work in #10110 led me to believe that we have plenty of authors whose remote IDs are not filled out yet, but discussions on Slack showed me that a lot of these identifiers are buried in the wikidata table that we automatically record to. So I split these three semi-connected problems into three pieces, in this order:
You're right that #10110 and #9674 can stand alone as features without this PR, but I opened this PR to set those features up to be as effective as possible. Does this make more sense? |
@pidgezero-one Thanks for breaking down the history, that helps.
To be clear, I don't like the idea that author identifiers can be imported separately through some direct not-already existing API side mechanism, because it'll be a nightmare to document and maintain. What's worse, I can't even tell by looking at existing code or documentation whether it's already been done or not. I don't see the the value in having a special 'Wikidata sourced side import channel', where there are already general purpose APIs to import and modify existing records. This is why I think such changes need planning and discussion. As it is, I don't see a clear purpose to the changes in the PR, and don't really know how to evaluate them. (Other than don't add new single source import channels)
The problems you noticed with "the output of the Wikisource script, I noticed that its records were producing duplicated authors on import" are not described anywhere, so it's hard to comment on expectations and what is needed to fix it. I think there should be a new issue with existing-import format examples and results, along with your expectations for discussion and coming up with a suitable solution. There could be problems with the input format, but the basic author-matching feature is broader than just Wikisource, so there are probably other factors to consider.
FWIW, I think #9448 (accept author identifiers on import, just for basic display purposes at minimum) is the most valuable. 2, using author identifiers to disambiguate authors is probably the most work, but I'd need to see some clear demonstrations of the problem first. I feel that a correct solution might be more subtle that it seems, have other implications, or involve assumptions that need clarification. It may not add as much value as it first seems either, but examples will hopefully demonstrate that either way. |
Basically, this was it. It seemed wasteful to me not to make use of data that we already had, so I was proposing with this PR that we could use it to improve author matching. My next step was to test it and then describe what it accomplishes more thoroughly so that I'd have something to show in a discussion that probably would have been on Slack or during open mic, but I didn't get that far before this comments discussion started. If there's disagreement on whether we should be storing that Wikidata data at all then that's fine, I can close this one. Thanks for the context on that.
FWIW, it's based on Freso's description of this proposed mechanism in #9448, starting at "...and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors." Ze offers a pretty thorough breakdown of how this should work and compares it to Musicbrainz' similar process as an example, although I based the logic in my draft more on hir suggestions in #10029, which is more recent. Currently, authors attached to books imported at /api/import are only matched according to exact name matching and birth/death date, which Tom Morris says is insufficient in the comments of 9674. I can provide some examples from my local testing when I have time to sit down with it later this week (only because it's 1 am here :) ).
To be clear, I didn't especially base #10110 on Wikisource, which is mostly why I split it out into a separate piece. My proposed schema update is based on the author identifiers we currently store in OL, which defines the dict keys of |
W.r.t to examples, I'll provide some examples on the issue #9448 to help clarify here, but this is definitely a feature that would be useful for our import pipeline! |
this should be squash merged to avoid conflicts with #9674, which split off from this PR
===This is a WIP===
Closes #10029
Consolidate author's existing known IDs in OL with their known IDs that come from Wikidata. Identifiers that already exist in OL from other imports take precedence.
Technical
TODO: unit tests
TODO: how do we flag conflicts to librarians?
TODO: maybe move the consolidation logic to the author object instead of the wikidata object
TODO: fix pre-commit linting problems (2nd TODO will take care of most of this)
Issues:
web.ctx.ip = '127.0.0.1'
because web.ctx.ip is empty when running the script locally, which means the attempt to save author entities gets rejectedTesting
I copied a wikidata JSON (which included an OL ID) into the wikidata postgres table, and then added the author with ./copydocs.py. I then ran backfill_author_identifiers.py, and identifiers that existed in the Wikidata json but not the author's remote_ids began to show up on the author's page.
Screenshot
Stakeholders