feat: consolidate author remote_ids and wikidata identifiers #10092

pidgezero-one · 2024-11-27T04:12:35Z

this should be squash merged to avoid conflicts with #9674, which split off from this PR

===This is a WIP===

Consolidate author's existing known IDs in OL with their known IDs that come from Wikidata. Identifiers that already exist in OL from other imports take precedence.

Technical

fetching live wikidata will also save wikidata's author identifiers to the author's remote ids (at least any that don't conflict with existing ones that already exist for that author, and only if there aren't TOO many conflicts)
backfill this operation for all existing authors with backfill_author_identifiers.py

TODO: unit tests
TODO: how do we flag conflicts to librarians?
TODO: maybe move the consolidation logic to the author object instead of the wikidata object
TODO: fix pre-commit linting problems (2nd TODO will take care of most of this)

Issues:

backfill script has this line: web.ctx.ip = '127.0.0.1' because web.ctx.ip is empty when running the script locally, which means the attempt to save author entities gets rejected

Testing

I copied a wikidata JSON (which included an OL ID) into the wikidata postgres table, and then added the author with ./copydocs.py. I then ran backfill_author_identifiers.py, and identifiers that existed in the Wikidata json but not the author's remote_ids began to show up on the author's page.

Screenshot

Stakeholders

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

hornc · 2024-11-28T19:50:47Z

@pidgezero-one general comments based on the examples in the description:

I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.
I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

pidgezero-one · 2024-11-28T19:58:06Z

@hornc

I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.

If I'm understanding correctly, this means that I just don't need to include that in the import record if it's coming from somewhere that isn't IA?

I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

I added this on a recommendation for books that have no publisher info returned from WD or WS, since publisher is a required field. Is there a better default that could be used instead?

Co-authored-by: Frederik “Freso” S. Olesen <[email protected]>

…fiers' of https://github.com/pidgezero-one/openlibrary into 10029/feat/consolidate-remote-ids-and-wikisource-identifiers

for more information, see https://pre-commit.ci

hornc · 2024-12-03T03:11:19Z

@pidgezero-one I'm sorry, but I think this PR is getting out of hand with scope creep from what seems to be the feature than prompted it. It seems to be building on features that have not been implemented yet. There's a method named merge_remote_ids(), but it's not yet possible to import Author ids, that's #9448

I think focussing on the original issues and implementing those is needed before taking this back to the planning stage (it's not clear to me the main purpose of this one).

I was going to apologise in case I'm commenting on the wrong PR, but it looks like the file content changed while I was typing this, so I don't know :) I'm getting confused -- the code changes are creeping away from the original issues, that's the problem I want to highlight.

I think you should focus on and deliver either issues #9671 (which I don't think even needs changes to author identifiers for a working MVP), or #9448 , which is a well defined feature that will help direct further discussion on what might be needed next for author identifiers. That would keep the PRs clearer and easier to review.

EDIT github made me chose the wrong issues by mixing in PRs with the issue numbers -- should be correct now. I'm trying to link to the issues that define the features.

pidgezero-one · 2024-12-03T04:01:37Z

@hornc When I was first testing the output of the Wikisource script, I noticed that its records were producing duplicated authors on import, which led me to investigate a solution for #9448 to see if it would solve my script's problem. Originally these were part of the same PR because it was easier for me to test that way, but I split off the work for #9448 into PR #10110.

The comments that were left on #9674 back when it included the work in #10110 led me to believe that we have plenty of authors whose remote IDs are not filled out yet, but discussions on Slack showed me that a lot of these identifiers are buried in the wikidata table that we automatically record to.

So I split these three semi-connected problems into three pieces, in this order:

This PR does the busy work of filling out remote_ids that we do already have in the Wikidata table but haven't made their way into the author's remote_ids (pulling in production data locally, I found instances of this pretty much immediately). This relies only on code that already exists. If an author has no remote_ids but we have a Wikidata json for them that includes an isni id, for example, it'll extract that info from the wikidata table and populate the author's remote_ids with it. There's also a possibility that the opposite can be the case for some authors, where we already have remote IDs and they conflict with what we got from wikidata, so merge_remote_ids judges that (and ideally would raise this to librarians, something Freso suggested).
[WIP] feat: use author known identifiers in import API #10110 compares incoming import author records to author remote_ids. This won't be very effective if too many remote_ids are missing on our existing authors to begin with, because we can't compare against remote_ids that our author objects don't have. That's the impetus for feat: consolidate author remote_ids and wikidata identifiers #10092, it sets us up to prevent the likelihood of the import API getting false negative author matches caused by missing remote_ids that shouldn't be missing in the first place. This should also use merge_remote_ids to detect if an import record conflicts too much with existing remote_ids.
Import records produced by feat: import books from Wikisource #9674 can then face a severely reduced likelihood of duplicating authors. Other import sources can be updated in the future to also take advantage of [WIP] feat: use author known identifiers in import API #10110.

You're right that #10110 and #9674 can stand alone as features without this PR, but I opened this PR to set those features up to be as effective as possible.

Does this make more sense?

hornc · 2024-12-03T05:04:28Z

@pidgezero-one Thanks for breaking down the history, that helps.

still confuses me, because I'm not sure the relationship that openlibrary/core/wikidata.py has with the Open Library import process (as far as I know it is totally independent), I can't tell if this PR is adding another import-like side channel (I think it is), or whether a import-like side channel was already added for some wikidata sourced values? I didn't like the idea of creating a 'Wikidata table' in OL, and I don't think it has been fully justified.

To be clear, I don't like the idea that author identifiers can be imported separately through some direct not-already existing API side mechanism, because it'll be a nightmare to document and maintain. What's worse, I can't even tell by looking at existing code or documentation whether it's already been done or not. I don't see the the value in having a special 'Wikidata sourced side import channel', where there are already general purpose APIs to import and modify existing records.

This is why I think such changes need planning and discussion. As it is, I don't see a clear purpose to the changes in the PR, and don't really know how to evaluate them. (Other than don't add new single source import channels)

is dependent on Import endpoint should allow for any (known) author identifiers #9448 to expand the import schema to accept author identifiers in the first place. "compares incoming import author records to author remote_ids" is jumping the gun considerably, and I'm worried it's based on assumptions about the import process that have not be made explicit, and may miss some of the general requirements for OL imports. It does seem surprising that OL doesn't use author identifiers to match authors, but it hasn't been a problem that has been explicitly described in well over a decade of importing books. To make the change it'd help to have some clear examples of what is wrong with the existing import process and what was expected.

The problems you noticed with "the output of the Wikisource script, I noticed that its records were producing duplicated authors on import" are not described anywhere, so it's hard to comment on expectations and what is needed to fix it. I think there should be a new issue with existing-import format examples and results, along with your expectations for discussion and coming up with a suitable solution. There could be problems with the input format, but the basic author-matching feature is broader than just Wikisource, so there are probably other factors to consider.

This core feature, import books from Wikisource, could just be implemented as an atomic MVP using the existing book import format which would make it easier to review. Any shortcomings raised by a new import source could then be made very clear and we can work on making the entire import process better. The current multiple-interelated drafts are confusing things and priorities.

FWIW, I think #9448 (accept author identifiers on import, just for basic display purposes at minimum) is the most valuable. 2, using author identifiers to disambiguate authors is probably the most work, but I'd need to see some clear demonstrations of the problem first. I feel that a correct solution might be more subtle that it seems, have other implications, or involve assumptions that need clarification. It may not add as much value as it first seems either, but examples will hopefully demonstrate that either way.

pidgezero-one · 2024-12-03T06:03:52Z

@hornc

I can't tell if this PR is adding another import-like side channel (I think it is), or whether a import-like side channel was already added for some wikidata sourced values? I didn't like the idea of creating a 'Wikidata table' in OL, and I don't think it has been fully justified.

Basically, this was it. It seemed wasteful to me not to make use of data that we already had, so I was proposing with this PR that we could use it to improve author matching. My next step was to test it and then describe what it accomplishes more thoroughly so that I'd have something to show in a discussion that probably would have been on Slack or during open mic, but I didn't get that far before this comments discussion started. If there's disagreement on whether we should be storing that Wikidata data at all then that's fine, I can close this one. Thanks for the context on that.

"compares incoming import author records to author remote_ids" is jumping the gun considerably, and I'm worried it's based on assumptions about the import process that have not be made explicit, and may miss some of the general requirements for OL imports. It does seem surprising that OL doesn't use author identifiers to match authors, but it hasn't been a problem that has been explicitly described in well over a decade of importing books. To make the change it'd help to have some clear examples of what is wrong with the existing import process and what was expected.

FWIW, it's based on Freso's description of this proposed mechanism in #9448, starting at "...and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors." Ze offers a pretty thorough breakdown of how this should work and compares it to Musicbrainz' similar process as an example, although I based the logic in my draft more on hir suggestions in #10029, which is more recent. Currently, authors attached to books imported at /api/import are only matched according to exact name matching and birth/death date, which Tom Morris says is insufficient in the comments of 9674. I can provide some examples from my local testing when I have time to sit down with it later this week (only because it's 1 am here :) ).

There could be problems with the input format, but the basic author-matching feature is broader than just Wikisource, so there are probably other factors to consider.

To be clear, I didn't especially base #10110 on Wikisource, which is mostly why I split it out into a separate piece. My proposed schema update is based on the author identifiers we currently store in OL, which defines the dict keys of remote_ids, so my draft code for author importing follows suit. I just formatted the data output by my Wikisource/Wikidata script to match that, because it was right there on my PC for me to use for generating sample records while I was first learning about how the import API works. Once I had a working POC for #9448, I thought it'd be worth at least seeing if it'd be worthwhile to offer stronger matching for Wikisource right out the gate, but it's easy enough to revert if not.

cdrini · 2024-12-05T17:47:55Z

FWIW, I think #9448 (accept author identifiers on import, just for basic display purposes at minimum) is the most valuable. 2, using author identifiers to disambiguate authors is probably the most work, but I'd need to see some clear demonstrations of the problem first. I feel that a correct solution might be more subtle that it seems, have other implications, or involve assumptions that need clarification. It may not add as much value as it first seems either, but examples will hopefully demonstrate that either way.

W.r.t to examples, I'll provide some examples on the issue #9448 to help clarify here, but this is definitely a feature that would be useful for our import pipeline!

pidgezero-one and others added 30 commits August 1, 2024 22:19

merge

dacef92

remove unnecessary print

1186696

[pre-commit.ci] auto fixes from pre-commit.com hooks

dd82901

for more information, see https://pre-commit.ci

uncomment imports

47b57c8

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

3c82121

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

better template check:

5e58373

publishers?

2c268b2

fix array

be0d1a8

unused import

d52c109

different wiki markup strip

df683a1

reduce image calls

8e7cb38

unstash

66744ef

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c51bcc

for more information, see https://pre-commit.ci

fix newlines

b5e4319

undo comments

34da18d

logger name

76a1724

fix array typing

2ce0e17

more cleanup

3645e9e

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e9650b

for more information, see https://pre-commit.ci

dry run outputs to a jsonl file in a gitignored folder

9663789

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

8f0810d

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

add this directory

2a3e617

.

0268018

Merge branch 'master' into 9671/feat/add-wikisource-import-script

99f3c93

[pre-commit.ci] auto fixes from pre-commit.com hooks

5390b54

for more information, see https://pre-commit.ci

.

b06b60d

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

d7561f8

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

unicode

3e10c5d

remove dry run flag

4bd8e54

this produces around 500 records

e445276

requirements.txt doesnt need to change here

ad3bf1c

pidgezero-one and others added 9 commits December 2, 2024 20:39

merge

6916035

merge

3fc91f8

merge

82c19f1

Update openlibrary/catalog/add_book/tests/test_load_book.py

46e22ce

Co-authored-by: Frederik “Freso” S. Olesen <[email protected]>

Update scripts/backfill_author_identifiers.py

eed54d0

Co-authored-by: Frederik “Freso” S. Olesen <[email protected]>

.

033b822

Merge branch '10029/feat/consolidate-remote-ids-and-wikisource-identi…

3f5ea0a

…fiers' of https://github.com/pidgezero-one/openlibrary into 10029/feat/consolidate-remote-ids-and-wikisource-identifiers

dont need this

31a0b9a

[pre-commit.ci] auto fixes from pre-commit.com hooks

51d8618

for more information, see https://pre-commit.ci

pidgezero-one mentioned this pull request Dec 3, 2024

[WIP] feat: use author known identifiers in import API #10110

Draft

pidgezero-one and others added 4 commits December 2, 2024 21:21

these dont need to be in this pr

25a3383

these dont need to be in this pr

ca286d3

use min OLID

cfdb166

[pre-commit.ci] auto fixes from pre-commit.com hooks

e4389c4

for more information, see https://pre-commit.ci

pidgezero-one added 2 commits December 2, 2024 22:15

remove try/catches

d3bade0

shouldnt be a return val here

011e5df

pidgezero-one changed the title ~~feat: consolidate remote ids and wikisource identifiers~~ feat: consolidate author remote ids and wikisource identifiers Dec 3, 2024

pidgezero-one changed the title ~~feat: consolidate author remote ids and wikisource identifiers~~ feat: consolidate author remote_ids and wikisource identifiers Dec 3, 2024

pidgezero-one added 3 commits December 2, 2024 22:24

address import problems

598f1c7

ruff fixes

03f12b5

precommit fixes

0e0fb1d

pidgezero-one changed the title ~~feat: consolidate author remote_ids and wikisource identifiers~~ feat: consolidate author remote_ids and wikidata identifiers Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: consolidate author remote_ids and wikidata identifiers #10092

feat: consolidate author remote_ids and wikidata identifiers #10092

pidgezero-one commented Nov 27, 2024 •

edited

Loading

hornc commented Nov 28, 2024

pidgezero-one commented Nov 28, 2024

hornc commented Dec 3, 2024 •

edited

Loading

pidgezero-one commented Dec 3, 2024 •

edited

Loading

hornc commented Dec 3, 2024

pidgezero-one commented Dec 3, 2024 •

edited

Loading

cdrini commented Dec 5, 2024

feat: consolidate author remote_ids and wikidata identifiers #10092

Are you sure you want to change the base?

feat: consolidate author remote_ids and wikidata identifiers #10092

Conversation

pidgezero-one commented Nov 27, 2024 • edited Loading

===This is a WIP===

Technical

Testing

Screenshot

Stakeholders

hornc commented Nov 28, 2024

pidgezero-one commented Nov 28, 2024

hornc commented Dec 3, 2024 • edited Loading

pidgezero-one commented Dec 3, 2024 • edited Loading

hornc commented Dec 3, 2024

pidgezero-one commented Dec 3, 2024 • edited Loading

cdrini commented Dec 5, 2024

pidgezero-one commented Nov 27, 2024 •

edited

Loading

hornc commented Dec 3, 2024 •

edited

Loading

pidgezero-one commented Dec 3, 2024 •

edited

Loading

pidgezero-one commented Dec 3, 2024 •

edited

Loading