Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feat: use author known identifiers in import API #10110

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Dec 3, 2024

===This is a WIP and is not ready for review===

this should be squash merged to avoid conflicts with #10092, which this is split off of (I am not a git commit history expert). this depends on #10092 being merged because it depends on a new author method in that PR

model update pr: internetarchive/openlibrary-client#419

Closes #9448
Closes #9411

Technical

  • import pipeline matches by OL ID first, then remote_ids (which should have been filled out with wikidata identifiers via backfill) (and as long as there aren't too many conflicts), then fall back to existing logic of matching by name
  • import records can use an "identifiers" field in the author dict, which is a dict in the form of { "viaf": "blahblahblah" ...etc }. related schema update PR: Update author/import models openlibrary-client#419

TODO: unit tests
TODO: how do we flag conflicts to librarians?

The import model is expanded by adding some additional logic to write to the author's remote_ids when detected in the incoming json object, and search Infogami against those remote ids to detect if the author already exists. The incoming author dict can store an optional remote_ids field to contain these (i.e. viaf stored in author["remote_ids"]["viaf"]), except for OL ID, which is not a remote identifier, so it is expected at author["key"].

Issues:

  • Importing books returns a 200 success, but the author's page still says 0 works

Testing

tested using the output from #9674

To test the import, I wasn't sure how to hit /api/import with user credentials, so I disabled the if not can_write(): condition in openlibrary/plugins/importapi/code.py as well as the if not account_key: condition in openlibrary/catalog/add_book/init.py, and copy-pasted the printed JSON records into a Postman request body.

Example:

{
    "title": "Equitation",
    "source_records": [
        "wikisource:en:Equitation"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation",
    "publish_date": "1922",
    "authors": [
        {
            "name": "Henry L. de Bussigny",
            "remote_ids": {
                "wikidata": "Q16862522",
                "viaf": "305913238",
                "isni": "0000000424758764",
                "project_gutenberg": "40106"
            }
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

This responds successfully with:

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "created"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL18M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL9W",
        "status": "created"
    }
}

Viewing this author key at http://localhost:8080/authors/OL15A shows that the strong identifiers were imported correctly:

image

Editing the author verifies this as well:
image

I then created a test book record whose author uses the same VIAF but has a missing name:

{
    "title": "Equitation test",
    "source_records": [
        "wikisource:en:Equitation_test"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation_test"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation_test",
    "publish_date": "1922",
    "authors": [
        {
            "name": "viaf test",
            "remote_ids": {
                "viaf": "305913238"
            }
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

The response shows that the author was successfully matched to an existing one by VIAF:

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "matched"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL19M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL10W",
        "status": "created"
    }
}

This also works for OL IDs, which uses a slightly different fetch query than the other strong identifiers do:

{
    "title": "Equitation test 2",
    "source_records": [
        "wikisource:en:Equitation_test_2"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation_test_2"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation_test_2",
    "publish_date": "1922",
    "authors": [
        {
            "name": "author ol_id test",
            "key": "/authors/OL15A"
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}
{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "matched"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL23M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL14W",
        "status": "created"
    }
}

Importing optional cover images also works:
image

I added support for all identifiers found in identifiers.yml, except for Inventaire, which I couldn't find in Wikidata:
image

Screenshot

Stakeholders

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant