feat: import books from Wikisource #9674

pidgezero-one · 2024-08-01T19:20:54Z

this should be squash merged to avoid conflicts with #10092, which split off from this PR.

This PR currently produces records that include author identifiers as implemented in #10110, because I was trying to solve the problem of duplicate authors being created from this script's records, which led me to learn that a lack of author identifier support on import was already a known problem! But if #10110 does not get merged in, I can revert the piece of this script that gets those identifiers, and we can revisit it at a later date.

Closes #9671

This PR does the following:

Adds a script that imports books from Wikisource.

Currently, the script is restricted to English wikisource (en.wikisource.org) and only fetches works that belong to its "Validated texts" category. However, supporting other categories and other languages in the future is trivial: the LangConfig dataclass defines categories and language codes, and the script loops through an array of them (which right now just contains English), so adding new LangConfigs is all that would be needed to expand the script's coverage.

Technical

The script interfaces with the public and free-to-use Wikisource API and Wikidata API.

At a high level, it does the following in this order:

It uses a Wikidata SparQL query to fetch every entry in the "Validated texts" category, except those that also fall other inappropriate categories (orations, essays, etc).
In batches of 50 (to prevent timeouts), it uses Wikidata SparQL queries to fetch additional metadata.
It uses Wikidata queries again to fetch strong author identifiers, such as known OL IDs, VIAF, etc. (This is a separate query because including it with book metadata is too strenuous for the API to handle.)
It uses the Wikisource API requests to fetch book descriptions, book subjects, and backups for any metadata that wasn't included in Wikidata's responses.
It prints each record as a jsonl object to console, ignoring any records that belong to excluded Wikisource categories, such as government documents that are explicitly identified as such on Wikisource but not Wikidata.

On a more detailed level:

It will wait 10 seconds and try again if either API returns an error that is likely related to overuse.
Metadata fetched from the Wikisource API comes from the item's infobox, which is not always defined, so not all metadata will be included for every item.
Wikisource's API paginates revision (infobox), category, and other properties separately, even when processing those properties for the same item, so pagination data in the response will have multiple values (one per requested data type - in this case, image data and revision/infobox data). This means that subsequent "pages" of API hits might include the same items as previous pages did, but with non-overlapping metadata (to see what I mean, use this query). Therefore, the script continues consuming subsequent Wikisource API results until it reaches the end, and throughout this loop, it updates the metadata for a book whenever it finds a new piece of metadata for that book.
This script uses the mwparserfromhell and wikitextparser libraries to parse the contents of a book's infobox, and uses the nameparser library to consistently format every author's name to a format that OL should be guaranteed understand.

Unresolved issues and open questions

Publish date is a required field in OL imports, but some books are missing it. What should the default value be?
Wikisource and Wikidata do not always differentiate books from other types of media, so the import is greedy and relies on specifying categories to exclude instead of include. This is unfortunate because it can't realistically account for all of the possible wikidata classifications and wikisource categories (of which there are 27,000 in english) that may need to be excluded in the future. There are also some false positive items included that are not categorized as things that are explicitly non-books, such as wikisource:en:Address_to_the_Mary_Adelaide_Nurses.
No mechanism for skipping already-imported works on subsequent runs of this script. This might not be necessary, but it would be nice for performance.

Testing

To create the import records, docker compose exec -e PYTHONPATH=. web bash and then python ./scripts/providers/import_wikisource.py ./conf/etc/openlibrary.yml. Several hundred records should output to the console

I tested this script against the code in this branch, which supports author identifiers in the import pipeline: #10092

Stakeholders

@cdrini @scottbarnes @mekarpeles

Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

tfmorris · 2024-10-16T16:46:07Z

I could very well modify this to greedily update existing authors with more information when an imported author matches that, unless there are any risks to that approach you can think of.

There are significant risks to doing that because OpenLibrary contains (or did until it went offline) a large number of conflated author records with more being created at an ever increasing rate as Amazon and then BWB dreck was added.

Currently, the import pipeline decides if an author is a match by doing these checks in the following order, and stopping when a match is found:

Exact full name match to any existing author

Exact match to any alternate names of an existing author

Neither of these conditions are sufficient to uniquely identify an author.

The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing.

@hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource.

pidgezero-one · 2024-10-16T16:51:29Z

There are significant risks to doing that because OpenLibrary contains (or did until it went offline) a large number of conflated author records with more being created at an ever increasing rate as Amazon and then BWB dreck was added.

That's fine, I suppose the next thing I need to know is how strong author identifiers should be ranked from most to least reliable so it can prioritize appropriately.

Neither of these conditions are sufficient to uniquely identify an author.

Even if so, that's how /api/import currently works according to the existing import_author code. I assume that's used in production. Would you like me to break this work and my unit tests for it out into a separate PR and tear down name matching to prepare this work for MARC imports?

tfmorris · 2024-10-16T16:52:30Z

Also, before starting to use author identifiers, it would make sense to make sure as many matching, non-conflicting, identifiers as possible are imported from Wikidata. It contains a large number of entities with OpenLibrary identifiers where the OL record didn't have a matching Wikidata identifier. Importing/caching VIAF, LCCN, etc identifiers from matching Wikidata entities should also be done so that they can be used for MARC imports.

pidgezero-one · 2024-10-16T16:53:59Z

Also, before starting to use author identifiers, it would make sense to make sure as many matching, non-conflicting, identifiers as possible are imported from Wikidata. It contains a large number of entities with OpenLibrary identifiers where the OL record didn't have a matching Wikidata identifier. Importing/caching VIAF, LCCN, etc identifiers from matching Wikidata entities should also be done so that they can be used for MARC imports.

Sure, I can look into this. The script I wrote here for Wikisource pulls all of that data from Wikidata already, so I could use that as a basis to pull in as much of that data as possible outside of just Wikisource books.

pidgezero-one · 2024-11-27T04:57:33Z

Moved all import API code changes to #10092, which also consolidates Wikidata identifiers with the identifiers we already have stored in OL.

openlibrary/components/AuthorIdentifiers.vue

-    imdb: /^\w{2}\d+$/,
-    opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/,


openlibrary/catalog/add_book/tests/test_load_book.py

@@ -392,4 +324,4 @@ def test_birth_and_death_date_match_is_on_year_strings(self, mock_site):
            "death_date": "November 1910",
        }
        found = import_author(searched_author)
-        assert found.key == author["key"]
+        assert found.key == author["key"]


openlibrary/catalog/add_book/load_book.py

@@ -314,4 +295,4 @@ def build_query(rec: dict[str, Any]) -> dict[str, Any]:
                book[k] = {'type': t, 'value': v}
        else:
            book[k] = v
-    return book
+    return book


hornc · 2024-11-28T08:50:31Z

@pidgezero-one I tried to use the Wikidata query here to figure out how many items in Wikidata have (archive.org ids OR openlibrary ids) AND Wikisource pages, but I can't quite figure out how the Wikidata query links to Wikisource. There isn't a wikisource identifier property?

The Wikidata query for books should probably check for P648 (Open Library id) on the book before importing.

Currently the various identifiers on the author are probably over-engineering this. The basic import would be to get the metadata from some list of candidate Wikisource books and convert it into the current import JSON format, which is documented here: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json . I see there are two mediawiki parsers imported and a call to the Wikidata API, and the Wikisource API -- what does the page scraping add that's not available from the two APIs? Why isn't the Wikisource API by itself sufficient to import a book?

pidgezero-one · 2024-11-28T13:54:34Z

@pidgezero-one I tried to use the Wikidata query here to figure out how many items in Wikidata have (archive.org ids OR openlibrary ids) AND Wikisource pages, but I can't quite figure out how the Wikidata query links to Wikisource. There isn't a wikisource identifier property?

The first Wikidata query that runs returns the Wikisource page title, which is queryable against Wikisource's API. Although titles can change in the future, that doesn't matter for the purposes of connecting one query to another at the point in time of running the script, as all WS page titles have to be distinct.

The Wikidata query for books should probably check for P648 (Open Library id) on the book before importing.

Why would we want to only import books that are already in OL? Or if the reverse is true and we don't want to import books that are already in OL, why would we not want to import a Wikisource ID for those books, which adds the ability for the user to go there to read it? Am I misunderstanding this suggestion?

Currently the various identifiers on the author are probably over-engineering this. The basic import would be to get the metadata from some list of candidate Wikisource books and convert it into the current import JSON format, which is documented here: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json .

I believe the records this produces follow the "properties" object in this schema, although I was looking at the /api/import code as a guide to format the output records and test them against that endpoint. Although, why would we not want to attempt to match authors by identifier? I think there's quite a few overlapping open issues regarding the point and I haven't covered all of them in my PR descriptions, but Freso mentioned these two:

#9411
#9448

I was originally trying to see if I could get this issue of author-matching on import to work at all, so I used the output of this script to test it, but I moved all of that work into #10092.

I see there are two mediawiki parsers imported and a call to the Wikidata API, and the Wikisource API -- what does the page scraping add that's not available from the two APIs? Why isn't the Wikisource API by itself sufficient to import a book?

Wikidata's API is prone to timeouts when queries are too complex, which is why there's one query for retrieving book data and one for retrieving author data.

The later query to Wikisource is to get metadata that either Wikidata does not offer out of the box, or would be too cumbersome for it to include without timing out. Originally this was meant for book cover images, because most WD hits were using .djvu files for this instead of more familiar image formats like .png or .jpg that Wikisource typically uses, but this didn't turn out to be as reliable as we would have liked. Other properties that the Wikisource query parses are book descriptions and subjects, which typically come from the infobox. The last thing it does is use the Wikisource response as a fallback for any metadata that the WD API did not include, because although WD is highly structured, it's sometimes missing information that's written directly into the WS page for the item, so we get the most metadata for each item by scraping both APIs.

for more information, see https://pre-commit.ci

hornc · 2024-12-03T12:02:19Z

@pidgezero-one I just noticed this comment re. the import format generated by this script:

{"ia_id": "wikisource:en:The_Hunting_of_the_Snark_%281876%29", "data": {"title": "The Hunting of the Snark", "source_records": ["wikisource:en:The_Hunting_of_the_Snark_%281876%29"], "identifiers": {"wikisource": ["en:The_Hunting_of_the_Snark_%281876%29"]}, "languages": ["eng"], "publish_date": "1876", "edition_name": "1", "authors": [{"name": "Carroll, Lewis", "personal_name": "Carroll, Lewis"}], "publishers": ["Macmillan and Co."]}}

The exact edition of the book is on OL, https://openlibrary.org/books/OL20593547M/The_Hunting_of_the_Snark but for some reason we don't have the exact scan on archive.org.

I don't think you need to populate the personal_name field. name should ideally be in natural name order i.e. Lewis Caroll (I don't think you need the extra name processing Python imports -- OL import search should find it either way, but OL prefers the records to be in natural order.

edition_name probably shouldn't be 1, but that's more of a question for @seabelis , but won't have an effect on matching, just initial imports.

Getting the publish_places in addition to the publisher would be ideal for a book import. ['London'] for this example (it is available at the source)

It seems like you should be able to extract good pagination from the Wikisource records too, since all the pages have been scanned and reviewed.

Using Wikidata to pre-populate entity_type would be helpful for new authors too.

Populating author dates birth_date and death_date from those available on Wikidata metadata should enable the current author disambiguating features. In the case of a new author import it will be more helpful to general library patrons to disambiguate by dates in preference to just a Wikidata or VIAF identifier. By all means import both, but the dates are important for new author records. Does including these dates, which are required for a complete author record on initial import and are generally available for these items, solve the problem that motivated #10029 ?

for more information, see https://pre-commit.ci

hornc · 2024-12-03T20:44:01Z

scripts/providers/import_wikisource.py

+]
+
+
+def format_contributor(raw_name: str) -> str:


I don't think this function is necessary. I believe it is taking a human friendly author name from Wikidata, and manipulating it to reconstruct a human friendly author name. At best it will be a no-op, at worst it will mix up an already good display name, depending on the black-box details of that external module. Reducing unnecessary module imports is good too.

cdrini

Wonderful work on this on @pidgezero-one ! We went through a fast code review on a call (And I've done a more thorough code review in the past) and I think this is in a great direction! I will be merging this in since I believe it's in a great position, can be run successfully to generate the JSON dump of books for analysis. This does not mean that the work on this is finished and that we'll be doing a bulk import tout-de-suite! We still have a few outstanding questions:

Do we create a new edition? How do we technically achieve that?
Do we specify Wikisource as a publisher? Always? Sometimes?
Charles' question about names (although @pidgezero-one had a good reason for this; I'll let her respond in comment)

But I want to make sure things are moving, and right now we have a lot of things which are somewhat interdependent which is making it hard to get a clear big picture. So merging, and let's create new issues/prs to address these other questions!

pidgezero-one and others added 19 commits August 1, 2024 15:20

first draft

19d9874

[pre-commit.ci] auto fixes from pre-commit.com hooks

f23baad

for more information, see https://pre-commit.ci

linting

34667e7

[pre-commit.ci] auto fixes from pre-commit.com hooks

17575bf

for more information, see https://pre-commit.ci

use a class for imports

dca89fb

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

147784a

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

[pre-commit.ci] auto fixes from pre-commit.com hooks

df423ca

for more information, see https://pre-commit.ci

mypy fixes

604a5d8

merge

34fe390

more linting

a307eca

[pre-commit.ci] auto fixes from pre-commit.com hooks

1fe4e84

for more information, see https://pre-commit.ci

is this deprecated too?

d7f4065

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

bcdcdb1

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

is this deprecated too?

c504f6c

is this deprecated too?

a996e64

improved data model

a3c299e

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f2b113

for more information, see https://pre-commit.ci

reformat name formatter

e18ba52

ruff fix

4bdf428

pidgezero-one changed the title ~~Add Wikisource import script~~ [WIP] Add Wikisource import script Aug 2, 2024

pidgezero-one and others added 10 commits August 1, 2024 22:17

improve infobox fetching

57388b4

[pre-commit.ci] auto fixes from pre-commit.com hooks

31b2a7f

for more information, see https://pre-commit.ci

uncomment

9492403

merge

dacef92

remove unnecessary print

1186696

[pre-commit.ci] auto fixes from pre-commit.com hooks

dd82901

for more information, see https://pre-commit.ci

uncomment imports

47b57c8

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

3c82121

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

better template check:

5e58373

publishers?

2c268b2

pidgezero-one mentioned this pull request Nov 13, 2024

Determine the specifics of author strong identifier matching #10029

Open

pidgezero-one mentioned this pull request Nov 27, 2024

feat: consolidate author remote_ids and wikidata identifiers #10092

Draft

pidgezero-one added 3 commits November 26, 2024 23:46

move this code

f0a1103

/

8d3fe1f

unused imports

eea9276

pidgezero-one changed the title ~~Import books from Wikisource and use strong author identifiers~~ feat: import books from Wikisource and use strong author identifiers Nov 27, 2024

Freso reviewed Nov 27, 2024

View reviewed changes

pidgezero-one changed the title ~~feat: import books from Wikisource and use strong author identifiers~~ feat: import books from Wikisource Nov 28, 2024

pidgezero-one added 2 commits November 27, 2024 19:19

this got moved out by accident

aca95e0

wtf?

d3b4767

pidgezero-one and others added 3 commits December 2, 2024 20:36

add another exclusion

2545c12

Merge branch 'master' into 9671/feat/add-wikisource-import-script

eb3cb40

[pre-commit.ci] auto fixes from pre-commit.com hooks

88648e0

for more information, see https://pre-commit.ci

pidgezero-one mentioned this pull request Dec 3, 2024

[WIP] feat: use author known identifiers in import API #10110

Draft

ruff fix

e04b80c

pidgezero-one and others added 2 commits December 3, 2024 14:08

just get birth and death dates for authors to use in import matching

c55261b

[pre-commit.ci] auto fixes from pre-commit.com hooks

abb913c

for more information, see https://pre-commit.ci

hornc reviewed Dec 3, 2024

View reviewed changes

cdrini approved these changes Dec 5, 2024

View reviewed changes

cdrini merged commit c232799 into internetarchive:master Dec 5, 2024
3 checks passed

cdrini mentioned this pull request Dec 5, 2024

Import Wikisource trusted book provider data #9671

Open

3 tasks

Freso mentioned this pull request Dec 6, 2024

Import endpoint should allow for any (known) author identifiers #9448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: import books from Wikisource #9674

feat: import books from Wikisource #9674

pidgezero-one commented Aug 1, 2024 •

edited

Loading

tfmorris commented Oct 16, 2024

pidgezero-one commented Oct 16, 2024 •

edited

Loading

tfmorris commented Oct 16, 2024

pidgezero-one commented Oct 16, 2024 •

edited

Loading

pidgezero-one commented Nov 27, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

hornc commented Nov 28, 2024 •

edited

Loading

pidgezero-one commented Nov 28, 2024 •

edited

Loading

hornc commented Dec 3, 2024

hornc Dec 3, 2024 •

edited

Loading

cdrini left a comment

feat: import books from Wikisource #9674

feat: import books from Wikisource #9674

Conversation

pidgezero-one commented Aug 1, 2024 • edited Loading

Technical

Unresolved issues and open questions

Testing

Stakeholders

tfmorris commented Oct 16, 2024

pidgezero-one commented Oct 16, 2024 • edited Loading

tfmorris commented Oct 16, 2024

pidgezero-one commented Oct 16, 2024 • edited Loading

pidgezero-one commented Nov 27, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

hornc commented Nov 28, 2024 • edited Loading

pidgezero-one commented Nov 28, 2024 • edited Loading

hornc commented Dec 3, 2024

hornc Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

cdrini left a comment

Choose a reason for hiding this comment

pidgezero-one commented Aug 1, 2024 •

edited

Loading

pidgezero-one commented Oct 16, 2024 •

edited

Loading

pidgezero-one commented Oct 16, 2024 •

edited

Loading

hornc commented Nov 28, 2024 •

edited

Loading

pidgezero-one commented Nov 28, 2024 •

edited

Loading

hornc Dec 3, 2024 •

edited

Loading