Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: import books from Wikisource #9674

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Aug 1, 2024

this should be squash merged to avoid conflicts with #10092, which split off from this PR.

This PR currently produces records that include author identifiers as implemented in #10110, because I was trying to solve the problem of duplicate authors being created from this script's records, which led me to learn that a lack of author identifier support on import was already a known problem! But if #10110 does not get merged in, I can revert the piece of this script that gets those identifiers, and we can revisit it at a later date.

Closes #9671

This PR does the following:

  • Adds a script that imports books from Wikisource.

Currently, the script is restricted to English wikisource (en.wikisource.org) and only fetches works that belong to its "Validated texts" category. However, supporting other categories and other languages in the future is trivial: the LangConfig dataclass defines categories and language codes, and the script loops through an array of them (which right now just contains English), so adding new LangConfigs is all that would be needed to expand the script's coverage.

Technical

The script interfaces with the public and free-to-use Wikisource API and Wikidata API.

At a high level, it does the following in this order:

  1. It uses a Wikidata SparQL query to fetch every entry in the "Validated texts" category, except those that also fall other inappropriate categories (orations, essays, etc).
  2. In batches of 50 (to prevent timeouts), it uses Wikidata SparQL queries to fetch additional metadata.
  3. It uses Wikidata queries again to fetch strong author identifiers, such as known OL IDs, VIAF, etc. (This is a separate query because including it with book metadata is too strenuous for the API to handle.)
  4. It uses the Wikisource API requests to fetch book descriptions, book subjects, and backups for any metadata that wasn't included in Wikidata's responses.
  5. It prints each record as a jsonl object to console, ignoring any records that belong to excluded Wikisource categories, such as government documents that are explicitly identified as such on Wikisource but not Wikidata.

On a more detailed level:

  • It will wait 10 seconds and try again if either API returns an error that is likely related to overuse.
  • Metadata fetched from the Wikisource API comes from the item's infobox, which is not always defined, so not all metadata will be included for every item.
  • Wikisource's API paginates revision (infobox), category, and other properties separately, even when processing those properties for the same item, so pagination data in the response will have multiple values (one per requested data type - in this case, image data and revision/infobox data). This means that subsequent "pages" of API hits might include the same items as previous pages did, but with non-overlapping metadata (to see what I mean, use this query). Therefore, the script continues consuming subsequent Wikisource API results until it reaches the end, and throughout this loop, it updates the metadata for a book whenever it finds a new piece of metadata for that book.
  • This script uses the mwparserfromhell and wikitextparser libraries to parse the contents of a book's infobox, and uses the nameparser library to consistently format every author's name to a format that OL should be guaranteed understand.

Unresolved issues and open questions

  • Publish date is a required field in OL imports, but some books are missing it. What should the default value be?
  • Wikisource and Wikidata do not always differentiate books from other types of media, so the import is greedy and relies on specifying categories to exclude instead of include. This is unfortunate because it can't realistically account for all of the possible wikidata classifications and wikisource categories (of which there are 27,000 in english) that may need to be excluded in the future. There are also some false positive items included that are not categorized as things that are explicitly non-books, such as wikisource:en:Address_to_the_Mary_Adelaide_Nurses.
  • No mechanism for skipping already-imported works on subsequent runs of this script. This might not be necessary, but it would be nice for performance.

Testing

To create the import records, docker compose exec -e PYTHONPATH=. web bash and then python ./scripts/providers/import_wikisource.py ./conf/etc/openlibrary.yml. Several hundred records should output to the console

I tested this script against the code in this branch, which supports author identifiers in the import pipeline: #10092

Stakeholders

@cdrini @scottbarnes @mekarpeles

Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.

@pidgezero-one pidgezero-one changed the title Add Wikisource import script [WIP] Add Wikisource import script Aug 2, 2024
@tfmorris
Copy link
Contributor

I could very well modify this to greedily update existing authors with more information when an imported author matches that, unless there are any risks to that approach you can think of.

There are significant risks to doing that because OpenLibrary contains (or did until it went offline) a large number of conflated author records with more being created at an ever increasing rate as Amazon and then BWB dreck was added.

Currently, the import pipeline decides if an author is a match by doing these checks in the following order, and stopping when a match is found:

  1. Exact full name match to any existing author
  2. Exact match to any alternate names of an existing author

Neither of these conditions are sufficient to uniquely identify an author.

The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing.

@hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource.

@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Oct 16, 2024

There are significant risks to doing that because OpenLibrary contains (or did until it went offline) a large number of conflated author records with more being created at an ever increasing rate as Amazon and then BWB dreck was added.

That's fine, I suppose the next thing I need to know is how strong author identifiers should be ranked from most to least reliable so it can prioritize appropriately.

Neither of these conditions are sufficient to uniquely identify an author.

Even if so, that's how /api/import currently works according to the existing import_author code. I assume that's used in production. Would you like me to break this work and my unit tests for it out into a separate PR and tear down name matching to prepare this work for MARC imports?

@tfmorris
Copy link
Contributor

Also, before starting to use author identifiers, it would make sense to make sure as many matching, non-conflicting, identifiers as possible are imported from Wikidata. It contains a large number of entities with OpenLibrary identifiers where the OL record didn't have a matching Wikidata identifier. Importing/caching VIAF, LCCN, etc identifiers from matching Wikidata entities should also be done so that they can be used for MARC imports.

@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Oct 16, 2024

Also, before starting to use author identifiers, it would make sense to make sure as many matching, non-conflicting, identifiers as possible are imported from Wikidata. It contains a large number of entities with OpenLibrary identifiers where the OL record didn't have a matching Wikidata identifier. Importing/caching VIAF, LCCN, etc identifiers from matching Wikidata entities should also be done so that they can be used for MARC imports.

Sure, I can look into this. The script I wrote here for Wikisource pulls all of that data from Wikidata already, so I could use that as a basis to pull in as much of that data as possible outside of just Wikisource books.

@pidgezero-one
Copy link
Contributor Author

Moved all import API code changes to #10092, which also consolidates Wikidata identifiers with the identifiers we already have stored in OL.

@pidgezero-one pidgezero-one changed the title Import books from Wikisource and use strong author identifiers feat: import books from Wikisource and use strong author identifiers Nov 27, 2024
Comment on lines 38 to 39
imdb: /^\w{2}\d+$/,
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/,

This comment was marked as resolved.

@@ -392,4 +324,4 @@ def test_birth_and_death_date_match_is_on_year_strings(self, mock_site):
"death_date": "November 1910",
}
found = import_author(searched_author)
assert found.key == author["key"]
assert found.key == author["key"]

This comment was marked as resolved.

@@ -314,4 +295,4 @@ def build_query(rec: dict[str, Any]) -> dict[str, Any]:
book[k] = {'type': t, 'value': v}
else:
book[k] = v
return book
return book

This comment was marked as resolved.

@pidgezero-one pidgezero-one changed the title feat: import books from Wikisource and use strong author identifiers feat: import books from Wikisource Nov 28, 2024
@hornc
Copy link
Collaborator

hornc commented Nov 28, 2024

@pidgezero-one I tried to use the Wikidata query here to figure out how many items in Wikidata have (archive.org ids OR openlibrary ids) AND Wikisource pages, but I can't quite figure out how the Wikidata query links to Wikisource. There isn't a wikisource identifier property?

The Wikidata query for books should probably check for P648 (Open Library id) on the book before importing.

Currently the various identifiers on the author are probably over-engineering this. The basic import would be to get the metadata from some list of candidate Wikisource books and convert it into the current import JSON format, which is documented here: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json . I see there are two mediawiki parsers imported and a call to the Wikidata API, and the Wikisource API -- what does the page scraping add that's not available from the two APIs? Why isn't the Wikisource API by itself sufficient to import a book?

@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Nov 28, 2024

@pidgezero-one I tried to use the Wikidata query here to figure out how many items in Wikidata have (archive.org ids OR openlibrary ids) AND Wikisource pages, but I can't quite figure out how the Wikidata query links to Wikisource. There isn't a wikisource identifier property?

The first Wikidata query that runs returns the Wikisource page title, which is queryable against Wikisource's API. Although titles can change in the future, that doesn't matter for the purposes of connecting one query to another at the point in time of running the script, as all WS page titles have to be distinct.

The Wikidata query for books should probably check for P648 (Open Library id) on the book before importing.

Why would we want to only import books that are already in OL? Or if the reverse is true and we don't want to import books that are already in OL, why would we not want to import a Wikisource ID for those books, which adds the ability for the user to go there to read it? Am I misunderstanding this suggestion?

Currently the various identifiers on the author are probably over-engineering this. The basic import would be to get the metadata from some list of candidate Wikisource books and convert it into the current import JSON format, which is documented here: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json .

I believe the records this produces follow the "properties" object in this schema, although I was looking at the /api/import code as a guide to format the output records and test them against that endpoint. Although, why would we not want to attempt to match authors by identifier? I think there's quite a few overlapping open issues regarding the point and I haven't covered all of them in my PR descriptions, but Freso mentioned these two:

#9411
#9448

I was originally trying to see if I could get this issue of author-matching on import to work at all, so I used the output of this script to test it, but I moved all of that work into #10092.

I see there are two mediawiki parsers imported and a call to the Wikidata API, and the Wikisource API -- what does the page scraping add that's not available from the two APIs? Why isn't the Wikisource API by itself sufficient to import a book?

Wikidata's API is prone to timeouts when queries are too complex, which is why there's one query for retrieving book data and one for retrieving author data.

The later query to Wikisource is to get metadata that either Wikidata does not offer out of the box, or would be too cumbersome for it to include without timing out. Originally this was meant for book cover images, because most WD hits were using .djvu files for this instead of more familiar image formats like .png or .jpg that Wikisource typically uses, but this didn't turn out to be as reliable as we would have liked. Other properties that the Wikisource query parses are book descriptions and subjects, which typically come from the infobox. The last thing it does is use the Wikisource response as a fallback for any metadata that the WD API did not include, because although WD is highly structured, it's sometimes missing information that's written directly into the WS page for the item, so we get the most metadata for each item by scraping both APIs.

@hornc
Copy link
Collaborator

hornc commented Dec 3, 2024

@pidgezero-one I just noticed this comment re. the import format generated by this script:

{"ia_id": "wikisource:en:The_Hunting_of_the_Snark_%281876%29", "data": {"title": "The Hunting of the Snark", "source_records": ["wikisource:en:The_Hunting_of_the_Snark_%281876%29"], "identifiers": {"wikisource": ["en:The_Hunting_of_the_Snark_%281876%29"]}, "languages": ["eng"], "publish_date": "1876", "edition_name": "1", "authors": [{"name": "Carroll, Lewis", "personal_name": "Carroll, Lewis"}], "publishers": ["Macmillan and Co."]}}

The exact edition of the book is on OL, https://openlibrary.org/books/OL20593547M/The_Hunting_of_the_Snark but for some reason we don't have the exact scan on archive.org.

I don't think you need to populate the personal_name field. name should ideally be in natural name order i.e. Lewis Caroll (I don't think you need the extra name processing Python imports -- OL import search should find it either way, but OL prefers the records to be in natural order.

edition_name probably shouldn't be 1, but that's more of a question for @seabelis , but won't have an effect on matching, just initial imports.

Getting the publish_places in addition to the publisher would be ideal for a book import. ['London'] for this example (it is available at the source)

It seems like you should be able to extract good pagination from the Wikisource records too, since all the pages have been scanned and reviewed.

Using Wikidata to pre-populate entity_type would be helpful for new authors too.

Populating author dates birth_date and death_date from those available on Wikidata metadata should enable the current author disambiguating features. In the case of a new author import it will be more helpful to general library patrons to disambiguate by dates in preference to just a Wikidata or VIAF identifier. By all means import both, but the dates are important for new author records. Does including these dates, which are required for a complete author record on initial import and are generally available for these items, solve the problem that motivated #10029 ?

]


def format_contributor(raw_name: str) -> str:
Copy link
Collaborator

@hornc hornc Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function is necessary. I believe it is taking a human friendly author name from Wikidata, and manipulating it to reconstruct a human friendly author name. At best it will be a no-op, at worst it will mix up an already good display name, depending on the black-box details of that external module. Reducing unnecessary module imports is good too.

Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work on this on @pidgezero-one ! We went through a fast code review on a call (And I've done a more thorough code review in the past) and I think this is in a great direction! I will be merging this in since I believe it's in a great position, can be run successfully to generate the JSON dump of books for analysis. This does not mean that the work on this is finished and that we'll be doing a bulk import tout-de-suite! We still have a few outstanding questions:

  • Do we create a new edition? How do we technically achieve that?
  • Do we specify Wikisource as a publisher? Always? Sometimes?
  • Charles' question about names (although @pidgezero-one had a good reason for this; I'll let her respond in comment)

But I want to make sure things are moving, and right now we have a lot of things which are somewhat interdependent which is making it hard to get a clear big picture. So merging, and let's create new issues/prs to address these other questions!

@cdrini cdrini merged commit c232799 into internetarchive:master Dec 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import Wikisource trusted book provider data
7 participants