-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: import books from Wikisource #9674
feat: import books from Wikisource #9674
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
There are significant risks to doing that because OpenLibrary contains (or did until it went offline) a large number of conflated author records with more being created at an ever increasing rate as Amazon and then BWB dreck was added.
Neither of these conditions are sufficient to uniquely identify an author. The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing. @hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource. |
That's fine, I suppose the next thing I need to know is how strong author identifiers should be ranked from most to least reliable so it can prioritize appropriately.
Even if so, that's how /api/import currently works according to the existing import_author code. I assume that's used in production. Would you like me to break this work and my unit tests for it out into a separate PR and tear down name matching to prepare this work for MARC imports? |
Also, before starting to use author identifiers, it would make sense to make sure as many matching, non-conflicting, identifiers as possible are imported from Wikidata. It contains a large number of entities with OpenLibrary identifiers where the OL record didn't have a matching Wikidata identifier. Importing/caching VIAF, LCCN, etc identifiers from matching Wikidata entities should also be done so that they can be used for MARC imports. |
Sure, I can look into this. The script I wrote here for Wikisource pulls all of that data from Wikidata already, so I could use that as a basis to pull in as much of that data as possible outside of just Wikisource books. |
Moved all import API code changes to #10092, which also consolidates Wikidata identifiers with the identifiers we already have stored in OL. |
imdb: /^\w{2}\d+$/, | ||
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
@@ -392,4 +324,4 @@ def test_birth_and_death_date_match_is_on_year_strings(self, mock_site): | |||
"death_date": "November 1910", | |||
} | |||
found = import_author(searched_author) | |||
assert found.key == author["key"] | |||
assert found.key == author["key"] |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
@@ -314,4 +295,4 @@ def build_query(rec: dict[str, Any]) -> dict[str, Any]: | |||
book[k] = {'type': t, 'value': v} | |||
else: | |||
book[k] = v | |||
return book | |||
return book |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
@pidgezero-one I tried to use the Wikidata query here to figure out how many items in Wikidata have (archive.org ids OR openlibrary ids) AND Wikisource pages, but I can't quite figure out how the Wikidata query links to Wikisource. There isn't a wikisource identifier property? The Wikidata query for books should probably check for P648 (Open Library id) on the book before importing. Currently the various identifiers on the author are probably over-engineering this. The basic import would be to get the metadata from some list of candidate Wikisource books and convert it into the current import JSON format, which is documented here: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json . I see there are two mediawiki parsers imported and a call to the Wikidata API, and the Wikisource API -- what does the page scraping add that's not available from the two APIs? Why isn't the Wikisource API by itself sufficient to import a book? |
The first Wikidata query that runs returns the Wikisource page title, which is queryable against Wikisource's API. Although titles can change in the future, that doesn't matter for the purposes of connecting one query to another at the point in time of running the script, as all WS page titles have to be distinct.
Why would we want to only import books that are already in OL? Or if the reverse is true and we don't want to import books that are already in OL, why would we not want to import a Wikisource ID for those books, which adds the ability for the user to go there to read it? Am I misunderstanding this suggestion?
I believe the records this produces follow the "properties" object in this schema, although I was looking at the /api/import code as a guide to format the output records and test them against that endpoint. Although, why would we not want to attempt to match authors by identifier? I think there's quite a few overlapping open issues regarding the point and I haven't covered all of them in my PR descriptions, but Freso mentioned these two: I was originally trying to see if I could get this issue of author-matching on import to work at all, so I used the output of this script to test it, but I moved all of that work into #10092.
Wikidata's API is prone to timeouts when queries are too complex, which is why there's one query for retrieving book data and one for retrieving author data. The later query to Wikisource is to get metadata that either Wikidata does not offer out of the box, or would be too cumbersome for it to include without timing out. Originally this was meant for book cover images, because most WD hits were using .djvu files for this instead of more familiar image formats like .png or .jpg that Wikisource typically uses, but this didn't turn out to be as reliable as we would have liked. Other properties that the Wikisource query parses are book descriptions and subjects, which typically come from the infobox. The last thing it does is use the Wikisource response as a fallback for any metadata that the WD API did not include, because although WD is highly structured, it's sometimes missing information that's written directly into the WS page for the item, so we get the most metadata for each item by scraping both APIs. |
@pidgezero-one I just noticed this comment re. the import format generated by this script:
The exact edition of the book is on OL, https://openlibrary.org/books/OL20593547M/The_Hunting_of_the_Snark but for some reason we don't have the exact scan on archive.org. I don't think you need to populate the
Getting the It seems like you should be able to extract good pagination from the Wikisource records too, since all the pages have been scanned and reviewed. Using Wikidata to pre-populate Populating author dates |
] | ||
|
||
|
||
def format_contributor(raw_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this function is necessary. I believe it is taking a human friendly author name from Wikidata, and manipulating it to reconstruct a human friendly author name. At best it will be a no-op, at worst it will mix up an already good display name, depending on the black-box details of that external module. Reducing unnecessary module imports is good too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful work on this on @pidgezero-one ! We went through a fast code review on a call (And I've done a more thorough code review in the past) and I think this is in a great direction! I will be merging this in since I believe it's in a great position, can be run successfully to generate the JSON dump of books for analysis. This does not mean that the work on this is finished and that we'll be doing a bulk import tout-de-suite! We still have a few outstanding questions:
- Do we create a new edition? How do we technically achieve that?
- Do we specify Wikisource as a publisher? Always? Sometimes?
- Charles' question about names (although @pidgezero-one had a good reason for this; I'll let her respond in comment)
But I want to make sure things are moving, and right now we have a lot of things which are somewhat interdependent which is making it hard to get a clear big picture. So merging, and let's create new issues/prs to address these other questions!
this should be squash merged to avoid conflicts with #10092, which split off from this PR.
This PR currently produces records that include author identifiers as implemented in #10110, because I was trying to solve the problem of duplicate authors being created from this script's records, which led me to learn that a lack of author identifier support on import was already a known problem! But if #10110 does not get merged in, I can revert the piece of this script that gets those identifiers, and we can revisit it at a later date.
Closes #9671
This PR does the following:
Currently, the script is restricted to English wikisource (en.wikisource.org) and only fetches works that belong to its "Validated texts" category. However, supporting other categories and other languages in the future is trivial: the
LangConfig
dataclass defines categories and language codes, and the script loops through an array of them (which right now just contains English), so adding newLangConfig
s is all that would be needed to expand the script's coverage.Technical
The script interfaces with the public and free-to-use Wikisource API and Wikidata API.
At a high level, it does the following in this order:
On a more detailed level:
mwparserfromhell
andwikitextparser
libraries to parse the contents of a book's infobox, and uses thenameparser
library to consistently format every author's name to a format that OL should be guaranteed understand.Unresolved issues and open questions
wikisource:en:Address_to_the_Mary_Adelaide_Nurses
.Testing
To create the import records,
docker compose exec -e PYTHONPATH=. web bash
and thenpython ./scripts/providers/import_wikisource.py ./conf/etc/openlibrary.yml
. Several hundred records should output to the consoleI tested this script against the code in this branch, which supports author identifiers in the import pipeline: #10092
Stakeholders
@cdrini @scottbarnes @mekarpeles
Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.