Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: import books from Wikisource #9674

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
161 commits
Select commit Hold shift + click to select a range
19d9874
first draft
pidgezero-one Aug 1, 2024
f23baad
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
34667e7
linting
pidgezero-one Aug 1, 2024
17575bf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
dca89fb
use a class for imports
pidgezero-one Aug 1, 2024
147784a
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 1, 2024
df423ca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
604a5d8
mypy fixes
pidgezero-one Aug 1, 2024
34fe390
merge
pidgezero-one Aug 1, 2024
a307eca
more linting
pidgezero-one Aug 1, 2024
1fe4e84
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
d7f4065
is this deprecated too?
pidgezero-one Aug 1, 2024
bcdcdb1
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 1, 2024
c504f6c
is this deprecated too?
pidgezero-one Aug 1, 2024
a996e64
is this deprecated too?
pidgezero-one Aug 1, 2024
a3c299e
improved data model
pidgezero-one Aug 2, 2024
0f2b113
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
e18ba52
reformat name formatter
pidgezero-one Aug 2, 2024
4bdf428
ruff fix
pidgezero-one Aug 2, 2024
57388b4
improve infobox fetching
pidgezero-one Aug 2, 2024
31b2a7f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
9492403
uncomment
pidgezero-one Aug 2, 2024
dacef92
merge
pidgezero-one Aug 2, 2024
1186696
remove unnecessary print
pidgezero-one Aug 2, 2024
dd82901
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
47b57c8
uncomment imports
pidgezero-one Aug 2, 2024
3c82121
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 2, 2024
5e58373
better template check:
pidgezero-one Aug 2, 2024
2c268b2
publishers?
pidgezero-one Aug 2, 2024
be0d1a8
fix array
pidgezero-one Aug 2, 2024
d52c109
unused import
pidgezero-one Aug 2, 2024
df683a1
different wiki markup strip
pidgezero-one Aug 2, 2024
8e7cb38
reduce image calls
pidgezero-one Aug 2, 2024
66744ef
unstash
pidgezero-one Aug 2, 2024
5c51bcc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
b5e4319
fix newlines
pidgezero-one Aug 2, 2024
34da18d
undo comments
pidgezero-one Aug 2, 2024
76a1724
logger name
pidgezero-one Aug 2, 2024
2ce0e17
fix array typing
pidgezero-one Aug 2, 2024
3645e9e
more cleanup
pidgezero-one Aug 2, 2024
0e9650b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
9663789
dry run outputs to a jsonl file in a gitignored folder
pidgezero-one Aug 6, 2024
8f0810d
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 6, 2024
2a3e617
add this directory
pidgezero-one Aug 6, 2024
0268018
.
pidgezero-one Aug 6, 2024
99f3c93
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Aug 6, 2024
5390b54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
b06b60d
.
pidgezero-one Aug 6, 2024
d7561f8
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 6, 2024
3e10c5d
unicode
pidgezero-one Aug 6, 2024
4bd8e54
remove dry run flag
pidgezero-one Aug 6, 2024
e445276
this produces around 500 records
pidgezero-one Aug 13, 2024
d633c62
wikisource API gives better image results. this script now gets most …
pidgezero-one Aug 13, 2024
4053014
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
e1bd982
undo comments
pidgezero-one Aug 13, 2024
62d1798
clearer comments
pidgezero-one Aug 13, 2024
25d9243
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
bbe242f
formatting
pidgezero-one Aug 13, 2024
15aa2b2
formatting
pidgezero-one Aug 13, 2024
cae28f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
cb2a959
condense
pidgezero-one Aug 13, 2024
5cb9dc9
more cleanup
pidgezero-one Aug 13, 2024
5370e41
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
b96a07d
more cleanup
pidgezero-one Aug 13, 2024
4d4d091
precommit
pidgezero-one Aug 13, 2024
da00d4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
d116d6e
aaaaa
pidgezero-one Aug 13, 2024
8d83830
more false positives, letter filter literally does not work for reaso…
pidgezero-one Aug 14, 2024
f7e61c0
this is annoying
pidgezero-one Aug 14, 2024
3018a5f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
878051c
ruff
pidgezero-one Aug 14, 2024
9135ff5
ruff
pidgezero-one Aug 14, 2024
113e6a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
42c05d0
cmt
pidgezero-one Aug 14, 2024
e9fef40
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 14, 2024
5193267
filters
pidgezero-one Aug 15, 2024
2075f4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 15, 2024
dbac756
fix
pidgezero-one Aug 16, 2024
916c3ae
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
d1683a5
ruff
pidgezero-one Aug 16, 2024
287bfe0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
4ad37de
aint no way precommit thinks 'pleas' is a typo
pidgezero-one Aug 16, 2024
e8fb019
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 16, 2024
d8452d8
comment clarity
pidgezero-one Aug 16, 2024
0d15c54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
db5c4a9
fix publishers
pidgezero-one Aug 16, 2024
3c46c9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
31d84a7
fix WS-side category filtering
pidgezero-one Aug 16, 2024
3dbe7b9
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 16, 2024
d6d303e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
aa73b6b
ruff
pidgezero-one Aug 16, 2024
a8fdcfb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
916dabc
clean up some re-request loops
pidgezero-one Aug 16, 2024
bb3c30c
clean up some re-request loops
pidgezero-one Aug 16, 2024
e28c8cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
f6f6c99
addresses most PR comments
pidgezero-one Sep 29, 2024
beaf68c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
e6fe169
precommit
pidgezero-one Sep 29, 2024
408da50
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Sep 29, 2024
e7f714b
fetches more author info, not sure how to format it yet
pidgezero-one Sep 29, 2024
8f01b5b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
77b23d8
brackets in wrong placE
pidgezero-one Sep 29, 2024
2aca912
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Oct 12, 2024
7bfc39b
format that works with /import/api
pidgezero-one Oct 13, 2024
461b02a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
147fab3
wip
pidgezero-one Oct 13, 2024
b73b8d1
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
92065ed
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
0750cb2
?
pidgezero-one Oct 13, 2024
5842e13
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
83a00f8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
bc31426
precommit errors
pidgezero-one Oct 13, 2024
2e99560
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
cb14b14
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
8d4dac0
support author id matching
pidgezero-one Oct 13, 2024
7d85a0f
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
f1b0edd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
5b57217
can't get it to work
pidgezero-one Oct 13, 2024
0bd52e8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
05f3644
unnecessary changes
pidgezero-one Oct 13, 2024
12b96f4
idk
pidgezero-one Oct 13, 2024
6a8234c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
b9c36e4
it works
pidgezero-one Oct 14, 2024
2fd4569
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
5a8ea7a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
268f055
wip: support more identifiers
pidgezero-one Oct 14, 2024
60e5d00
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
b6b68f3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
e02ae78
fix remote ids
pidgezero-one Oct 14, 2024
d7dd818
fix remote ids
pidgezero-one Oct 14, 2024
ee00b9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
0ecba61
comment
pidgezero-one Oct 14, 2024
f952c4f
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
0e2c510
fix wd condition
pidgezero-one Oct 14, 2024
1265681
remote_ids will never be empty in script
pidgezero-one Oct 14, 2024
216cb50
attempt unit tests
pidgezero-one Oct 14, 2024
eb8d3d4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
d298d39
irrelevant
pidgezero-one Oct 14, 2024
8c17a14
update comment
pidgezero-one Oct 14, 2024
8917ae8
unnecessary change
pidgezero-one Oct 14, 2024
a174609
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
ad263aa
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
ab624f6
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
320be8b
suggested rename
pidgezero-one Oct 16, 2024
f07e5fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 16, 2024
3bfe2d9
why did changing remote_ids to identifiers break tests?
pidgezero-one Oct 16, 2024
a0aaf40
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 16, 2024
d97921d
fix import key
pidgezero-one Oct 16, 2024
565853b
identifiers
pidgezero-one Oct 16, 2024
d84c5fd
identifiers
pidgezero-one Oct 16, 2024
f0a1103
move this code
pidgezero-one Nov 27, 2024
8d3fe1f
/
pidgezero-one Nov 27, 2024
eea9276
unused imports
pidgezero-one Nov 27, 2024
aca95e0
this got moved out by accident
pidgezero-one Nov 28, 2024
d3b4767
wtf?
pidgezero-one Nov 28, 2024
2545c12
add another exclusion
pidgezero-one Dec 3, 2024
eb3cb40
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Dec 3, 2024
88648e0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 3, 2024
e04b80c
ruff fix
pidgezero-one Dec 3, 2024
c55261b
just get birth and death dates for authors to use in import matching
pidgezero-one Dec 3, 2024
abb913c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 3 additions & 22 deletions openlibrary/catalog/add_book/load_book.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,8 +142,7 @@ def walk_redirects(obj, seen):
seen.add(obj['key'])
return obj

# Try for open library ID, then other external identifiers (wikidata, bookbrainz, etc).
# If not found, try for an 'exact' (case-insensitive) name match, but fall back to alternate_names,
# Try for an 'exact' (case-insensitive) name match, but fall back to alternate_names,
# then last name with identical birth and death dates (that are not themselves `None` or '').
name = author["name"].replace("*", r"\*")
queries = [
Expand All @@ -156,13 +155,6 @@ def walk_redirects(obj, seen):
"death_date~": f"*{extract_year(author.get('death_date', '')) or -1}*",
}, # Use `-1` to ensure an empty string from extract_year doesn't match empty dates.
]
if identifiers := author.get("identifiers"):
for id in identifiers:
queries.insert(
0, {"type": "/type/author", f"identifiers.{id}~": identifiers[id]}
)
if key := author.get("key"):
queries.insert(0, {"type": "/type/author", "key~": key})
for query in queries:
if reply := list(web.ctx.site.things(query)):
break
Expand Down Expand Up @@ -254,20 +246,9 @@ def import_author(author: dict[str, Any], eastern=False) -> "Author | dict[str,
new['death_date'] = author['death_date']
return new
a = {'type': {'key': '/type/author'}}
for f in (
'name',
'title',
'personal_name',
'birth_date',
'death_date',
'date',
'remote_ids',
):
for f in 'name', 'title', 'personal_name', 'birth_date', 'death_date', 'date':
if f in author:
a[f] = author[f]
# Import record hitting endpoing should list external IDs under "identifiers", but needs to be "remote_ids" when going into the DB.
if "identifiers" in author:
a["remote_ids"] = author["identifiers"]
return a


Expand Down Expand Up @@ -314,4 +295,4 @@ def build_query(rec: dict[str, Any]) -> dict[str, Any]:
book[k] = {'type': t, 'value': v}
else:
book[k] = v
return book
return book

This comment was marked as resolved.

76 changes: 4 additions & 72 deletions openlibrary/catalog/add_book/tests/test_load_book.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,77 +136,9 @@ def test_author_wildcard_match_with_no_matches_creates_author_with_wildcard(
new_author_name = import_author(author)
assert author["name"] == new_author_name["name"]

def test_first_match_ol_key(self, mock_site):
def test_first_match_priority_name_and_dates(self, mock_site):
"""
Highest priority match is OL key.
"""
self.add_three_existing_authors(mock_site)

# Author with VIAF
author = {
"name": "William H. Brewer",
"key": "/authors/OL3A",
"type": {"key": "/type/author"},
"identifiers": {"viaf": "12345678"},
}

# Another author with VIAF
author_different_key = {
"name": "William Brewer",
"key": "/authors/OL4A",
"type": {"key": "/type/author"},
"identifiers": {"viaf": "87654321"},
}

mock_site.save(author)
mock_site.save(author_different_key)

# Look for exact match on OL ID, regardless of other fields.
# We ideally shouldn't ever have a case where different authors have the same VIAF, but this demonstrates priority.
searched_author = {
"name": "William H. Brewer",
"key": "/authors/OL4A",
"identifiers": {"viaf": "12345678"},
}
found = import_author(searched_author)
assert found.key == author_different_key["key"]

def test_second_match_strong_identifier(self, mock_site):
"""
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc.
"""
self.add_three_existing_authors(mock_site)

# Author with VIAF
author = {
"name": "William H. Brewer",
"key": "/authors/OL3A",
"type": {"key": "/type/author"},
"identifiers": {"viaf": "12345678"},
}

# Another author with VIAF
author_different_viaf = {
"name": "William Brewer",
"key": "/authors/OL4A",
"type": {"key": "/type/author"},
"identifiers": {"viaf": "87654321"},
}

mock_site.save(author)
mock_site.save(author_different_viaf)

# Look for exact match on VIAF, regardless of name field.
searched_author = {
"name": "William Brewer",
"identifiers": {"viaf": "12345678"},
}
found = import_author(searched_author)
assert found.key == author["key"]

def test_third_match_priority_name_and_dates(self, mock_site):
"""
Next highest priority match is name, birth date, and death date.
Highest priority match is name, birth date, and death date.
"""
self.add_three_existing_authors(mock_site)

Expand Down Expand Up @@ -269,7 +201,7 @@ def test_non_matching_birth_death_creates_new_author(self, mock_site):
assert isinstance(found, dict)
assert found["death_date"] == searched_and_not_found_author["death_date"]

def test_match_priority_alternate_names_and_dates(self, mock_site):
def test_second_match_priority_alternate_names_and_dates(self, mock_site):
"""
Matching, as a unit, alternate name, birth date, and death date, get
second match priority.
Expand Down Expand Up @@ -392,4 +324,4 @@ def test_birth_and_death_date_match_is_on_year_strings(self, mock_site):
"death_date": "November 1910",
}
found = import_author(searched_author)
assert found.key == author["key"]
assert found.key == author["key"]

This comment was marked as resolved.

2 changes: 0 additions & 2 deletions openlibrary/components/AuthorIdentifiers.vue
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,6 @@ const identifierPatterns = {
lc_naf: /^n[bors]?[0-9]+$/,
amazon: /^B[0-9A-Za-z]{9}$/,
youtube: /^@[A-Za-z0-9_\-.]{3,30}/,
imdb: /^\w{2}\d+$/,
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/,

This comment was marked as resolved.

}
export default {
// Props are for external options; if a subelement of this is modified,
Expand Down
13 changes: 4 additions & 9 deletions openlibrary/plugins/importapi/import_edition_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,15 +100,10 @@ def add_list(self, key, val):
self.edition_dict[key] = [val]

def add_author(self, key, val):
if isinstance(val, dict):
author_dict = val
if "name" in author_dict:
author_dict['personal_name'] = author_dict['name']
self.add_list('authors', author_dict)
else:
self.add_list(
'authors', {'personal_name': val, 'name': val, 'entity_type': 'person'}
)
# We don't know birth_date or death_date.
# Should name and personal_name be the same value?
author_dict = {'personal_name': val, 'name': val, 'entity_type': 'person'}
self.add_list('authors', author_dict)

def add_illustrator(self, key, val):
self.add_list('contributions', val + ' (Illustrator)')
Expand Down
3 changes: 0 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ isbnlib==3.10.14
luqum==0.11.0
lxml==4.9.4
multipart==0.2.4
mwparserfromhell==0.6.6
nameparser==1.1.3
Pillow==10.4.0
psycopg2==2.9.6
pydantic==2.4.0
Expand All @@ -32,4 +30,3 @@ sentry-sdk==1.28.1
simplejson==3.19.1
statsd==4.0.1
validate_email==1.3
wikitextparser==0.56.1
Loading