Skip to content

Latest commit

 

History

History

sparql

measuring

We want to measure the size (number of pages) of novels (i.e., fictional literary works) in the German National Library (DNB).

It is not trivial to extract all novels from a big catalogue like that of the German National library. “Librarians estimate that genre information is present in the expected MARC field for less than a quarter of the volumes in HathiTrust Digital Library,” (Underwood et al. 2013) and we encounter the same problem, which calls for an innovative solution.

Our approach is to

  1. extract a list of writers from Wikidata together with their GND id
  2. download linked data about the DNB books
  3. join the writer list with the list of books using the GND id

This page documents the evolution of this process, which turned out to be not as straightforward as it seems. One reason is the size of the data and the complexity of path traversal queries (more on that later). Therefore, our final solution relies on rather simple SPARQL queries to Wikidata and manual joins of the retrieved datasets.

extracting writers from Wikidata

We query Wikidata’s SPARQL endpoint to get a list of writers. Since SPARQL queries become complex quickly, we build the final query step by step.

We will use the following Wikidata items:

occupation is writer

First, we query for all Wikidata items whose occupation (P106) is writer (Q36180) (SPARQL):

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P106 wd:Q36180 .              # occupation(P106) is writer(Q36180)
  SERVICE wikibase:label {                # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
LIMIT 10
itemitemLabel
wd:Q3737990Fabrizio Corona
wd:Q10329870Miguel Temprano
wd:Q17639667Sébastien Valiela
wd:Q3335George Orwell
wd:Q23434Ernest Hemingway
wd:Q28658John F. Burns
wd:Q29344William T. Vollmann
wd:Q74465Friedrich Wilhelm Hackländer
wd:Q75265Otto von Corvin
wd:Q77475Ludwig Ganghofer

occupation is writer or a subclass

Background: Not all people which we consider to be a “writer” have as occupation property the class writer (Q36180) but instead a subclass of it, for instance poet (Q49757) or science fiction writer (Q18844224) (see the subclasses of writer). Since subclasses can have further subclasses, we need to traverse the whole subclass tree (the transitive closure of the subclass relation) to find all people who can be subsumed as writers. (see transitive subclasses of writer).

Second, we extend this to include subclasses of (P279) writer (Q36180) (SPARQL):

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P106/wdt:P279* wd:Q36180 .   # occupation(P106) is writer(Q36180) or a subclass(P279)
  SERVICE wikibase:label {               # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
LIMIT 10
itemitemLabel
wd:Q3737990Fabrizio Corona
wd:Q10329870Miguel Temprano
wd:Q17639667Sébastien Valiela
wd:Q3335George Orwell
wd:Q23434Ernest Hemingway
wd:Q28658John F. Burns
wd:Q29344William T. Vollmann
wd:Q74465Friedrich Wilhelm Hackländer
wd:Q75265Otto von Corvin
wd:Q77475Ludwig Ganghofer

Let us download the actual data:

curl \
    --header "Accept: text/tab-separated-values" \
    --output wikidata_writer_subclass.tsv \
    --globoff \
     'https://query.wikidata.org/sparql?query=SELECT%20%3Fitem%20%3FitemLabel%0A%20%20WHERE%0A%20%20{%0A%20%20%20%20%3Fitem%20wdt%3AP106%2Fwdt%3AP279*%20wd%3AQ36180%20.%20%20%20%23%20occupation%28P106%29%20is%20writer%28Q36180%29%20or%20a%20subclass%28P279%29%0A%20%20%20%20SERVICE%20wikibase%3Alabel%20{%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20...%20include%20the%20labels%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%0A%20%20%20%20}%0A%20%20}'

checking the result

Let us check the result by comparing its size and searching for Goethe:

wc -l wikidata_writer_subclass.tsv
  • 73796
grep Goethe wikidata_writer_subclass.tsv
http://www.wikidata.org/entity/Q4322238Nicole Van Goethem
http://www.wikidata.org/entity/Q5879Johann Wolfgang von Goethe

That looks good at first sight, but at least Wolfgang Maximilian von Goethe and Johann Caspar Goethe are missing. Furthermore, the following query returns a count of 358703 (on 2016-12-15) instead of 73796:

SELECT (COUNT(DISTINCT ?item) AS ?count)
WHERE
{
  ?item wdt:P106/wdt:P279* wd:Q36180 .   # occupation(P106) is writer(Q36180) or a subclass(P279)
  SERVICE wikibase:label {               # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}

We can see the reason for this problem by checking the end of the result file:

grep Exception wikidata_writer_subclass.tsv | head -n1
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.QueryInterruptedException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.bop.engine.QueryTimeoutException: Query deadline is expired.

Not all writers are returned due to a query timeout which is likely caused by the expensive path query to get all subclasses of writer.

occupation is writer or a subclass and has GND id

Background: It is clear by now that the path query will not return all results but for the sake of completeness, we continue with this approach to explain how it works in principle.

Third, we ensure that the item has a GND ID (P227) property (SPARQL):

SELECT ?item ?itemLabel ?gndid
WHERE
{
  ?item wdt:P106/wdt:P279* wd:Q36180 .  # occupation(P106) is writer(Q36180) or a subclass(P279)
  ?item p:P227 ?gndid .                 # ... with a GND ID(P227)
  SERVICE wikibase:label {              # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
LIMIT 10

TODO: FILTER(BOUND(?gndid))?

itemitemLabelgndid
wd:Q42Douglas Adamswd:statement/q42-8AA8CCC1-86CE-4C66-88FC-267621A81EA0
wd:Q272Paul Morandwd:statement/q272-9373E898-F2B0-4BF8-871A-B09A3A055577
wd:Q303Elvis Presleywd:statement/q303-44C100BC-71C3-4D4F-881D-6729D4B58D28
wd:Q392Bob Dylanwd:statement/q392-0190B28E-161C-4BA9-99D4-4B7C27B6F4FD
wd:Q489Bill Maherwd:statement/q489-79FF5E0D-6C59-48AE-8266-775105133852
wd:Q765Dario Fowd:statement/q765-5FD4C8D5-2258-4091-87AB-C4112FE30CE3
wd:Q835Mikhail Bulgakovwd:statement/q835-94A1550F-D43D-4C6F-A80F-563889002835
wd:Q853Andrei Tarkovskywd:statement/q853-A2E1C3C3-2F32-4A0B-B091-2DBB06BCEF18
wd:Q882Charlie Chaplinwd:statement/q882-A7EB6AB3-8F42-4405-8B47-6E78D2C1E2C1
wd:Q892J. R. R. Tolkienwd:statement/q892-122F7562-51B2-4A14-82D6-6202DB5A79ED

Unfortunately, we do not get the GND id but a reference to a statement.

We have to resolve this statement to get the actual GND id (SPARQL, inspired by an example query for awarded Chemistry Nobel Prizes):

SELECT ?item ?itemLabel ?gndid
WHERE
{
  ?item wdt:P106 wd:Q36180 .              # occupation(P106) is writer(Q36180)
  ?item p:P227 ?wds . 
  ?wds ?v ?gndid .
  ?wdP wikibase:claim p:P227 .
  ?wdP wikibase:statementProperty ?v .
  SERVICE wikibase:label {                # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
LIMIT 10
itemitemLabelgndid
wd:Q4498Beatritz de Dia134832809
wd:Q42Douglas Adams119033364
wd:Q377Yanka Kupala118640267
wd:Q501Charles Baudelaire118507184
wd:Q555Rachel Maddow1022390589
wd:Q747Pierre Corneille118522175
wd:Q926Roald Amundsen118502670
wd:Q1151Hector Berlioz118509675
wd:Q4128Louis Aragon118503774
wd:Q4340Andreas Capellanus118502905

Let us download the actual data:

curl \
    --header "Accept: text/tab-separated-values" \
    --output wikidata_writer_subclass_gndid.tsv \
    --globoff \
     'https://query.wikidata.org/sparql?query=SELECT%20%3Fitem%20%3FitemLabel%20%3Fgndid%0A%20%20WHERE%0A%20%20{%0A%20%20%20%20%3Fitem%20wdt%3AP106%20wd%3AQ36180%20.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20occupation%28P106%29%20is%20writer%28Q36180%29%0A%20%20%20%20%3Fitem%20p%3AP227%20%3Fwds%20.%20%0A%20%20%20%20%3Fwds%20%3Fv%20%3Fgndid%20.%0A%20%20%20%20%3FwdP%20wikibase%3Aclaim%20p%3AP227%20.%0A%20%20%20%20%3FwdP%20wikibase%3AstatementProperty%20%3Fv%20.%0A%20%20%20%20SERVICE%20wikibase%3Alabel%20{%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20...%20include%20the%20labels%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%0A%20%20%20%20}%0A%20%20}'

checking the result

Let us again check the result comparing its size and searching for Goethe:

wc -l wikidata_writer_subclass_gndid.tsv
  • 57495
grep Goethe wikidata_writer_subclass_gndid.tsv
http://www.wikidata.org/entity/Q1586540Wolfgang Maximilian von Goethe118717928
http://www.wikidata.org/entity/Q1585819Johann Caspar Goethe118695940

Unfortunately, Johann Wolfgang von Goethe is not contained, although he has a GND id and as occupation (among others) novelist (Q6625963) which is a subclass of writer (Q36180). The reason is again a query timeout:

grep Exception wikidata_writer_subclass_gndid.tsv | head -n1
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.QueryInterruptedException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.bop.engine.QueryTimeoutException: Query deadline is expired.

Therefore, we must find another way to extract the data from Wikidata. One option is to avoid the expensive joins and instead extract separate datasets and join them manually.

downloading the individual parts and joining them

Presumably due to the size of the data and the complexity of path queries and joins, we struggle to put all pieces together within Wikidata. Therefore, we here try to download the different bits and pieces and put them together manually (well, with some simple tools):

We then join the resulting three files and get a list of writers together with their GND id.

item has an occupation property

This is quite simple: we get all items which have an occupation property together with the value of that property (SPARQL):

SELECT ?item ?itemLabel ?occupation
WHERE
{
  ?item wdt:P106 ?occupation .            # occupation(P106)
  SERVICE wikibase:label {                # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}

Let us download the data:

  curl \
      --header "Accept: text/tab-separated-values" \
      --output wikidata_occupation.tsv \
      --globoff \
'https://query.wikidata.org/sparql?query=%20%20SELECT%20%3Fitem%20%3FitemLabel%20%3Foccupation%0A%20%20WHERE%0A%20%20{%0A%20%20%20%20%3Fitem%20wdt%3AP106%20%3Foccupation%20.%20%20%20%20%20%20%20%20%20%20%20%20%23%20occupation%28P106%29%0A%20%20%20%20SERVICE%20wikibase%3Alabel%20{%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20...%20include%20the%20labels%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%0A%20%20%20%20}%0A%20%20}'

and count the number of items:

wc -l wikidata_occupation.tsv
  • 3053738

Impressive!

all subclasses of writer

This is the expensive path query, except that we avoid to join any large amounts of data such that the result is quite small:

SELECT ?subclass
WHERE
{
  ?subclass wdt:P279* wd:Q36180
}
curl \
    --header "Accept: text/tab-separated-values" \
    --output wikidata_writer_subclasses.tsv \
    --globoff \
     'https://query.wikidata.org/sparql?query=%23added%20before%202016-10%0ASELECT%20%3Fsubclass%0AWHERE%0A{%0A%20%20%3Fsubclass%20wdt%3AP279*%20wd%3AQ36180%0A}'

We get a list of all #transitive-subclasses-of-writer.

item has a GND id

We download all items which have a GND id:

SELECT ?item ?itemLabel ?gndid
WHERE
{
  ?item p:P227 ?wds .                     # has property occupation(P227)
  ?wds ?v ?gndid .
  ?wdP wikibase:claim p:P227 .
  ?wdP wikibase:statementProperty ?v .
  SERVICE wikibase:label {                # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
curl \
    --header "Accept: text/tab-separated-values" \
    --output wikidata_gndid.tsv \
    --globoff \
     'https://query.wikidata.org/sparql?query=%20%20SELECT%20%3Fitem%20%3FitemLabel%20%3Fgndid%0A%20%20WHERE%0A%20%20{%0A%20%20%20%20%3Fitem%20p%3AP227%20%3Fwds%20.%20%0A%20%20%20%20%3Fwds%20%3Fv%20%3Fgndid%20.%0A%20%20%20%20%3FwdP%20wikibase%3Aclaim%20p%3AP227%20.%0A%20%20%20%20%3FwdP%20wikibase%3AstatementProperty%20%3Fv%20.%0A%20%20%20%20SERVICE%20wikibase%3Alabel%20{%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20...%20include%20the%20labels%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%0A%20%20%20%20}%0A%20%20}'
wc -l wikidata_gndid.tsv
  • 449396

This is close to the count of 447551 returned by a COUNT query.

struggles with the itemLabel

The itemLabel was later added to the query but Wikidata struggled to deliver the complete results. Hence, when less results are returned, try removing the itemLabel from the SELECT statement.

Another alternative is to retrieve the labels for the items using the query where an item has an occupation property.

We then also need to adopt the columns which are joined in the next section.

joining the files

We want to join the following three files:

  1. wikidata_occupation.tsv
  2. wikidata_writer_subclasses.tsv
  3. wikidata_gndid.tsv

Due to an incompatibility between the sort and join program, we have to ensure that all operations use the same collation. In our example we use en_EN.

To join the files, they need to be sorted. Therefore, we first sort them using the en_EN collation:

export LANG=en_EN
sort -k1 wikidata_writer_subclasses.tsv > wd_ws
sort -k2 wikidata_occupation.tsv > wd_o
sort -k1 wikidata_gndid.tsv > wd_g

We can now join wikidata_occupation.tsv (column 2) with wikidata_writer_subclasses.tsv (column1) using the occupation:

export LANG=en_EN
join -1 1 -2 2 wd_ws wd_o | cut -d' ' -f2 | sort -u > wd_w

We now join this with the GND ids:

export LANG=en_EN
join wd_g wd_w > wd_result
wc -l wd_result
  • 114872

So we have at least more results than we got directly from Wikidata. Let’s check for Goethe:

grep "Q5879>" wd_result
http://www.wikidata.org/entity/Q5879118540238

Success! Goethe is included. We can now continue and join the writers from Wikidata with the records from the DNB

Linked Data from DNB

download

curl --output DNBTitel.rdf.gz "http://datendienst.dnb.de/cgi-bin/mabit.pl?cmd=fetch&userID=opendata&pass=opendata&mabheft=DNBTitel.rdf.gz"

checks

It contains entries like these:

<http://d-nb.info/1057803898> a bibo:Document ;
        dcterms:medium <http://rdvocab.info/termList/RDACarrierType/1044> ;
        owl:sameAs <http://hub.culturegraph.org/resource/DNB-1057803898> ;
        dc:identifier "(DE-101)1057803898" ;
        bibo:isbn13 "9783150186329" ;
        rdau:P60521 "kart. : EUR 3.60" ;
        dc:identifier "(OCoLC)890646150" ;
        dcterms:language <http://id.loc.gov/vocabulary/iso639-2/ger> ;
        dc:title "Die Leiden des jungen Werthers" ;
        dcterms:creator <http://d-nb.info/gnd/118540238> ;
        rdau:P60163 "Stuttgart" ;
        dc:publisher "Reclam" ;
        rdau:P60333 "Stuttgart : Reclam, 2014" ;
        isbd:P1053 "140 S." ;
        dcterms:bibliographicCitation "Reclams Universal-Bibliothek ; Nr. 18632" ;
        dcterms:issued "2014" ;
        rdau:P60493 "erste Fassung 1774" ;
        bibo:authorList _:node1ashhhff1x4370327 .

We find the GND id 118540238 in the URL http://d-nb.info/gnd/118540238 of the dcterms:creator property. It points to

grep 118540238 wd_result
http://www.wikidata.org/entity/Q5879118540238

Goethe!

extraction

We want to find all entries in that file with a GND id from our Wikidata items.

To have some smaller data to play around, we extract some entries:

zcat DNBTitel.ttl.gz| head -n29999 > dnbtest.ttl

(We need almost 30000 lines to get some entries with a dc:creator property below.)

Now we need Python and rdflib to parse the data:

(setq org-babel-python-command "python3")
import rdflib
from rdflib.namespace import DCTERMS, DC

g = rdflib.Graph()
g.parse('dnbtest.ttl', format='n3')
isbd = rdflib.Namespace("http://iflastandards.info/ns/isbd/elements/")
 
for s, o in g.subject_objects(DCTERMS["creator"]):
    # property with linked GND id found, extract GND id
    url, gndid = o.rsplit('/', 1)
    # get title and page number
    title = g.value(s, DC["title"], None)
    pages = g.value(s, isbd["P1053"], None)
    print(s, gndid, title, pages, sep='|')
http://d-nb.info/97559074X110717996Der LehrprinzX, 528 S.
http://d-nb.info/9686682671074145577Integrierte digitale SchaltungenXII, 711 S.
http://d-nb.info/9661020021030562091Brandschutzgeschichte250 S.
http://d-nb.info/973161485117227242Der Logos des Spiegels366 S.
http://d-nb.info/969479468121279723Tessloffs Aufklärungsbuch80 S.
http://d-nb.info/9661020021049499417Brandschutzgeschichte250 S.
http://d-nb.info/963874004121412636Durch Wüste und Steppe610 S.
http://d-nb.info/9686682671074145798Integrierte digitale SchaltungenXII, 711 S.
http://d-nb.info/972213066123476879Modellfall für Deutschland?1382 S.
http://d-nb.info/9701763841043946004Beck’sches Handbuch Umwandlungen internationalXIV, 697 S.
http://d-nb.info/972500502118176900Gastling103 S.
http://d-nb.info/975731858128369078Opfer rechtsextremer Gewalt176 S.
http://d-nb.info/969479468129025232Tessloffs Aufklärungsbuch80 S.
http://d-nb.info/972647988121742660ProdukthaftpflichtversicherungXVI, 498 S.
http://d-nb.info/966547403118503901Trug doch die Nacht den Albatros133 S.
http://d-nb.info/969932391121102467Fichte lesenXV, 119 S.
http://d-nb.info/9757318581043395121Opfer rechtsextremer Gewalt176 S.
http://d-nb.info/975192507118509861Gedichte601 Seiten
http://d-nb.info/96665241X118654292Luther158 S.
http://d-nb.info/972721304115881301Frank Bsirske macht Urlaub au Krk315 S.
http://d-nb.info/972721304141422114Frank Bsirske macht Urlaub au Krk315 S.
http://d-nb.info/975731858123182301Opfer rechtsextremer Gewalt176 S.
http://d-nb.info/975778269170750345Nachhaltigkeit und Betriebswirtschaftslehre598 S.

It is slow but it works pretty well. In the last column we can already see that processing the page numbers will require some effort.

joining writers and books

We extend the Python code to check for each GND whether it appears in the writer list wd_result from Wikidata (and we add support to read the gzip compressed file from the DNB):

from __future__ import print_function
import rdflib
import gzip
import codecs
import logging
from rdflib.namespace import DCTERMS, DC

logging.basicConfig()

# read Wikidata GND ids of writers
writers = set()
with open('wd_result', 'r') as f:
    for line in f:
        entity, gndid = line.strip().split()
        writers.add(gndid)

# read DNB data
g = rdflib.Graph()
g.parse(gzip.open('DNBTitel.ttl.gz', 'rt'), format='n3')
isbd = rdflib.Namespace("http://iflastandards.info/ns/isbd/elements/")

fout = open('dnb_pages.tsv', 'wt')

for s, o in g.subject_objects(DCTERMS["creator"]):
    # property with linked GND id found, extract GND id
    url, gndid = o.rsplit('/', 1)
    # check whether this is a writer
    if gndid in writers:
        # get title and page number
        title = g.value(s, DC["title"], None)
        pages = g.value(s, isbd["P1053"], None)
        if title == None:
            title = ""
        if pages == None:
            pages = ""
        
        print(s.encode("utf-8"), gndid.encode("utf-8"), title.encode("utf-8"), pages.encode("utf-8"), sep='\t', file=fout)

fout.close()

Don’t try this at home kids! This loads the whole DNB RDF graph into memory and this requires an enormous amount of main memory (200GB).

Error output:

WARNING:rdflib.term:http://d-nb.info/gnd/11900951X  does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/104288744 ; 116055804 does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/104288744 ; 116055804 does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/104288744 ; 116055804 does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/ 110148991 does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/105112348; 116518308; 118667653 (3 Tp) does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/105112348; 116518308; 118667653 (3 Tp) does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/119193906 (VWForm) does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/116596376; 10083485X does not look like a valid URI, trying to serialize this will break.
WARNING:rdflib.term:http://d-nb.info/gnd/ 113519834 does not look like a valid URI, trying to serialize this will break.

The RDF extraction is also non-deterministic, since some properties have several values, for instance, this document has two values for the isbd:P1053 property which holds the number of pages:

<http://d-nb.info/1010791265> a bibo:Document ;
        dcterms:medium <http://rdvocab.info/termList/RDACarrierType/1044> ;
        owl:sameAs <http://hub.culturegraph.org/resource/DNB-1010791265> ;
        dc:identifier "(DE-101)1010791265" ;
        rdau:P60521 "Lw. : Pta 125.00" ;
        dc:identifier "(OCoLC)254209729" ;
        dc:title "Los pájaros de Baden-Baden" ;
        dcterms:creator <http://d-nb.info/gnd/118644491> ;
        rdau:P60163 "Madrid" ;
        dc:publisher "Ed. Cid" ;
        rdau:P60333 "Madrid : Ed. Cid, 1965" ;
        isbd:P1053 "263 S." , "8" ;
        dcterms:issued "1965" ;
        bibo:authorList _:node1ashhhfpkx8672282 .

In our implementation a random value is selected, such that either 263 S. or 8 will be extracted.

analysing the page numbers

We finally have a result and can analyse dnb_pages.tsv.

syntax

A quick look at the data immediately reveals that there are many different ways how page numbers are specified:

cut -f4 dnb_pages.tsv | sed -e "s/[0-9][0-9]*/0/g"  | sort -u | wc -l
  • 6903

Impressive! Let’s have a look at the most frequent examples:

cut -f4 dnb_pages.tsv | sed -e "s/[0-9][0-9]*/0/g"  | sort | uniq -c | sort -nr | head
9026340 S.
116585
189090 S.;
18739[0] S.
134360 Seiten
131050 Bl.
7168VIII, 0 S.
57580, [0] S.
46430, 0 S.
4224XII, 0 S.

And the least frequent examples:

cut -f4 dnb_pages.tsv | sed -e "s/[0-9][0-9]*/0/g"  | sort | uniq -c | sort -nr | tail
10, 0, 0, 0, 0, 0, 0, 0, 0 S.
10, [0], [0]
1[0] - 0
10, [0]
10-0.
10 (0)
10 0
10 0
1[0].
10,

counting

Given the top list above, with the very simple regular expression [0-9]+ S we can cover three of the most important cases and extract the longest works in our list:

import re
import numpy as np

re_pages = re.compile("([0-9]+) S")

items = []
pagecounts = []

with open("dnb_pages.tsv", "rt") as f:
    for line in f:
        parts = line.strip().split('\t')
        # ignore broken lines for now
        if len(parts) == 4:
            item, gndid, title, pages = parts
            # parse pages
            match = re_pages.search(pages)
            if match:
                # store page and item
                pagecounts.append(int(match.group(1)))
                items.append(line.strip())

# get permutation to sort by pagecount
pagecounts_sorted = np.argsort(pagecounts)

# print top 10
for i in range(1,11):
    print(pagecounts[pagecounts_sorted[-i]], items[pagecounts_sorted[-i]])
pagesitem idGND idtitlepages
348333http://d-nb.info/920918131119483823Tim348333 S.
332331http://d-nb.info/930916484118507591Selbstvergessenheit332331 S.
239240http://d-nb.info/920996760118815202Denkzettel239240 S.
176150http://d-nb.info/880974125118520520Die fünfte Freiheit176150 S.
137317http://d-nb.info/942067983130671088Revolutionäre Psychologie137317 S.
96104http://d-nb.info/959654496115662863Die Malerei im Bietigheimer Hornmoldhaus96104 S.
22522http://d-nb.info/951007475118694316Dictionnaire universel, contenant generalement tous les mots françois22522 S.
21920http://d-nb.info/82054161311890423XDie grünen Augen21920 S.
13008http://d-nb.info/958168091121276791Nicht mehr rauchen und dabei schlank bleiben13008 S.
9306http://d-nb.info/964760746128401370Flying Dutchmen9306 S.

The names of the authors are missing for now but having a look at the corresponding DNB pages we can get an idea that something must be wrong here:

  1. The page numbers seem to be too large to be true.
  2. The first book, “Tim”, was published by Goldmann Verlag, a typical paperback publisher.
  3. The last book, “Flying Dutchmen” from the architect Kari Jormakka is non-fiction.

We will investigate these issues in the next steps.

additional information

subclasses of writer

Just out of curiosity: what are the subclasses of (P279) writer (Q36180) (SPARQL):

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P279 wd:Q36180 .             # subclass(P279) of writer(Q36180)
  SERVICE wikibase:label {               # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
ORDER BY ?itemLabel
itemitemLabel
wd:Q152182
wd:Q21140478
wd:Q26203955
wd:Q26260814
wd:Q4938203Boholano writers
wd:Q16637669Félibresse
wd:Q764233Geoponici
wd:Q20532870Uzbek writers
wd:Q864380biographer
wd:Q4853732children’s writer
wd:Q27431213cookery writer
wd:Q11500768cooking expert
wd:Q14466416copywriter
wd:Q3589290correspondent
wd:Q21036234crime writer
wd:Q10297252detective writer
wd:Q22811127devotional writer
wd:Q487596dramaturge
wd:Q11774202essayist
wd:Q3064032fabulist
wd:Q623386ghostwriter
wd:Q5689489head writer
wd:Q8178443librettist
wd:Q18533509medical writer
wd:Q24387326mythographer
wd:Q15980158non-fiction writer
wd:Q6625963novelist
wd:Q16254673pamphleteer
wd:Q551835physician writer
wd:Q214917playwright
wd:Q49757poet
wd:Q12144794prosaist
wd:Q18844224science fiction writer
wd:Q28389screenwriter
wd:Q15949613short story writer
wd:Q175301speechwriter
wd:Q7596574staff writer
wd:Q15979013surrealist writer
wd:Q1568338technical writer
wd:Q381353woman of letters
wd:Q27212012young adult writer

Some item labels are empty because the items do not have an English label. For instance, the first item wd:Q152182 refers to the German concept “Literat” for which no English translation is provided.

transitive subclasses of writer

And now let us respect transitivity:

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P279* wd:Q36180 .            # subclass(P279) (transitive) of writer(Q36180)
  SERVICE wikibase:label {               # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}
ORDER BY ?itemLabel
itemitemLabel
wd:Q152182
wd:Q1456208
wd:Q2325224
wd:Q2393424
wd:Q2781593
wd:Q3477303
wd:Q3531683
wd:Q3765898
wd:Q3765897
wd:Q11301725
wd:Q11598626
wd:Q11711601
wd:Q16681315
wd:Q18059311
wd:Q18222085
wd:Q21140478
wd:Q22929895
wd:Q24812704
wd:Q25277824
wd:Q26203955
wd:Q26234215
wd:Q26260814
wd:Q779388Akhmatova’s Orphans
wd:Q2632248Akhoond
wd:Q2870087Atthidographer
wd:Q4938203Boholano writers
wd:Q670974Bollandist
wd:Q2936875Cantore al liuto
wd:Q1107241Coirpre
wd:Q14466416Copywriter
wd:Q2354449Dichter des Vaderlands
wd:Q5338722Editor-at-large
wd:Q22888258Edward James Bennell
wd:Q744738Encyclopédistes
wd:Q16637669Félibresse
wd:Q764233Geoponici
wd:Q767975Goliard
wd:Q4529715Hittitologist
wd:Q6296256Journalist
wd:Q956365Liedermacher
wd:Q1133078Logographer
wd:Q3296111Marxist historian
wd:Q156624Paparazzi
wd:Q7207430Poet Laureate of the District of Columbia
wd:Q41775Poète maudit
wd:Q3403391Prince des poètes
wd:Q7459839Sha’ir
wd:Q20002503Story artist
wd:Q19346706US american journalist
wd:Q20532870Uzbek writers
wd:Q6399436VJ
wd:Q23038345Wikipedian
wd:Q3809586Wikipedian in Residence
wd:Q17486321agrarian historian
wd:Q794686aoidos
wd:Q3606216aphorist
wd:Q619553apologist
wd:Q17486326architectural historian
wd:Q17391659architectural theoretician
wd:Q17391654architecture critic
wd:Q4164507art critic
wd:Q1792450art historian
wd:Q17391638art theorist
wd:Q18814623autobiographer
wd:Q215144bard
wd:Q10429346bibliographer
wd:Q864380biographer
wd:Q24705156broadcast journalist
wd:Q26233771building researcher
wd:Q26132815byzantinist
wd:Q13391399chansonnier
wd:Q15986551chess journalist
wd:Q15958307chess theoretician
wd:Q4853732children’s writer
wd:Q3330547chronicler
wd:Q1743122church historian
wd:Q15983985classical archaeologist
wd:Q16267607classical philologist
wd:Q2468727classical scholar
wd:Q3276037color commentator
wd:Q1086863columnist
wd:Q11914886comedy writer
wd:Q21207686comics critic
wd:Q20669602contemporary historian
wd:Q2995513contemporary historian
wd:Q876864contributing editor
wd:Q27431213cookery writer
wd:Q11500768cooking expert
wd:Q1155838correspondent
wd:Q3589290correspondent
wd:Q21036234crime writer
wd:Q6430706critic
wd:Q21286455critic of religions
wd:Q20020377cultural critic
wd:Q15462162cultural historian
wd:Q19765978dance critic
wd:Q10297252detective writer
wd:Q22811127devotional writer
wd:Q3026032dialogue writer
wd:Q18939491diarist
wd:Q23117687dithyrambic poet
wd:Q487596dramaturge
wd:Q17488363economic historian
wd:Q589298editor-in-chief
wd:Q3024627editorial cartoonist
wd:Q17342450editorial columnist
wd:Q1350189egyptologist
wd:Q16314501encyclopedist
wd:Q22917056engraved gem researcher
wd:Q26237228epigrammatist
wd:Q15632632epigrapher
wd:Q11774202essayist
wd:Q20743624etruscologist
wd:Q3064032fabulist
wd:Q4220892film critic
wd:Q20971250film historian
wd:Q17391605film theorist
wd:Q1495660food critic
wd:Q18190897foreign correspondent
wd:Q22662561game show host
wd:Q8963721genealogist
wd:Q623386ghostwriter
wd:Q5615122guest host
wd:Q17166634hagiographer
wd:Q5689489head writer
wd:Q19967350hellenist
wd:Q18916625hellenist
wd:Q516463heraldist
wd:Q201788historian
wd:Q17504989historian of Eastern Europe
wd:Q17488392historian of cartography
wd:Q20873384historian of classical antiquity
wd:Q17486330historian of mathematics
wd:Q17488357historian of religion
wd:Q16063546historian of science
wd:Q17505002historian of student
wd:Q17486338historian of technology
wd:Q17489339historian of the modern age
wd:Q5905231horror host
wd:Q20738773hymnwriter
wd:Q256876improvvisatore
wd:Q15931838investigative reporter
wd:Q1930187journalist
wd:Q17598791latinist
wd:Q2135538legal historian
wd:Q16012028legal scholar
wd:Q8178443librettist
wd:Q11236655light novel writer
wd:Q2617025list of Muslim historians
wd:Q4263842literary critic
wd:Q13570226literary historian
wd:Q15962340literary theorist
wd:Q1595570local historian
wd:Q822146lyricist
wd:Q19251029media critic
wd:Q8175949media historian
wd:Q15985128medical historian
wd:Q18533509medical writer
wd:Q3332711medievalist
wd:Q11774156memoirist
wd:Q1493121military historian
wd:Q18932086military theorist
wd:Q11781549minnesänger
wd:Q1350157music critic
wd:Q20198542music historian
wd:Q20669622music journalist
wd:Q16031530music theorist
wd:Q12270170mutakallim
wd:Q26425137mykenologist
wd:Q24387326mythographer
wd:Q17351648newspaper editor
wd:Q15980158non-fiction writer
wd:Q6625963novelist
wd:Q2004963numismatist
wd:Q21272406paleoanthropologist
wd:Q16254673pamphleteer
wd:Q16267158papyrologist
wd:Q24265174philosophy historian
wd:Q957729photojournalist
wd:Q551835physician writer
wd:Q214917playwright
wd:Q49757poet
wd:Q1209498poet lawyer
wd:Q12901590poetaster
wd:Q4992409political editor
wd:Q15973695political journalist
wd:Q15994177political theorist
wd:Q15958642political writer
wd:Q17488316prehistorian
wd:Q12144794prosaist
wd:Q16755977racial theorist
wd:Q24702769radio journalist
wd:Q3406651radio producer
wd:Q17488465regional historian
wd:Q24262594religious writer
wd:Q42909reporter
wd:Q936371rhapsode
wd:Q15978391satirist
wd:Q18844224science fiction writer
wd:Q17433421science journalist
wd:Q3745071science writer
wd:Q28389screenwriter
wd:Q839935script doctor
wd:Q7458488sex columnist
wd:Q15949613short story writer
wd:Q7508381sideline reporter
wd:Q15981299singer-lyricist
wd:Q488205singer-songwriter
wd:Q947305skald
wd:Q2293636slam poet
wd:Q1366909soccer commentator
wd:Q15978337social critic
wd:Q17504992social historian
wd:Q753110songwriter
wd:Q175301speechwriter
wd:Q17504998sport historian
wd:Q7579803sports analyst
wd:Q15941595sports columnist
wd:Q2986228sports commentator
wd:Q13219447sports journalist
wd:Q11313148sportswriter
wd:Q7596574staff writer
wd:Q27915504story by
wd:Q7620399story editor
wd:Q1771040stringer
wd:Q15979013surrealist writer
wd:Q18668527talk show host
wd:Q11122954tanka poet
wd:Q1568338technical writer
wd:Q19607300television columnist
wd:Q18810130television critic
wd:Q22976182television journalist
wd:Q947873television presenter
wd:Q17337766theatre critic
wd:Q1234713theologian
wd:Q16270720theoretical biologist
wd:Q19350898theoretical physicist
wd:Q18931911theorist
wd:Q22073916tragedy writer
wd:Q23055218travel guide writer
wd:Q3579035travel writer
wd:Q1747339trobairitz
wd:Q186370troubadour
wd:Q1996467trouvère
wd:Q189459ulama
wd:Q3476620video game writer
wd:Q619514video journalist
wd:Q164236war correspondent
wd:Q11496048war photographer
wd:Q2556193wine critic
wd:Q381353woman of letters
wd:Q3589292women letter writer
wd:Q36180writer
wd:Q27212012young adult writer

This list is amazing and disturbing at the same time.

BTW: the query to get the subclasses of literary work (Q7725634) is provided as an example (SPARQL):

SELECT ?s ?desc
WHERE
{
  ?s wdt:P279 wd:Q7725634 .
  OPTIONAL {
    ?s rdfs:label ?desc filter (lang(?desc) = "en").
  }
}
ORDER BY ?s

the writer class hierarchy

Let us visualise the subclasses of writer in a tree. First, we need to get pairs of subclasses (SPARQL):

#defaultView:Graph
SELECT ?item ?itemLabel ?otherItem ?otherItemLabel
WHERE
{
  ?item wdt:P279 ?otherItem .
  ?otherItem wdt:P279* wd:Q36180 .            # subclass(P279) (transitive) of writer(Q36180)
  SERVICE wikibase:label {               # ... include the labels
    bd:serviceParam wikibase:language "en"
  }
}

Downloading the data:

curl \
    --header "Accept: text/tab-separated-values" \
    --output wikidata_writer_pairs.tsv \
    --globoff \
     'https://query.wikidata.org/sparql?query=%20%20SELECT%20%3Fitem%20%3FitemLabel%20%3FotherItem%20%3FotherItemLabel%0A%20%20WHERE%0A%20%20{%0A%20%20%20%20%3Fitem%20wdt%3AP279%20%3FotherItem%20.%0A%20%20%20%20%3FotherItem%20wdt%3AP279*%20wd%3AQ36180%20.%20%20%20%20%20%20%20%20%20%20%20%20%23%20subclass%28P279%29%20%28transitive%29%20of%20writer%28Q36180%29%0A%20%20%20%20SERVICE%20wikibase%3Alabel%20{%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20...%20include%20the%20labels%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%0A%20%20%20%20}%0A%20%20}'

And creating a GraphViz file to plot it:

vertices = dict()
edges = []

with open("wikidata_writer_pairs.tsv", "rt") as f:
    for line in f:
        # ignore first line
        if line[0] != "?":
            item, itemLabel, otherItem, otherItemLabel = line.strip().split('\t')
            # shorten item ids
            item = item[len("<http://www.wikidata.org/entity/"):-1]
            otherItem = otherItem[len("<http://www.wikidata.org/entity/"):-1]
            # clean labels
            if itemLabel[0] == "\"":
                itemLabel = itemLabel[1:-len("\"@en")]
            if otherItemLabel[0] == "\"":
                otherItemLabel = otherItemLabel[1:-len("\"@en")]
            # add vertices
            vertices[item] = itemLabel
            vertices[otherItem] = otherItemLabel
            # add edge
            edges.append((item, otherItem))
# print graph
with open("wikidata_writer_graph.dot", "wt") as f:
    print("digraph writers {", file=f)
    print("  graph [overlap=none, rankdir=RL];", file=f)
    for v in vertices:
        print("  ", v, " [label=\"", vertices[v], "\"];", sep='', file=f)
    for d, e in edges:
        print("  ", d, " -> ", e, ";", sep='', file=f)
    print("}", file=f)
dot -Tpng -owikidata_writer_graph.png wikidata_writer_graph.dot
dot -Tsvg -owikidata_writer_graph.svg wikidata_writer_graph.dot

https://raw.githubusercontent.com/weltliteratur/weighing/master/wikidata_writer_graph.png

wikidata_writer_graph.svg