Skip to content

Python wrapper for the CWB to extract concordances and score frequency lists

License

Notifications You must be signed in to change notification settings

ausgerechnet/cwb-ccc

Repository files navigation

Collocation and Concordance Computation

Build PyPI version PyPI Downloads License Imports: association-measures

cwb-ccc is a Python 3 wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries (including queries with more than two anchor points), extract concordance lines, and score frequency lists (particularly to extract collocates and keywords).

The Quickstart gives a rough overview. For a more detailed dive into the functionality, see the Vignette.

Installation

The module needs a working installation of CWB and operates on CWB-indexed corpora. If you want to run queries with more than two anchor points, you will need CWB version 3.4.16 or later. We recommend installing the 3.5.x package.

You can install cwb-ccc with pip from PyPI:

python -m pip install cwb-ccc

You can also clone the source from github, cd in the respective folder, and build your own wheel:

python3 -m venv venv
. venv/bin/activate
pip3 install -U pip setuptools wheel twine
pip3 install -r requirements.txt
pip3 install -r requirements-dev.txt
python3 -m cython -2 ccc/cl.pyx
python3 setup.py bdist_wheel

Quickstart

Accessing Corpora

To list all available corpora, you can use

from ccc import Corpora
corpora = Corpora(
    registry_dir="/usr/local/share/cwb/registry/"
)

Most functionality is tied to the Corpus class, which establishes the connection to your CWB-indexed corpus:

from ccc import Corpus
corpus = Corpus(
  corpus_name="GERMAPARL1386",
  registry_dir="/usr/local/share/cwb/registry/"
)

This will raise a KeyError if the named corpus is not in the specified registry.

Queries and SubCorpora

The usual starting point for using this module is to run a query with corpus.query(), which accepts valid CQP queries such as

subcorpus = corpus.query(
    '[lemma="Arbeit"]', context_break='s'
)

The result is a SubCorpus; at its core this is a pandas DataFrame with corpus positions (similar to CWB dumps of NQRs).

Note that you can also query for structural attributes, e.g.:

corpus.query(
    s_query='text_party', s_values={'CDU', 'CSU'}
)

Concordancing

You can access concordance lines via the concordance() method of the subcorpus. This method returns a DataFrame with information about the query matches in context:

subcorpus.concordance()

match matchend word
151 151 Er brachte diese Erfahrung in seine Arbeit im Ausschuß für Familie , Senioren , Frauen und Jugend sowie im Petitionsausschuß ein , wo er sich vor allem
227 227 Seine Arbeit und sein Rat werden uns fehlen .
1493 1493 Ausschuß für Arbeit und Sozialordnung
1555 1555 Ausschuß für Arbeit und Sozialordnung
1598 1598 Ausschuß für Arbeit und Sozialordnung
... ... ...

By default, this retrieves concordance lines in simple format in the order in which they appear in the corpus. A better approach is

subcorpus.concordance(form='kwic', order='random')

match matchend left_word node_word right_word
81769 81769 Ich unterstütze daher nachträglich die Forderung , daß die Durchführung des Gesetzes auch künftig durch die Bundesanstalt für Arbeit vorgenommen wird ; denn beim Bund gibt es die entsprechend ausgebildeten Sachbearbeiter .
8774 8774 Glauben Sie im Ernst , Sie könnten am Ende ein Bündnis für Arbeit , eine Wende in der deutschen Politik , die Bekämpfung der Arbeitslosigkeit erreichen , wenn Sie nicht die Länder ,
8994 8994 alle Entscheidungen gemeinsam zu treffen , die sich gegen Schwarzarbeit und illegale Arbeit wenden , und gemeinsam nach einem Weg zu suchen ,
80098 80098 : Was der Vermittlungsausschuß mit Mehrheit zum Meister-BAföG beschlossen hat , heißt , daß die bewährten Institutionen der Bundesanstalt für Arbeit , die die Ausbildungsförderung für Meister bis zum Jahr 1993 durchgeführt haben , die darin große Erfahrung haben , die
61056 61056 Selbst wenn Sie ein Konstrukt anbieten , das tendenziell die zusätzliche Belastung der Bundesanstalt für Arbeit etwas geringer hielte als die Entlastung bei der gesetzlichen Rentenversicherung , so wäre dies bei einem deutlichen Aufwuchs der Arbeitslosigkeit
... ... ... ... ...

which retrieves random concordance lines in KWIC formatting. Use cut_off to specify the maximum number of lines.

Collocation Analyses

After executing a query, you can use subcorpus.collocates() to extract collocates (see the vignette for parameter settings). The result is a DataFrame with lemmata as index and frequency signatures and association measures as columns:

subcorpus.collocates()

item O11 O12 O21 O22 R1 R2 C1 C2 N E11 E12 E21 E22 z_score t_score log_likelihood simple_ll min_sensitivity liddell dice log_ratio conservative_log_ratio mutual_information local_mutual_information ipm ipm_reference ipm_expected in_nodes marginal
für 46 730 831 148102 776 148933 877 148832 149709 4.54583 771.454 872.454 148061 19.4429 6.11208 134.301 130.019 0.052452 0.047547 0.055656 3.40925 2.26335 1.00514 46.2366 59278.4 5579.69 5858.03 0 877
, 43 733 7827 141106 776 148933 7870 141839 149709 40.7933 735.207 7829.21 141104 0.345505 0.336523 0.124564 0.117278 0.005464 0.000296 0.009947 0.076412 0 0.02288 0.983836 55412.4 52553.8 52568.6 0 7870
. 33 743 5626 143307 776 148933 5659 144050 149709 29.3328 746.667 5629.67 143303 0.677108 0.638378 0.461005 0.440481 0.005831 0.000673 0.010256 0.170891 0 0.05116 1.68829 42525.8 37775.4 37800 0 5659
und 32 744 2848 146085 776 148933 2880 146829 149709 14.9282 761.072 2865.07 146068 4.41852 3.0179 15.1452 14.6555 0.011111 0.006044 0.017505 1.10866 0 0.331144 10.5966 41237.1 19122.7 19237.3 0 2880
in 24 752 2474 146459 776 148933 2498 147211 149709 12.9481 763.052 2485.05 146448 3.07138 2.25596 7.72813 7.51722 0.009608 0.004499 0.014661 0.896724 0 0.268005 6.43212 30927.8 16611.5 16685.7 0 2498
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

This allows calculating scores for arbitrary combinations of positional attributes, e.g. p_query=['lemma', 'pos']. The dataframe contains the counts and is annotated with all available association measures in the pandas-association-measures package (parameter ams).

Keyword Analyses

Having created a subcorpus

subcorpus = corpus.query(
    s_query='text_party', s_values={'CDU', 'CSU'}
)

you can use its keywords() method for retrieving keywords:

subcorpus.keywords(order='conservative_log_ratio')

item O11 O12 O21 O22 R1 R2 C1 C2 N E11 E12 E21 E22 z_score t_score log_likelihood simple_ll min_sensitivity liddell dice log_ratio conservative_log_ratio mutual_information local_mutual_information ipm ipm_reference ipm_expected
deswegen 55 41296 37 108412 41351 108449 92 149708 149800 25.3958 41325.6 66.6042 108382 5.87452 3.99183 41.5308 25.794 0.00133 0.321982 0.002654 1.96293 0.404166 0.335601 18.458 1330.08 341.174 614.152
CSU 255 41096 380 108069 41351 108449 635 149165 149800 175.286 41175.7 459.714 107989 6.02087 4.99187 46.6543 31.7425 0.006167 0.126068 0.012147 0.81552 0.212301 0.162792 41.512 6166.72 3503.95 4238.99
CDU 260 41091 390 108059 41351 108449 650 149150 149800 179.427 41171.6 470.573 107978 6.01515 4.99693 46.6055 31.7289 0.006288 0.124499 0.012381 0.80606 0.209511 0.161086 41.8823 6287.64 3596.16 4339.12
in 867 40484 1631 106818 41351 108449 2498 147302 149800 689.551 40661.4 1808.45 106641 6.75755 6.02647 61.2663 42.1849 0.020967 0.072241 0.039545 0.47937 0.168901 0.099452 86.2253 20966.8 15039.3 16675.6
Wirtschaft 39 41312 25 108424 41351 108449 64 149736 149800 17.6666 41333.3 46.3334 108403 5.07554 3.41607 30.9328 19.1002 0.000943 0.333476 0.001883 2.03257 0.150982 0.34391 13.4125 943.145 230.523 427.236
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Just as with collocates, the result is a DataFrame with lemmata as index and frequency signatures and association measures as columns.

Testing

The module ships with a small test corpus ("GERMAPARL1386"), which contains all speeches of the 86th session of the 13th German Bundestag on Feburary 8, 1996. This corpus consists of 149,800 tokens in 7332 paragraphs (s-attribute "p" with annotation "type" ("regular" or "interjection")) split into 11,364 sentences (s-attribute "s"). The p-attributes are "pos" and "lemma":

corpus.available_attributes()

type attribute annotation active
p-Att word False True
p-Att pos False False
p-Att lemma False False
s-Att corpus False False
s-Att corpus_name True False
s-Att sitzung False False
s-Att sitzung_date True False
s-Att sitzung_period True False
s-Att sitzung_session True False
s-Att div False False
s-Att div_desc True False
s-Att div_n True False
s-Att div_type True False
s-Att div_what True False
s-Att text False False
s-Att text_id True False
s-Att text_name True False
s-Att text_parliamentary_group True False
s-Att text_party True False
s-Att text_position True False
s-Att text_role True False
s-Att text_who True False
s-Att p False False
s-Att p_type True False
s-Att s False False

The corpus is located in this repository. All tests are written using this corpus as well as some reference counts and scores obtained from the UCS toolkit and some additional frequency lists. Make sure you install all development dependencies (especially pytest). You can then

pytest -m "not benchmark"
pytest -m benchmark
pytest --cov-report term-missing -v --cov=ccc/

Acknowledgements

About

Python wrapper for the CWB to extract concordances and score frequency lists

Resources

License

Stars

Watchers

Forks

Packages

No packages published