Annif subject classification for US Congress bills

This repo contains code and documentation for applying Annif for the subject classification of US Congress bills. It was inspired by the publication of the bill_labels_us data set on Hugging Face Hub by Ari Hershowitz. It shows how to predict policy areas (multiclass classification: pick the best one out of 34 policy areas) as well as legislative subjects (multilabel classification: pick the most relevant subjects out of 4658 legislative subjects) based only on the title and text of the bill.

Data preparation

The Jupyter notebook dataset-to-corpus.ipynb loads the original data set directly from Hugging Face Hub using the datasets library and then converts it into corpus files in a format suitable for Annif. The original data set is also split into 90% train / 10% test subsets. The notebook will produce the following files:

policy_area.csv: the 34 policy areas as an Annif vocabulary in CSV format
policy_area-test.tsv.gz: train set for the policy area prediction task (107k examples)
policy_area-train.tsv.gz: test set for the policy area prediction task (12k examples)
legislative_subjects.csv: the 4658 legislative subjects as an Annif vocabulary in CSV format
legislative_subject-test.tsv.gz: train set for the legislative subjects prediction task (107k examples)
legislative_subject-train.tsv.gz: test set for the legislative prediction task (12k examples)

(The train and test sets are not included in this repository because they take up around 670MB, which is too big for plain git without LFS.)

Annif projects

The Annif configuration file projects.cfg defines six projects, three for each prediction task:

pa-tfidf-en: TFIDF prediction of policy areas
pa-parabel-en: Omikuji Parabel prediction of policy areas
pa-svc-en: Support Vector Classification prediction of policy areas
ls-mllm-en: MLLM (lexical) prediction of legislative subjects
ls-parabel-en: Omikuji Parabel prediction of legislative subjects
ls-bonsai-en: Omikuji Bonsai prediction of legislative subjects (using ngram=2 and min_df=3 settings for improved accuracy)

Snowball stemming is used in all six projects. Minimum token size is 3 characters, which is the default for Annif.

Training Annif

To learn how to install and use Annif, see the Annif GitHub repository and the Annif tutorial.

Here are the commands used to train the models:

# load the policy area vocabulary
annif load-vocab pa policy_area.csv
# train the policy area models
annif train pa-tfidf-en policy_area-train.tsv.gz
annif train pa-parabel-en policy_area-train.tsv.gz
annif train pa-svc-en policy_area-train.tsv.gz

# load the legislative subjects vocabulary
annif load-vocab ls legislative_subjects.csv
# train the legislative subjects models
annif train ls-mllm-en legislative_subject-train.tsv.gz
annif train ls-parabel-en legislative_subject-train.tsv.gz
annif train ls-bonsai-en legislative_subject-train.tsv.gz

Training of the models took less than 1 hour each on a Lenovo T14 laptop, except for the Bonsai model that took 4 hours.

Evaluation

The projects were evaluated using the annif eval command against the test set records. The legislative subject projects were additionally evaluated using the annif optimize command to find out the limit and threshold parameters that maximize the F1 score. The precision and recall have been reported using the same values for limit and threshold. The size of the model files on disk are also reported. The RAM requirements for running the models are somewhat more than the file sizes, but they correlate.

Results

Policy areas

For the policy area prediction, the results were as follows:

Project	Accuracy	nDCG	Model size
TFIDF	0.5381	0.7129	5.1 MB
Parabel	0.9001	0.9531	8.8 MB
SVC	0.9011	0.9536	12 MB

Findings:

The baseline TFIDF method performed quite poorly.
Parabel and SVC gave very similar results with more than 90% accuracy. It seems likely that the remaining 10% were difficult cases that may be impossible to automatically classify based on the title and text alone.
All models are very small mainly due to the small vocabulary

Legislative subjects

For the legislative subjects prediction, the results were as follows:

Project	limit	threshold	precision	recall	F1	Model size
MLLM	15	0.15	0.3834	0.1897	0.2210	1 MB
Parabel	15	0.15	0.6983	0.7072	0.6717	174 MB
Bonsai	15	0.15	0.7762	0.7540	0.7372	597 MB

Findings:

The MLLM lexical method performs poorly, probably because the vocabulary is just a flat list of labels without any synonyms or semantic relations.
Parabel and Bonsai both give excellent results. Bonsai, which uses 2-grams, achieves a F1 score of almost 0.74, which is so good it is almost suspicious.
Model sizes are relatively modest, although the ngram=2 setting clearly increases the size of the Bonsai model compared to the otherwise similar Parabel model.

Author

Osma Suominen

License

CC0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset-to-corpus.ipynb		dataset-to-corpus.ipynb
legislative_subjects.csv		legislative_subjects.csv
policy_area.csv		policy_area.csv
projects.cfg		projects.cfg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Annif subject classification for US Congress bills

Data preparation

Annif projects

Training Annif

Evaluation

Results

Policy areas

Legislative subjects

Author

License

About

Releases

Packages

Languages

License

osma/annif-us-congress-bills

Folders and files

Latest commit

History

Repository files navigation

Annif subject classification for US Congress bills

Data preparation

Annif projects

Training Annif

Evaluation

Results

Policy areas

Legislative subjects

Author

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages