Skip to content

Latest commit

 

History

History
112 lines (70 loc) · 3.78 KB

Readme.md

File metadata and controls

112 lines (70 loc) · 3.78 KB

Allris-Scraper

A scraper for ratsinfo.leipzig.de.

Requirements

Runtime

Development

Usage

Using docker

Build the docker image

docker build -t codeforleipzig/allris-scraper:latest .

Run the docker container

docker run -v $(pwd)/data:/app/data --rm codeforleipzig/allris-scraper

Using python

It is recommended to use a virtual environment in order to isolate libraries used in this project from the environment of your operating system. To do so, run the following in the project directory:

# create the virtual environment in the project directory; do this once
python3 -m venv venv

# activate the environment; do this before working with the scraper
source venv/bin/activate

# install the required libraries
pip3 install -r requirements.txt

To run the scraper using python:

python3 ./1_read_paper_json.py --page_from 1 --page_to 1000 --modified_to 2023-04-27 --modified_from 2023-04-19
python3 ./2_download_pdfs.py
python3 ./3_txt_extraction.py
python3 ./4_srm_import.py

Scraper Output

The scraper writes its output to the data directory. One file per scraping session is written, the convention for the filename is <OParl object type>_<current timestamp>.jl. For example, when scraping papers: paper_2020-06-19T10-19-16.jl.

The output is a feed in JSONLines format, which means one scraped JSON document per line. For inspecting the data, the jq is useful and can be used line this:

# all documents in the file
cat path/to/file | jq .

# only the first document
head -n1 path/to/file | jq .

Extraction of PDF and TXT files

The method download_pdfs() in the leipzig.py file downloads all PDFs, linked in the the JSONLines files and saves them in data/pdfs. Files that are already saved in the folder will not be downloaded.

From the PDF files, TXT files can be generated with the extract_text_from_pdfs_recursively() method in txt_extraction.py, using Tika. The TXTs will be saved to data/txts. Files that are already saved in the folder will not be extracted.

Configuration

Scrapy allows for configuration on various levels. General configuration can be found in allris/settings.py. For the purposes of this project, relevant values are overridden in leipzig.py. Per default, it is configured towards development needs. Specifically, aggressive caching is enabled (HTTP_CACHE_ENABLED) and the number of scraped pages is limited (CLOSESPIDER_PAGECOUNT).

PDF text extraction

Prerequisite: leipzig.py scraper has been run and downloaded files to data/pdfs.

Run

python3 ./txt_extraction.py

to extract the texts from the PDFs. Files will be created under data/txts.

CSV

Prerequisite: txt_extraction.py has been run.

Run

python3 ./nlp.py

to join those text files as rows into a CSV file. That is created as data/data.csv. This file can be used for further NLP processing.

NLP

Data Preparation

nlp.py provides a method read_txts_into_dataframe() to read all TXT files in data/txts into a pandas dataframe and a method write_df_to_csv() to save this dataframe in csv format as data.csv in the data folder.

Topic Modeling

To make the obtained documents more accessible for users interested in certain topics, a topic modeling has been run on the extracted documents with the R software tidyToPān. The obtained model will be used later on for e.g. a search function.

python -m spacy download de_core_news_sm