This repository contains the code for the paper "Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training".
You can find the published models in the Huggingface hub.
The data used for this project can be downloaded using our scrapers.
Use the finetuning.py
script to create your own Leichte Sprache language models. You need to download/scrape the monolingual corpus from here first.
The evaluations for the perplexity scores, the readability of the language model outouts and the downstream task performance are provided in the respective scripts. We also publish the answers from the human grammar evaluation in the file evaluation/Evaluierung von large language models.csv
. You can analyze these results with the human evaluation notebook.
For the application of the language models as ATS decoders, please refer to the original Github repo. You can find the fine-tuned simplification model on Huggingface. The simplification results are stored in the original tensorboard_logs.
If you use our models or the code in one of our repos, please use the following citation:
@inproceedings{anschutz-etal-2023-language,
title = "Language Models for {G}erman Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training",
author = {Ansch{\"u}tz, Miriam and Oehms, Joshua and Wimmer, Thomas and Jezierski, Bart{\l}omiej and Groh, Georg},
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.74",
pages = "1147--1158",
}