Skip to content

Latest commit

 

History

History
28 lines (24 loc) · 2.52 KB

File metadata and controls

28 lines (24 loc) · 2.52 KB

Language-Models-German-Simplification

This repository contains the code for the paper "Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training".
You can find the published models in the Huggingface hub.
The data used for this project can be downloaded using our scrapers.

Fine-tuning language models

Use the finetuning.py script to create your own Leichte Sprache language models. You need to download/scrape the monolingual corpus from here first.

Re-creating the results from the paper

The evaluations for the perplexity scores, the readability of the language model outouts and the downstream task performance are provided in the respective scripts. We also publish the answers from the human grammar evaluation in the file evaluation/Evaluierung von large language models.csv. You can analyze these results with the human evaluation notebook.

For the application of the language models as ATS decoders, please refer to the original Github repo. You can find the fine-tuned simplification model on Huggingface. The simplification results are stored in the original tensorboard_logs.

Citation

If you use our models or the code in one of our repos, please use the following citation:

@inproceedings{anschutz-etal-2023-language,  
    title = "Language Models for {G}erman Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training",  
    author = {Ansch{\"u}tz, Miriam  and Oehms, Joshua  and Wimmer, Thomas  and Jezierski, Bart{\l}omiej  and Groh, Georg},  
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",  
    month = jul,  
    year = "2023",  
    address = "Toronto, Canada",  
    publisher = "Association for Computational Linguistics",  
    url = "https://aclanthology.org/2023.findings-acl.74",  
    pages = "1147--1158",  
}