Skip to content

Code for the paper "Language Models for German Text SimplificationOvercoming Parallel Data Scarcity through Style-specific Pre-training"

License

Notifications You must be signed in to change notification settings

MiriUll/Language-Models-German-Simplification

Repository files navigation

Language-Models-German-Simplification

This repository contains the code for the paper "Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training".
You can find the published models in the Huggingface hub.
The data used for this project can be downloaded using our scrapers.

Fine-tuning language models

Use the finetuning.py script to create your own Leichte Sprache language models. You need to download/scrape the monolingual corpus from here first.

Re-creating the results from the paper

The evaluations for the perplexity scores, the readability of the language model outouts and the downstream task performance are provided in the respective scripts. We also publish the answers from the human grammar evaluation in the file evaluation/Evaluierung von large language models.csv. You can analyze these results with the human evaluation notebook.

For the application of the language models as ATS decoders, please refer to the original Github repo. You can find the fine-tuned simplification model on Huggingface. The simplification results are stored in the original tensorboard_logs.

Citation

If you use our models or the code in one of our repos, please use the following citation:

@inproceedings{anschutz-etal-2023-language,  
    title = "Language Models for {G}erman Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training",  
    author = {Ansch{\"u}tz, Miriam  and Oehms, Joshua  and Wimmer, Thomas  and Jezierski, Bart{\l}omiej  and Groh, Georg},  
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",  
    month = jul,  
    year = "2023",  
    address = "Toronto, Canada",  
    publisher = "Association for Computational Linguistics",  
    url = "https://aclanthology.org/2023.findings-acl.74",  
    pages = "1147--1158",  
}

About

Code for the paper "Language Models for German Text SimplificationOvercoming Parallel Data Scarcity through Style-specific Pre-training"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages