This is the repo for details on how to replicate the findings reported in:
Howcroft, David M., and Vera Demberg. 2017. "Psycholinguistic Models of Sentence Processing Improve Sentence Readability Ranking". Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Pages 958-968. Valencia, Spain, April 3-7, 2017. Association for Computational Linguistics.
ACL Anthology || PDF
The first couple of sections describe the resources used and how to install them. The Instructions section then describes how to use them to replicate our results. References come after that, followed with some metadata for this document.
The English and Simple English Wikipedia (ESEW) was developed in Hwang et al. 2015. The resource is available from the authors on the project page
For our work we used only the good
alignments.
You can download the files from the command-line using:
wget http://ssli.ee.washington.edu/tial/projects/simplification/aligned-good-0.67.txt
The download is about 40 MB in size.
The One Stop English corpus was developed by Sowmya Vajjala using data from [onestopenglish.com]. The corpus is available from her BitBucket repo: OSE Corpus.
You can fetch the data with:
wget https://bitbucket.org/nishkalavallabhi/complexity-features/raw/3cf60342c7ec82371ea2d0ef1bb290e7b0c9bac2/corpus/OSE-SentenceAlignedCorpus-ThreeLevel-2013toMid2015-FINAL.txt
The download is about 700 KB in size.
Our surprisal and embedding depth features are extracted by running the ModelBlocks
parser in complexity output mode.
The main distribution for the parser is on Github.
Our integration cost features use a locally-developed tool called icy-parses
(formerly icToolDist).
This is available on Github as well.
Our propositional idea density features depend on the adapted IDD3 repo and therefore also on the Stanford dependency parser.
Running setup.sh
in a bash-like environment will fetch the corpora and these repos for you.
Under development: This README is still under development and will be supplemented with all of the necessary scripts to automate the replication of our results.
Howcroft, David M., and Vera Demberg. 2017. "Psycholinguistic Models of Sentence Processing Improve Sentence Readability Ranking". Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Pages 958-968. Valencia, Spain, April 3-7, 2017. Association for Computational Linguistics. ACL Anthology || PDF
Hwang, William, Hannaneh Hajishirzi, Mari Ostendorf, and Wei Wu. 2015. "Aligning Sentences from Standard Wikipedia to Simple Wikipedia". Proc. of the 2015 COnference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Pages 211-217. Denver, Colorado, USA. Association for Computational Linguistics. ACL Anthology || PDF
Written by David M. Howcroft, April 2017.