It's a collections of small python modules made to enhance the interoperability ot the awesome NLTK and CLTK with CTS-compatible texts that follow the guidelines of Capitains, and especially those of the Perseus DL and First 1K Years of Greek.
At the moment, I have a put together:
- a corpus reader (see here for an introduction to NLTK corpus readers) for Capitains-compliant XML files. It works with all the First1K texts that you can download using CLTK downloader. It lets you load and tokenize your corpus and store citations for all your tokens.
- a Greek tokenizer (in progress) that should work well with Perseus treebank (I am still testing...)
- a concordance indexer to create (enhanced!) concordances from CTS-compatible texts
- a class for full morphology tagging of Greek and to lemmatize tagged texts (see here)