-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paranoid Transformer #142
Comments
Also, there is my related NanoNaNoGenMo submission: |
Thank's for making this available, was just reading through your code and your results and it looks really cool. I don't know if you have the time, but it would be incredible if you could include a quick how-to guide to get it up and running to experiment with a new corpus. I'm more than willing to help with this because I think that this would be really educational for a lot of people. Thanks again. |
Thanks for your interest! I'm not sure, are you asking about NaNoGenMo or about NanoNaNoGenMo entry? The former is more or less described here https://github.com/altsoph/paranoid_transformer/blob/master/README.md, the latter -- here: https://medium.com/altsoph/123-bytes-perl-markov-chain-b80e1212f3b3 |
Sorry for the latest joining, but I still believe it's worth to try on it this year's NaNoGenMo :)
This month I tried to build a paranoiac-critical system based on two neural networks, Paranoid Transformer.
The first network is a Paranoiac-intrusive Generator and the second one, Critic, works as a filtering subsystem, so it selects the best ones from the flow of text passages.
Let me share some details:
Generator subsystem
The first network, Paranoiac-intrusive subsystem AKA Generator, uses an OpenAI GPT architecture and the implementation from huggingface. I took a publicly available network model already pre-trained on a huge fiction BooksCorpus dataset with approx ~10K books and ~1B words.
Next, I finetuned it on several additional handcrafted text corpora (altogether ~50Mb of text):
During the finetuning phase, I used special labels to tell the model which type of text it reads:
Each text got 2 labels, for example, it was + for Cyphernomicon, + for Kafka and + for fortune cookie messages. Note, there were almost no texts labeled as +, just a few nerd jokes.
At last, in generation mode, I kindly asked mode to generate some + texts.
The raw results were already promising enough:
Critic subsystem
The next big thing to do was filter some real gems from this endless text flow.
At first, I made a script with some simple heuristic filters such as:
The application of this script cut the text flow into a sequence of valid chunks.
Here I used manual labeling of such chunks with two classes, GOOD/BAD.
I took approx 1K chunks, balanced (one half of them were GOOD, the other half -- BAD).
At last, I trained the Critic subsystem.
This neural network uses a BERT architecture implemented again by huggingface. Again I took a public available pre-trained network model and finetuned it on my labeled 1K chunks dataset to predict the label of any given chunk.
Finally, I made a pipeline which includes Generator subsystem, heuristic filters and Critic subsystem.
Here it is a short sample of the final results:
The huge blob of generated text could be found here:
https://github.com/altsoph/paranoid_transforner/blob/master/NaNoGenMo_50K_words_sample.txt
The text was updated successfully, but these errors were encountered: