-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the training data formatting? #28
Comments
Training data for Word2Vec or for Seq2Seq? If you're asking for Seq2Seq, you should take a look at Seq2SeqXTrain.npy and Seq2SeqYTrain.npy. They are numpy matrices which dimensionalities of A x B where A is the number of training examples and B is the sequence length. You can see above that the first training input (x[0]) consists of the integers that represent the words (52780 for blank, 34931 for "the", etc) for the input message. y[0] will contain the sequence of words that is the response. |
The facebook chat data cannot be converted due to big changes to the html layout. so i was curious how the facebook data was parsed and stored so i can make one manually and have createDataset.py process that instead. Thanks for some insight on how Seq2Seq works though. |
Oh, so you're referring to the HTML file that you get after downloading your data? If so, then, you probably should check Dillon's repo |
Yeah the repo isn't maintained anymore:
So i can't use my message data without knowing how to format it for createDataset.py |
Okay, Thanks for the assistance. From the data you've showed me i assume 08:00 doesn't change and probably isn't important? [dateTtime-08:00] username: message |
Yeah we end up ignoring everything in the brackets. |
Should oldest messages be on top or on the bottom of file? |
Think it depends on how facebook organizes the data when you download it. From what I remember it wasnt necessarily chronological. |
I want to use training data from a different platform but i'm unsure of what the formatting should be for the training data.
is it a json format? etc
if so can i get an example?
Thanks.
The text was updated successfully, but these errors were encountered: