Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the training data formatting? #28

Open
CoderReece opened this issue May 25, 2018 · 9 comments
Open

What's the training data formatting? #28

CoderReece opened this issue May 25, 2018 · 9 comments

Comments

@CoderReece
Copy link

I want to use training data from a different platform but i'm unsure of what the formatting should be for the training data.

is it a json format? etc
if so can i get an example?

Thanks.

@adeshpande3
Copy link
Owner

adeshpande3 commented May 25, 2018

Training data for Word2Vec or for Seq2Seq? If you're asking for Seq2Seq, you should take a look at Seq2SeqXTrain.npy and Seq2SeqYTrain.npy. They are numpy matrices which dimensionalities of A x B where A is the number of training examples and B is the sequence length.

image

You can see above that the first training input (x[0]) consists of the integers that represent the words (52780 for blank, 34931 for "the", etc) for the input message. y[0] will contain the sequence of words that is the response.

@CoderReece
Copy link
Author

The facebook chat data cannot be converted due to big changes to the html layout. so i was curious how the facebook data was parsed and stored so i can make one manually and have createDataset.py process that instead.

Thanks for some insight on how Seq2Seq works though.

@adeshpande3
Copy link
Owner

Oh, so you're referring to the HTML file that you get after downloading your data?

Like the below step?
image

If so, then, you probably should check Dillon's repo

@CoderReece
Copy link
Author

Yeah the repo isn't maintained anymore:

UPDATE April 28th 2018: Facebook recently revamped the "download your data" feature to a much more usable state. This was probably in compliance with GDPR laws by the European Union, which will be enforced starting in May 2018. Facebook now allows you to download your message data in JSON format, which supercedes the purpose of this project.

In light of that this repository will no longer be maintained.

So i can't use my message data without knowing how to format it for createDataset.py
Sorry I've not done a good job at explaining myself, I don't mean to take up your time.

@adeshpande3
Copy link
Owner

It's cool dw, I think the main thing you have to do is take that JSON and change it to a TXT file with the following format.

image

@CoderReece
Copy link
Author

Okay, Thanks for the assistance.
So like have i got that right? I just want to be sure before i go off to try this.

From the data you've showed me i assume 08:00 doesn't change and probably isn't important?

[dateTtime-08:00] username: message

@adeshpande3
Copy link
Owner

Yeah we end up ignoring everything in the brackets.

@withthelemons
Copy link

Should oldest messages be on top or on the bottom of file?

@adeshpande3
Copy link
Owner

Think it depends on how facebook organizes the data when you download it. From what I remember it wasnt necessarily chronological.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants