What's the training data formatting? #28

CoderReece · 2018-05-25T15:06:11Z

I want to use training data from a different platform but i'm unsure of what the formatting should be for the training data.

is it a json format? etc
if so can i get an example?

Thanks.

adeshpande3 · 2018-05-25T15:14:34Z

Training data for Word2Vec or for Seq2Seq? If you're asking for Seq2Seq, you should take a look at Seq2SeqXTrain.npy and Seq2SeqYTrain.npy. They are numpy matrices which dimensionalities of A x B where A is the number of training examples and B is the sequence length.

You can see above that the first training input (x[0]) consists of the integers that represent the words (52780 for blank, 34931 for "the", etc) for the input message. y[0] will contain the sequence of words that is the response.

CoderReece · 2018-05-25T15:46:49Z

The facebook chat data cannot be converted due to big changes to the html layout. so i was curious how the facebook data was parsed and stored so i can make one manually and have createDataset.py process that instead.

Thanks for some insight on how Seq2Seq works though.

adeshpande3 · 2018-05-25T15:58:38Z

Oh, so you're referring to the HTML file that you get after downloading your data?

Like the below step?

If so, then, you probably should check Dillon's repo

CoderReece · 2018-05-25T16:05:06Z

Yeah the repo isn't maintained anymore:

UPDATE April 28th 2018: Facebook recently revamped the "download your data" feature to a much more usable state. This was probably in compliance with GDPR laws by the European Union, which will be enforced starting in May 2018. Facebook now allows you to download your message data in JSON format, which supercedes the purpose of this project.

In light of that this repository will no longer be maintained.

So i can't use my message data without knowing how to format it for createDataset.py
Sorry I've not done a good job at explaining myself, I don't mean to take up your time.

adeshpande3 · 2018-05-25T16:10:59Z

It's cool dw, I think the main thing you have to do is take that JSON and change it to a TXT file with the following format.

CoderReece · 2018-05-25T16:17:57Z

Okay, Thanks for the assistance.
So like have i got that right? I just want to be sure before i go off to try this.

From the data you've showed me i assume 08:00 doesn't change and probably isn't important?

[dateTtime-08:00] username: message

adeshpande3 · 2018-05-25T16:20:10Z

Yeah we end up ignoring everything in the brackets.

withthelemons · 2018-06-25T21:04:26Z

Should oldest messages be on top or on the bottom of file?

adeshpande3 · 2018-06-25T23:02:36Z

Think it depends on how facebook organizes the data when you download it. From what I remember it wasnt necessarily chronological.

adeshpande3 mentioned this issue Oct 5, 2019

Support for parsing custom corpus #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the training data formatting? #28

What's the training data formatting? #28

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018 •

edited

Loading

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018

withthelemons commented Jun 25, 2018

adeshpande3 commented Jun 25, 2018

What's the training data formatting? #28

What's the training data formatting? #28

Comments

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018 • edited Loading

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018

CoderReece commented May 25, 2018

adeshpande3 commented May 25, 2018

withthelemons commented Jun 25, 2018

adeshpande3 commented Jun 25, 2018

adeshpande3 commented May 25, 2018 •

edited

Loading