Noise in image to text data-set. #1

codeorbit · 2016-03-29T15:52:13Z

@prodicus Dataset which contain text from images is not cleaned (i.e containing lots of special character and numbers) and not normalized as well.
for e.g. sambhar vada , vada sambar, vada with sambhar all are same but they are different in the dataset.

tasdikrahman · 2016-03-29T16:25:45Z

@codeorbit That would be a problem.

But it would be a pain to manually clean the dataset with the cases like sambhar vada. Skim through the dataset once, there are some properly OCR'd menus in there.

Does nltk (or any other module) have something which normalizes the data the way we want?

codeorbit · 2016-03-29T16:32:12Z

@prodicus There is no such module for that in nltk. Everyone makes their own custom algorithm for normalization based on some pattern in the dataset.

tasdikrahman · 2016-03-29T17:09:36Z

@codeorbit So would your models not work with this data?

codeorbit · 2016-03-29T17:15:21Z

Not only mine, any model will not work with noisy data.

tasdikrahman · 2016-03-29T17:31:48Z

What is the plan now then? OCR wont give be giving us the data in the form you want. Burrp hosts text menus on their website. So we could probably use them

codeorbit · 2016-03-29T18:17:24Z

@prodicus Cool .. we can go for it 👍

tasdikrahman added the data cleaning label Mar 29, 2016

codeorbit removed the data cleaning label Mar 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noise in image to text data-set. #1

Noise in image to text data-set. #1

codeorbit commented Mar 29, 2016

tasdikrahman commented Mar 29, 2016

codeorbit commented Mar 29, 2016

tasdikrahman commented Mar 29, 2016

codeorbit commented Mar 29, 2016

tasdikrahman commented Mar 29, 2016

codeorbit commented Mar 29, 2016

Noise in image to text data-set. #1

Noise in image to text data-set. #1

Comments

codeorbit commented Mar 29, 2016

tasdikrahman commented Mar 29, 2016

codeorbit commented Mar 29, 2016

tasdikrahman commented Mar 29, 2016

codeorbit commented Mar 29, 2016

tasdikrahman commented Mar 29, 2016

codeorbit commented Mar 29, 2016