Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noise in image to text data-set. #1

Open
codeorbit opened this issue Mar 29, 2016 · 6 comments
Open

Noise in image to text data-set. #1

codeorbit opened this issue Mar 29, 2016 · 6 comments

Comments

@codeorbit
Copy link

@prodicus Dataset which contain text from images is not cleaned (i.e containing lots of special character and numbers) and not normalized as well.
for e.g. sambhar vada , vada sambar, vada with sambhar all are same but they are different in the dataset.

@tasdikrahman
Copy link
Member

@codeorbit That would be a problem.

But it would be a pain to manually clean the dataset with the cases like sambhar vada. Skim through the dataset once, there are some properly OCR'd menus in there.

Does nltk (or any other module) have something which normalizes the data the way we want?

@codeorbit
Copy link
Author

@prodicus There is no such module for that in nltk. Everyone makes their own custom algorithm for normalization based on some pattern in the dataset.

@tasdikrahman
Copy link
Member

@codeorbit So would your models not work with this data?

@codeorbit
Copy link
Author

Not only mine, any model will not work with noisy data.

@tasdikrahman
Copy link
Member

What is the plan now then? OCR wont give be giving us the data in the form you want. Burrp hosts text menus on their website. So we could probably use them

@codeorbit
Copy link
Author

@prodicus Cool .. we can go for it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants