-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noise in image to text data-set. #1
Comments
@codeorbit That would be a problem. But it would be a pain to manually clean the dataset with the cases like sambhar vada. Skim through the dataset once, there are some properly OCR'd menus in there. Does |
@prodicus There is no such module for that in nltk. Everyone makes their own custom algorithm for normalization based on some pattern in the dataset. |
@codeorbit So would your models not work with this data? |
Not only mine, any model will not work with noisy data. |
What is the plan now then? OCR wont give be giving us the data in the form you want. Burrp hosts text menus on their website. So we could probably use them |
@prodicus Cool .. we can go for it 👍 |
@prodicus Dataset which contain text from images is not cleaned (i.e containing lots of special character and numbers) and not normalized as well.
for e.g. sambhar vada , vada sambar, vada with sambhar all are same but they are different in the dataset.
The text was updated successfully, but these errors were encountered: