-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page segmentation deletes some parts of the text, how to avoid? #259
Comments
Hey @franvillamil, could you please provide a whole example page? It is hard to reproduce the error with an image of a single column. |
Several 1s together might appear like a black line? |
During page segmentation there is also a step which deletes small component However, as already @lehzwo mention, I also cannot reproduce your issues with the images you provide and I don't know whether you use any special parameter during the call. @franvillamil if you provide more information such that we can reproduce the issue, then we can look here again, otherwise I suggest to close this issue. |
I also faced this issue. Here are some sample images. Commands used to process the image pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-nlbin /home/chillaranand/projects/ocr/data/vishadam-021.png -o output -n
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-gpageseg output/????.bin.png -n
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-rpred -Q 4 -m /home/chillaranand/projects/ocr/ocropy/models/te.pyrnn.gz output/????.bin.png -n |
@ChillarAnand Okay, it looks that for your examples the parts below the baseline (descenders?) are larger than expected. The computed lines look then like this: and lines from the descender part are then neglected. Try to adjust the vscale/scale manually, e.g.
which should work well, except for the page number on the top left (but I think this is a known issue). |
Thank you @zuphilip. With |
The |
I tried the above image with different --vscale values. With
_lineseeds.png seems to be identifying all the lines. |
I am using ocropy to extract data from old documents that list electoral results. These are big pages arranged in up to 4 colums, from which I have taken screenshots of each column. See a sample of the raw image (I know the quality is very bad, but is basically running OCR on this or copying it entirely by hand): https://user-images.githubusercontent.com/3774527/32782400-9111bab6-c948-11e7-9ea6-6266cc828627.png
To avoid problems, I'm trying to make ocropy read the text as a one-column text (see issue #240), after deleting every and so far it's more or less going well. In some cases, however, ocropy is deleting some parts (mainly numbers) when it does the segmentation. See below two screenshorts of the original binary file and the segmentation output (the
.nrm.png
files):Missing some part of '207', original binary:
Image after segmentation:
The '3' is completely removed, original binary:
After segmentation:
In some cases (e.g. when there are a few 1s, see below), it seems ocropy thinks these are black lines delimiting columns and tries to ignore them. But I don't want to do this, as I'm removing every black line that could be mistaken in Gimp.
Several 1s together might appear like a black line?:
Solution?
Does anyone know if there is any piece of code I can modify to avoid this? I've been looking into the
gpageseg
code but haven't found anything.Your Environment
The text was updated successfully, but these errors were encountered: