Fixing the OCR on server-side.

For some reasons, the behavior of ocrmypdf seem to have change. Whereas before we were expecting directly the .txt file from it, now it was generating a PDF with the ocr-ed text overlaid to it. This commit fix this issue by overwriting the original scan PDF with a pdf with text overlaid and run the usual pdftotext on this new PDF.
BlueBrain · Aug 21, 2018 · 710d283 · 710d283
1 parent 5961509
commit 710d283
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/nat/restServer.py b/nat/restServer.py
@@ -53,7 +53,8 @@ def runOCR(fileName):
         app.OCRLock.release()    
 
         # Run OCR
-        run_ocrmypdf(fileName + ".pdf", fileName + ".txt")
+        run_ocrmypdf(fileName + ".pdf", fileName + ".pdf")
+        check_call(['pdftotext', '-enc', 'UTF-8', fileName + ".pdf", fileName + ".txt"])
 
         acquireLockWithTimeout()
         del app.OCRFiles[app.OCRFiles.index(fileName)]