Implementation of Multi-Modal Page Stream Segmentation with CNN Networks & Transformers as a Python Service
Additional Resources
- model_training contains jupyter notebooks to train the CNN models
- document_stream_builder contains a script to build document streams from pdf files
- Poppler (PDF Rendering Library)
brew install poppler
- CMake
brew install cmake
- tesseract
brew install tesseract
- python
- poetry (Dependency Management)
Download english fasttext wiki word vectors under this link and put wiki.en.bin in ./app/pss/models/.
poetry install
uvicorn app.main:app --reload
OpenAPI Documentation http://localhost:8000/openapi.json
SwaggerUI Instance http://localhost:8000/docs
For all routes the model_type can always be single_page or prev_page.
single_page selects the model, that was trained only with the current page as input data.
prev_page selects the model, that was trained with a pair of consecutive pages as input data.
The PDF Documents for all routes need to be sent at the request body as form-data with the key being file
POST localhost:8000/pss/textModel/{model_type}/processDocument/
Process PDF Documents with models that only consider the text data.
Model Performance
Model | Accuracy | Kappa |
---|---|---|
single_page | 0,826255 | 0,627790 |
prev_page | 0,830116 | 0,641725 |
POST localhost:8000/pss/bertTextModel/processDocument/
Process PDF Documents with a transformer bert model, that only considers the text data.
Model | Accuracy | Kappa |
---|---|---|
bert-based-uncased | 0,915058 | 0,824828 |
POST localhost:8000/pss/imageModel/{model_type}/processDocument/
Process PDF Documents with models that only consider the image data.
Model Performance
Model | Accuracy | Kappa |
---|---|---|
single_page | 0,926641 | 0,847236 |
prev_page | 0,934363 | 0,863316 |
POST localhost:8000/pss/combinedModels/{text_model_type}/{image_model_type}/processDocument/
Process PDF Documents with text and image models and combine the output for a multi-modal PSS prediction.
Text Model | Image Model | Accuracy | Kappa |
---|---|---|---|
single_page | single_page | 0,926641 | 0,847236 |
single_page | prev_page | 0,942085 | 0,879059 |
prev_page | single_page | 0,918919 | 0,830682 |
prev_page | prev_page | 0,938224 | 0,871176 |
POST localhost:8000/pss/combinedModelsBert/{image_model_type}/processDocument/
Process PDF Documents with bert text and image models and combine the output for a multi-modal PSS prediction.
The bert based model is hosted at the huggingface model repository
Text Model | Image Model | Accuracy | Kappa |
---|---|---|---|
bert-based-uncased | single_page | 0,926641 | 0,850575 |
bert-based-uncased | prev_page | 0,934363 | 0,866669 |
This repository builds onto the works of Wiedemann & Heyer 2019:
Wiedemann, G., Heyer, G. Multi-modal page stream segmentation with convolutional neural networks. Lang Resources & Evaluation (2019). https://doi.org/10.1007/s10579-019-09476-2
The Model Training was performed with the Tobacco800 Dataset: (Model Performance was measured with a test subset)
David Doermann, Tobacco 800 Dataset (Tobacco800) http://tc11.cvc.uab.es/datasets/Tobacco800_1