This is a Python library to perform document classification for OCCRP Aleph. It allows to train and test a classifier that can predict the type of a document.
git clone <repo-url>
cd <repo-directory>
In config.py:
ROOT_PATH = "/data" if IN_DOCKER else "/data/dssg/occrp/data"
Replace /data/dssg/occrp/data
with any path of your local filesystem.
You also need to replace that path in the Docker volumes defined in run_gpu_model.sh and in run_models_in_sequence.sh:
# replace /data/dssg/occrp/data/ with the selected ROOT_PATH
-v /data/dssg/occrp/data:/data \
-v /data/dssg/occrp/data/:/data/dssg/occrp/data/ \
Once you selected a ROOT_PATH, all the subdirectories necessary to run the project can be created using the init_data_structure.sh script passing the selected ROOT_PATH as an argument:
./init_data_structure.sh <ROOT_PATH>
The repository should contain a data.zip
file in order to run the init script.
Once the script run is finished, running tree -L 3
in the ROOT_PATH should show the following folder structure:
ROOT_PATH
├── input
│ ├── document_classification_clean
│ │ ├── bank-statements
│ │ ├── company-registry
│ │ ├── contracts
│ │ │ ...
│ └── rvl-cdip
│ │ ├── images
│ │ ├── labels
├── logs
├── mlruns
├── output
│ ├── document_classifier
│ ├── feature_extraction
│ └── firstpage_classification
└── processed_clean
├── document_classifier
│ ├── bank-statements
│ ├── company-registry
│ ├── contracts
│ │ ...
└── firstpage_classifier
├── firstpages
└── middlepages_1233
That's it! You just have setted all the necessary data to train models.
# install the dependencies
pipenv install
# activate the environment
pipenv shell
# check if the CLI is working
python src/main.py --help
This projects uses Python 3.8 and pipenv to manage its dependencies. The list of requirements is available in the Pipfile. To install the requirements:
# install the dependencies
pipenv install
# activate the environment
pipenv shell
After this, you should be able to run any of the commands available in the CLI.
The Dockerfile allows to run the CLI in a Docker container with Python. See the Docker README to find instructions of how to use it.
The gpu.Dockerfile allows to run the CLI in a Docker container with GPU out of the box using the NVIDIA drivers. See the GPU README to find instructions of how to use it.
Basic command line usage:
python src/main.py [OPTIONS] COMMAND [ARGS]...
More information about the commands available can be obtained using python src/main.py --help
or in the Command Line Interface README.
This project comes with default trained classifier models to be used out of the box. To do that, just select a INPUT_PATH
directory with documents to classify and run:
python src/main.py predict INPUT_PATH OUTPUT_PATH
A json file with the results of the prediction will be saved in
OUTPUT_PATH/prediction__%Y_%m_%d_%H_%M_%S.json
More details can be found in the FAQ
Find an experiment in the UI via the hash
If you have a MLflow hash, e.g. from the config, and want to know how to find it in the UI:
- Navigate to the mlruns directory
find -name <hash>
, returns something like./1/0a5006859f154daebc7a697d190f7a2
. The first number is theexperiment_id
, the second one is therun_id
.- Navigate to the ML UI by replacing the experiment_id and run_id:
http://127.0.0.1:5000/#/experiments/<experiment_id>/runs/<run_id>
Find more information about how to use MLflow in the Cheatsheet.
We used Visual Studio Code for development and the extensions ToDo Tree by Gruntfuggly. We configured it in a way that distinguishes TODO
s and ENHANCEMENT
s. TODO
for us means that important work is still required, ENHANCEMENT
means that a certain improvement is desired but optional
To make this work as in our settings, install ToDo Tree in VS Code. Add this to your .vscode/settings.json
(which is not in the repository):
"todo-tree.highlights.customHighlight": {
"ENHANCEMENT": {
"icon": "note",
"foreground": "black",
"background": "lightgreen",
"iconColour": "gray",
},
},
"todo-tree.general.tags": [
"TODO",
"ENHANCEMENT",
],