occrp-document-classifier

This is a Python library to perform document classification for OCCRP Aleph. It allows to train and test a classifier that can predict the type of a document.

Quick Start

1. Clone the repo

git clone <repo-url>
cd <repo-directory>

2. Select a root path

In config.py:

ROOT_PATH = "/data" if IN_DOCKER else "/data/dssg/occrp/data"

Replace /data/dssg/occrp/data with any path of your local filesystem.

You also need to replace that path in the Docker volumes defined in run_gpu_model.sh and in run_models_in_sequence.sh:

    # replace /data/dssg/occrp/data/ with the selected ROOT_PATH
    -v /data/dssg/occrp/data:/data \
    -v /data/dssg/occrp/data/:/data/dssg/occrp/data/ \

3. Create the folder structure using script

Once you selected a ROOT_PATH, all the subdirectories necessary to run the project can be created using the init_data_structure.sh script passing the selected ROOT_PATH as an argument:

./init_data_structure.sh <ROOT_PATH>

The repository should contain a data.zip file in order to run the init script.

Once the script run is finished, running tree -L 3 in the ROOT_PATH should show the following folder structure:

ROOT_PATH
├── input
│   ├── document_classification_clean
│   │   ├── bank-statements
│   │   ├── company-registry
│   │   ├── contracts
│   │   │   ...
│   └── rvl-cdip
│   │   ├── images
│   │   ├── labels
├── logs
├── mlruns
├── output
│   ├── document_classifier
│   ├── feature_extraction
│   └── firstpage_classification
└── processed_clean
    ├── document_classifier
    │   ├── bank-statements
    │   ├── company-registry
    │   ├── contracts
    │   │   ...
    └── firstpage_classifier
        ├── firstpages
        └── middlepages_1233

That's it! You just have setted all the necessary data to train models.

4. Install requirements

# install the dependencies
pipenv install

# activate the environment
pipenv shell

5. Run the CLI

# check if the CLI is working
python src/main.py --help

Installation

Python CLI

This projects uses Python 3.8 and pipenv to manage its dependencies. The list of requirements is available in the Pipfile. To install the requirements:

# install the dependencies
pipenv install

# activate the environment
pipenv shell

After this, you should be able to run any of the commands available in the CLI.

Docker

The Dockerfile allows to run the CLI in a Docker container with Python. See the Docker README to find instructions of how to use it.

Docker GPU

The gpu.Dockerfile allows to run the CLI in a Docker container with GPU out of the box using the NVIDIA drivers. See the GPU README to find instructions of how to use it.

Usage

Basic command line usage:

python src/main.py [OPTIONS] COMMAND [ARGS]...

More information about the commands available can be obtained using python src/main.py --help or in the Command Line Interface README.

Prediction

This project comes with default trained classifier models to be used out of the box. To do that, just select a INPUT_PATH directory with documents to classify and run:

python src/main.py predict INPUT_PATH OUTPUT_PATH

A json file with the results of the prediction will be saved in
OUTPUT_PATH/prediction__%Y_%m_%d_%H_%M_%S.json

More details can be found in the FAQ

Mlflow

Find an experiment in the UI via the hash

If you have a MLflow hash, e.g. from the config, and want to know how to find it in the UI:

Navigate to the mlruns directory
find -name <hash>, returns something like ./1/0a5006859f154daebc7a697d190f7a2. The first number is the experiment_id, the second one is the run_id.
Navigate to the ML UI by replacing the experiment_id and run_id: http://127.0.0.1:5000/#/experiments/<experiment_id>/runs/<run_id>

Find more information about how to use MLflow in the Cheatsheet.

TODOs and ENHANCEMENTS

We used Visual Studio Code for development and the extensions ToDo Tree by Gruntfuggly. We configured it in a way that distinguishes TODOs and ENHANCEMENTs. TODO for us means that important work is still required, ENHANCEMENT means that a certain improvement is desired but optional

To make this work as in our settings, install ToDo Tree in VS Code. Add this to your .vscode/settings.json (which is not in the repository):

    "todo-tree.highlights.customHighlight": {
        "ENHANCEMENT": {
            "icon": "note",
            "foreground": "black",
            "background": "lightgreen",
            "iconColour": "gray",
        },
    },
    "todo-tree.general.tags": [
        "TODO",
        "ENHANCEMENT",
    ],

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
0e6ce6d0eb14412eb99632b752826f99/0		0e6ce6d0eb14412eb99632b752826f99/0
docs		docs
notebooks		notebooks
src		src
tests		tests
tests_old		tests_old
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
gpu.Dockerfile		gpu.Dockerfile
init_data_structure.sh		init_data_structure.sh
portainer.sh		portainer.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_docker.txt		requirements_docker.txt
run_gpu_model.sh		run_gpu_model.sh
run_models_in_sequence.sh		run_models_in_sequence.sh
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

occrp-document-classifier

Quick Start

1. Clone the repo

2. Select a root path

3. Create the folder structure using script

4. Install requirements

5. Run the CLI

Installation

Python CLI

Docker

Docker GPU

Usage

Prediction

Mlflow

TODOs and ENHANCEMENTS

About

Releases

Packages

Languages

alephdata/document-categorization

Folders and files

Latest commit

History

Repository files navigation

occrp-document-classifier

Quick Start

1. Clone the repo

2. Select a root path

3. Create the folder structure using script

4. Install requirements

5. Run the CLI

Installation

Python CLI

Docker

Docker GPU

Usage

Prediction

Mlflow

TODOs and ENHANCEMENTS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages