Skip to content

alephdata/document-categorization

Repository files navigation

occrp-document-classifier

This is a Python library to perform document classification for OCCRP Aleph. It allows to train and test a classifier that can predict the type of a document.

Quick Start

1. Clone the repo

git clone <repo-url>
cd <repo-directory>

2. Select a root path

In config.py:

ROOT_PATH = "/data" if IN_DOCKER else "/data/dssg/occrp/data"

Replace /data/dssg/occrp/data with any path of your local filesystem.

You also need to replace that path in the Docker volumes defined in run_gpu_model.sh and in run_models_in_sequence.sh:

    # replace /data/dssg/occrp/data/ with the selected ROOT_PATH
    -v /data/dssg/occrp/data:/data \
    -v /data/dssg/occrp/data/:/data/dssg/occrp/data/ \

3. Create the folder structure using script

Once you selected a ROOT_PATH, all the subdirectories necessary to run the project can be created using the init_data_structure.sh script passing the selected ROOT_PATH as an argument:

./init_data_structure.sh <ROOT_PATH>

The repository should contain a data.zip file in order to run the init script.

Once the script run is finished, running tree -L 3 in the ROOT_PATH should show the following folder structure:

ROOT_PATH
├── input
│   ├── document_classification_clean
│   │   ├── bank-statements
│   │   ├── company-registry
│   │   ├── contracts
│   │   │   ...
│   └── rvl-cdip
│   │   ├── images
│   │   ├── labels
├── logs
├── mlruns
├── output
│   ├── document_classifier
│   ├── feature_extraction
│   └── firstpage_classification
└── processed_clean
    ├── document_classifier
    │   ├── bank-statements
    │   ├── company-registry
    │   ├── contracts
    │   │   ...
    └── firstpage_classifier
        ├── firstpages
        └── middlepages_1233

That's it! You just have setted all the necessary data to train models.

4. Install requirements

# install the dependencies
pipenv install

# activate the environment
pipenv shell

5. Run the CLI

# check if the CLI is working
python src/main.py --help

Installation

Python CLI

This projects uses Python 3.8 and pipenv to manage its dependencies. The list of requirements is available in the Pipfile. To install the requirements:

# install the dependencies
pipenv install

# activate the environment
pipenv shell

After this, you should be able to run any of the commands available in the CLI.

Docker

The Dockerfile allows to run the CLI in a Docker container with Python. See the Docker README to find instructions of how to use it.

Docker GPU

The gpu.Dockerfile allows to run the CLI in a Docker container with GPU out of the box using the NVIDIA drivers. See the GPU README to find instructions of how to use it.

Usage

Basic command line usage:

python src/main.py [OPTIONS] COMMAND [ARGS]...

More information about the commands available can be obtained using python src/main.py --help or in the Command Line Interface README.

Prediction

This project comes with default trained classifier models to be used out of the box. To do that, just select a INPUT_PATH directory with documents to classify and run:

python src/main.py predict INPUT_PATH OUTPUT_PATH

A json file with the results of the prediction will be saved in
OUTPUT_PATH/prediction__%Y_%m_%d_%H_%M_%S.json

More details can be found in the FAQ

Mlflow

Find an experiment in the UI via the hash

If you have a MLflow hash, e.g. from the config, and want to know how to find it in the UI:

  1. Navigate to the mlruns directory
  2. find -name <hash>, returns something like ./1/0a5006859f154daebc7a697d190f7a2. The first number is the experiment_id, the second one is the run_id.
  3. Navigate to the ML UI by replacing the experiment_id and run_id: http://127.0.0.1:5000/#/experiments/<experiment_id>/runs/<run_id>

Find more information about how to use MLflow in the Cheatsheet.

TODOs and ENHANCEMENTS

We used Visual Studio Code for development and the extensions ToDo Tree by Gruntfuggly. We configured it in a way that distinguishes TODOs and ENHANCEMENTs. TODO for us means that important work is still required, ENHANCEMENT means that a certain improvement is desired but optional

To make this work as in our settings, install ToDo Tree in VS Code. Add this to your .vscode/settings.json (which is not in the repository):

    "todo-tree.highlights.customHighlight": {
        "ENHANCEMENT": {
            "icon": "note",
            "foreground": "black",
            "background": "lightgreen",
            "iconColour": "gray",
        },
    },
    "todo-tree.general.tags": [
        "TODO",
        "ENHANCEMENT",
    ],

About

DSSG document categorization repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages