CatastroBERT: Extreme Weather Event detection based on the Gazette de Lausanne dataset

OVERVIEW

This project was a journey towards exploring and extracting trends from the historical climate events reporting in the Gazette de Lausanne, a daily newspaper, with over 4 millions articles spanning from 1798 to 1990. Given the scale of the project, we leveraged natural language processing (NLP) techniques to efficiently process the data.

The core of the project was the development of a specific manually annotated dataset and the creation of a tailored language model (LM), CatastroBERT. This LM identified approximately 15,000 pertinent articles, demonstrating not only a high degree of precision and efficiency but also an ability to generalize and predict extreme weather events in years not included in its training. This robustness underscores CatastroBERT’s potential for a wide array of future research applications.

CatastroBERT and its experimental multilingual variant, CatastroBERT-M, are now accessible for future research on HuggingFace. While CatastroBERT-M shows promise, it may require further tuning to optimize its performance across languages, reflecting our commitment to continually enhancing these tools’ capabilities. This project provides valuable tools and insights for ongoing and future research in the field.

Example usage

See the notebook for a detailed example of how to use CatastroBERT on larger datasets.

Prerequisites

Before you begin, make sure you have Python installed on your system. This script was tested with Python 3.8 and above.

Environment Setup

It is recommended to use a virtual environment for Python projects to manage dependencies effectively. You can set up a virtual environment as follows:

python -m venv myenv
source myenv/bin/activate  # On Windows use `myenv\Scripts\activate`

then install the required packages with the following command:

pip install transformers torch

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


model_name = "epfl-dhlab/CatastroBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification(model_name)

def predict(text):
    # Prepare the text data
    inputs = tokenizer.encode_plus(
        text,
        None,
        add_special_tokens=True,
        return_token_type_ids=True,
        padding=True,
        max_length=512,
        truncation=True,
        return_tensors='pt'
    )

    ids = inputs['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
    mask = inputs['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')

    # Get predictions
    with torch.no_grad():
        outputs = model(ids, mask)
        logits = outputs.logits

    # Apply sigmoid function to get probabilities
    probs = torch.sigmoid(logits).cpu().numpy()

    # Return the probability of the class (1)
    return probs[0][0]

#example usage 
text = "Un violent ouragan est passé cette nuit sur Lausanne."
print(f"Prediction: {predict(text)}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CatastroBERT: Extreme Weather Event detection based on the Gazette de Lausanne dataset

OVERVIEW

Example usage

Prerequisites

Environment Setup

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

CatastroBERT: Extreme Weather Event detection based on the Gazette de Lausanne dataset

OVERVIEW

Example usage

Prerequisites

Environment Setup

Usage