🐑 Stack Trace Deduplication 🐑

This repository provides an overview and instructions for replicating experiments on stack trace deduplication from our paper "Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios", including details on code structure, setup, and execution steps. Below, you will find a breakdown of the key directories and scripts essential for the experiments.

🏗️ Repository structure

The directory ea/sim/main/methods/neural/encoders/ contains the implementation of the neural encoders used in the experiments:

our embedding model presented in the paper,
our implementation of the DeepCrash model.

The directory ea/sim/main/methods/neural/cross_encoders/ contains the implementation of the models that involve interaction between stack traces when computing similarity scores:

our cross-encoder presented in the paper,
S3M,
Lerch.

The implementation of the FaST model is located here.

The training scripts are located in the directory ea/sim/dev/scripts/training/training/.

The evaluation scripts are located in the directory ea/sim/dev/scripts/training/evaluating/.

🗃️ Data for experiments

To train and evaluate the models, you need a dataset of stack traces. In our paper, we present a novel industrial dataset and also use established open-source ones.

SlowOps, our new dataset of Slow Operation Assertion stack traces from IntelliJ-based products, can be found here.

Open-source datasets, namely Ubuntu, Eclipse, NetBeans, and Gnome, can be found TODO.

Note: to run our models on open-source datasets, you need to transform them into the right format. The scripts for doing that are available TODO.

🏃 Running the code

1. Install the required packages

poetry install

2. Setup

To run experiments for a specific dataset, create a designated directory ARTIFACTS_DIR for the dataset. Inside this directiry, there should be a config.json file with the following structure:

{
    "reports_dir": "path/to/dataset/reports",
    "labels_dir": "path/to/dataset/labels",
    "data_name": "dataset_name",
    "scope": "dataset_scope (same as data_name if not specified)",
    "train_start": "days from the first report to start training",
    "train_longitude": "longitude of the training period in days",
    "val_start": "days from the first report to start validation",
    "val_longitude": "longitude of the validation period in days",
    "test_start": "days from the first report to start testing",
    "test_longitude": "longitude of the testing period in days",
    "forget_days": "days to use for report attaching",
    "dup_attach": "whether to attach duplicates"
}

In the reports_dir directory, all reports should be located. Each report should be a separate file with the following name format: report_id.json.

In the labels_dir directory, there should be a CSV file with the following structure:

timestamp,rid,iid
...

where timestamp is the timestamp of the report, rid is the report ID, and iid is the category ID.

An example of a config can be found in the NetBeans_config_example.json file.

3. Run the experiments

Generating the training dataset

Before training an embedding model (embedding_model, cross_encoder, deep_crash, s3m), the training dataset should be generated from the reports and labels. Scripts for generating the training dataset are located in the directory ea/sim/dev/scripts/data/dataset/. Here is an example of how to generate the training dataset for the NetBeans dataset:

python ea/sim/dev/scripts/data/dataset/nb/main.py --reports_dir=path/to/dataset/NetBeans/ --state_path=path/to/dataset/NetBeans/state.csv --save_dir=path/to/save/netbeans/

The generated dataset should be passed to training scripts as a dataset_dir argument.

Training the models

Training scripts are located in the directory ea/sim/dev/scripts/training/training. To run the script, ARTIFACTS_DIR should be specified as an environment variable.

export ARTIFACTS_DIR=artifacts_dir; python ea/sim/dev/scripts/training/training/<script_name>.py

Here are the available scripts for training:

Embedding model

python ea/sim/dev/scripts/training/training/train_model.py --path_to_save='path/to/save/model/embedding_model.pth'

Cross Encoder

python ea/sim/dev/scripts/training/training/train_model.py --path_to_save='path/to/save/model/cross_encoder.pth'

DeepCrash

python ea/sim/dev/scripts/training/training/train_model.py --path_to_save='path/to/save/model/deep_crash.pth'

S3M

python ea/sim/dev/scripts/training/training/train_s3m.py --path_to_save='path/to/save/model/s3m.pth'

Evaluating the models

Evaluation scripts are located in the directory ea/sim/dev/scripts/training/evaluating. To run the script, ARTIFACTS_DIR should be specified as an environment variable.

export ARTIFACTS_DIR=artifacts_dir; python ea/sim/dev/scripts/training/evaluating/<script_name>.py

Here are the available scripts for evaluation:

Embedding model

python ea/sim/dev/scripts/training/evaluating/retrieval_stage.py --model_ckpt_path='path/to/model/embedding_model.pth'

Cross Encoder

python ea/sim/dev/scripts/training/evaluating/scoring_stage.py --cross_encoder_path='path/to/model/cross_encoder.pth'

DeepCrash

python ea/sim/dev/scripts/training/evaluating/retrieval_stage.py --model_ckpt_path='path/to/model/deep_crash.pth'

S3M

python ea/sim/dev/scripts/training/evaluating/eval_s3m.py --model_ckpt_path='path/to/model/s3m.pth'

FaST

python ea/sim/dev/scripts/training/evaluating/eval_fast.py

Lerch

python ea/sim/dev/scripts/training/evaluating/eval_lerch.py

OpenAI embedding model

First, precompute the embeddings using ea/sim/dev/scripts/training/training/embeddings/main.py. Then, run the following script:
```
python ea/sim/dev/scripts/training/evaluating/openai/run.py 
```

The results of the evaluation will be saved in the ARTIFACTS_DIR directory.

👩🏻‍🔬 Citing

If you want to find more details about the models or the evaluation, please refer to our SANER paper. If you use the code in your work, please consider citing us:

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
ea		ea
helpers		helpers
LICENSE		LICENSE
NetBeans_config_example.json		NetBeans_config_example.json
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐑 Stack Trace Deduplication 🐑

🏗️ Repository structure

🗃️ Data for experiments

🏃 Running the code

1. Install the required packages

2. Setup

3. Run the experiments

Generating the training dataset

Training the models

Evaluating the models

👩🏻‍🔬 Citing

About

Contributors 4

Languages

License

JetBrains-Research/stack-trace-deduplication

Folders and files

Latest commit

History

Repository files navigation

🐑 Stack Trace Deduplication 🐑

🏗️ Repository structure

🗃️ Data for experiments

🏃 Running the code

1. Install the required packages

2. Setup

3. Run the experiments

Generating the training dataset

Training the models

Evaluating the models

👩🏻‍🔬 Citing

About

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages