This repository provides an overview and instructions for replicating experiments on stack trace deduplication from our paper "Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios", including details on code structure, setup, and execution steps. Below, you will find a breakdown of the key directories and scripts essential for the experiments.
The directory ea/sim/main/methods/neural/encoders/ contains the implementation of the neural encoders used in the experiments:
- our embedding model presented in the paper,
- our implementation of the DeepCrash model.
The directory ea/sim/main/methods/neural/cross_encoders/ contains the implementation of the models that involve interaction between stack traces when computing similarity scores:
- our cross-encoder presented in the paper,
- S3M,
- Lerch.
The implementation of the FaST model is located here.
The training scripts are located in the directory ea/sim/dev/scripts/training/training/.
The evaluation scripts are located in the directory ea/sim/dev/scripts/training/evaluating/.
To train and evaluate the models, you need a dataset of stack traces. In our paper, we present a novel industrial dataset and also use established open-source ones.
SlowOps, our new dataset of Slow Operation Assertion stack traces from IntelliJ-based products, can be found here.
Open-source datasets, namely Ubuntu, Eclipse, NetBeans, and Gnome, can be found here.
Note: to run our models on open-source datasets, you need to transform them into the right format. The script for doing that is available here.
poetry install
To run experiments for a specific dataset, create a designated directory ARTIFACTS_DIR
for the dataset.
Inside this directiry, there should be a config.json
file with the following structure:
{
"reports_dir": "path/to/dataset/reports",
"labels_dir": "path/to/dataset/labels",
"data_name": "dataset_name",
"scope": "dataset_scope (same as data_name if not specified)",
"train_start": "days from the first report to start training",
"train_longitude": "longitude of the training period in days",
"val_start": "days from the first report to start validation",
"val_longitude": "longitude of the validation period in days",
"test_start": "days from the first report to start testing",
"test_longitude": "longitude of the testing period in days",
"forget_days": "days to use for report attaching",
"dup_attach": "whether to attach duplicates"
}
In the reports_dir
directory, all reports should be located. Each report should be a separate file with the following
name format: report_id.json
.
In the labels_dir
directory, there should be a CSV file with the following structure:
timestamp,rid,iid
...
where timestamp
is the timestamp of the report, rid
is the report ID, and iid
is the category ID.
An example of a config can be found in the NetBeans_config_example.json file.
Before training an embedding model (embedding_model
, cross_encoder
, deep_crash
, s3m
), the training dataset should be generated from the reports and labels. Scripts for generating the training dataset are located in the directory ea/sim/dev/scripts/data/dataset/. Here is an example of how to generate the training dataset for the NetBeans dataset:
python ea/sim/dev/scripts/data/dataset/nb/main.py --reports_dir=path/to/dataset/NetBeans/ --state_path=path/to/dataset/NetBeans/state.csv --save_dir=path/to/save/netbeans/
The generated dataset should be passed to training scripts as a dataset_dir
argument.
Training scripts are located in the directory ea/sim/dev/scripts/training/training.
To run the script, ARTIFACTS_DIR
should be specified as an environment variable.
export ARTIFACTS_DIR=artifacts_dir; python ea/sim/dev/scripts/training/training/<script_name>.py
Here are the available scripts for training:
-
Embedding model
python ea/sim/dev/scripts/training/training/train_model.py --path_to_save='path/to/save/model/embedding_model.pth'
-
Cross Encoder
python ea/sim/dev/scripts/training/training/train_model.py --path_to_save='path/to/save/model/cross_encoder.pth'
-
DeepCrash
python ea/sim/dev/scripts/training/training/train_model.py --path_to_save='path/to/save/model/deep_crash.pth'
-
S3M
python ea/sim/dev/scripts/training/training/train_s3m.py --path_to_save='path/to/save/model/s3m.pth'
Evaluation scripts are located in the directory ea/sim/dev/scripts/training/evaluating.
To run the script, ARTIFACTS_DIR
should be specified as an environment variable.
export ARTIFACTS_DIR=artifacts_dir; python ea/sim/dev/scripts/training/evaluating/<script_name>.py
Here are the available scripts for evaluation:
-
Embedding model
python ea/sim/dev/scripts/training/evaluating/retrieval_stage.py --model_ckpt_path='path/to/model/embedding_model.pth'
-
Cross Encoder
python ea/sim/dev/scripts/training/evaluating/scoring_stage.py --cross_encoder_path='path/to/model/cross_encoder.pth'
-
DeepCrash
python ea/sim/dev/scripts/training/evaluating/retrieval_stage.py --model_ckpt_path='path/to/model/deep_crash.pth'
-
S3M
python ea/sim/dev/scripts/training/evaluating/eval_s3m.py --model_ckpt_path='path/to/model/s3m.pth'
-
FaST
python ea/sim/dev/scripts/training/evaluating/eval_fast.py
-
Lerch
python ea/sim/dev/scripts/training/evaluating/eval_lerch.py
-
OpenAI embedding model
First, precompute the embeddings using ea/sim/dev/scripts/training/training/embeddings/main.py. Then, run the following script:
python ea/sim/dev/scripts/training/evaluating/openai/run.py
The results of the evaluation will be saved in the ARTIFACTS_DIR
directory.
If you want to find more details about the models or the evaluation, please refer to our SANER paper. If you use the code in your work, please consider citing us:
@article{shibaev2024stack,
title={Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios},
author={Shibaev, Egor and Sushentsev, Denis and Golubev, Yaroslav and Khvorov, Aleksandr},
journal={arXiv preprint arXiv:2412.14802},
year={2024}
}