Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection (ISSRE-2024)

This is the basic implementation of our submission in ISSRE 2024: Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection

Description

Log-based anomaly detection is critical in monitoring the operations of information systems and in the realtime reporting of system failures. Utilizing deep learning-based log anomaly detection methods facilitates effective detection of anomalies within logs. However, existing methods are greatly dependent on log parsers, and parsing errors can considerably affect downstream anomaly detection tasks. Additionally, methods that predict the next log event in a sequence are susceptible to the instability of sequences and the emergence of unseen logs as systems evolve, resulting in a higher false positive rate. In this paper, we put forward LogRAG, a semi-supervised log anomaly detection framework based on retrieval-augmented generation (RAG). This framework conducts phased detection using both Log Tokens and Log Templates to mitigate the impact of log parsing errors. It also utilizes a single-class classifier to model the normal behavior of the system, thereby circumventing the effects of unstable sequences. Finally, it employs large language model (LLM) empowered by RAG to reevaluate detected anomalous logs, thereby improving accuracy. LogRAG demonstrates a 15% improvement in F1 Score on the BGL dataset and a 60% improvement on the Spirit dataset when compared to the previous best semi-supervised learning algorithm.

Project Structure

LogRAG
├── .gitignore
├── README.md
├── config.yaml # config file
├── dataset  # train and test file
├── main.py # mian method
├── postprocess # post processing stage using RAG 
├── prelogad # anomaly detection stage
├── pretrained # pretrained language model
├── output # results
├── requirements.txt
└── utils

Dataset

We will use the BGL and Spirit datasets for our experiments, which you can find in LogHub or LogADEmpirical

Dataset	Size	# Logs	# Anomalies	Anomaly Ratio
BGL	743 MB	4,747,963	348,460	7.34 %
Spirit	1.4 GB	5,000,000	764,500	15.29%

The raw logs need to be parsed to get a structured log. Here, we use BGL.log_structured.csv and Spirit.log_structured.csv as the Drain parsed data

Reproducibility Instructions

Environment Setup

Requirements:

Python 3.8.18
NVIDIA GPU + CUDA cuDNN
PyTorch 1.13.0
langchain 0.2.14
transformers 4.37.2
openai 1.42.0

The required packages are listed in requirements.txt. Install:

pip install -r requirements.txt

Getting Started

You need to follow these steps to completely run LogRAG (Let's take an example of using LogRAG on the BGL-Example dataset)

Step1: set api-key and get pretrained language model

get pretrained language model from BigLog, and put the model file into the ./pretrained
get chatgpt api-key for LogRAG-ChatGPT and LLM Embedding, we get from here

As for LogRAG-ChatGPT, you need to set the api-key value in configs.yaml
As for LogRAG-Mistral, you need first download mistralai/Mistral-7B-Instruct-v0.1 from huggingface

Step2: set config parameters

is_train: True # train deepsvdd
is_rag: True # use post processing stage
dataset_name: BGL
train_ratio: 0.8
window_size: 200 
window_time: 1800 # 30 minutes

# dataset and model paths
log_structed_path: ./dataset/BGL/bgl-example.log_structured.csv
encoder_path: ./pretrained

# openai key, you can get from :https://www.closeai-asia.com
api_key: PUT_YOUR_OWN_API_KEY_HERE 
api_base: https://api.openai-proxy.org/v1

llm_name: gpt-3.5-turbo
# llm_name: mistralai/Mistral-7B-Instruct-v0.1 # from huggingface

# deepsvdd parameters
normal_class: 0
is_pretrain: False
optimizer_name: adam
lr: 0.0001
n_epochs: 150
lr_milestones: [50]
batch_size: 40960
weight_decay: 0.0005
device: cuda:0
n_jobs_dataloader: 0

# rag parameters
threshold: 0.8
topk: 5
prompt: prompt5
persist_directory: ./output/ragdb-bgl

You can adjust configs parameters to verify the effect of LogRAG under different experimental conditions. The main fields are explained as follows:

is_train: Whether to train the DeepOC module, this should be set to True for the first run or when switching conditions
is_rag: Whether the RAG post-processing module is enabled or not, if False, the experimental results are only the results detected by DeepOC
window_size: This fixes the window size
window_time: This is a fixed amount of time in seconds

Step3: let’s go !

Download the dataset and the desired model, and then adjust the configs parameters to what you want, then:

python main.py

and you will find the results in outpout/runtime.log ！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection (ISSRE-2024)

Description

Project Structure

Dataset

Reproducibility Instructions

Environment Setup

Getting Started

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataset/BGL		dataset/BGL
postprocess		postprocess
prelogad/DeepSVDD/src		prelogad/DeepSVDD/src
pretrained		pretrained
utils		utils
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
lograg.jpg		lograg.jpg
main.py		main.py
requirements.txt		requirements.txt

License

WanhaoZhang/LogRAG

Folders and files

Latest commit

History

Repository files navigation

Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection (ISSRE-2024)

Description

Project Structure

Dataset

Reproducibility Instructions

Environment Setup

Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages