Current LLMs use RLHF to reduce explicit bias in their outputs. But do they also address implicit bias?
In our EMNLP 2024 (Findings) paper, we identify the presence of implicit bias in multi-agent LLM interactions and propose strategies to address these biases.
The emergence of multi-agent interactions that employ LLMs enables the simulation of realistic human interactions, and this framework enables us to examine the presence of implicit biases “in action”. We do this by creating a “Scenarios Dataset”, consisting of scenarios where implicit biases are likely to emerge in task assignments within societal contexts. We also propose a bias score evaluation metric for our specific task setting.
We find that biases increase after multi-agent interaction. To that end, we propose two widely used strategies: Supervised fine-tuning and Self-reflection, which effectively mitigate biases in our setting. For more information, read our paper:
Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions
By Angana Borah and Rada Mihalcea
- LLMs generate implicit biases even when trained with human preference alignment like RLHF.
- Larger models are prone to produce more biased outputs.
- Biases increase after multi-agent LLM interactions.
- Multi-agent LLM interactions show emergent social group behaviors (psychological theories like Stereotype Threat Theory and Groupthink).
The Scenarios, Fine-tune and Test datasets are provided in the Data folder.
The codebase for the multi-agent framework is in the Code folder.
@misc{borah2024implicitbiasdetectionmitigation,
title={Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions},
author={Angana Borah and Rada Mihalcea},
year={2024},
eprint={2410.02584},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.02584},
}