Enhancing LLM Safety through a Theoretical Minimax Game Lens
Yihe Deng · Yu Yang · Junkai Zhang · Wei Wang · Bo Li
Abstract
The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10\% on English benchmarks while achieving 4.5$\times$ faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.
Successful Page Load