Risk Mitigation
Methods, datasets, and frameworks for mitigating safety risks and improving alignment in large language models and multimodal AI systems.
Building safer AI systems requires both understanding failure modes and developing principled mitigation strategies. This research direction focuses on preference alignment, safety training data, and defense mechanisms for both single-agent and multi-agent settings.
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
2024
SPA-VL provides 100K+ preference pairs across 6 harmfulness domains for safety-focused RLHF training of vision-language models. Models fine-tuned with SPA-VL show significant improvements in safety while preserving general capability.
PsySafe: Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
ACL 2024
Beyond attack analysis, PsySafe develops defense strategies targeting the psychological layer of multi-agent systems, including agent-level and system-level mitigations that reduce harmful behavior without degrading task performance.