Risk Mitigation | Jing Shao

Building safer AI systems requires both understanding failure modes and developing principled mitigation strategies. This research direction focuses on preference alignment, safety training data, and defense mechanisms for both single-agent and multi-agent settings.

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

2024

SPA-VL provides 100K+ preference pairs across 6 harmfulness domains for safety-focused RLHF training of vision-language models. Models fine-tuned with SPA-VL show significant improvements in safety while preserving general capability.

(Zhang et al., 2024)

PsySafe: Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

ACL 2024

Beyond attack analysis, PsySafe develops defense strategies targeting the psychological layer of multi-agent systems, including agent-level and system-level mitigations that reduce harmful behavior without degrading task performance.

(missing reference)