Attack

Adversarial attack research targeting the safety and alignment of large language models and multimodal AI systems.

This research direction investigates how adversarial actors can exploit vulnerabilities in large language and multimodal models, uncovering systematic weaknesses in their safety mechanisms.


CodeAttack: Revealing Safety Generalization Challenges of LLMs via Code Completion

ACL 2024

We show that safety alignment of LLMs can be circumvented by reformulating harmful queries as code completion tasks, exposing fundamental challenges in generalizing safety across modalities and formats.

(Ren et al., 2024)


PsySafe: Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

ACL 2024

We propose a comprehensive framework that studies how dangerous psychological characteristics assigned to AI agents can induce harmful behaviors in multi-agent systems, and evaluate both attack success and mitigation strategies.

(Zhang et al., 2024)