Publications | Jing Shao

# indicates equal contributions; * indicates corresponding authors.

2025

RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

Jingyi Yang, Shuai Shao, Dongrui Liu, and 1 more author

arXiv, May 2025

arXiv Code
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu, and 3 more authors

arXiv, May 2025

arXiv Code
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu, Dongrui Liu, Yi Yu, and 2 more authors

EMNLP 2025 Findings, May 2025

arXiv Code
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Ziqi Miao, Yi Ding, Lijun Li, and 1 more author

EMNLP 2025 Main Conference, May 2025

arXiv Code

2024

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

Yongting Zhang, Lu Chen, Guodong Zheng, and 10 more authors

Arxiv, May 2024

Website
Assessment of Multimodal Large Language Models in Alignment with Human Values

Zhelun Shi, Zhipin Wang, Hongxing Fan, and 7 more authors

Arxiv, May 2024

Website
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

Qibing Ren, Chang Gao, Jing Shao, and 4 more authors

ACL, May 2024

Website
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Chen Qian, Jie Zhang, Wei Yao, and 5 more authors

ACL, May 2024

Website
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Lijun Li, Bowen Dong, Ruohui Wang, and 5 more authors

ACL, May 2024

Website
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Chaochao Lu, Chen Qian, Guodong Zheng, and 33 more authors

Technicle Report, May 2024

Website
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

Zaibin Zhang, Yongting Zhang, Lijun Li, and 6 more authors

ACL, May 2024

Website
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

Zaibin Zhang, Yongting Zhang, Lijun Li, and 6 more authors

ACL 2024, May 2024

ACL 2024 Outstanding Paper Award Code Website

https://2024.aclweb.org/program/best_papers/
REEF: Representation Encoding Fingerprints for Large Language Models

Jie Zhang

ICLR 2025, May 2024

ICLR 2025 Oral arXiv Code

https://iclr.cc/virtual/2025/events/oral
The Tug of War Within: Mitigating the Fairness-Privacy Conflicts in Large Language Models

Chen Qian, Dongrui Liu, Jie Zhang, and 2 more authors

arXiv, May 2024

Code
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

Qibing Ren, Hao Li, Dongrui Liu, and 7 more authors

ACL 2025, May 2024

ACL 2025 Outstanding Paper Award arXiv Code

https://2025.aclweb.org/program/awards/
OASIS: Open Agent Social Interaction Simulations with One Million Agents

Ziyi Yang, Zaibin Zhang, Zirui Zheng, and 20 more authors

arXiv, May 2024

arXiv Code

2023

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

Zhelun* Shi, Zhipin* Wang, Hongxing* Fan, and 4 more authors

Arxiv, May 2023

Website
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Zhenfei* Yin, Jiong* Wang, JianJian* Cao, and 9 more authors

NeurIPS, May 2023

Website

2022

1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)

Dong An, Zun Wang, Yangguang Li, and 5 more authors

CVPR, May 2022
ERGO: Event Relational Graph Transformer for Document-level Event Causality Identification

Meiqi Chen, Yixin Cao, Kunquan Deng, and 4 more authors

Arxiv, May 2022
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Yufeng Cui, Lichen Zhao, Feng Liang, and 2 more authors

Arxiv, May 2022
X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

Yinan He, Gengshi Huang, Siyu Chen, and 7 more authors

ECCV, May 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Yangguang Li, Feng Liang, Lichen Zhao, and 5 more authors

In International Conference on Learning Representations, Mar 2022
MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities

Yubo Ma, Zehao Wang, Mukai Li, and 8 more authors

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, May 2022
Prompt for Extraction? PAIE: Prompting Argument Interaction for Event Argument Extraction

Yubo Ma, Zehao Wang, Yixin Cao, and 4 more authors

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Junting Pan, Ziyi Lin, Xiatian Zhu, and 2 more authors

NeurIPS, May 2022
Few-shot Forgery Detection via Guided Adversarial Interpolation

Haonan Qiu, Siyu Chen, Bei Gan, and 4 more authors

Arxiv, May 2022
Task-Balanced Distillation for Object Detection

Ruining Tang, Zhenyu Liu, Yangguang Li, and 6 more authors

PR, May 2022
RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training

Luya Wang, Feng Liang, Yangguang Li, and 3 more authors

In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Jul 2022
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Hao Wang, Yangguang Li, Zhen Huang, and 3 more authors

ICIC, Jul 2022
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

Yuanhan Zhang, Qinghong Sun, Yichun Zhou, and 7 more authors

Arxiv, Jul 2022
Benchmarking Omni-Vision Representation through the Lens of Visual Realms

Yuanhan Zhang, Zhenfei Yin, Jing Shao, and 1 more author

ECCV, Jul 2022
Robust Face Anti-Spoofing with Dual Probabilistic Modeling

Yuanhan Zhang, Yichao Wu, Zhenfei Yin, and 2 more authors

Arxiv, Jul 2022

2021

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Junting Pan, Siyu Chen, Mike Zheng Shou, and 3 more authors

In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021
ForgeryNet - Face Forgery Analysis Challenge 2021: Methods and Results

Yinan He, Lu Sheng, Jing Shao, and 19 more authors

CoRR, Jun 2021
ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis

Yinan He, Bei Gan, Siyu Chen, and 6 more authors

In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021
A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Teli Ma, Shijie Geng, Mengmeng Wang, and 5 more authors

CoRR, Jun 2021
Few-Shot Domain Expansion for Face Anti-Spoofing

Bowen Yang, Jing Zhang, Zhenfei Yin, and 1 more author

CoRR, Jun 2021
BlockQNN: Efficient Block-Wise Neural Network Architecture Generation

Zhao Zhong, Zichen Yang, Boyang Deng, and 4 more authors

IEEE Transactions on Pattern Analysis and Machine Intelligence, Jul 2021

Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

Abs

Convolutional neural networks have gained a remarkable success in computer vision. However, most popular network architectures are hand-crafted and usually require expertise and elaborate design. In this paper, we provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy. The optimal network block is constructed by the learning agent which is trained to choose component layers sequentially. We stack the block to construct the whole auto-generated network. To accelerate the generation process, we also propose a distributed asynchronous framework and an early stop strategy. The block-wise generation brings unique advantages: (1) it yields state-of-the-art results in comparison to the hand-crafted networks on image classification, particularly, the best network generated by BlockQNN achieves 2.35 percent top-1 error rate on CIFAR-10. (2) it offers tremendous reduction of the search space in designing networks, spending only 3 days with 32 GPUs. A faster version can yield a comparable result with only 1 GPU in 20 hours. (3) it has strong generalizability in that the network built on CIFAR also performs well on the larger-scale dataset. The best network achieves very competitive accuracy of 82.0 percent top-1 and 96.0 percent top-5 on ImageNet.

2020

1st place solution for AVA-Kinetics Crossover in AcitivityNet Challenge 2020

Siyu Chen, Junting Pan, Guanglu Song, and 6 more authors

CoRR, Jul 2020
High-Quality Video Generation from Static Structural Annotations

Lu Sheng, Junting Pan, Jiaming Guo, and 2 more authors

International Journal of Computer Vision, Nov 2020

Abs

This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released: https://github.com/junting/seg2vid.
Morphing and Sampling Network for Dense Point Cloud Completion

Minghua Liu, Lu Sheng, Sheng Yang, and 2 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, Apr 2020

Number: 07
CelebA-Spoof: Large-Scale Face Anti-spoofing Dataset with Rich Annotations

Yuanhan Zhang, ZhenFei Yin, Yidong Li, and 4 more authors

In Computer Vision – ECCV 2020, Apr 2020
Learning Connectivity of Neural Networks from a Topological Perspective

Kun Yuan, Quanquan Li, Jing Shao, and 1 more author

In Computer Vision – ECCV 2020, Apr 2020
Powering One-Shot Topological NAS with Stabilized Share-Parameter Proxy

Ronghao Guo, Chen Lin, Chuming Li, and 4 more authors

In Computer Vision – ECCV 2020, Apr 2020
Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues

Yuyang Qian, Guojun Yin, Lu Sheng, and 2 more authors

In Computer Vision – ECCV 2020, Apr 2020
PV-NAS: Practical Neural Architecture Search for Video Recognition

Zihao Wang, Chen Lin, Lu Sheng, and 2 more authors

CoRR, Apr 2020
PV-NAS: Practical Neural Architecture Search for Video Recognition

Zihao Wang, Chen Lin, Lu Sheng, and 2 more authors

CoRR, Apr 2020

2019

Unsupervised Bi-directional Flow-based Video Generation from one Snapshot

Lu Sheng, Junting Pan, Jiaming Guo, and 3 more authors

CoRR, Apr 2019
Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Xihui Liu, Zihao Wang, Jing Shao, and 2 more authors

In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019
Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

Xihui Liu, Guojun Yin, Jing Shao, and 2 more authors

In Advances in Neural Information Processing Systems, Jun 2019
Video Generation From Single Semantic Label Map

Junting Pan, Chengyu Wang, Xu Jia, and 4 more authors

In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019

Abs

This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between ﬂexibility and quality in the generation process. Different from typical end-to-end approaches, which model both scene content and dynamics in a single step, we propose to decompose this difﬁcult task into two sub-problems. As current image generation methods do better than video generation in terms of detail, we synthesize high quality content by only generating the ﬁrst frame. Then we animate the scene based on its semantic meaning to obtain temporally coherent video, giving us excellent results overall. We employ a cVAE for predicting optical ﬂow as a beneﬁcial intermediate step to generate a video sequence conditioned on the initial single frame. A semantic label map is integrated into the ﬂow prediction module to achieve major improvements in the image-to-video generation process. Extensive experiments on the Cityscapes dataset show that our method outperforms all competing methods. The source code will be released on https://github.com/junting/seg2vid.
CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Zihao Wang, Xihui Liu, Hongsheng Li, and 4 more authors

In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019
Context and Attribute Grounded Dense Captioning

Guojun Yin, Lu Sheng, Bin Liu, and 3 more authors

In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019

Abs

Dense captioning aims at simultaneously localizing semantic regions and describing these regions-of-interest (ROIs) with short phrases or sentences in natural language. Previous studies have shown remarkable progresses, but they are often vulnerable to the aperture problem that a caption generated by the features inside one ROI lacks contextual coherence with its surrounding context in the input image. In this work, we investigate contextual reasoning based on multi-scale message propagations from the neighboring contents to the target ROIs. To this end, we design a novel end-to-end context and attribute grounded dense captioning framework consisting of 1) a contextual visual mining module and 2) a multi-level attribute grounded description generation module. Knowing that captions often co-occur with the linguistic attributes (such as who, what and where), we also incorporate an auxiliary supervision from hierarchical linguistic attributes to augment the distinctiveness of the learned captions. Extensive experiments and ablation studies on Visual Genome dataset demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.
Semantics Disentangling for Text-To-Image Generation

Guojun Yin, Bin Liu, Lu Sheng, and 3 more authors

In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019

2018

Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection

Yongcheng Liu, Lu Sheng, Jing Shao, and 3 more authors

In Proceedings of the 26th ACM international conference on Multimedia, Jun 2018
Localization Guided Learning for Pedestrian Attribute Recognition

Pengze Liu, Xihui Liu, Junjie Yan, and 1 more author

In British Machine Vision Conference 2018, BMVC 2018, Sep 2018
Exploring Disentangled Feature Representation Beyond Face Identification

Yu Liu, Fangyin Wei, Jing Shao, and 3 more authors

In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018
Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

Dapeng Chen, Hongsheng Li, Xihui Liu, and 4 more authors

In Computer Vision – ECCV 2018, Jun 2018
Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data

Xihui Liu, Hongsheng Li, Jing Shao, and 2 more authors

In Computer Vision – ECCV 2018, Jun 2018
Transductive Centroid Projection for Semi-supervised Large-Scale Recognition

Yu Liu, Guanglu Song, Jing Shao, and 2 more authors

In Computer Vision – ECCV 2018, Jun 2018
Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Guojun Yin, Lu Sheng, Bin Liu, and 4 more authors

In Computer Vision – ECCV 2018, Jun 2018
Avatar-Net: Multi-scale Zero-Shot Style Transfer by Feature Decoration

Lu Sheng, Ziyi Lin, Jing Shao, and 1 more author

In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018
Practical Block-Wise Neural Network Architecture Generation

Zhao Zhong, Junjie Yan, Wei Wu, and 2 more authors

In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018

2017

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification

Zhongdao Wang, Luming Tang, Xihui Liu, and 7 more authors

In 2017 IEEE International Conference on Computer Vision (ICCV), Jun 2017
HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

Xihui Liu, Haiyu Zhao, Maoqing Tian, and 5 more authors

In 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017
Crowded Scene Understanding by Deeply Learned Volumetric Slices

Jing Shao, Chen Change Loy, Kai Kang, and 1 more author

IEEE Transactions on Circuits and Systems for Video Technology, Mar 2017
Learning Scene-Independent Group Descriptors for Crowd Understanding

Jing Shao, Chen Change Loy, and Xiaogang Wang

IEEE Transactions on Circuits and Systems for Video Technology, Jun 2017
Spindle Net: Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion

Haiyu Zhao, Maoqing Tian, Shuyang Sun, and 5 more authors

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017

2016

Slicing Convolutional Neural Network for Crowd Video Understanding

Jing Shao, Chen Change Loy, Kai Kang, and 1 more author

In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016

ISSN: 1063-6919

Abs

Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio-and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55% from 51.84% [21]).

2015

Deeply learned attributes for crowded scene understanding

Jing Shao, Kai Kang, Chen Change Loy, and 1 more author

In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015

2014

Scene-Independent Group Profiling in Crowd

Jing Shao, Chen Change Loy, and Xiaogang Wang

In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun 2014