Slicing Convolutional Neural Network for Crowd Video Understanding

Jing Shao1, Chen Change Loy2, Kai Kang1, and Xiaogang Wang1

1Department of Electronic Engineering, 2Department of Informaiton Engineering, The Chinese University of Hong Kong.

[PDF] [Spotlight (download)] [Presentation video] [Code] [Homepage]



Comtemporary CNNs are capable of learning strong generic appearance representations from static image sets such as ImageNet. Nevertheless, they lack of the critical capability for learning dynamic representation. In existing approaches, a video is treated as a 3D volume and 2D CNN is simply extended to 3D CNN [5], mixing the appearance and dynamic feature representations in the learned 3D filters. Instead, appearance and dynamic features should be extracted separately, since they are encoded in different ways in videos and convey different information. Alternative solutions include sampling frames along the temporal direction and fusing their 2D CNN feature maps at different levels, or feeding motion maps obtained by existing tracking or optical flow methods. While computationally more feasible than 3D CNN, these methods lose critical dynamic information at the input layer. In this study, we wish to show that with innovative model design, appearance and dynamic information can be effectively extracted at a deeper layer of CNN that conveys richer semantical notion (i.e. groups and individuals). In our new model design, appearance and dynamics have separate representations yet they interact seamlessly at semantic level. We name our model as Slicing CNN (S-CNN).

The contributions of this work:

Slicing CNN Model

In this paper, we propose a new end-to-end model named as Slicing CNN (S-CNN) consisting of three branches. We first learn appearance features by a 2D CNN model on each frame of the input video volume, and obtain a collection of semantic feature cuboids. Each feature cuboid captures a distinct visual pattern, or an object instance/category. Based on the extracted feature cuboids, we introduce three different 2D spatio- and temporal-filters (i.e. xy-, xt-, and yt-) to learn the appearance and dynamic features from different dimensions, each of which is followed by a 1D temporal pooling layer. Recognition of crowd attribute is achieved by applying a classifier on the concatenated feature vector extracted from the feature maps of xy-, xt-, and yt-branch. The complete S-CNN model is shown as follows:


The figure below shows the detailed architecture of the single branch of S-CNN-xt.    

Semantic Selectiveness of Feature Maps

The feature map obtained by a spatial filter at one intermediate layer of a deep model records the spatial distribution of visual pattern of a specific object. From the example shown in the left figure below, convolutional layers of the VGG model pre-trained on ImageNet depict visual patterns in different scales and levels, in which the conv4_3 layer extracts the semantic patterns in object level. For instance, the filter #26 in this layer precisely captures ice ballet dancers in all frames. Further examining the selectiveness of the feature maps, the right figure below demonstrates that different filters at conv4_3 layer are possibly linked to different visual patterns. The orange patches in (a) and (c) mark the receptive fields of the strongest responses with a certain filter on the given crowd images in (b). The top five receptive fields from images in WWW crowd dataset that have the strongest responses of the corresponding filters are listed aside. (d) and (e) present patches that have strongest responses for the reserved spatial filters and pruned spatial filters.


Semantic Temporal Slices

Existing studies typically learn dynamic features from raw video volumes or hand-crafted motion maps. However, much information is lost at the input layer since they compress the entire temporal range by sub-sampling frames or averaging spatial feature maps along the time dimension. Indeed, dynamic feature representations can also be described from 2D temporal slices that cut across 3D volume from another two orthogonal planes, as xt- or yt-slices shown as follows:

However, It is a general case that a xt- or yt-slice captured from a raw video volume contains motion patterns of multiple objects of different categories, which cannot be well separated since the features that identify these categories always refer to appearance but not motion. For instance, the yt-slice in the figure above contains motion patterns from audience, dancers and ice rink. It is not a trivial task to divide their motion patterns apart without identifying these objects at first. Motivated by this observation, we propose Semantic Temporal Slice (STS) extracted from semantic feature cuboids, which are obtained from the xy convolutional layers. Such kind of slices can distinguish and purify the dynamic representation for a certain semantic pattern without the interference from other objects, instances or visual patterns inside one temporal slice.







Reference and Acknowledgments

If you use our dataset or code, please cite our papers.

Jing Shao, Chen Change Loy, Kai Kang, and Xiaogang Wang. "Slicing Convolutional Neural Network for Crowd Video Understanding", in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016, spotlight).  

Jing Shao, Kai Kang, Chen Change Loy, and Xiaogang Wang. "Deeply Learned Attributes for Crowd Scene Understanding", in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015, oral).  

This work is partially supported by SenseTime Group Limited, the General Research Fund sponsored by the Research Grants Council of Hong Kong (Project Nos. CUHK14206114, CUHK14205615, CUHK419412, CUHK14203015), the Hong Kong Innovation and Technology Support Programme (No. ITS/221/13FP) and National Natural Science Foundation of China (NSFCNO. 61371192).


Contact Me

If you have any questions, please feel free to contact me (

Back to top

Last update: July 15, 2016