
Video-based person re-identification is an important research topic in computer vision that entails associating a pedestrian’s identity with non-overlapping cameras. It suffers from severe temporal appearance misalignment and visual ambiguity problems. We propose a novel self-supervised human semantic parsing approach (SS-HSP) for video-based person re-identification in this work. It employs self-supervised learning to adaptively segment the human body at pixel-level by estimating motion information of each body part between consecutive frames and explores complementary temporal relations for pursuing reinforced appearance and motion representations. Specifically, a semantic segmentation network within SS-HSP is designed, which exploits self-supervised learning by constructing a pretext task of predicting future frames. The network learns precise human semantic parsing together with the motion field of each body part between consecutive frames, which permits the reconstruction of future frames with the aid of several customized loss functions. Local aligned features of body parts are obtained according to the estimated human parsing. Moreover, an aggregation network is proposed to explore the correlation information across video frames for refining the appearance and motion representations. Extensive experiments on two video datasets have demonstrated the effectiveness of the proposed approach.
The basic structure of self-supervised human semantic parsing approach (SS-HSP).
[1] |
Li X, Zhou W, Zhou Y, et al. Relation-guided spatial attention and temporal refinement for video-based person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 11434–11441. DOI: https://doi.org/10.1609/aaai.v34i07.6807
|
[2] |
Cheng Z, Dong Q, Gong S, et al. Inter-task association critic for cross-resolution person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020: 2602–2612.
|
[3] |
Huang Y, Zha Z J, Fu X, et al. Real-world person re-identification via degradation invariance learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020: 14072–14082.
|
[4] |
Ding Y, Fan H, Xu M, et al. Adaptive exploration for unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2020, 16 (1): 1–19. DOI: 10.1145/3369393
|
[5] |
Kalayeh M M, Basaran E, Gökmen M, et al. Human semantic parsing for person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1062–1071.
|
[6] |
Liang X, Gong K, Shen X, et al. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (4): 871–885. DOI: 10.1109/TPAMI.2018.2820063
|
[7] |
Song C, Huang Y, Ouyang W, et al. Mask-guided contrastive attention model for person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1179–1188.
|
[8] |
Ye M, Yuen P C. PurifyNet: A robust person re-identification model with noisy labels. IEEE Transactions on Information Forensics and Security, 2020, 15: 2655–2666. DOI: 10.1109/TIFS.2020.2970590
|
[9] |
Liu H, Jie Z, Jayashree K, et al. Video-based person re-identification with accumulative motion context. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28 (10): 2788–2802. DOI: 10.1109/TCSVT.2017.2715499
|
[10] |
Wang Z, Luo S, Sun H, et al. An efficient non-local attention network for video-based person re-identification. In: ICIT 2019: Proceedings of the 2019 7th International Conference on Information Technology: IoT and Smart City. Shanghai, China: Association for Computing Machinery, 2019: 212–217.
|
[11] |
Zheng L, Bie Z, Sun Y, et al. MARS: A video benchmark for large-scale person re-identification. In: Leibe B, Matas J, Sebe N, et al. editors. Computer Vision – ECCV 2016. Cham, Switzerland: Springer, 2016: 868–884.
|
[12] |
Wang T, Gong S, Zhu X, et al. Person re-identification by video ranking. In: Fleet D, PajdlaT, Schiele B, et al. editors. Computer Vision – ECCV 2014. Cham, Switzerland: Springer, 2014: 688–703.
|
[13] |
McLaughlin N, del Rincon J M, Miller P. Recurrent convolutional network for video-based person re-identification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 1325–1334.
|
[14] |
Yang J, Zheng W S, Yang Q, et al. Spatial-temporal graph convolutional network for video-based person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020: 3286-3296.
|
[15] |
Wu Y, Bourahla O E F, Li X, et al. Adaptive graph representation learning for video person re-identification. IEEE Transactions on Image Processing, 2020, 29: 8821–8830. DOI: 10.1109/TIP.2020.3001693
|
[16] |
Li S, Bak S, Carr P, et al. Diversity regularized spatiotemporal attention for video-based person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 369–378.
|
[17] |
Zhou Z, Huang Y, Wang W, et al. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017: 4747-4756.
|
[18] |
Li X, Loy C C. Video object segmentation with joint re-identification and attention-aware mask propagation. In: Ferrari, V, Hebert M, Sminchisescu C, et al. editors. Computer Vision – ECCV 2018. Cham, Switzerland: Springer, 2018: 93–110.
|
[19] |
Jones M J, Rambhatla S. Body part alignment and temporal attention for video-based person re-identification. In: Sidorov K, Hicks Y, editors. Proceedings of the British Machine Vision Conference (BMVC). London: BMVA Press, 2019, 115: 1−12.
|
[20] |
Gao C, Chen Y, Yu J G, et al. Pose-guided spatiotemporal alignment for video-based person re-identification. Information Sciences, 2020, 527: 176–190. DOI: 10.1016/j.ins.2020.04.007
|
[21] |
Liu J, Zha Z J, Chen X, et al. Dense 3D-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15 (1s): 1–19. DOI: 10.1145/3231741
|
[22] |
Chung D, Tahboub K, Delp E J. A two stream siamese convolutional neural network for person re-identification. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017: 1992-2000.
|
[23] |
Li J, Zhang S, Huang T. Multi-scale 3D convolution network for video based person re-identification. In: AAAI'19: AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI Press, 2019: 1057.
|
[24] |
Jin X, He T, Zheng K, et al. Cloth-changing person re-identification from a single image with gait prediction and regularization. [2021-09-01]. https://arxiv.org/abs/2103.15537
|
[25] |
Zhang P, Wu Q, Xu J, et al. Long-term person re-identification using true motion from videos. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, USA: IEEE, 2018: 494–502.
|
[26] |
Zhu K, Guo H, Liu Z, et al. Identity-guided human semantic parsing for person re-identification. In: Vedaldi A, Bischof H, Brox T, et al. editors. Computer Vision – ECCV 2020. Cham, Switzerland: Springer, 2020: 346-363.
|
[27] |
Liao S C, Hu Y, Zhu X Y, et al. Person re-identification by local maximal occurrence representation and metric learning. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015, 2197–2206.
|
[28] |
Bazzani L, Cristani M, Murino V. Symmetry-driven accumulation of local features for human characterization and re-identification. Computer Vision and Image Understanding, 2013, 117 (2): 130–144. DOI: 10.1016/j.cviu.2012.10.008
|
[29] |
Zhang L, Xiang T, Gong S. Learning a discriminative null space for person re-identification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 1239-1248.
|
[30] |
Zhou Q, Zhong B, Lan X, et al. LRDNN: Local-refining based deep neural network for person re-identification with attribute discerning. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Macao: International Joint Conferences on Artificial Intelligence Organization, 2019: 1041−1047.
|
[31] |
Zhang Z, Lan C, Zeng W, et al. Relation-aware global attention for person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020: 3183-3192.
|
[32] |
Jin X, Lan C, Zeng W, et al. Semantics-aligned representation learning for person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 11173–11180. DOI: 10.1609/aaai.v34i07.6775
|
[33] |
You J, Wu A, Li X, et al. Top-push video-based person re-identification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 1345–1353.
|
[34] |
Gu X, Chang H, Ma B, et al. Appearance-preserving 3D convolution for video-based person re-identification. In: Vedaldi A, Bischof H, Brox T, et al. editors. Computer Vision – ECCV 2020. Cham, Switzerland: Springer, 2020: 228–243.
|
[35] |
Li S, Yu H, Hu H. Appearance and motion enhancement for video-based person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 11394–11401. DOI: 10.1609/aaai.v34i07.6802
|
[36] |
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 770–778.
|
[37] |
Siarohin A, Lathuilière A, Tulyakov S, et al. First order motion model for image animation. In: Wallach H, Larochelle H, Beygelzimer A et al. editors. Advances in Neural Information Processing Systems. Red Hook, NY: Curran Associates, Inc, 2019: 3854.
|
[38] |
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, et al. editors. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. Cham, Switzerland: Springer, 2015: 234–241.
|
[39] |
Johnson J, Alahi A, Li F F. Perceptual losses for real-time style transfer and super-resolution. In: Leibe B, Matas J, Sebe N, et al. editors. Computer Vision – ECCV 2016. Cham, Switzerland: Springer, 2016: 694-711.
|
[40] |
Siarohin A, Sangineto E, Lathuiliere S, et al. Deformable GANs for pose-based human image generation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 3408−3416.
|
[41] |
Hung W C, Jampani V, Liu S F, et al. SCOPS: Self-supervised co-part segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA: IEEE, 2019: 869–878.
|
[42] |
Hou R, Chang H, Ma B, et al. Temporal complementary learning for video person re-identification. [2021-09-01]. https://arxiv.org/abs/2007.09357.
|
[43] |
Hermans A, Beyer L, Leibe B. In defense of the triplet loss for person re-identification. [2021-09-01]. https://arxiv.org/abs/1703.07737
|
[44] |
Liu J, Zha Z J, Chen D, et al. Adaptive transfer network for cross-domain person re-identification. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE, 2019: 7195–7204.
|
[45] |
Liu Y, Yan J, Ouyang W. Quality aware network for set to set recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017: 4694–4703.
|
[46] |
Subramaniam A, Nambiar A, Mittal A, et al. Co-segmentation inspired attention networks for video-based person re-identification. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2019: 562–572.
|
[47] |
Chen D, Li H, Xiao T, et al. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1169–1178.
|
[48] |
Li J, Zhang S, Wang J, et al. Global-local temporal representations for video person re-identification. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea(South): IEEE, 2019: 3957–3966.
|
[49] |
Aich A, Zheng M, Karanam S, et al. Spatio-temporal representation factorization for video-based person re-identification. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021: 152–162.
|
[50] |
He T Y, Jin X, Shen X, et al. Dense interaction learning for video-based person re-identification. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021: 1470–1481.
|
Method | Rank-1 (%) | Rank-5 (%) | Rank-20 (%) | mAP (%) |
CNN+XQDA [11] | 68.3 | 82.6 | 89.4 | 49.3 |
QAN [45] | 73.5 | 84.9 | 91.6 | 51.7 |
STAN [16] | 82.3 | − | − | 65.8 |
M3D [23] | 84.4 | 93.8 | 97.7 | 74.0 |
COSAM [46] | 84.9 | 95.5 | 97.9 | 79.9 |
Snippet [47] | 86.3 | 94.7 | 98.2 | 76.1 |
GLTR [48] | 87.0 | 95.8 | 98.2 | 78.5 |
RGSAT [1] | 89.4 | 96.9 | 98.3 | 84.0 |
AGRL [15] | 89.8 | 96.1 | 97.6 | 81.1 |
TCLNet [42] | 89.8 | − | − | 85.1 |
STGCN [14] | 90.0 | 96.4 | 98.3 | 83.7 |
AP3D [34] | 90.1 | − | − | 85.1 |
STRF [49] | 90.3 | − | − | 86.1 |
DenseIL [50] | 90.8 | 97.1 | 98.8 | 87.0 |
SS-HSP | 91.0 | 96.9 | 98.6 | 85.9 |
Method | Rank-1 (%) | Rank-5 (%) | Rank-20 (%) |
CNN+XQDA [11] | 53.0 | 81.4 | 95.1 |
QAN [45] | 68.0 | 86.8 | 97.4 |
M3D [23] | 74.0 | 94.33 | − |
COSAM [46] | 79.6 | 95.3 | − |
STAN [26] | 80.2 | − | − |
AGRL [15] | 83.7 | 95.4 | 99.5 |
Snippet [47] | 85.4 | 96.7 | 99.5 |
RGSAT [1] | 86.0 | 98.0 | 99.4 |
GLTR [48] | 86.0 | 98.0 | − |
TCLNet [42] | 86.6 | − | − |
AP3D [34] | 86.7 | − | − |
DenseIL [50] | 92.0 | 98.0 | − |
SS-HSP | 88.3 | 98.4 | 99.9 |
Model | Rank-1 (%) | Rank-5 (%) | Rank-20 (%) | mAP (%) |
Basel | 86.3 | 94.6 | 97.2 | 79.1 |
Basel+Part | 88.5 | 95.7 | 98.0 | 82.8 |
Basel+Part+Motion | 89.9 | 96.4 | 98.4 | 84.9 |
Basel+Part+Motion+TRB | 91.0 | 96.9 | 98.6 | 85.9 |
Model | Rank-1 (%) | Rank-5 (%) | Rank-20 (%) | mAP (%) |
SS-HSP w/o {\cal{L} }_{{\rm{tri}}} | 87.2 | 94.7 | 97.8 | 81.8 |
SS-HSP w/o {\cal{L} }_{{\rm{ide}}} | 88.3 | 95.5 | 98.1 | 83.0 |
SS-HSP w/o {\cal{L} }_{{\rm{equ}}} | 89.0 | 96.0 | 98.4 | 84.1 |
SS-HSP w/o {\cal{L} }_{{\rm{geo}}} | 90.3 | 96.5 | 98.5 | 85.2 |
SS-HSP | 91.0 | 96.9 | 98.6 | 85.9 |