Abstract
Most existing video-based object detection methods utilize successful image-based object detector as a base network, and additionally exploit temporal information with either bounding-box post-processing or feature enhancement from multiple frames. However, little work has been done on directly modeling temporal motion in an efficient way for detection in surveillance videos. In this paper, a simple but effective module, denoted as motion-from-memory (MFM), is proposed to encode temporal context for improved detection in surveillance videos. With appearance features extracted from a base CNN, the MFM module maintains a dynamic memory for each input sequence and output motion features on each frame. This module costs minor additional model parameters and computations, but is very helpful for moving object detection, especially in surveillance videos. Thanks to the additional MFM module, the performance of a light-weight MobileNet-based Faster RCNN detector is boosted by 13.93% in mAP, achieving comparable performance to that of strong ResNet-50-based. When MFM is integrated into an even weaker but faster single-stage detector, it ranks the second best one among all published works when submitted to the DEETRAC vehicle detection benchmark, with 69.10% mAP, compared to 69.87% of the best one. However, when running speed is considered, the proposed method is the fastest one, running at 33 FPS with 540x960 surveillance videos on a moderate commercial GPU (NVIDIA GTX 1080Ti), which is about 3 times faster than the second fastest one.
Original language | English |
---|---|
Article number | 8669956 |
Pages (from-to) | 3558-3567 |
Number of pages | 10 |
Journal | IEEE Transactions on Circuits and Systems for Video Technology |
Volume | 29 |
Issue number | 12 |
DOIs | |
Publication status | Published - Dec 2019 |
Externally published | Yes |
Keywords
- deep neural network
- Object detection
- surveillance video
ASJC Scopus subject areas
- Media Technology
- Electrical and Electronic Engineering