TY - GEN
T1 - Drone-HAT
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
AU - Khan, Mustaqeem
AU - Ahmad, Jamil
AU - El Saddik, Abdulmotaleb
AU - Gueaieb, Wail
AU - De Masi, Giulia
AU - Karray, Fakhri
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Ultra-high-resolution aerial videos are becoming increasingly popular for enhancing surveillance capabilities in sparsely populated areas. However, analyzing human activities automatically, such as "who is doing what?"in these videos, is desirable to realize their surveillance potential. In contrast, atomic visual action detection has successfully recognized such activities in movie data. However, adapting it to ultra-high resolution aerial videos is challenging because the target persons appear relatively tiny from overhead views and are sparsely located. Additionally, existing atomic visual action detection methods are based on single-label actions. However, people can perform multiple actions simultaneously, so a multi-label approach would be more appropriate. To address these problems, we propose a multi-label action detection/recognition framework using a hybrid attention vision transformer (HAT) to recognize recurrent actions more efficiently. Additionally, a multi-scale, multi-granularity module inside the action recognition transformer block extracts relevant features without redundancy. Using the Okutama Dataset, we demonstrated that our method performs better than existing state-of-the-art methodologies for interpreting aerial videos for human activity.
AB - Ultra-high-resolution aerial videos are becoming increasingly popular for enhancing surveillance capabilities in sparsely populated areas. However, analyzing human activities automatically, such as "who is doing what?"in these videos, is desirable to realize their surveillance potential. In contrast, atomic visual action detection has successfully recognized such activities in movie data. However, adapting it to ultra-high resolution aerial videos is challenging because the target persons appear relatively tiny from overhead views and are sparsely located. Additionally, existing atomic visual action detection methods are based on single-label actions. However, people can perform multiple actions simultaneously, so a multi-label approach would be more appropriate. To address these problems, we propose a multi-label action detection/recognition framework using a hybrid attention vision transformer (HAT) to recognize recurrent actions more efficiently. Additionally, a multi-scale, multi-granularity module inside the action recognition transformer block extracts relevant features without redundancy. Using the Okutama Dataset, we demonstrated that our method performs better than existing state-of-the-art methodologies for interpreting aerial videos for human activity.
KW - Aerial Surveillance
KW - Hybrid Attention Transformer
KW - Multi-granularity and Multi-scale Fusion
KW - Multi-label Action Recognition
UR - https://www.scopus.com/pages/publications/85200720570
UR - https://www.scopus.com/pages/publications/85200720570#tab=citedBy
U2 - 10.1109/CVPRW63382.2024.00474
DO - 10.1109/CVPRW63382.2024.00474
M3 - Conference contribution
AN - SCOPUS:85200720570
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 4713
EP - 4722
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -