Two-Stream Architecture Using RGB-based ConvNet and Pose-based LSTM for Video Action Recognition

Ching Jung Huang, Munkhjargal Gochoo, Tan Hsu Tan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Traditional methods for video recognition require hand-crafted features, which often involves offline pre-processing for real-world videos. In this study, we propose a conceptually simple framework that directly takes raw videos as an input source for activity recognition. Our framework consists of two streams, namely a spatial stream and a temporal stream. The spatial stream is trained on RepVGG-B0 ConvNet using cropped RGB features, while the temporal stream uses an attention-based Bi-directional Long Short-Term Memory (Bi-LSTM) network to learn posture vectors from human pose data obtained through Faster R-CNN pre-Trained model. Our proposed method is evaluated on a standard video action recognition benchmark, MSR Daily Activity3D, and proves to be competitive with state-of-The-Art action recognition methods. We achieve state-of-The-Art performance on MSR Daily Activity3D with a precision and recall rate of 99.01% and 98.91%, respectively. Our results demonstrate the effectiveness of our approach in recognizing video actions.

Original languageEnglish
Title of host publication2023 15th International Conference on Innovations in Information Technology, IIT 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages127-131
Number of pages5
ISBN (Electronic)9798350382396
DOIs
Publication statusPublished - 2023
Event15th International Conference on Innovations in Information Technology, IIT 2023 - Al Ain, United Arab Emirates
Duration: Nov 14 2023Nov 15 2023

Publication series

Name2023 15th International Conference on Innovations in Information Technology, IIT 2023

Conference

Conference15th International Conference on Innovations in Information Technology, IIT 2023
Country/TerritoryUnited Arab Emirates
CityAl Ain
Period11/14/2311/15/23

Keywords

  • Faster R-CNN
  • RGB activity images
  • RepVGG
  • attention-based Bi-directional LSTM
  • deep learning
  • video action recognition

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Two-Stream Architecture Using RGB-based ConvNet and Pose-based LSTM for Video Action Recognition'. Together they form a unique fingerprint.

Cite this