Traditional methods for video recognition require hand-crafted features, which often involves offline pre-processing for real-world videos. In this study, we propose a conceptually simple framework that directly takes raw videos as an input source for activity recognition. Our framework consists of two streams, namely a spatial stream and a temporal stream. The spatial stream is trained on RepVGG-B0 ConvNet using cropped RGB features, while the temporal stream uses an attention-based Bi-directional Long Short-Term Memory (Bi-LSTM) network to learn posture vectors from human pose data obtained through Faster R-CNN pre-Trained model. Our proposed method is evaluated on a standard video action recognition benchmark, MSR Daily Activity3D, and proves to be competitive with state-of-The-Art action recognition methods. We achieve state-of-The-Art performance on MSR Daily Activity3D with a precision and recall rate of 99.01% and 98.91%, respectively. Our results demonstrate the effectiveness of our approach in recognizing video actions.