Fine-Tuning Vision Transformer for Arabic Sign Language Video Recognition on Augmented Small-Scale Dataset

Munkhjargal Gochoo, Ganzorig Batnasan, Ahmed Abdelhadi Ahmed, Munkh Erdene Otgonbold, Fady Alnajjar, Timothy K. Shih, Tan Hsu Tan, Lai Khin Wee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the rise of AI, the recognition of Sign Language (SL) through sign-to-text has gained significance in the field of computer vision and deep machine learning. However, there are only a few medium to large open datasets available for this task, as it requires a vast dataset of thousands of signs for words/phrases in different environments, which is a time-consuming and tedious process. Furthermore, there has been very little effort towards Arabic Sign Language Recognition (ArSLR). This research paper presents the results of fine-tuning the Vision Transformer (ViT) model on a small-scale in-house dataset of ArSL. The main goal is to attain satisfactory results by utilizing minimal computing power and a small dataset involving less than 10 individuals, with only one recording made for each sign in every environment. The dataset comprises 49 classes/signs, all of which were made with two hands and belong to the Level I category in terms of popularity. To enhance the dataset, three types of augmentations - translation, shear, and rotation were employed. The ViT model, pre-trained on the Kinetics dataset, was trained on the variation of augmented datasets with 2 to 40 times samples for each original video, where the training set includes original and augmented videos of 8 volunteers and the test set includes only original videos of one particular volunteer. Experimental results reveal that the combination of rotation and shear outperformed the others, achieving an accuracy of 93% on the 20 times augmented samples per class per signer dataset. We believe this study sheds light on small-scale dataset-based SLR tasks and video/action recognition in general.

Original languageEnglish
Title of host publication2023 IEEE International Conference on Systems, Man, and Cybernetics
Subtitle of host publicationImproving the Quality of Life, SMC 2023 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2880-2885
Number of pages6
ISBN (Electronic)9798350337020
DOIs
Publication statusPublished - 2023
Event2023 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2023 - Hybrid, Honolulu, United States
Duration: Oct 1 2023Oct 4 2023

Publication series

NameConference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
ISSN (Print)1062-922X

Conference

Conference2023 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2023
Country/TerritoryUnited States
CityHybrid, Honolulu
Period10/1/2310/4/23

Keywords

  • Arabic Sign Language
  • Augmentation
  • Deep Learning
  • Smale-scale dataset
  • Vision Transformer
  • ViT

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Control and Systems Engineering
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Fine-Tuning Vision Transformer for Arabic Sign Language Video Recognition on Augmented Small-Scale Dataset'. Together they form a unique fingerprint.

Cite this