ViTs as backbones: Leveraging vision transformers for feature extraction

Omar Elharrouss, Yassine Himeur, Yasir Mahmood, Saed Alrabaee, Abdelmalik Ouamane, Faycal Bensaali, Yassine Bechqito, Ammar Chouchane

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

The emergence of Vision Transformers (ViTs) has marked a significant shift in the field of computer vision, presenting new methodologies that challenge traditional convolutional neural networks (CNNs). This review offers a thorough exploration of ViTs, unpacking their foundational principles, including the self-attention mechanism and multi-head attention, while examining their diverse applications. We delve into the core mechanics of ViTs, such as image patching, positional encoding, and the datasets that underpin their training. By categorizing and comparing ViTs, CNNs, and hybrid models, we shed light on their respective strengths and limitations, offering a nuanced perspective on their roles in advancing computer vision. A critical evaluation of notable ViT architectures—including DeiT, DeepViT, and Swin-Transformer—highlights their efficacy in feature extraction and domain-specific tasks. The review extends its scope to illustrate the versatility of ViTs in applications like image classification, medical imaging, object detection, and visual question answering, supported by case studies on benchmark datasets such as ImageNet and COCO. While ViTs demonstrate remarkable potential, they are not without challenges, including high computational demands, extensive data requirements, and generalization difficulties. To address these limitations, we propose future research directions aimed at improving scalability, efficiency, and adaptability, especially in resource-constrained settings. By providing a comprehensive overview and actionable insights, this review serves as an essential guide for researchers and practitioners navigating the evolving field of vision-based deep learning.

Original languageEnglish
Article number102951
JournalInformation Fusion
Volume118
DOIs
Publication statusPublished - Jun 2025
Externally publishedYes

Keywords

  • Attention
  • Computer vision
  • Deep learning
  • Transformers
  • Vision transformers

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'ViTs as backbones: Leveraging vision transformers for feature extraction'. Together they form a unique fingerprint.

Cite this