Abstract
The emergence of Vision Transformers (ViTs) has marked a significant shift in the field of computer vision, presenting new methodologies that challenge traditional convolutional neural networks (CNNs). This review offers a thorough exploration of ViTs, unpacking their foundational principles, including the self-attention mechanism and multi-head attention, while examining their diverse applications. We delve into the core mechanics of ViTs, such as image patching, positional encoding, and the datasets that underpin their training. By categorizing and comparing ViTs, CNNs, and hybrid models, we shed light on their respective strengths and limitations, offering a nuanced perspective on their roles in advancing computer vision. A critical evaluation of notable ViT architectures—including DeiT, DeepViT, and Swin-Transformer—highlights their efficacy in feature extraction and domain-specific tasks. The review extends its scope to illustrate the versatility of ViTs in applications like image classification, medical imaging, object detection, and visual question answering, supported by case studies on benchmark datasets such as ImageNet and COCO. While ViTs demonstrate remarkable potential, they are not without challenges, including high computational demands, extensive data requirements, and generalization difficulties. To address these limitations, we propose future research directions aimed at improving scalability, efficiency, and adaptability, especially in resource-constrained settings. By providing a comprehensive overview and actionable insights, this review serves as an essential guide for researchers and practitioners navigating the evolving field of vision-based deep learning.
| Original language | English |
|---|---|
| Article number | 102951 |
| Journal | Information Fusion |
| Volume | 118 |
| DOIs | |
| Publication status | Published - Jun 2025 |
| Externally published | Yes |
Keywords
- Attention
- Computer vision
- Deep learning
- Transformers
- Vision transformers
ASJC Scopus subject areas
- Software
- Signal Processing
- Information Systems
- Hardware and Architecture
Fingerprint
Dive into the research topics of 'ViTs as backbones: Leveraging vision transformers for feature extraction'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS