TY - JOUR
T1 - ViTs as backbones
T2 - Leveraging vision transformers for feature extraction
AU - Elharrouss, Omar
AU - Himeur, Yassine
AU - Mahmood, Yasir
AU - Alrabaee, Saed
AU - Ouamane, Abdelmalik
AU - Bensaali, Faycal
AU - Bechqito, Yassine
AU - Chouchane, Ammar
N1 - Publisher Copyright:
© 2025
PY - 2025/6
Y1 - 2025/6
N2 - The emergence of Vision Transformers (ViTs) has marked a significant shift in the field of computer vision, presenting new methodologies that challenge traditional convolutional neural networks (CNNs). This review offers a thorough exploration of ViTs, unpacking their foundational principles, including the self-attention mechanism and multi-head attention, while examining their diverse applications. We delve into the core mechanics of ViTs, such as image patching, positional encoding, and the datasets that underpin their training. By categorizing and comparing ViTs, CNNs, and hybrid models, we shed light on their respective strengths and limitations, offering a nuanced perspective on their roles in advancing computer vision. A critical evaluation of notable ViT architectures—including DeiT, DeepViT, and Swin-Transformer—highlights their efficacy in feature extraction and domain-specific tasks. The review extends its scope to illustrate the versatility of ViTs in applications like image classification, medical imaging, object detection, and visual question answering, supported by case studies on benchmark datasets such as ImageNet and COCO. While ViTs demonstrate remarkable potential, they are not without challenges, including high computational demands, extensive data requirements, and generalization difficulties. To address these limitations, we propose future research directions aimed at improving scalability, efficiency, and adaptability, especially in resource-constrained settings. By providing a comprehensive overview and actionable insights, this review serves as an essential guide for researchers and practitioners navigating the evolving field of vision-based deep learning.
AB - The emergence of Vision Transformers (ViTs) has marked a significant shift in the field of computer vision, presenting new methodologies that challenge traditional convolutional neural networks (CNNs). This review offers a thorough exploration of ViTs, unpacking their foundational principles, including the self-attention mechanism and multi-head attention, while examining their diverse applications. We delve into the core mechanics of ViTs, such as image patching, positional encoding, and the datasets that underpin their training. By categorizing and comparing ViTs, CNNs, and hybrid models, we shed light on their respective strengths and limitations, offering a nuanced perspective on their roles in advancing computer vision. A critical evaluation of notable ViT architectures—including DeiT, DeepViT, and Swin-Transformer—highlights their efficacy in feature extraction and domain-specific tasks. The review extends its scope to illustrate the versatility of ViTs in applications like image classification, medical imaging, object detection, and visual question answering, supported by case studies on benchmark datasets such as ImageNet and COCO. While ViTs demonstrate remarkable potential, they are not without challenges, including high computational demands, extensive data requirements, and generalization difficulties. To address these limitations, we propose future research directions aimed at improving scalability, efficiency, and adaptability, especially in resource-constrained settings. By providing a comprehensive overview and actionable insights, this review serves as an essential guide for researchers and practitioners navigating the evolving field of vision-based deep learning.
KW - Attention
KW - Computer vision
KW - Deep learning
KW - Transformers
KW - Vision transformers
UR - http://www.scopus.com/inward/record.url?scp=85215859282&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215859282&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2025.102951
DO - 10.1016/j.inffus.2025.102951
M3 - Article
AN - SCOPUS:85215859282
SN - 1566-2535
VL - 118
JO - Information Fusion
JF - Information Fusion
M1 - 102951
ER -