Cracks are the initial indicators of the deterioration of any civil infrastructure. Structures are typically monitored manually by inspectors, which is time-consuming, laborious, costly, and easily prone to human error. To address these limitations this paper aims to present a vision transformer-based stone floor tiles crack detection and localization approach. The proposed model is trained on a custom dataset acquired from various stone tiles under various illumination conditions in the United Arab Emirates. The dataset consists of 5800 images having a resolution of 224×224 pixels. To assess the effectiveness of the proposed model, various evaluation metrics such as testing accuracy, precision, recall, F1 score, and computational time are employed to analyze its performance. The input patch size of the Vision Transformer (ViT) model is varied to investigate its effect on the performance of the model. The experimental results show that input patch size has a significant on the performance of the models. The ViT model trained on the lowest patch size of 14×14 pixels achieved the highest testing accuracy, precision, recall, and F1 score of 0.8612, 0.8840, 0.8304, and 0.8564 respectively. The inference time of the ViT model for a single patch is 0.092 sec. The crack localization is performed by combining the proposed trained ViT model with the sliding window approach. The model performed well in detecting and locating cracks in stone floor tiles, indicating its potential for practical use.