Proceedings of International Conference on Applied Innovation in IT  ·  2026/03/31  ·  Vol. 14  ·  Issue 1  ·  pp. 651–660
Recognizing Gesture images with ViT and Spatial Attention Regularization
Zahraa Thamer, Noor S.Sagheer, Ashwan A.Abdulmunem, Hawraa Thamer1 and Og˘uz Ata
One important area of Human-Computer Interaction (HCI) is image-based gesture recognition. Despite tremendous advancements, it is still very difficult to achieve reliable and accurate gesture recognition in unrestricted, real-world settings. Conventional techniques frequently find it difficult to handle changes in lighting, background noise, occlusions, size variations, and the innate similarity between various gestures. To enhance the discriminative ability of the Vision Transformer (ViT) model for intricate hand gestures, this work presents a carefully planned fine-tuning methodology. Encourage ViT to concentrate on salient gesture regions while remaining resilient to environmental noise; the proposed method combines an adaptive learning rate scheduling system with a novel spatial attention regulator during fine-tuning. Experiments on a challenging and varied gesture dataset demonstrate that the proposed approach significantly performs better than state-of-the-art methods, attaining superior accuracy reaching 100% and demonstrating generalization capabilities. This study opens the door for more user-friendly human-computer interaction systems by providing a highly effective and flexible framework for sophisticated image-based gesture recognition systems.
Computer Vision ViT Gesture Images Spatial Attention Regularization Image Analysis Hand Gesture Recognition d.
References
  1. A. Osman Hashi, S. Zaiton Mohd Hashim, and A. Bte Asamah, “A systematic review of hand gesture recognition: An update from 2018 to 2024,” IEEE Access, vol. 12, pp. 143599-143626, 2024, doi: 10.1109/ACCESS.2024.3421992.
  2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84-90, May 2017, doi: 10.1145/3065386.
  3. P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3D convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Boston, MA, USA, 2015, pp. 1-7, doi: 10.1109/CVPRW.2015.7301342.
  4. P. Mittal, B. Sharma, and D. P. Yadav, “Comparative analysis between CNN and ViT using brain MRI dataset,” in Proc. 8th Int. Conf. Parallel, Distributed and Grid Comput. (PDGC), Solan, India, 2024, pp. 290-295, doi: 10.1109/PDGC64653.2024.10984339.
  5. I. Pacal, B. Ozdemir, J. Zeynalov, H. Gasimov, and N. Pacal, “A novel CNN-ViT-based deep learning model for early skin cancer diagnosis,” Biomed. Signal Process. Control, vol. 104, p. 107627, 2025, doi: 10.1016/j.bspc.2025.107627.
  6. A. Al-Zebari, N. Omar, and A. Sengur, “Vision transformers-based hand gesture classification,” in Proc. 3rd Int. Informatics and Software Eng. Conf. (IISEC), Ankara, Turkey, 2022, pp. 1-3, doi: 10.1109/IISEC56263.2022.9998295.
  7. T. Kaggle, “Hand gesture recognition dataset,” Kaggle, 2022. [Online]. Available: https://www.kaggle.com/datasets/tapakah68/hand-gesture-recognition-dataset
  8. T.-H. Nguyen, B.-V. Ngo, and T.-N. Nguyen, “Vision-based hand gesture recognition using a YOLOv8n model for the navigation of a smart wheelchair,” Electronics, vol. 14, no. 4, p. 734, 2025, doi: 10.3390/electronics14040734.
  9. Shivani and S. B. Gupta, “A comprehensive analysis of recognition of hand gestures using machine learning,” Makara J. Technol., vol. 29, no. 1, Art. no. 5, 2025, doi: 10.7454/mst.v29i1.1679.
  10. C. K. Tan, K. M. Lim, R. K. Y. Chang, C. P. Lee, and A. Alqahtani, “HGR-ViT: Hand gesture recognition with vision transformer,” Sensors, vol. 23, no. 12, p. 5555, 2023, doi: 10.3390/s23125555.
  11. Y. Altaf, “Efficient hand sign recognition with fine-tuned faster vision transformers: A comparative study on benchmark image datasets,” J. Electr. Syst., vol. 20, no. 3, pp. 8082-8098, 2024.
  12. A. R. Asif et al., “Performance evaluation of convolutional neural network for hand gesture recognition using EMG,” Sensors, vol. 20, no. 6, p. 1642, 2020, doi: 10.3390/s20061642.
  13. H. Hellara, R. Barioul, S. Sahnoun, A. Fakhfakh, and O. Kanoun, “Comparative study of sEMG feature evaluation methods based on the hand gesture classification performance,” Sensors, vol. 24, no. 11, p. 3638, 2024, doi: 10.3390/s24113638.
  14. V.-D. Do, V.-H. Le, H.-S. Do, V.-N. Phan, and T.-H. Te, “TQU-HG dataset and comparative study for hand gesture recognition of RGB-based images using deep learning,” Indones. J. Electr. Eng. Comput. Sci., vol. 34, no. 3, pp. 1603-1617, 2024.
  15. K. Myagila and H. Kilavo, “A comparative study on performance of SVM and CNN in Tanzania sign language translation using image recognition,” Appl. Artif. Intell., vol. 36, no. 1, p. 2005297, 2021, doi: 10.1080/08839514.2021.2005297.
  16. S. Bhushan, M. Alshehri, I. Keshta, A. K. Chakraverti, J. Rajpurohit, and A. Abugabah, “An experimental analysis of various machine learning algorithms for hand gesture recognition,” Electronics, vol. 11, no. 6, p. 968, 2022, doi: 10.3390/electronics11060968.
  17. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017, pp. 5998-6008.
  18. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020, doi: 10.48550/arXiv.2010.11929.
  19. K. Gupta, A. Singh, S. R. Yeduri, M. B. Srinivas, and L. R. Cenkeramaddi, “Hand gestures recognition using edge computing system based on vision transformer and lightweight CNN,” J. Ambient Intell. Humanized Comput., vol. 14, no. 3, pp. 2601-2615, 2023, doi: 10.1007/s12652-022-04506-4.

Proceedings of the International Conference on Applied Innovations in IT by Anhalt University of Applied Sciences is licensed under CC BY-SA 4.0  ·  This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

ICAIIT 2026
International Conference on Applied Innovation in IT
Navigation
Publisher
ISSN2199-8876
Location Anhalt University of Applied Sciences
Phone +49 (0) 3496 67 5611
Address Building 01, Room 425
Bernburger Str. 55
D-06366 Köthen, Germany
Open Access License

All works are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0), unless otherwise noted.

Published by ICAIIT in cooperation with Anhalt University of Applied Sciences.

© 2026 ICAIIT — International Conference on Applied Innovations in IT. Anhalt University of Applied Sciences, Köthen, Germany.
Visitors: site traffic counter