BENCHMARKING CNN, LSTM, AND VISION TRANSFORMER MODELS FOR MULTILINGUAL SIGN LANGUAGE RECOGNITION: A CASE STUDY ON ASL, ISL, AND BISINDO

Authors

  • Yuvi Darmayunata Universitas Lancang Kuning
  • Lucky Lhaura Van FC Universitas Lancang Kuning
  • Vebby Vebby Universitas Lancang Kuning

DOI:

https://doi.org/10.31849/718y3547

Keywords:

Bahasa Isyarat, Deep Learning, Convolutional Neural Network (CNN), Transformer, Long Short-Term Memory

Abstract

Penelitian ini membahas perbandingan tiga arsitektur deep learning, yaitu Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), dan Vision Transformer (ViT), dalam pengenalan bahasa isyarat multibahasa (ASL, ISL, dan BISINDO). Eksperimen dilakukan menggunakan dataset video terstruktur dengan tahap pra-pemrosesan berupa ekstraksi frame, segmentasi tangan, normalisasi, dan augmentasi data. Model dievaluasi berdasarkan akurasi, F1-score, latensi inferensi, dan kompleksitas komputasi. Hasil penelitian menunjukkan bahwa Transformer memberikan akurasi tertinggi sebesar 98,7%, namun membutuhkan memori dan waktu inferensi terbesar. Model LSTM menghasilkan trade-off terbaik dengan akurasi 96,7% dan latensi 120 ms/frame, sehingga lebih layak diterapkan pada sistem real-time. Temuan ini membuktikan bahwa pemilihan model harus disesuaikan dengan konteks implementasi, khususnya dalam pengembangan teknologi inklusif untuk komunitas Tuli di Indonesia.

References

[1] World Federation of the Deaf, “Global Deaf Population Report,” 2023. [Online]. Available: https://wfdeaf.org

[2] Kemendikbudristek RI, “Statistik Penyandang Disabilitas Indonesia 2024,” Pusat Data dan Informasi, Jakarta, Indonesia, 2024.

[3] H. Prasetyo and S. Nuraini, “Accessibility Challenges for D/deaf People in Indonesia,” J. Komun. Inklusif, vol. 5, no. 1, pp. 22–31, 2023.

[4] R. Munir and Y. Brianorman, “Convolutional Neural Network for Static ISL Recognition,” J. Inform., vol. 9, no. 2, pp. 112–119, 2021.

[5] Y. Al-Shayea, “LSTM for Dynamic Indian Sign Language Recognition,” Multimedia Tools Appl., vol. 84, pp. 27987–28011, 2025.

[6] C. Tan, P. Chen, and H. Zhao, “HGR-ViT: Hand Gesture Recognition with Vision Transformers,” Sensors, vol. 23, no. 4, pp. 1–14, 2023.

[7] Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked Autoencoders for Self-Supervised Video Pre-Training,” in Proc. 36th Conf. Neural Inf. Process. Syst. (NeurIPS), New Orleans, USA, Dec. 2022.

[8] D. Kothadiya, A. Desai, and K. Rathod, “SIGNFORMER: Transformer-based Deep Learning for Continuous Sign Language Recognition,” J. Comput. Vis. Image Process., vol. 10, no. 2, pp. 45–56, 2024.

[9] J. Zhang and H. Wang, “Spatio-temporal Transformers in Dynamic SLR,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 3, pp. 1502–1514, 2024.

[10] A. Aljabar and S. Suharjito, “BISINDO Sign Language Recognition Using CNN and LSTM,” Adv. Sci. Technol. Eng. Syst. J., vol. 5, no. 5, pp. 282–287, 2020.

[11] M. Alaftekin, I. Pacal, and K. Cicek, “Real-Time Sign Language Recognition Based on YOLO Algorithm,” Neural Comput. Appl., vol. 36, no. 14, pp. 7609–7624, 2024.

[12] Y. Brianorman and R. Munir, “Evaluation of Pre-trained CNN Models for Hijaiyah Hand Gesture Recognition,” J. Sist. Inf. Bisnis, vol. 13, no. 1, pp. 52–59, 2023.

[13] S. R. Dewi, I. S. Mariana, and R. Ekawati, “Dataset and Linguistic Gaps for BISINDO Recognition,” in Proc. 14th Int. Conf. Comput. Sci., Bali, Indonesia, 2023, pp. 112–118.

[14] L. Papa, F. Bonanno, and C. De Stefano, “Efficient Vision Transformers: A Survey,” IEEE Trans. Pattern Anal. Mach. Intell., early access, 2024, doi: 10.1109/TPAMI.2024.1234567.

[15] J. Lee, “Hybrid CNN–LSTM with Attention for Dynamic Sign Recognition,” Electronics, vol. 13, no. 7, pp. 1–13, 2024.

[16] K. Yin, “OpenASL: A Large-Scale Benchmark Dataset for American Sign Language,” in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), Abu Dhabi, UAE, 2022, pp. 842–851.

[17] J. Singh and M. Kaur, “Indian Sign Language Video Dataset for Deep Gesture Recognition,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Kuala Lumpur, Malaysia, 2025, pp. 1–5.

[18] R. Kusuma and A. Josef, “Alphabet Recognition Using Bayesian Optimization in Hand Gesture Classification,” Revue d’Intelligence Artificielle, vol. 38, no. 3, pp. 929–938, 2024.

[19] W. Huang, “EfficientNet-Lite for Real-Time ASL Recognition,” LSEE Electron. Eng. J., vol. 42, no. 6, pp. 412–419, 2022.

[20] C. Wu, M. Li, and S. Zhou, “Cross-domain Adversarial Adaptation in Sign Language Recognition,” Pattern Recognit., vol. 140, pp. 1–12, 2024.

Downloads

Published

2026-01-24

How to Cite

[1]
“BENCHMARKING CNN, LSTM, AND VISION TRANSFORMER MODELS FOR MULTILINGUAL SIGN LANGUAGE RECOGNITION: A CASE STUDY ON ASL, ISL, AND BISINDO”, zn, vol. 8, no. 1, pp. 255–265, Jan. 2026, doi: 10.31849/718y3547.