BENCHMARKING CNN, LSTM, AND VISION TRANSFORMER MODELS FOR MULTILINGUAL SIGN LANGUAGE RECOGNITION: A CASE STUDY ON ASL, ISL, AND BISINDO
DOI:
https://doi.org/10.31849/718y3547Keywords:
Bahasa Isyarat, Deep Learning, Convolutional Neural Network (CNN), Transformer, Long Short-Term MemoryAbstract
Penelitian ini membahas perbandingan tiga arsitektur deep learning, yaitu Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), dan Vision Transformer (ViT), dalam pengenalan bahasa isyarat multibahasa (ASL, ISL, dan BISINDO). Eksperimen dilakukan menggunakan dataset video terstruktur dengan tahap pra-pemrosesan berupa ekstraksi frame, segmentasi tangan, normalisasi, dan augmentasi data. Model dievaluasi berdasarkan akurasi, F1-score, latensi inferensi, dan kompleksitas komputasi. Hasil penelitian menunjukkan bahwa Transformer memberikan akurasi tertinggi sebesar 98,7%, namun membutuhkan memori dan waktu inferensi terbesar. Model LSTM menghasilkan trade-off terbaik dengan akurasi 96,7% dan latensi 120 ms/frame, sehingga lebih layak diterapkan pada sistem real-time. Temuan ini membuktikan bahwa pemilihan model harus disesuaikan dengan konteks implementasi, khususnya dalam pengembangan teknologi inklusif untuk komunitas Tuli di Indonesia.
References
[1] World Federation of the Deaf, “Global Deaf Population Report,” 2023. [Online]. Available: https://wfdeaf.org
[2] Kemendikbudristek RI, “Statistik Penyandang Disabilitas Indonesia 2024,” Pusat Data dan Informasi, Jakarta, Indonesia, 2024.
[3] H. Prasetyo and S. Nuraini, “Accessibility Challenges for D/deaf People in Indonesia,” J. Komun. Inklusif, vol. 5, no. 1, pp. 22–31, 2023.
[4] R. Munir and Y. Brianorman, “Convolutional Neural Network for Static ISL Recognition,” J. Inform., vol. 9, no. 2, pp. 112–119, 2021.
[5] Y. Al-Shayea, “LSTM for Dynamic Indian Sign Language Recognition,” Multimedia Tools Appl., vol. 84, pp. 27987–28011, 2025.
[6] C. Tan, P. Chen, and H. Zhao, “HGR-ViT: Hand Gesture Recognition with Vision Transformers,” Sensors, vol. 23, no. 4, pp. 1–14, 2023.
[7] Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked Autoencoders for Self-Supervised Video Pre-Training,” in Proc. 36th Conf. Neural Inf. Process. Syst. (NeurIPS), New Orleans, USA, Dec. 2022.
[8] D. Kothadiya, A. Desai, and K. Rathod, “SIGNFORMER: Transformer-based Deep Learning for Continuous Sign Language Recognition,” J. Comput. Vis. Image Process., vol. 10, no. 2, pp. 45–56, 2024.
[9] J. Zhang and H. Wang, “Spatio-temporal Transformers in Dynamic SLR,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 3, pp. 1502–1514, 2024.
[10] A. Aljabar and S. Suharjito, “BISINDO Sign Language Recognition Using CNN and LSTM,” Adv. Sci. Technol. Eng. Syst. J., vol. 5, no. 5, pp. 282–287, 2020.
[11] M. Alaftekin, I. Pacal, and K. Cicek, “Real-Time Sign Language Recognition Based on YOLO Algorithm,” Neural Comput. Appl., vol. 36, no. 14, pp. 7609–7624, 2024.
[12] Y. Brianorman and R. Munir, “Evaluation of Pre-trained CNN Models for Hijaiyah Hand Gesture Recognition,” J. Sist. Inf. Bisnis, vol. 13, no. 1, pp. 52–59, 2023.
[13] S. R. Dewi, I. S. Mariana, and R. Ekawati, “Dataset and Linguistic Gaps for BISINDO Recognition,” in Proc. 14th Int. Conf. Comput. Sci., Bali, Indonesia, 2023, pp. 112–118.
[14] L. Papa, F. Bonanno, and C. De Stefano, “Efficient Vision Transformers: A Survey,” IEEE Trans. Pattern Anal. Mach. Intell., early access, 2024, doi: 10.1109/TPAMI.2024.1234567.
[15] J. Lee, “Hybrid CNN–LSTM with Attention for Dynamic Sign Recognition,” Electronics, vol. 13, no. 7, pp. 1–13, 2024.
[16] K. Yin, “OpenASL: A Large-Scale Benchmark Dataset for American Sign Language,” in Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), Abu Dhabi, UAE, 2022, pp. 842–851.
[17] J. Singh and M. Kaur, “Indian Sign Language Video Dataset for Deep Gesture Recognition,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Kuala Lumpur, Malaysia, 2025, pp. 1–5.
[18] R. Kusuma and A. Josef, “Alphabet Recognition Using Bayesian Optimization in Hand Gesture Classification,” Revue d’Intelligence Artificielle, vol. 38, no. 3, pp. 929–938, 2024.
[19] W. Huang, “EfficientNet-Lite for Real-Time ASL Recognition,” LSEE Electron. Eng. J., vol. 42, no. 6, pp. 412–419, 2022.
[20] C. Wu, M. Li, and S. Zhou, “Cross-domain Adversarial Adaptation in Sign Language Recognition,” Pattern Recognit., vol. 140, pp. 1–12, 2024.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 ZONAsi: Jurnal Sistem Informasi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
CC BY-SA 4.0
Attribution-ShareAlike 4.0
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
