Towards an Automated Essay Evaluation System NLP Based Text Embeddings and Similarity Metrics
DOI:
https://doi.org/10.31849/digitalzone.v16i1.26541Keywords:
Automatic Essay Scoring, Natural Language Processing, Cosine Similarity, Manhattan Distance, Educational AssessmentAbstract
This study aims to develop an automatic essay answer assessment system based on Natural Language Processing (NLP) to reduce the time and effort required for evaluation. The system uses Cosine Similarity and Manhattan Distance as evaluation metrics and implements two text embedding methods—Term Frequency-Inverse Document Frequency (TF-IDF) and Bag of Words (BoW)—to represent the user’s answer text. The methodology begins with text processing and pre-processing, followed by embedding and similarity calculation between the user’s answer and the reference text to generate an evaluation score categorized into three levels: good, sufficient, and poor. Based on Cohen’s Kappa analysis, the kappa value for Cosine Similarity reaches 0.78, indicating high agreement between the Cosine TF-IDF and Cosine BoW methods. In contrast, Manhattan Distance yields a kappa value of -0.05, indicating a discrepancy between the two Manhattan-based methods. The evaluation results suggest that Cosine Similarity is more suitable, whereas Manhattan Distance is not relevant for this task. At the modeling stage, the best classification models are Decision Tree and Random Forest, each achieving an accuracy of 96.67%. Although Random Forest demonstrates a higher AUC than Decision Tree, it requires a longer training time. Overall, the system is considered effective for assessing essay answers with both purpose and consistency, offering potential applications in the field of education
References
[1] A. Kayan, A. Sanjaya, and U. Mahdiyah, “Koreksi Otomatis Ujian Esai Menerapkan Algoritma Winnowing Dan Metode Cosine Similarity,” 2024.
[2] R. Fitri and A. N. Asyikin, “Aplikasi Penilaian Ujian Essay Otomatis Menggunakan Metode Cosine Similarity,” POROS Tek., vol. 7, no. 2, pp. 88–94, 2015, doi: 10.31961/porosteknik.v7i2.218.
[3] D. O. Sihombing, “Implementasi Natural Language Processing (NLP) dan Algoritma Cosine Similarity dalam Penilaian Ujian Esai Otomatis,” J. Sist. Komput. Dan Inform. JSON, vol. 4, no. 2, p. 396, Dec. 2022, doi: 10.30865/json.v4i2.5374.
[4] K. Sun and R. Wang, “Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression,” Jun. 03, 2024, arXiv: arXiv:2406.01198. doi: 10.48550/arXiv.2406.01198.
[5] K. Doi, K. Sudoh, and S. Nakamura, “Automated Essay Scoring Using Grammatical Variety and Errors with Multi-Task Learning and Item Response Theory,” in Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), E. Kochmar, M. Bexte, J. Burstein, A. Horbach, R. Laarmann-Quante, A. Tack, V. Yaneva, and Z. Yuan, Eds., Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 316–329. Accessed: Mar. 24, 2025. [Online]. Available: https://aclanthology.org/2024.bea-1.26/
[6] S. Astutik, A. D. Cahyani, and M. K. Sophan, “Sistem Penilaian Esai Otomatis Pada E-Learning Dengan Algoritma Winnowing,” J. Inform., vol. 12, no. 2, pp. 47–52, Jan. 2014, doi: 10.9744/informatika.12.2.47-52.
[7] M. Mi’andri, A. C. Siregar, and P. Y. Utami, “Sistem Penilaian Ujian Otomatis Untuk Soal Esai Menggunakan Metode Vector Space ModeL,” JUTECH J. Educ. Technol., vol. 2, no. 2, pp. 1–15, Jan. 2022, doi: 10.31932/jutech.v2i2.1273.
[8] I. Huda, “Penerapan Deep Learning pada Kasus Sistem Penilaian Esai Otomatis Bahasa Indonesia,” 2022, Accessed: Mar. 24, 2025. [Online]. Available: https://digilib.uns.ac.id/dokumen/89030/Penerapan-Deep-Learning-pada-Kasus-Sistem-Penilaian-Esai-Otomatis-Bahasa-Indonesia
[9] K. Poonpon, P. Manorom, and W. Chansanam, “Exploring effective methods for automated essay scoring of non-native speakers,” Contemp. Educ. Technol., vol. 15, no. 4, p. ep475, Oct. 2023, doi: 10.30935/cedtech/13740.
[10] M. Meccawy, A. A. Bayazed, B. Al-Abdullah, and H. Algamdi, “Automatic Essay Scoring for Arabic Short Answer Questions using Text Mining Techniques,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 6, 2023, doi: 10.14569/IJACSA.2023.0140682.
[11] D. Sugiarto, E. Utami, and A. Yaqin, “Perbandingan Kinerja Model TF-IDF dan BOW untuk Klasifikasi Opini Publik Tentang Kebijakan BLT Minyak Goreng,” J. Tek. Ind., vol. 12, no. 3, Art. no. 3, Dec. 2022, doi: 10.25105/jti.v12i3.15669.
[12] I. F. Effendi, D. A. Utami, R. A. Rahmawati, R. Prasetyowibowo, and P. Isbandono, “Twitter Data Sentiment Analysis on the Economic Sector: Public Response to Government Policies During the COVID-19 Pandemic in Indonesia,” presented at the International Joint Conference on Arts and Humanities 2023 (IJCAH 2023), Atlantis Press, Dec. 2023, pp. 472–491. doi: 10.2991/978-2-38476-152-4_45.
[13] K. T. Putra, M. A. Hariyadi, and C. Crysdian, “Perbandingan Feature Extraction Tf-Idf Dan Bow Untuk Analisis Sentimen Berbasis Svm”.
[14] F. Li, X. Xi, Z. Cui, D. Li, and W. Zeng, “Automatic Essay Scoring Method Based on Multi-Scale Features,” Appl. Sci., vol. 13, no. 11, Art. no. 11, Jan. 2023, doi: 10.3390/app13116775.
[15] S. M. Vieira, U. Kaymak, and J. M. C. Sousa, “Cohen’s kappa coefficient as a performance measure for feature selection,” in International Conference on Fuzzy Systems, Jul. 2010, pp. 1–8. doi: 10.1109/FUZZY.2010.5584447.
[16] G. Rau and Y.-S. Shih, “Evaluation of Cohen’s kappa and other measures of inter-rater agreement for genre analysis and other nominal data,” J. Engl. Acad. Purp., vol. 53, p. 101026, Sep. 2021, doi: 10.1016/j.jeap.2021.101026.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Digital Zone: Jurnal Teknologi Informasi dan Komunikasi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.






