Comparative Study of the Effect of Datasets and Machine Learning Algorithms for PDF Malware Detection

  • Salman Wiharja Universitas Pendidikan Indonesia
  • Deden Pradeka
  • Wirmanto Suteddy
Keywords: Pembelajaran Mesin, PDF, Malware, Random Forest, Random Committee

Abstract

This research presents an innovative approach to detecting malicious PDFs through machine learning algorithms, focusing on the expansion of the Evasive-PDFMal2022 dataset. The objective is to enhance the accuracy of detecting malicious PDFs by enriching the dataset, augmenting its representation and diversity, and developing a practical tool—a website—for extracting and detecting malicious PDFs. The methodology involves updating and enlarging the dataset with additional malicious PDFs sourced from CVE and Exploit-db, along with non-malicious PDFs from diverse origins. Features are then extracted using the PDFID tool, and these 20 features serve as the foundation for implementing K-Nearest Neighbor (KNN), Random Forest, and Random Committee algorithms. The outcomes demonstrate that the model trained with the expanded dataset achieves a remarkable 99% accuracy, surpassing the performance of models relying solely on the Evasive-PDFMal2022 dataset. Additionally, this research significantly enhances the representation and diversity of the dataset while delivering a practical solution in the form of a website tailored for the extraction and detection of malicious PDFs.

Downloads

Download data is not yet available.

References

H. Bae, Y. Lee, Y. Kim, U. Hwang, S. Yoon, dan Y. Paek, “Learn2Evade: Learning-Based Generative Model for Evading PDF Malware Classifiers,” IEEE Transactions on Artificial Intelligence, vol. 2, no. 4, hlm. 299–313, Agu 2021, doi: 10.1109/tai.2021.3103139.

International Organization for Standardization, ISO 32000-2:2020 (PDF 2.0), 2 ed. Switzerland: PDF Association, Inc., 2020.

P. Singh, S. Tapaswi, dan S. Gupta, “Malware Detection in PDF and Office Documents: A survey,” Information Security Journal, vol. 29, no. 3, hlm. 134–153, Mei 2020, doi: 10.1080/19393555.2020.1723747.

Paloalto Networks, “Network Threat Trends Research Report,” 2023.

F. Baharuddin dan A. Tjahyanto, “Peningkatan Performa Klasifikasi Machine Learning Melalui Perbandingan Metode Machine Learning dan Peningkatan Dataset,” Jurnal Sisfokom (Sistem Informasi dan Komputer), vol. 11, no. 1, hlm. 25–31, Mar 2022, doi: 10.32736/sisfokom.v11i1.1337.

S. Y. Yerima dan A. Bashar, “Explainable Ensemble Learning Based Detection of Evasive Malicious PDF Documents,” Electronics (Basel), vol. 12, no. 3148, Jul 2023, doi: 10.3390/electronics12143148.

D. Pradeka, “Implementasi Aplikasi Kriptografi Berbasis Android menggunakan Metode Subtitusi dan Permutasi,” In Search – Informatic, Science, Entrepreneur, Applied Art, Research, Humanism, vol. 18, no. 01, hlm. 161–168, Apr 2019.

M. Elingiusti, L. Aniello, L. Querzoni, dan R. Baldoni, “PDF-Malware detection: A Survey and taxonomy of current techniques,” Advances in Information Security, vol. 70, hlm. 169–191, 2018, doi: 10.1007/978-3-319-73951-9_9.

W. Suteddy, D. Aprianti, R. Agustini, A. Adiwilaga, dan A. Atmanto, “End-To-End Evaluation of Deep Learning Architectures for Offline Handwriting Writer Identification: A Comparative Study,” JOIV : Int. J. Inform. Visualization, vol. 7, no. 1, hlm. 178185, Mar 2023, doi: 10.30630/joiv.7.1.1293.

A. N. Syafia, M. F. Hidayattullah, dan W. Suteddy, “Studi Komparasi Algoritma SVM dan Random Forest pada Analisis Sentimen Komentar Youtube BTS,” Jurnal Informatika: Jurnal pengembangan IT (JPIT), vol. 8, no. 3, hlm. 207–212, Sep 2023, doi: 10.30591/jpit.v8i3.5064.

R. Fettaya dan Y. Mansour, “Detecting malicious PDF using CNN,” Jul 2020, doi: 10.48550/arXiv.2007.12729.

S. A. Roseline, S. Geetha, S. Kadry, dan Y. Nam, “Intelligent Vision-Based Malware Detection and Classification Using Deep Random Forest Paradigm,” IEEE Access, vol. 8, hlm. 206303–206324, 2020, doi: 10.1109/ACCESS.2020.3036491.

N. F. Munazhif, G. J. Yanris, dan M. N. S. Hasibuan, “Implementation of the K-Nearest Neighbor (kNN) Method to Determine Outstanding Student Classes,” SinkrOn, vol. 8, no. 2, hlm. 719–732, Apr 2023, doi: 10.33395/sinkron.v8i2.12227.

D. Pradeka, A. Adiwilaga, D. A. R. Agustini, M. B. Hidayatullah, dan A. Suheryadi, Belajar Dasar Pemrograman Web serta Pengenalan Kriptografi dan Plugin Moodle, vol. 1. Bandung: Widina Media Utama, 2023.

D. Avelino, L. Cancerlon, M. K. Ryanta, Y. H. Christianto, dan W. Wangnardy, “Penggunaan Bahasa Pemrograman Python dalam Menganalisis Perbedaan Desain Website Tren di Negara Jepang dan Dunia,” Journal of Student Development Information System (JoSDIS), vol. 3, no. 2, hlm. 51–61, 2023, doi: 10.36987/josdis.v3i2.4525.

M. Issakhani, P. Victor, A. Tekeoglu, dan A. H. Lashkari, “CIC-Evasive-PDFMal2022,” Canadian Institute for Cybersecurity. Diakses: 26 Desember 2023. [Daring]. Tersedia pada: https://www.unb.ca/cic/datasets/pdfmal-2022.html

R. Dubin, “Content Disarm and Reconstruction of PDF Files,” IEEE Access, vol. 11, hlm. 38399–38416, 2023, doi: 10.1109/ACCESS.2023.3267717.

M. Issakhani, P. Victor, A. Tekeoglu, dan A. H. Lashkari, “PDF Malware Detection based on Stacking Learning,” dalam International Conference on Information Systems Security and Privacy, Science and Technology Publications, Lda, 2022, hlm. 562–570. doi: 10.5220/0010908400003120.

A. Althnian dkk., “Impact of dataset size on classification performance: An empirical evaluation in the medical domain,” Applied Sciences (Switzerland), vol. 11, no. 2, hlm. 1–18, Jan 2021, doi: 10.3390/app11020796.

K. Koptyra dan M. R. Ogiela, “Distributed steganography in PDF files - Secrets hidden in modified pages,” Entropy, vol. 22, no. 6, Jun 2020, doi: 10.3390/E22060600.

Published
2024-05-31
How to Cite
Wiharja, S., Pradeka, D., & Suteddy, W. (2024). Comparative Study of the Effect of Datasets and Machine Learning Algorithms for PDF Malware Detection. Digital Zone: Jurnal Teknologi Informasi Dan Komunikasi, 15(1), 80-93. https://doi.org/10.31849/digitalzone.v15i1.19744
Abstract viewed = 0 times
PDF downloaded = 0 times