Integrated Named Entity Recognition and Identical-Entity Detection for Extracting Unique Information Sources in News Articles
DOI:
https://doi.org/10.31849/digitalzone.v16i2.27687Keywords:
Named Entity Recognition, NLP, Unique Information Source ExtractionAbstract
Native advertising is often difficult to detect because it resembles regular news articles. One indicator is the absence of diverse information sources or the reliance on a single perspective. Therefore, it is necessary to employ an extraction technique capable of consolidating various forms of identical entity mentions. This study integrates an NER model based on XLNet+BiLSTM+CRF with identical entity classification using Levenshtein distance features and static and contextual vector representations. The results show an F1-score of 93.71% at the entity level and 92.84% for identical entity identification, along with a list of unique citation sources. These findings demonstrate that this unique list can be an additional feature in detecting native advertising, which often relies on a single source. With an average unique entity coverage of 97.40%, the proposed architecture can extract unique entities within news articles
References
[1] Nic Newman with Richard Fletcher, Craig T. Robertson, Kirsten Eddy, and Rasmus Kleis Nielsen, “Reuters Institute Digital News Report 2022,” Oxford, 2022. Accessed: Mar. 20, 2023. https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2022-06/Digital_News-Report_2022.pdf
[2] W. Yao, J. W. B. Mohd Zawawi, A. M. @ Z. Ahmad, and T. J. Sern, “Recognizing Native Advertising and Its Challenge to Traditional Advertising,” International Journal of Academic Research in Business and Social Sciences, vol. 11, no. 19, Dec. 2021. https://doi.org/10.6007/IJARBSS/v11-i19/11727
[3] M. A. Amazeen and B. W. Wojdynski, “The effects of disclosure format on native advertising recognition and audience perceptions of legacy and online news publishers,” Journalism, vol. 21, no. 12, pp. 1965–1984, Dec. 2020. https://doi.org/10.1177/1464884918754829
[4] A. Kutlu, “Native Advertising the Effect of Native Advertising on Ad Credibility,” International Journal of Economics, Business and Management Research, vol. 06, no. 11, pp. 152–165, 2022. https://doi.org/10.51505/IJEBMR.2022.61111
[5] C. C. Pasandaran, “Political Advertising Camouflage As News,” Jurnal Komunikasi Ikatan Sarjana Komunikasi Indonesia, vol. 3, no. 2, Dec. 2018. https://doi.org/10.25008/jkiski.v3i2.239
[6] Y. Li, “The Role Performance of Native Advertising in Legacy and Digital-Only News Media,” Digital Journalism, vol. 7, no. 5, pp. 592–613, May 2019. https://doi.org/10.1080/21670811.2019.1571931
[7] B. R. P. Darnoto, D. Siahaan, and D. Purwitasari, “Deep Learning for Native Advertisement Detection in Electronic News: A Comparative Study,” in 2022 11th Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS), IEEE, Aug. 2022, pp. 304–309. https://doi.org/10.1109/EECCIS54468.2022.9902953
[8] B. R. P. Darnoto, D. Siahaan, and D. Purwitasari, “Automated Detection of Persuasive Content in Electronic News,” Informatics, vol. 10, no. 4, p. 86, Nov. 2023, https://doi.org/10.3390/informatics10040086
[9] B. S. Jati, S. Widyawan, and S. T. Muhammad Nur Rizal, “Multilingual Named Entity Recognition Model for Indonesian Health Insurance Question Answering System,” in 2020 3rd International Conference on Information and Communications Technology (ICOIACT), IEEE, Nov. 2020, pp. 180–184. https://doi.org/10.1109/ICOIACT50329.2020.9332027
[10] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” COLING 2020 - The 28th International Conference on Computational Linguistics, Nov. 2020. https://doi.org/10.48550/arXiv.2011.00677
[11] R. Yan, X. Jiang, and D. Dang, “Named Entity Recognition by Using XLNet-BiLSTM-CRF,” Neural Process Lett, vol. 53, no. 5, pp. 3339–3356, Oct. 2021. https://doi.org/10.1007/s11063-021-10547-1
[12] V. Christanti Mawardi, F. Augusfian, J. Pragantha, and S. Bressan, “Spelling Correction Application with Damerau-Levenshtein Distance to Help Teachers Examine Typographical Error in Exam Test Scripts,” E3S Web of Conferences, vol. 188, p. 00027, Sep. 2020. https://doi.org/10.1051/e3sconf/202018800027
[13] K. Babić, F. Guerra, S. Martinčić-Ipšić, and A. Meštrović, “A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings,” Journal of information and organizational sciences, vol. 44, no. 2, pp. 231–246, Dec. 2020. https://doi.org/10.31341/jios.44.2.2
[14] T. Pratama and S. Rjito, “IndoXLNet: Pre-Trained Language Model for Bahasa Indonesia,” International Journal of Engineering Trends and Technology, vol. 70, no. 5, pp. 367–381, Jun. 2021. https://doi.org/10.14445/22315381/IJETT-V70I5P240
[15] B. R. P. Darnoto, D. Siahaan, and D. Purwitasari, “Electronic News Dataset for Native Advertisement Detection,” Sci Data, vol. 12, no. 1, p. 1045, Jun. 2025. https://doi.org/10.1038/s41597-024-04341-6
[16] C. Palen-Michel, M. Pickering, M. Kruse, J. Sälevä, and C. Lignos, “OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages,” Dec. 2024. https://doi.org/10.48550/arXiv.2412.09587
[17] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 843–857. https://doi.org/10.18653/v1/2020.aacl-main.85
[18] P. Chen, M. Zhang, X. Yu, and S. Li, “Named entity recognition of Chinese electronic medical records based on a hybrid neural network and medical MC-BERT,” BMC Med Inform Decis Mak, vol. 22, no. 1, p. 315, Dec. 2022. https://doi.org/10.1186/s12911-022-02059-2
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Digital Zone: Jurnal Teknologi Informasi dan Komunikasi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.






