skip to main content

Algoritme decision tree untuk mendeteksi ujaran kebencian dan bahasa kasar multilabel pada Twitter berbahasa Indonesia

Decision tree algorithm for multi-label hate speech and abusive language detection in Indonesian Twitter

Department of Informatics, UIN Sultan Syarif Kasim Riau. Jl. H.R. Soebrantas km 11.5 Simpang Baru Panam, Pekanbaru, Riau 28293, Indonesia

Received: 7 Sep 2020; Revised: 4 Jun 2021; Accepted: 8 Aug 2021; Published: 31 Oct 2021.
Open Access Copyright (c) 2021 The authors. Published by Department of Computer Engineering, Universitas Diponegoro
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Format:
Abstract
Hate speech and abusive language are easily found in written communications in social media like Twitter. They often cause a dispute between parties, the victims, and the first who write the tweet. However, it is also difficult to distinguish whether a tweet contains hate speech and/or abusive language for those who take sides. This research aims to develop a method to classify the tweets into abusive and/or contain hate speech classes. If hate speech is detected, then the system will measure the hardness level of hatred. The dataset includes 13,126 real tweets data. Word embeddings are used for featuring text input. For the tweets classification, we use a Decision Tree algorithm. Some engineering of features and parameters tuning has improved the classification of the three classes: hate speech class, abusive words, and hate speech level. The lexicon feature in the Decision Tree classification produces the highest accuracy for detecting the three classes rather than engineering special features and textual features. The average accuracy of the three classes increased from 69.77 % to 70.48 % for the training-testing composition of 90:10, and another 69.35 % to 69.54 % for 80:20 respectively.
Keywords: hate speech; abusive language; decision tree; Twitter; word embeddings
Funding: UIN Sultan Syarif Kasim Riau

Article Metrics:

  1. M. Febriyani, “Analisis faktor penyebab pelaku melakukan ujaran kebencian (hate speech ) dalam media sosial,” Poenale: Jurnal Bagian Hukum Pidana, vol. 3, no. 2, pp. 139–157, 2018
  2. T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” arXiv:1703.04009v1 [cs.CL], 2017
  3. A. F. Hidayatullah, A. A. Fadila, K. P. Juwairi, and R. A. Nayoan, “Identifikasi konten kasar pada tweet bahasa Indonesia,” Jurnal Linguistik Komputasional, vol. 2, no. 1, pp. 1-5, 2019. doi: 10.26418/jlk.v2i1.15
  4. E. D. Putra, Menguak jejaring sosial. Tangerang, 2014
  5. F. Gorunescu, Data mining: Concepts, models and techniques. Berlin: Springer, 2011
  6. N. T. Romadloni, I. Santoso, and S. Budilaksono, “Perbandingan metode naive bayes, knn dan decision tree terhadap analisis sentimen transportasi commuter line,” Jurnal Komputer dan Informatika, vol. 3, no. 2, pp. 1–9, 2019
  7. W. A. Luqyana, I. Cholissodin, and R. S. Perdana, “Analisis sentimen cyberbullying pada komentar instagram dengan metode klasifikasi support vector machine,” Jurnal Pengembangan Teknlogi Informasi dan Ilmu Komputer, vol. 2, no. 11, pp. 4704–4713, 2018
  8. M. Hakiem and M. A. Fauzi, “Klasifikasi ujaran kebencian pada twitter menggunakan metode naïve bayes berbasis n-gram dengan seleksi fitur information gain,” Jurnal Pengembangan Teknlogi Informasi dan Ilmu Komputer, vol. 3, no. 3, pp. 2443–2451, 2019
  9. M. O. Ibrohim and I. Budi, “Multi-label hate speech and abusive language detection in Indonesian Twitter,” in the Third Workshop on Abusive Language Online, Florence, Italy, Aug. 2019, pp. 46–57. doi: 10.18653/v1/W19-3506
  10. A. K. B. A. Putra, M. A. Fauzi, B. D. Setiawan, and E. Setiawati, “Identifikasi ujaran kebencian pada Facebook dengan metode ensemble feature dan support vector machine,” Jurnal Pengembangan Teknlogi Informasi dan Ilmu Komputer, vol. 2, no. 12, 2018
  11. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," in International Conference on Learning Representations, Arizona, USA, May 2013, pp. 1-12
  12. K. Antariksa, Y. S. Purnomo, and D. Ernawati, “Klasifikasi ujaran kebencian pada cuitan dalam bahasa Indonesia,” Jurnal Buana Informatika, vol. 10, no. 2, pp. 164–171, 2019. doi: 10.24002/jbi.v10i2.2451
  13. S. Santoso, A. Dewa, B. Soetiono, E. Setyati, and E. M. Yuniarno, “Self-training naive bayes berbasis word2vec untuk kategorisasi berita bahasa Indonesia,” Jurnal Nasional Teknik Elektro dan Teknologi Informasi, vol. 7, no. 2, pp. 158–166, 2018. doi: 10.22146/jnteti.v7i2.418
  14. Z. A. Arliyanti Nurdin, Bernadus Anggo Seno Aji, Anugrayani Bustamin, “Perbandingan kinerja word embedding word2vec, Glove dan FastText pada klasifikasi teks,” Jurnal Teknokompak, vol. 14, no. 2, pp. 74--79, 2020. doi: 10.33365/jtk.v14i2.732
  15. D. M. W. Powers, "Evaluation: from precision, recall and f-measure to ROC, informedness, markedness & correlation," Journal of Machine Learning Technologies. vol. 2, no. 1, pp. 37–63, 2011

Last update:

No citation recorded.

Last update: 2022-12-05 18:09:56

No citation recorded.