Perbandingan penghitungan jarak pada k-nearest neighbour dalam klasifikasi data tekstual

Comparison of distance measurement on k-nearest neighbour in textual data classification

*Wahyono Wahyono scopus  -  Department of Computer Science and Electronic, Universitas Gadjah Mada, Indonesia
I Nyoman Prayana Trisna  -  Master of Computer Science, Universitas Gadjah Mada, Indonesia
Sarah Lintang Sariwening  -  Master of Computer Science, Universitas Gadjah Mada, Indonesia
Muhammad Fajar  -  Master of Computer Science, Universitas Gadjah Mada, Indonesia
Danur Wijayanto  -  Master of Computer Science, Universitas Gadjah Mada, Indonesia
Received: 15 Jun 2019; Revised: 22 Oct 2019; Accepted: 5 Nov 2019; Published: 31 Jan 2020; Available online: 15 Nov 2019.
Open Access Copyright (c) 2020 Jurnal Teknologi dan Sistem Komputer
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Format:
Article Info
Section: Original Research Articles
Language: ID
Statistics: 326 70
Abstract
One algorithm to classify textual data in automatic organizing of documents application is KNN, by changing word representations into vectors. The distance calculation in the KNN algorithm becomes essential in measuring the closeness between data elements. This study compares four distance calculations commonly used in KNN, namely Euclidean, Chebyshev, Manhattan, and Minkowski. The dataset used data from Youtube Eminem’s comments which contain 448 data. This study showed that Euclidian and Minkowski on the KNN algorithm achieved the best result compared to Chebycev and Manhattan. The best results on KNN are obtained when the K value is 3.
Keywords: KNN; tekstual data; distance measurement; Euclidean; Chebyshev; Manhattan; Minkowski

Article Metrics:

  1. R. Feldman and J. Sanger, The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press, 2007.
  2. T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967. doi: 10.1109/TIT.1967.1053964
  3. T. Bailey and A. K. Jain, “A note on distance-weighted k-nearest neighbor rules,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 8, no. 4, pp. 311-313, 1978. doi: 10.1109/TSMC.1978.4309958
  4. V. B. Prasath et al., “Distance and similarity measures effect on the performance of k-nearest neighbor classifier - a review,“ arXiv:1708.04321 v3 [cs.LG], 2019. doi: 10.1089/big.2018.0175
  5. Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, "Approximate nearest neighbor algorithm based on navigable small world graphs," Information Systems, vol. 45, pp. 61–68, 2014. doi: 10.1016/j.is.2013.10.006
  6. T. Kirdat and V. V. Patil, “Application of Chebyshev distance and Minkowski distance to CBIR using color histogram,” International Journal of Innovative Research in Technology (IJIRT), vol. 2, no. 9, pp. 28-31, 2016.
  7. N. E. Md Isa, A. Amir, M. Z. Ilyas, and M. S. Razalli, “The performance analysis of k-nearest neighbors (K-NN) algorithm for motor imagery classification based on EEG signal, ” in 2017 International Conference on Emerging Electronic Solutions for IoT, Penang, Malaysia, Oct. 2017, pp. 1-6. doi: 10.1051/matecconf/201714001024
  8. N. Ali, D. Neagu, and P. Trundle, “Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets,” SN Applied Sciences, vol. 1, no. 1559, 2019. doi: 10.1007/s42452-019-1356-9
  9. K. Chomboon, P. Chujai, P. Teerarassamee, K. Kerdprasop, and N. Kerdprasop, “An empirical study of distance metrics for k-nearest neighbor algorithm,” in the 3rd International Conference on Industrial Application Engineering 2015, Kitakyushu, Japan, Mar. 2015, pp. 1-6. doi: 10.12792/iciae2015.051
  10. L. Y. Hu, M. W. Huang, S. W. Ke, and C. F. Tsai, “The distance function effect on k-nearest neighbor classification for medical datasets,” SpringerPlus, vol. 5, no. 1, pp. 1-9, 2016. doi: 10.1186/s40064-016-2941-7
  11. P. Mulak and N. Talhar, “Analysis of distance measures using k-nearest neighbor algorithm on kdd dataset,” International Journal of Science and Research, vol. 4, no. 7, pp.2101-2104, 2015.
  12. D. Sinwar and R. Kaushik, “Study of Euclidean and Manhattan distance metrics using simple k-means clustering,” International Journal for Research in Applied Science and Engineering Technology (IJRASET), vol. 2, no. 5, pp. 270-274, 2014.
  13. R. Todeschini, D. Ballabio, V. Consonni, and F. Grisoni, “A new concept of higher-order similarity and the role of distance/similarity measures in local classification methods,” Chemometrics and Intelligent Laboratory Systems, vol. 157, pp. 50-57, 2016. doi: 10.1016/j.chemolab.2016.06.013
  14. Y. Shikhar, V. P. Singh, and R. Srivastava, “Comparative analysis of distance metrics for designing an effective content-based image retrieval system using colour and texture features,” International Journal of Image, Graphics, and Signal Processing, vol. 12, no. 7, pp. 58-65, 2017. doi: 10.5815/ijigsp.2017.12.07
  15. P. Koniusz, F. Yan, P. H. Gosselin, and K. Mikolajczyk, "Higher-order occurrence pooling for bags-of-words: visual concept detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 2, pp. 313–326, 2017. doi: 10.1109/TPAMI.2016.2545667

  1. K-nearest neighbor performance for Nusantara scripts image transliteration
    Anastasia Rita Widiarti, Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 150, 2020. doi: 10.14710/jtsiskom.8.2.2020.150-156