Prapemrosesan klasifikasi algoritme kNN menggunakan K-means dan matriks jarak untuk dataset hasil studi mahasiswa

Preprocessing kNN algorithm classification using K-means and distance matrix with students’ academic performance dataset

*Sugriyono Sugriyono  -  Master of Informatics Department, Sunan Kalijaga Islamic State University, Indonesia
Maria Ulfah Siregar orcid scopus  -  Master of Informatics Department, Sunan Kalijaga Islamic State University, Indonesia
Received: 24 Aug 2020; Revised: 19 Oct 2020; Accepted: 21 Oct 2020; Published: 31 Oct 2020; Available online: 21 Oct 2020.
Fulltext Fulltext |
Open Access Copyright (c) 2020 Jurnal Teknologi dan Sistem Komputer under http://creativecommons.org/licenses/by-sa/4.0.

Citation Format:
Article Info
Section: Original Research Articles
Language: ID
Statistics: 175 115
Share:
Abstract
The existence of outliers in the dataset can cause low accuracy in a classification process. Outliers in the dataset can be removed from a preprocessing stage of classification algorithms. Clustering can be used as an outlier detection method. This study applies K-means and a distance matrix to detect outliers and remove them from datasets with class labels. This research used a dataset of students’ academic performance totaling 6847 instances, having 18 attributes and 3 class labels. Preprocessing applies the K-means method to get centroid in each class. The distance matrix is used to evaluate the distance of instance to the centroid. Outliers, which are a different class, will be removed from the dataset. This preprocessing improves the classification accuracy of the kNN algorithm. Data without preprocessing has 72.28 % accuracy, preprocessed data using K-means with Euclidean has 98.42 % accuracy (an increase of 26.14 %), while the K-means with Manhattan has 97.76 % accuracy (an increase of 25.48 %).
Keywords: preprocessing; K-means; kNN; distance matrix; Manhattan; Euclidean
  1. M. Maisah, F. Hairul, A. Iwan, A. Amiruudin, and Zulqarnain, “Strategi pengembangan mutu perguruan tinggi,” Jurnal Ilmu Manajemen Terapan, vol. 1, no. 5, pp. 416-424, 2020. doi: 10.31933/jimt.v1i5.202
  2. I. P. Darmawan and D. Triastanti, “Pola perwalian sebagai pembinaan akademik, kerohanian dan karakter mahasiswa,” Jurnal Ilmiah Religiosity Entity Humanit, vol. 2, no. 1, pp. 13–26, 2020. doi: 10.37364/jireh.v2i1.32
  3. J. Han, M. Kamber, and J. Pei, “8 - Classification: Basic Concepts,” in Data Mining, Third Edition, J. Han, M. Kamber, and J. Pei, Eds. Boston: Morgan Kaufmann, 2012, pp. 327–391.
  4. D. A. Adeniyi, Z. Wei, and Y. Yongquan, “Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method,” Applied Computing and Informatics, vol. 12, no. 1, pp. 90–108, 2016. doi: 10.1016/j.aci.2014.10.001
  5. M. Hijazi, J. Khalife, H. El Ghor, and J. Verdejo, “Network traffic classification based on class weight based k-NN classifier (CWK-NN),” in 2nd International Conference on Big Data and Cyber-Security Intelligence, Versailles, France, Dec. 2019, pp. 105-112.
  6. G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “kNN classification with an outlier informative distance measure,” in Pattern Recognition and Machine Intelligence, Kolkata, India, Dec. 2017, pp. 21–27. doi: 10.1007/978-3-319-69900-4_3
  7. D. C. P. Sinaga, T. Tulus, and P. Sihombing, “Performance of distance-based k-nearest neighbor classification method using local mean vector and harmonic distance,” IOP Conference Series: Materials Science and Engineering, vol. 725, 12122, 2020. doi: 10.1088/1757-899X/725/1/012122
  8. F. Yoseph and M. Heikkilä, “A clustering approach for outliers detection in a big point-of-sales database,” in International Conference on Machine Learning and Data Engineering, Taipei, Taiwan, Dec. 2019, pp. 65–71. doi: 10.1109/iCMLDE49015.2019.00023
  9. V. Bhatt, M. Dhakar, and B. Chaurasia, “Filtered clustering based on local outlier factor in data mining,” International Journal of Database Theory and Application, vol. 9, pp. 275–282, 2016. doi: 10.14257/ijdta.2016.9.5.28
  10. G. Mishra, S. Agarwal, P. K. Jain, and R. Pamula, “Outlier detection using subset formation of clustering based method,” in International Conference on Advanced Computing Networking and Informatics, West Bengal, India, Dec. 2019, pp. 521–528. doi: 10.1007/978-981-13-2673-8_55
  11. B. Angelin and A. Geetha, “Outlier detection using clustering techniques – k-means and k-median,” in 4th International Conference on Intelligent Computing and Control Systems, Madurai, India, May 2020, pp. 373–378. doi: 10.1109/ICICCS48265.2020.9120990
  12. H. Gustavsson, “Clustering based outlier detection for improved situation awareness within air traffic control,” Thesis, KTH Royal Institute of Technology, Sweden, 2019.
  13. T. Nizam and S. I. Hassan, “Exemplifying the effects of distance metrics on clustering techniques: f-measure, accuracy and efficiency,” in 7th International Conference on Computing for Sustainable Global Development, New Delhi, India, Mar. 2020, pp. 39–44. doi: 10.23919/INDIACom49435.2020.9083687
  14. S. Aggarwal, N. Agarwal, and M. Jain, “Performance analysis of uncertain k-means clustering algorithm using different distance metrics,” in Computational Intelligence: Theories, Applications and Future Directions - Volume I, 2019, pp. 237–245. doi: 10.1007/978-981-13-1132-1_19
  15. S. Kapil and M. Chawla, “Performance evaluation of k-means clustering algorithm with various distance metrics,” in 1st International Conference on Power Electronics, Intelligent Control and Energy Systems, New Delhi, India, Jul. 2016, pp. 1–4. doi: 10.1109/ICPEICES.2016.7853264
  16. S. A. Salihu, I. P. Onyekwere, M. A. Mabayoje, and H. A. Mojeed, “Performance evaluation of manhattan and euclidean distance measures for clustering based automatic text summarization,” Journal of Engineering and Technology, vol. 4, no. 1, pp. 135-139, 2019. doi: 10.46792/fuoyejet.v4i1.316
  17. Y.-D. Zhang and S. Wang, “Detection of Alzheimer’s disease by displacement field and machine learning,” PeerJ, vol. 3, E1251, 2015. doi: 10.7717/peerj.1251
  18. Y. S. Thakare and S. B. Bagal, “Performance evaluation of k-means clustering algorithm with various distance metrics,” International Journal of Computer Application, vol. 110, no. 11, pp. 12-16, 2015.

No citation recorded.