Prapemrosesan klasifikasi algoritme kNN menggunakan K-means dan matriks jarak untuk dataset hasil studi mahasiswa

Sugriyono Sugriyono; Maria Ulfah Siregar

doi:10.14710/jtsiskom.2020.13874

DOI: https://doi.org/10.14710/jtsiskom.2020.13874

Prapemrosesan klasifikasi algoritme kNN menggunakan K-means dan matriks jarak untuk dataset hasil studi mahasiswa

Preprocessing kNN algorithm classification using K-means and distance matrix with students’ academic performance dataset

Sugriyono Sugriyono , Maria Ulfah Siregar

Master of Informatics Department, Sunan Kalijaga Islamic State University, Indonesia

Received: 24 Aug 2020; Revised: 19 Oct 2020; Accepted: 21 Oct 2020; Available online: 21 Oct 2020; Published: 31 Oct 2020.

Citation Format:

Abstract

The existence of outliers in the dataset can cause low accuracy in a classification process. Outliers in the dataset can be removed from a preprocessing stage of classification algorithms. Clustering can be used as an outlier detection method. This study applies K-means and a distance matrix to detect outliers and remove them from datasets with class labels. This research used a dataset of students’ academic performance totaling 6847 instances, having 18 attributes and 3 class labels. Preprocessing applies the K-means method to get centroid in each class. The distance matrix is used to evaluate the distance of instance to the centroid. Outliers, which are a different class, will be removed from the dataset. This preprocessing improves the classification accuracy of the kNN algorithm. Data without preprocessing has 72.28 % accuracy, preprocessed data using K-means with Euclidean has 98.42 % accuracy (an increase of 26.14 %), while the K-means with Manhattan has 97.76 % accuracy (an increase of 25.48 %).

Fulltext View|Download Email colleagues

Keywords: preprocessing; K-means; kNN; distance matrix; Manhattan; Euclidean

Funding: UIN Sunan Kalijaga, Yogyakarta, Indonesia

Article Metrics:

Article Info

Section: Original Research Articles

Language : ID

In Volume 8, Issue 4, Year 2020 (October 2020)

Customer segmentation using bisecting k-means algorithm based on recency, frequency, and monetary (RFM) model Comparison of Document Plagiarism Detection Results by Jaro-Winkler Distance and Latent Semantic Analysis Methods K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes Decision Support System for Football Players Lineup Selection using Fuzzy Multiple Attribute Decision Making and K-Means Clustering Methods Parameter tuning in KNN for software defect prediction: an empirical analysis More related articles

Most cited articles

Multi Layer Perceptron Model for Indoor Positioning System Based on Wi-Fi Pengembangan Sistem Informasi Rekam Medis untuk Dinas Kabupaten Grobogan Perancangan Sistem Sensor Pemonitor Lingkungan Berbasis Jaringan Sensor Nirkabel Optimization for prediction model of palm oil land suitability using spatial decision tree algorithm Perancangan Jaringan Sensor Terdistribusi untuk Pengaturan Suhu, Kelembaban dan Intensitas Cahaya More cited articles

M. Maisah, F. Hairul, A. Iwan, A. Amiruudin, and Zulqarnain, “Strategi pengembangan mutu perguruan tinggi,” Jurnal Ilmu Manajemen Terapan, vol. 1, no. 5, pp. 416-424, 2020. doi: 10.31933/jimt.v1i5.202
I. P. Darmawan and D. Triastanti, “Pola perwalian sebagai pembinaan akademik, kerohanian dan karakter mahasiswa,” Jurnal Ilmiah Religiosity Entity Humanit, vol. 2, no. 1, pp. 13–26, 2020. doi: 10.37364/jireh.v2i1.32
J. Han, M. Kamber, and J. Pei, “8 - Classification: Basic Concepts,” in Data Mining, Third Edition, J. Han, M. Kamber, and J. Pei, Eds. Boston: Morgan Kaufmann, 2012, pp. 327–391
D. A. Adeniyi, Z. Wei, and Y. Yongquan, “Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method,” Applied Computing and Informatics, vol. 12, no. 1, pp. 90–108, 2016. doi: 10.1016/j.aci.2014.10.001
M. Hijazi, J. Khalife, H. El Ghor, and J. Verdejo, “Network traffic classification based on class weight based k-NN classifier (CWK-NN),” in 2nd International Conference on Big Data and Cyber-Security Intelligence, Versailles, France, Dec. 2019, pp. 105-112
G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “kNN classification with an outlier informative distance measure,” in Pattern Recognition and Machine Intelligence, Kolkata, India, Dec. 2017, pp. 21–27. doi: 10.1007/978-3-319-69900-4_3
D. C. P. Sinaga, T. Tulus, and P. Sihombing, “Performance of distance-based k-nearest neighbor classification method using local mean vector and harmonic distance,” IOP Conference Series: Materials Science and Engineering, vol. 725, 12122, 2020. doi: 10.1088/1757-899X/725/1/012122
F. Yoseph and M. Heikkilä, “A clustering approach for outliers detection in a big point-of-sales database,” in International Conference on Machine Learning and Data Engineering, Taipei, Taiwan, Dec. 2019, pp. 65–71. doi: 10.1109/iCMLDE49015.2019.00023
V. Bhatt, M. Dhakar, and B. Chaurasia, “Filtered clustering based on local outlier factor in data mining,” International Journal of Database Theory and Application, vol. 9, pp. 275–282, 2016. doi: 10.14257/ijdta.2016.9.5.28
G. Mishra, S. Agarwal, P. K. Jain, and R. Pamula, “Outlier detection using subset formation of clustering based method,” in International Conference on Advanced Computing Networking and Informatics, West Bengal, India, Dec. 2019, pp. 521–528. doi: 10.1007/978-981-13-2673-8_55
B. Angelin and A. Geetha, “Outlier detection using clustering techniques – k-means and k-median,” in 4th International Conference on Intelligent Computing and Control Systems, Madurai, India, May 2020, pp. 373–378. doi: 10.1109/ICICCS48265.2020.9120990
H. Gustavsson, “Clustering based outlier detection for improved situation awareness within air traffic control,” Thesis, KTH Royal Institute of Technology, Sweden, 2019
T. Nizam and S. I. Hassan, “Exemplifying the effects of distance metrics on clustering techniques: f-measure, accuracy and efficiency,” in 7th International Conference on Computing for Sustainable Global Development, New Delhi, India, Mar. 2020, pp. 39–44. doi: 10.23919/INDIACom49435.2020.9083687
S. Aggarwal, N. Agarwal, and M. Jain, “Performance analysis of uncertain k-means clustering algorithm using different distance metrics,” in Computational Intelligence: Theories, Applications and Future Directions - Volume I, 2019, pp. 237–245. doi: 10.1007/978-981-13-1132-1_19
S. Kapil and M. Chawla, “Performance evaluation of k-means clustering algorithm with various distance metrics,” in 1st International Conference on Power Electronics, Intelligent Control and Energy Systems, New Delhi, India, Jul. 2016, pp. 1–4. doi: 10.1109/ICPEICES.2016.7853264
S. A. Salihu, I. P. Onyekwere, M. A. Mabayoje, and H. A. Mojeed, “Performance evaluation of manhattan and euclidean distance measures for clustering based automatic text summarization,” Journal of Engineering and Technology, vol. 4, no. 1, pp. 135-139, 2019. doi: 10.46792/fuoyejet.v4i1.316
Y.-D. Zhang and S. Wang, “Detection of Alzheimer’s disease by displacement field and machine learning,” PeerJ, vol. 3, E1251, 2015. doi: 10.7717/peerj.1251
Y. S. Thakare and S. B. Bagal, “Performance evaluation of k-means clustering algorithm with various distance metrics,” International Journal of Computer Application, vol. 110, no. 11, pp. 12-16, 2015

Last update:

Classification of beneficiaries for the rehabilitation of uninhabitable houses using the K-Nearest Neighbor algorithm
An-Naas Shahifatun Na’iema, Harminto Mulyo, Nur Aeni Widiastuti. Jurnal Teknologi dan Sistem Komputer, 10 (1), 2022. doi: 10.14710/jtsiskom.2021.14110

Last update: 2026-04-05 22:07:25

No citation recorded.

Starting from 2021, the author(s) whose article is published in the JTSiskom journal attain the copyright for their article and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. By submitting the manuscript to JTSiskom, the author(s) agree with this policy. No special document approval is required.

The author(s) guarantee that:

their article is original, written by the mentioned author(s),
has never been published before,
does not contain statements that violate the law, and
does not violate the rights of others, is subject to copyright held exclusively by the author(s), is free from the rights of third parties, and the necessary written permission to quote from other sources has been obtained by the author(s).

The author(s) retain all rights to the published work, such as (but not limited to) the following rights:

Copyright and other proprietary rights related to the article, such as patents,
The right to use the substance of the article in its own future works, including lectures and books,
The right to reproduce the article for its own purposes,
The right to archive all versions of the article in any repository, and
The right to enter into separate additional contractual arrangements for the non-exclusive distribution of published versions of the article (for example, posting them to institutional repositories or publishing them in a book), acknowledging its initial publication in this journal (Jurnal Teknologi dan Sistem Komputer).

Suppose the article was prepared jointly by more than one author. Each author submitting the manuscript warrants that all co-authors have given their permission to agree to copyright and license notices (agreements) on their behalf and notify co-authors of the terms of this policy. JTSiskom will not be held responsible for anything arising because of the writer's internal dispute. JTSiskom will only communicate with correspondence authors.

Authors should also understand that their articles (and any additional files, including data sets and analysis/computation data) will become publicly available once published. The license of published articles (and additional data) will be governed by a Creative Commons Attribution-ShareAlike 4.0 International License. JTSiskom allows users to copy, distribute, display and perform work under license. Users need to attribute the author(s) and JTSiskom to distribute works in journals and other publication media. Unless otherwise stated, the author(s) is a public entity as soon as the article is published.

Prapemrosesan klasifikasi algoritme kNN menggunakan K-means dan matriks jarak untuk dataset hasil studi mahasiswa

Preprocessing kNN algorithm classification using K-means and distance matrix with students’ academic performance dataset

EDITORIAL OFFICE OF JURNAL TEKNOLOGI DAN SISTEM KOMPUTER