Parameter tuning in KNN for software defect prediction: an empirical analysis

Modinat Abolore Mabayoje  -  Department of Computer Science, University of Ilorin, Nigeria
*Abdullateef Olwagbemiga Balogun orcid scopus  -  Department of Computer Science, University of Ilorin, Nigeria
Hajarah Afor Jibril  -  Department of Computer Science, University of Ilorin, Nigeria
Jelili Olaniyi Atoyebi  -  Department of Computer Science and Engineering, Obafemi Awolowo University, Nigeria
Hammed Adeleye Mojeed  -  Department of Computer Science, University of Ilorin, Nigeria
Victor Elijah Adeyemo  -  Department of Computer Science, University of Ilorin, Nigeria
Received: 27 Jan 2019; Revised: 31 Jul 2019; Accepted: 10 Aug 2019; Published: 31 Oct 2019; Available online: 3 Oct 2019.
Open Access Copyright (c) 2019 Jurnal Teknologi dan Sistem Komputer
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Format:
Article Info
Section: Original Research Articles
Language: EN
Statistics: 764 204
Abstract
Software Defect Prediction (SDP) provides insights that can help software teams to allocate their limited resources in developing software systems. It predicts likely defective modules and helps avoid pitfalls that are associated with such modules. However, these insights may be inaccurate and unreliable if parameters of SDP models are not taken into consideration. In this study, the effect of parameter tuning on the k nearest neighbor (k-NN) in SDP was investigated. More specifically, the impact of varying and selecting optimal k value, the influence of distance weighting and the impact of distance functions on k-NN. An experiment was designed to investigate this problem in SDP over 6 software defect datasets. The experimental results revealed that k value should be greater than 1 (default) as the average RMSE values of k-NN when k>1(0.2727) is less than when k=1(default) (0.3296). In addition, the predictive performance of k-NN with distance weighing improved by 8.82% and 1.7% based on AUC and accuracy respectively. In terms of the distance function, kNN models based on Dilca distance function performed better than the Euclidean distance function (default distance function). Hence, we conclude that parameter tuning has a positive effect on the predictive performance of k-NN in SDP.
Keywords: software defect prediction; parameter tuning; k-nearest neighbor; distance function; distance weighting

Article Metrics:

  1. M. M. Ali, S. Huda, J. Abawajy, S. Alyahya, H. Al-Dossari, and J. Yearwood, "A parallel framework for software defect detection and metric selection on cloud computing," Cluster Computing, vol. 20, no. 3, pp. 2267-2281, 2017. doi: 10.1007/s10586-017-0892-6
  2. H. B. Yadav and D. K. Yadav, "A fuzzy logic based approach for phase-wise software defects prediction using software metrics," Information and Software Technology, vol. 63, pp. 44-57, 2015. doi: 10.1016/j.infsof.2015.03.001
  3. Huda et al., "A framework for software defect prediction and metric selection," IEEE access, vol. 6, pp. 2844-2858, 2018. doi: 10.1109/ACCESS.2017.2785445
  4. Z. Li, X.-Y. Jing and X. Zhu, "Progress on approaches to software defect prediction," IET Software, vol. 12, no. 3, pp. 161-175, 2018. doi: 10.1049/iet-sen.2017.0148
  5. M. Tan, L. Tan, S. Dara, and C. Mayeux, "Online defect prediction for imbalanced data," in the 37th IEEE International Conference on Software Engineering, Florence, Italy, May 2015, pp. 99-108. doi: 10.1109/ICSE.2015.139
  6. C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, "An empirical comparison of model validation techniques for defect prediction models," IEEE Transactions on Software Engineering, vol. 43, no. 1, pp. 1-18, 2017. doi: 10.1109/TSE.2016.2584050
  7. X.-Y. Jing, F. Wu, X. Dong, and B. Xu, "An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems," IEEE Transactions on Software Engineering, vol. 43, no. 4, pp. 321-339, 2017. doi: 10.1109/TSE.2016.2597849
  8. H. Tong, B. Liu, and S. Wang, "Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning," Information and Software Technology, vol. 96, pp. 94-111, 2017. doi: 10.1016/j.infsof.2017.11.008
  9. Ö. F. Arar and K. Ayan, "Software defect prediction using cost-sensitive neural network," Applied Soft Computing, vol. 33, pp. 263-277, 2015. doi: 10.1016/j.asoc.2015.04.045
  10. F. Zhang, Q. Zheng, Y. Zou, and A. E. Hassan, "Cross-project defect prediction using a connectivity-based unsupervised classifier," in the 38th International Conference on Software Engineering, Austin, USA, May 2016, pp. 309-320. doi: 10.1145/2884781.2884839
  11. A. O. Balogun, S. Basri, S. J. Abdulkadir, and A. S. Hashim, "Performance analysis of feature selection methods in software defect prediction: a search method approach," Applied Sciences, vol. 9, no. 13, pp. 1-20, 2019. doi: 10.3390/app9132764
  12. S. Herbold, A. Trautsch, and J. Grabowski, "A comparative study to benchmark cross-project defect prediction approaches," IEEE Transactions on Software Engineering, vol. 44, no. 9, pp. 811-833, 2017. doi: 10.1109/TSE.2017.2724538
  13. Y. Kamei, T. Fukushima, S. McIntosh, K. Yamashita, N. Ubayashi, and A. E. Hassan, "Studying just-in-time defect prediction using cross-project models," Empirical Software Engineering, vol. 21, no. 5, pp. 2072-2106, 2016. doi: 10.1007/s10664-015-9400-x
  14. R. Malhotra, "A systematic review of machine learning techniques for software fault prediction," Applied Soft Computing, vol. 27, pp. 504-518, 2015. doi: 10.1016/j.asoc.2014.11.023
  15. C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, "Automated parameter optimization of classification techniques for defect prediction models," in the IEEE/ACM 38th International Conference on Software Engineering, Austin, USA, May 2016, pp. 321-332. doi: 10.1145/2884781.2884857
  16. A. O. Balogun, S. Basri, S. J. Abdulkadir, and A. S. Hashim, "A hybrid multi-filter wrapper feature selection method for software defect predictors," International Journal of Supply Chain Management, vol. 8, no. 2, pp. 916-922, 2019.
  17. W. Fu, T. Menzies, and X. Shen, "Tuning for software analytics: Is it really necessary?," Information and Software Technology, vol. 76, pp. 135-146, 2016. doi: 10.1016/j.infsof.2016.04.017
  18. Y. Jiang, B. Cukic, and T. Menzies, "Can data transformation help in the detection of fault-prone modules?," in the 2008 Workshop on Defects in Large Software Systems, Seattle, USA, Jul. 2008, pp. 16-20. doi: 10.1145/1390817.1390822
  19. A. Tosun and A. Bener, "Reducing false alarms in software defect prediction by decision threshold optimization," in the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, Florida, USA, Oct. 2009, pp. 477-480. doi: 10.1109/ESEM.2009.5316006
  20. A. G. Koru and H. Liu, "An investigation of the effect of module size on defect prediction using static measures," in the 2005 Workshop on Predictor Models in Software Engineering, New York, USA, May 2005, pp. 1-5. doi: 10.1145/1083165.1083172
  21. T. Mende, "Replication of defect prediction studies: problems, pitfalls and recommendations," in the 6th International Conference on Predictive Models in Software Engineering, Timisoara, Romania, Sept. 2010, pp. 1-10. doi: 10.1145/1868328.1868336
  22. T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, "A systematic literature review on fault prediction performance in software engineering," IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1276-1304, 2012. doi: 10.1109/TSE.2011.103
  23. A. G. Akintola, A. O. Balogun, F. Lafenwa-Balogun, and H. A. Mojeed, "Comparative analysis of selected heterogeneous classifiers f analysis of gray code number system in image security or software defects prediction using filter-based feature selection methods," FUOYE Journal of Engineering and Technology, vol. 3, no. 1, pp. 134-137, 2018.
  24. R. Jimoh, A. Balogun, A. Bajeh, and S. Ajayi, "A PROMETHEE based evaluation of software defect predictors," Journal of Computer Science and Its Application, vol. 25, no. 1, pp. 106-119, 2018.
  25. M. A. Mabayoje, A. O. Balogun, S. M. Bello, J. O. Atoyebi, H. A. Mojeed, and A. H. Ekundayo, "Wrapper feature selection based heterogeneous classifiers for software defect prediction," Adeleke University Journal of Engineering and Technology, vol. 2, no. 1, pp. 1-11, 2019.
  26. A. O. Balogun, R. O. Oladele, H. A. Mojeed, B. Amin-Balogun, V. E. Adeyemo, and T. O. Aro, "Performance analysis of selected clustering techniques for software defects prediction," African Journal of Computing & ICT, vol. 12, no. 2, pp. 30-42, 2019.
  27. T. G. Grbac, G. Mausa, and B. D. Basic, "Stability of software defect prediction in relation to levels of data imbalance," in the 2nd Workshop on Software Quality Analysis, Monitoring, Improvement, and Applications, Novi Sad, Serbia, Sept. 2013, pp. 1-10.
  28. Q. Yu, S. Jiang, and Y. Zhang, "The performance stability of defect prediction models with class imbalance: an empirical study," IEICE Transactions on Information and Systems, vol. 100, no. 2, pp. 265-272, 2017.
  29. S. Bibi, G. Tsoumakas, I. Stamelos, and I. P. Vlahavas, "Software defect prediction using regression via classification," in IEEE International Conference on Computer Systems and Applications, Dubai, UAE, Mar. 2006, pp. 330-336. doi: 10.1109/AICCSA.2006.205110
  30. P. Singh and S. Verma, "Automated tool for extraction of software fault data," in Advances in Data and Information Sciences: Springer, 2018, pp. 29-37. doi: 10.1007/978-981-10-8360-0_3
  31. M. Tan, L. Tan, S. Dara, and C. Mayeux, "Online defect prediction for imbalanced data," in the 37th Internation Conference on Software Engineering, Florence, Italy, May 2015, pp. 99-108.
  32. G. I. Salama, M. Abdelhalim, and M. A.-e. Zeid, "Breast cancer diagnosis on three different datasets using multi-classifiers," International Journal of Computer and Information Technology, vol. 1, no. 1, pp. 36-43, 2012.
  33. Y. A. Christobel and P. Sivaprakasam, "A new classwise k nearest neighbor (CKNN) method for the classification of diabetes dataset," International Journal of Engineering and Advanced Technology, vol. 2, no. 3, pp. 396-200, 2013.
  34. Y. Liao and V. R. Vemuri, "Use of k-nearest neighbor classifier for intrusion detection," Computers & Security, vol. 21, no. 5, pp. 439-448, 2002. doi: 10.1016/S0167-4048(02)00514-X
  35. M. Mabayoje, A. Balogun, A. Bajeh, and B. Musa, "Software defect prediction: effect of feature selection and ensemble methods," FUW Trends in Science & Technology Journal, vol. 3, no. 2, pp. 518-522, 2018.
  36. P. Hall, B. U. Park, and R. J. Samworth, "Choice of neighbor order in nearest-neighbor classification," The Annals of Statistics, vol. 36, no. 5, pp. 2135-2152, 2008. doi: 10.1214/07-AOS537
  37. R. J. Samworth, "Optimal weighted nearest neighbour classifiers," The Annals of Statistics, vol. 40, no. 5, pp. 2733-2763, 2012. doi: 10.1214/12-AOS1049
  38. T. M. Kodinariya and P. R. Makwana, "Review on determining number of cluster in k-means clustering," International Journal of Advanced Research in Computer Science and Management Studies, vol. 1, no. 6, pp. 90-95, 2013.
  39. L. Song, L. L. Minku, and X. Yao, "The impact of parameter tuning on software effort estimation using learning machines," in the 9th International Conference on Predictive Models in Software Engineering, Maryland, USA, Oct. 2013, pp. 1-10. doi: 10.1145/2499393.2499394

  1. Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study
    Abdullateef O. Balogun, Shuib Basri, Saipunidzam Mahamad, Said J. Abdulkadir, Malek A. Almomani, Victor E. Adeyemo, Qasem Al-Tashi, Hammed A. Mojeed, Abdullahi A. Imam, Amos O. Bajeh, Symmetry, vol. 12, no. 7, pp. 1147, 2020. doi: 10.3390/sym12071147
  2. Optimization of k value and lag parameter of k-nearest neighbor algorithm on the prediction of hotel occupancy rates
    Agus Subhan Akbar, R. Hadapiningradja Kusumodestoni, Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 3, pp. 246, 2020. doi: 10.14710/jtsiskom.2020.13648