Parameter tuning in KNN for software defect prediction: an empirical analysis

Software Defect Prediction (SDP) provides insights that can help software teams to allocate their limited resources in developing software systems. It predicts likely defective modules and helps avoid pitfalls that are associated with such modules. However, these insights may be inaccurate and unreliable if parameters of SDP models are not taken into consideration. In this study, the effect of parameter tuning on the k nearest neighbor (k-NN) in SDP was investigated. More specifically, the impact of varying and selecting optimal k value, the influence of distance weighting and the impact of distance functions on k-NN. An experiment was designed to investigate this problem in SDP over 6 software defect datasets. The experimental results revealed that k value should be greater than 1 (default) as the average RMSE values of k-NN when k>1(0.2727) is less than when k=1(default) (0.3296). In addition, the predictive performance of k-NN with distance weighing improved by 8.82% and 1.7% based on AUC and accuracy respectively. In terms of the distance function, kNN models based on Dilca distance function performed better than the Euclidean distance function (default distance function). Hence, we conclude that parameter tuning has a positive effect on the predictive performance of k-NN in SDP.


I. INTRODUCTION
Software Defect Prediction (SDP) entails the identification or prediction of defect-prone software modules which in turn helps software engineers to prioritize the usage of limited resources during testing or maintenance phases in the SDLC [1], [2]. Therefore, the software quality and reliability are guaranteed [3], [4]. Software source code complexity, software metrics, and software development history are the information that serves as the features used by SDP models for the prediction of defective software modules [5]- [7]. Engineered software metrics such as McCabe and Halstead Metrics, Procedural Metrics, etc. are used to determine the quality and reliability level of a software system [5], [8]. Each software module or component is characterized by a set of metrics and a class label. The class label indicates the state of a module, whether it is defective or non-defective, and the derived metric values are used to build SDP models [9]- [11]. SDP utilizes historical data mined from software repositories to determine the quality and reliability of software modules for software quality assurance [12], [13].
Machine learning methods are the most common and widely used method for SDP [14]. Data-driven SDP generally relies on machine learning techniques, most of which have several parameters that can be adjusted to optimize the algorithm [15], [16]. Most machine learning algorithms have a default set of parameters that are chosen or set by default to reflect the best setting for general performance [17]. However, these default settings may not give the best results in all cases, and the optimal parameter settings are not known in advance [15]. The practice of choosing parameters that leads to increased performance within a particular domain or when applied to a particular type of data is known as parameter tuning.
Jiang et al. [18] and Tosun and Bener [19] in their respective works reported that Random Forest and Naïve Bayes would give a sub-optimal performance with default parameter settings. Koru and Liu [20] and Mende [21] also showed that tuning parameter settings of SDP models affect its performance. Also, Hall et al. [22] showed that the use of default parameters in unstable classification techniques leads to its underperformance. All those mentioned above made it imperative to investigate the impact of parameter tuning in SDP.
It is unknown the effects of parameter tuning on the performance of classifiers in SDP, as many studies seem to make an implicit assumption on the parameter setting by using default values [15], [23]- [25]. Findings from this study will help researchers in deciding and setting the appropriate parameters for selected predictive models in their research that will give better and consistent predictive results irrespective of the tool used for the analysis.
The attention of researchers has been drawn to parameter settings of prediction models in SDP. For example, Koru and Liu [20] and Mende [21], in their works, posited that using different parameter settings than the default settings has a positive effect on the performance of SDP prediction models. Tosun and Bener [19], in their respective studies, also mentioned that the default parameter usage of machine learning tools such as R, Weka, Scikit-learn, and Matlab are suboptimal. It has also been reported that SDP models may under-perform when sub-optimal parameters are used.
However, determining the optimal and suboptimal parameter settings is a challenge as most SDP models have many parameters [26]- [28]. It makes many empirical studies of SDP to settle for default parameter settings. For example, Mende [21] implemented random forest using the R package with the default number of decision trees as its parameter setting. Jiang et al. [18] and Bibi et al. [29] also used the default value of the knearest neighbors' classification technique (k = 1). Also, the implementations of classification techniques that are provided by different research toolkits often use different default settings. As a result of different parameter settings across machine learning tools, this may affect the SDP researches [30].
Recent researches have looked into the knowledge transfer mechanism of using parameter settings of prediction models with good performance on a particular dataset to another dataset. As a reference, Tan et al. [31] experimented and explored different parameter settings for Alternating Decision Tree (ADTree). The goal was to identify the optimal parameter setting and apply it to other datasets. Jiang et al. [18] also did the same on Multivariate Adaptive Regression Splines (MARS) with various parameter settings on one dataset. With those mentioned above, the applicability issue of using parameter settings across datasets is still not clear as several other factors, such as data quality problems can set in. However, determining and adapting optimal parameter settings of prediction models across datasets without depletion in predictive performance will be of benefit against automated parameter optimization.
Therefore, this study aims to investigate the parameter tuning of Instance-Based Learning (IBK) algorithm, more specifically k-Nearest Neighbor (k-NN), as it has been widely used in SDP [15], [23], [24], [26]. The parameter tuning is based on determining the optimal number of neighbors, best distance function, and applicability of distance weighting. Disjoint k-NN models were developed using default and optimal k values, different distance weighting methods, and different distance functions. The respective models were used on six software defects dataset from the NASA repository, and their predictive performances were measured comparatively analyzed. The experimental results showed that parameter tuning with respect to k value, distance function, and distance weighting options in k-NN has a positive effect on its predictive performance.

II. RESEARCH METHODS
This study is aimed at investigating and evaluating the impact of parameter tuning of k-NN for SDP.

A. Experimental framework
As depicted in Figure 1, the experimental framework of this study makes use of datasets which were divided into training and test sets based on 10-fold crossvalidation, a process of dividing a given dataset into 10 subsets, in which 9 subsets are used for training and the remaining one subset is used for testing the developed model, iteratively for ten times until all subsets are used as test set and results are averaged. Moreover, the phase of data pre-processing saw the selection of relevant and useful features among the features of the given dataset through the usage of the Correlation Feature Selection (CFS) technique, which was based on greedy-stepwise search method.
The k-NN algorithm was implemented, and the search for the optimal k value for each dataset was carried out, both with the optimal k value and the default parameter values for k-NN. Disjoint experiments were carried out to reveal the effect of implementing different distance weighting methods. With the optimal k value, different experiments were carried by implementing different distance functions of k-NN, and thus the impact of the distance functions was evaluated. The performances of all developed models, using default and optimal k values, different distance weighting methods, and different distance functions

B. Datasets
The datasets used in this study are five publicdomain software defect datasets provided by the National Aeronautics Space Administration (NASA) repository. The datasets used in this study are KC1, KC2, KC3, MW1, PC2, and PC4. A brief description of these datasets is provided in Table 1.

C. K nearest neighbor (KNN)
Instance Base Learner (IBL) or k-Nearest Neighbor classifies instances based on similarities. It is a type of lazy learning method where the function is only approximated locally, and all computation is deferred until classification [32]. An object is classified by a majority of its neighbors. The k is always a positive integer. The neighbors are selected from a set of objects for which the correct classification is known. Whenever there is a need for a new point to classify, its k nearest neighbors from the training data are used in determining the class of its replica in the test set [33]. Algorithm 1 presents the algorithm for k-NN.

D. Performance metrics
The metrics used in this study to evaluate the performance of a classifier model are accuracy (Eq. 1), precision (Eq.

III. RESULTS AND DISCUSSIONS
Tables 2 presents the optimal k values for k-NN using the elbow method. From studies, it has been gathered that obtaining the optimal k value for instance of base learners depends on the nature of the dataset [36], [37]. Elbow method was adopted based on Relative Mean Square Error (RMSE) for each k Value (1 to 15) [38]. As the value of k increases, the error rate goes down, then stabilizes, and then rises again. The optimum k value is at the beginning of the stable zone.
It clearly shows that using the default value of k (usually k=1) is not the appropriate or the best value in the context of SDP. Also, this finding further strengthens the aim of this study of tuning parameters appropriately in the classification task. Respective optimal k values for the selected SDP datasets used in this study, which are k = 3 for KC1, KC3, and PC4, and k = 4 for KC2, MW1, and PC2. if Sim(X, Dj) == 1.0 then X is normal and exit; Order Sim(X, Di) from Lowest to highest, (i = 1,...,N); Find K biggest scores of Sim(X, D); Select the K nearest instances to X: D K X; Assign to x the most frequent class in D K X; Calculate Sim_Avg for k-nearest neighbours; If Sim_Avg > threshold then X is normal; else then X is abnormal; Return X; END In revealing the impact of tuning the distance weight parameter, various methods of distance weighting (DW) were carried out, which are no distance weight (No DW) as default, 1/ DW , and 1−DW . These methods of distance weighting were also experimented using the default k-value and the optimal k value of the k-NN algorithm. Tables 3 and 4 depict the results for these experiments as measured using the performance metrics.
In Table 3, the average performance evaluation of k-NN with default (k = 1) based on different metrics (accuracy, precision, recall, f-measure, and Area Under Curve) is presented. The essence of this analysis is to further evaluate the effect of tuning k-NN's distance weighting function. With or without tuning the distance weighting parameters, the same results were observed in terms of the average accuracy, average precision, average recall, and average f-measure. In contrast, the value of the AUC varies as k = 1 with 1/ DW and 1−DW gave better AUC values of 0.715 and 0.718 respectively against No DW of 0.618.
Further analyses were carried out to investigate the effect of distance weighting on respective optimal k values. As shown in Table 6, with optimal k value from Table 4, distance weighting ( 1/ DW and 1−DW ) had a good effect on the optimal k values with an average accuracy of 86.06% and 86.65% against 85.13%. There was also a significant increase in the average recall, average f-measure, and average AUC values, as presented in Table 4. Table 5 shows the results of the average performance evaluation of k-NN distance function techniques. By default, the Euclidean distance is used, but from analysis, it was discovered that there are other distance functions that can do better than Euclidean distance. Dilca distance had the best average accuracy (85.16%), best average precision (0.823), average recall (0.852), average F-measure (0.832), and the be0st AUC value (0.741) when compared with the default Euclidean distance function. Chebyshev distance function also had a good impact more than the Euclidean distance with better accuracy, recall value, and AUC. However, the Euclidean was slightly better than Chebyshev in precision and f-measure values.
It has shown that parameter tuning in k-NN had a positive effect on the performance of the classifier. The k value should not be set to default (k = 1) in the classification task as it has been proved that higher k values performed better than the default (k = 1). However, from our findings, there is no universal k value as the k value varies from dataset to dataset. Distance weighting also should be done as the k-NN classifier got better when the distance weighting options are considered and the other distance functions such as Dilca Distance and Chebyshev also gave good results when compared with default Euclidean distance.
Conclusively, k-NN parameter tuning in SDP is highly encouraged as the predictive performance of k-NN was better than using default parameter values. The findings of this study on tuning parameters of classifiers are consistent with that of Tantithamthavorn et al. [15], and Song et al. [39] as parameter tuning has a positive impact on classifiers in SDP.

IV. CONCLUSIONS
The experimental results revealed that parameter tuning had a positive effect on the performance of k-NN in SDP. The value for k should be greater than 1 (default), distance weighting option should be used, and other distance functions can also be explored as they gave better predictive performance than k-NN with default parameters.
Even if SDP models are trained on a clean defect dataset, and their respective parameters are not tuned accordingly, SDP models may produce inaccurate performance. To this end, parameter tuning of SDP models is advised. It is also recommended that future works should look into using other classification techniques. It will enable researchers and software engineers to get the best out of classification techniques in SDP. Also, parameter tuning in the presence of data quality issues such as outliers and class imbalance can also be considered.