Identification of the distribution village maturation: Village classification using Density-based spatial clustering of applications with noise

The rural development measurement is undoubtedly not easy due to its particular needs and conditions. This study classifies village performance from social, economic, and ecological indices. One thousand five hundred ninety-one villages from the Community and Village Empowerment Office at Riau Province, Indonesia, are grouped into five village maturation classes: very under-developed village, under-developed village, developing village, developed village, and independent village. To date, Densitybased spatial clustering of applications with noise (DBSCAN) is utilized in mining 13 of the villages’ attributes. Python programming is applied to analyze and evaluate the DBSCAN activities. The study reveals the grouping’s silhouette coefficient values at 0.8231, thus indicating the well-being clustering performance. The epsilon and minimum points values are considered in DBSCAN evaluation with percentage splits simulation. This grouping can be used as guidelines for governments in analyzing the distribution of rural development subsidies more optimal.


I. INTRODUCTION
A village is a law unit with the territory and approved a regulation in governance, the community's interests, and legal rights recognized and respected by the government system in the Negara Kesatuan Republik Indonesia (NKRI) [1]. As an archipelagic country, Indonesia faces complex interaction and access to the physical environment. Thus, this geospatial condition causes the village's situation to be separated from rural areas and identic with poverty and underdevelopment. Moreover, the limited livelihoods of the village community triggered the condition that further away from prosperity status.
To support village development, the government issued various regulations and laws, including statute No. 6, 2014 concerning the village's status, and government regulation No. 22, 2015 for village fund management arrangements. Moreover, Rencana Pembangunan Jangka Menengah Nasional (RPJMN) declaration in 2015-2019 released strengthening efforts to achieve villages and rural areas' development goals by introducing the villages index mechanism. The program seeks to cut the number of under-developed villages to 5,000 villages and boost the percentage of autonomous villages to at least 2,000 population by 2019. The villages index categorized five villages: under-developed village, under-developed village, developing village, developed village, and independent village. The aggregation classes were determined based on the progress and independence status, with the scoring range between 0.4907 to 0.8155. This village index mapped how were the conditions and characteristics of the village. Thus, the government can use this index to design village development plans to be more efficient and targeted.
Meanwhile, Badan Perencanaan Pembangunan Nasional (National Development Planning Agency Republic of Indonesia) or Bappenas, as the government agency that regulates the national development planning, issued another village index category. The villages were classified into three categories, viz. underdeveloped village, developing village, and independent village. However, it found several differences between the two above government measurements, especially that related to the indicator's performance and the percentage calculation of the village mapping [2]. Moreover, the two leveling indexes above are carried out using only the simple arithmetic nominal dimension scale [3]. This condition prompted the measurement analysis's obligation to map the village grouping with more optimal, accurate, and comprehensive without ignoring the significant standard factors proposed by government appraisal indicators. Riau province has a strategic geographical position, flanked by three neighboring countries: Malaysia, Singapore, and Thailand. This position certainly has a positive impact, especially concerning the development of the industrial market economy in Riau, thus accelerate the growth and progress of villages in this province. It is reported that there were 1591 villages data in Riau province that categorized 736 villages into developing village class, 69 villages in a developed village class, four villages in independent villages class, 121 villages in very under-developed, and 661 villages are grouping into an under-developed category. In a nutshell, these data indicated that the village's construction conditions in Riau province are far from evenly distributed, and many towns are found in the cluster behind.
Clustering is among the most critical data mining methods to treat and group unsupervised data in similar ways [4]. Clustering provides a valid analytical for solving complex problems by finding specific interesting data patterns to support the knowledge discovery process [5]. The contribution of clustering covers the limitation of statistical analysis, especially for a considerable data analysis. The various domain studies have been shown the effectiveness of this approach in clustering the medical imaging data and image segmentation [6], [7], the digital marketing analysis and performance metrics [8], [9], the education and performance prediction [10], the chemical process analysis [11], [12], and the manufacturing process and analytical [13], [14]. In a nutshell, the previous studies reflected on the potential of data mining tools and clustering techniques to increase the visibility and responsiveness of distributed knowledge discovery data.
Previous studies reviewed the three commonly used techniques in data clustering, including partition-based [15], hierarchy-based [16], and density-based [17]. Density-based clustering algorithms are widely used in several areas [18] that highlighting the arbitrary-shaped clusters and data noise. Density-based clustering distinguishes the different groups or clusters in a dataset relying on the idea that clusters are densely contiguous areas within the total data space, separated from other clusters by adjacent regions with relatively lower data density [19]. Data points with a softer object density ratio in the scattering area are typically classified as noise or outlier [19], [20].
Meanwhile, Density-Based Spatial Clustering of Application with Noise (DBSCAN) is a density-based clustering method that creates population densities linked to high and low-density deliberation [21]. DBSCAN generates the numbers of data within the radius of Eps (ε) and the minimum number of contiguous data points (minpts) to be grouped into clusters. Thus, DBSCAN is perceived as the most rugged and cited cluster algorithm for density that recognizes significant random shapes and sizes clusters in massive, nuisance-damaged databases [22]. Since DBSCAN accomplishes the disturbance points correctly and effectively, this method defines a group surrounded by noise and separates it into different categories [23], [24]. However, the current DBSCAN algorithm still has many shortcomings, such as unwillingness to locate multi-density clusters [25], [26], the issues with specifying the appropriate density thresholds [27], a scarcity of computational parallel design, the time spent in finding the nearest neighbors inside the cluster expansion [28], and the inability to group gradually [29].
Several new changes have been made to DBSCAN to overcome the original DBSCAN and effectively deal with ambient queries. Andrade et al. [30] used a graphics processing unit to parallel G-DBSCAN. Yinghua et al. [28] established a DBSCAN-Influence Space for a complex data set. Several studies optimized and rapidly generated DBSCAN with R, a novel DBSCAN hybridization and fuzzy earthworm optimization algorithm for data cube clustering [28], [30], [31].
After investigating the reviews of the DBSCAN's advantages and the opportunities of this method advancement in analytical data mining, this study aims to employ the DBSCAN method to cluster the development of villages status to provide a more comprehensive and accurate analytical solution in measuring development villages index. Here, Python programming is applied for interpreting the calculation and clustering theorem. Thus, the mapping and identification of villages' characteristics grow into more precise and optimal.
The remaining portion of this paper is structured accordingly: Section 2 outlined the procedures used in this paper, such as data mining and the DBSCAN formula. Section 3 considered the outcome and assessment of the DBSCAN adoption in the clustering of villages. The final declaration and contribution of this paper concluded with the new part in Section 4.

II. RESEARCH METHODS
Systematically, this research was conducted through several activities, including problem identification by exploring literature reviews associated with the topic, the observation, and interviews at the community and village empowerment office, Riau province. Five stakeholders from the agency were asked about their functions and work operations, activities, strategic planning, and supporting regulations for developing and empowering rural communities in Riau province.
Certain supporting documents were studied, especially the villages mapping data based on the development villages index's value. As primary data, 1591 villages from the year 2018 with 13 attributes were analyzed by focusing on the three main attributes, namely the social resilience index (IKS), the environmental resilience index (IKL), and the economic resilience index (IKE). The above three main attributes are chosen by referring to the development villages index set up within the government regulation No. 2, 2016 that concerning the dimensions of the development village index [32].
The IKS is measured by considering the dimension in health (service, health, community empowerment for health), education (educational access to middle and high school, the road to non-formal education, and admittance to knowledge), and social capital (solidarity sensitivity, tolerance awareness, and sense of citizens protection). Each dimension is determined into several key performance indicators (KPIs).
The IKL is defined by the ecological dimension, including the environmental quality and disaster response. The availability of water, soil, air pollution, and the numbers of river waste affected assume KPIs' form in environmental quality. Meanwhile, natural disasters such as floods, landslides, forest fires, and handling such disasters were resolved as KPIs for disaster response.
The IKE deliberated economic (production diversity, the availability of service center, trading, access to financial credit institutions, and economic institutions), social welfare (the availability of special schools, numbers of people with social welfare, the numbers of suicide people), and settlement (access to clean water, sanitary, electricity, communication, and regional openness) dimensions whereby measured by its KPIs.
The government regulation also groups the villages into five village statuses with the scale distribution defined in Table 1. The calculation of the total value of the development villages index (IDM) is carried out with a simple formula in (1) by adding up the total values of IKS, IKL, and IKE.
Meanwhile, the adoption of knowledge discovery data (KDD) mining in this study generates a new contribution to the IDM measurement and classification. Subsequently, this study follows the KDD concepts, consisting of data selection, preprocessing/cleaning, transformation, data mining, and interpretation [33], [34]. KDD is a noticeable method to find new relevant patterns from large quantities of potentially useful and meaningful [35]. Data mining is an unavoidable step of the KDD process in harvesting useful knowledge from the dataset. For mining, this study applied the DBSCAN method in clustering the village data. The tracking algorithm of DBSCAN is stated below [36].
1. Determine the point p as an object randomly. 2. Calculate the Euclidean distance with (2), where x and y as objects and n as numbers of objects. This calculation respects the similarity measurement between objects in cluster analysis.
3. Determine the value Eps and MinPts by considering the values of noise in (3), directlydensity-reachable in (4), and density-connected in (5). The x denotes the data cluster, Ci the first cluster, NEps (y) the point around y in the radius, Eps MinPts as the minimum point in the cluster, NEps (x) as the surrounding point of x in the radius Eps, D as the data set, dist (x, y) as the Euclidean distance, and Eps as the radius parameter. The algorithm tracks the following as Algorithm 1.
In order to test the accuracy of village grouping, the silhouette validity index was applied with a ratio percentage split of training data and testing data at 90:10, 80:20, and 70:30, respectively [37]. Silhouette metric simultaneously tests cluster segregation and cohesiveness [38]. The visual object outcomes generally apply the silhouette approach to discover the cluster's  let N be the set of objects in the Eps-neighbourhood of p; 8: for each point p0 in N 9: if p0 is unvisited 10: mark p0 as visited; 11: if the Eps-neighbourhood of p0 has at least MinPts points 12: add those points to N; 13: if p0 is not yet a member of any cluster 14: add p0 to C; end for 15: output C; 16: else mark as noise; until no object is unvisited; intensity and consistency [39]. The silhouette coefficient enumerates the average distance between data points in a similar cluster compared to other clusters [40]. The measurement of the silhouette validity index pursues (6)- (8).  [41], [38].
Python programming is planned and executed for the total calculation and clustering. Python is open-source packaging that carries out unsupervised graph data learning. This model offers community identification, node integration, and whole graph incorporation techniques, particularly data mining. [42]. The programming language embraces refinement, aggregation, interpolation, eigenvalues problems, algebraic equations, differential equations, and many other problems. In addition, Python's language has emerged long-term favorable and has culminated in the entire library ecosystem of related programs and social activities being interfered with [43].

A. The execution of KDD
For further review, the KDD method was preferred in Section 2 for 1591 villages with 13 different attributes, including Provincial Code, Provincial Name, District Code, Regional Name, Sub District Name, Village Code, Village Name, IKS, IKE, IKL, IDM, and status. The data selection stage emphasizes the three main attributes as the village group's references, namely IKS, IKE, and IKL attributes. The illustration of the data selection process can be depicted in Figure 1.
Furthermore, the preprocessing stage is conducted by investigating the missing values and duplicate data. Several Excel programming functions are executed, thus found no mislead ( Figure 2a)

B. DBSCAN analysis
Since the DBSCAN tracking algorithm allows the execution of Equation (2) to (5) and the measurement of silhouette indexes in Equation (6) to (8), where p is set as the first data, the obtained values of Euclidean as shown in Table 2.
Hence, the toolkit library for Python is executed with an epsilon value at 0.100 and minpts at 0.1. As a result, the outstanding silhouette index value is determined at 0.82308. This outcome indicated the right point of data grouping due to enclose of the above value into 1. The index value has been successfully clustered the data into five groups. Cluster 0 is a group of very under-developed villages with 1577 villages numbers (silhouette coefficient = 0.47768), cluster 1 as an underdeveloped village with six numbers villages (silhouette coefficient = 0.81104), cluster 2 as developing villages with only one village (silhouette coefficient = 1.0), cluster 3 is a developed village with total six villages (silhouette coefficient = 0.82669), and cluster 4 as an independent village consisting of 1 village data (silhouette coefficient = 1.0). The interpretation of the clustering pattern for five villages grouping can be seen in Figure 3.  This study reveals that the DBSCAN and government IDM calculation index pomp the significant differences in clustering the village's status. The comparison of the two above classifications can be seen in Figure 4. The running of (1) for the government IDM estimation index fails to explain the reasonable computation in grouping the villages and ensuring grouping validation.
Furthermore, the reasoning for DBSCAN grouping offered a more practical design with the current village conditions at the community and village empowerment agency at Riau Province, Indonesia.

C. DBSCAN evaluation
Subsequently, the silhouette calculation index on Equation (6) to (8) is compared depending on the percentage splits' equipment, as disclosed in Table 3. Table 3 showed the highest achievement of silhouette index value with randomized testing of epsilon and minpts rates. It reveals the percentage splits at 90:10 as the top of the silhouette index into grouping 5 clusters of villages (0.84351). The pattern performed from this interpretation can be seen in Figure 5. Hence, the comparison of villages grouping on each percentage split is outlined in Table 4.
Since the series of percentage splits tests with variant values of minpts and epsilon, the DBSCAN algorithm has successfully delivered the optimum numbers of silhouette index scores approaching 1. These values indicate that the epsilon and minpts values' determination directly affects the clustering amount produced [44]. The noise volume or outliers can be decreased or increased by the epsilon numbers' value [45]. Herein, DBSCAN presents an advantage in finding the clusters of arbitrary shapes efficiently, especially in massive grouping databases, by emphasizing the minimal needs of domain knowledge for parameters input [46].

IV. CONCLUSION
This study reveals that the DBSCAN algorithm has succeeded in handing over a novelty calculation in clustering the development villages index with unbroken reference to government regulation and standardization. DBSCAN accomplishes in grouping five villages in Riau province with the highest accuracy at 0.82308 silhouette coefficient value with epsilon and minpts values at 0.100 and 0.1, respectively. The evaluation presents the significant numbers of epsilon and minpts that directly affect many clusters composed with the percentage splits' divergent simulation.
In a nutshell, the deployment of DBSCAN in this case study provides a significant contribution to the village clustering with a better level of accuracy than ordinary mathematical calculations. This village's clustering will benefit village development planning and budget allocations and village development activity programs.