Comparison of the histogram of oriented gradient, GLCM, and shape feature extraction methods for breast cancer classification using SVM

– Breast cancer originates from the ducts or lobules of the breast and is the second leading cause of death after cervical cancer. Therefore, early breast cancer screening is required, one of which is mammography. Mammography images can be automatically identified using Computer-Aided Diagnosis by leveraging machine learning classifications. This study analyzes the Support Vector Machine (SVM) in classifying breast cancer. It compares the performance of three features extraction methods used in SVM, namely Histogram of Oriented Gradient (HOG), GLCM, and shape feature extraction. The dataset consists of 320 mammogram image data from MIAS containing 203 normal images and 117 abnormal images. Each extraction method used three kernels, namely Linear, Gaussian, and Polynomial. The shape feature extraction-SVM using Linear kernel shows the best performance with an accuracy of 98.44 %, sensitivity of 100 %, and specificity of 97.50 %.


I. INTRODUCTION
Cancer is a deadly tumor due to abnormal cells that grow in the body tissues uncontrollably [1]. Based on data World Health Organization (WHO) in 2013, the second cause of death in the world, as much as 13 %, was cancer after cardiovascular disease and is expected to rise in 2030 [2], [3]. One of the highest cancer rates in the world was breast cancer in 2020 [4].
Breast cancer is a cancer of the body tissues originating from the ducts or lobules of the breast [5]. Breast cancer has the second-highest cause of death after cervical cancer. The reason is unknown, and its growth is controlled by genes in the nucleus of breast tissue cells. The decrease in genetic factors is only about 5 % to 10 %. However, congenital disabilities caused by style are the cause and the first menstrual cycle before 12 years old or women who have menopause after 55 years old [6], [7]. Early breast cancer screening is necessary, one of which is mammography [8].
Mammography is an X-ray examination technique that penetrates the breast tissue to see the overall picture of the breast [9]. The results of this technique are still analyzed manually by experienced experts. This research is expected to facilitate medical parties to solve breast cancer identification using Computer-Aided Diagnosis (CAD) [10].
CAD is widely used as decision support in disease detection based on the signal, numerical, and image data [11]. One of the CAD methods that is widely used in classification is the Support Vector Machine (SVM). SVM method can solve problems on patterns (the curse of dimensionality), patterns that are not included in the class can be classified, and easy to apply, so that it is good in classification performance [12], [13]. Vijayarajeswar et al. [14] classified mammogram images using SVM and Hough transform feature extraction. In this research, the best accuracy of SVM reached 94 % compared to LDA, which only obtained 86 %. Ma'arif and Arifin [15] conducted another research to classify breast cancer using Backward Elimination (BE) and SVM methods. This study combined BE and SVM feature selection algorithms with data sharing using 10-fold cross-validations. The research obtained an increased accuracy of up to 14 % so that the accuracy value was 97.14 % and the AUC value was 0.995. Based on those studies, it was found that SVM produced high accuracy values.
The classification stage can be carried out if several steps have been carried out, such as feature extraction. Feature extraction aims to extract meaningful information from an image to facilitate the classification stage [16]. The most commonly used feature extraction method is Gray Level Co-occurrence Matrix (GLCM). Tunjungsari et al. [17] researched feature extraction using mammographic images to detect breast cancer using Gray Level Co-occurrence and Fuzzy Backpropagation. This research resulted in an accuracy rate of 50 % with input on FBP, which combines five GLCM features, such as contrast, dissimilarity, energy, entropy, and inverse difference moment. Sarosa et al. [18] researched breast cancer detection on mammographic images using GLCM and SVM feature extraction. This research had a preprocessing stage using a grayscale and histogram equalization, GLCM feature extraction, and SVM classification, with an accuracy of 63.03 %. Another feature extraction method is the Histogram of Oriented Gradient descriptor (HOG). Suresh et al. [19] researched HOG to examine and classify normal and abnormal patterns on mammographic images using a Deep Neural Network (DBN) by performing preprocessing, segmentation, feature extraction using HOG DBN classification. Such research could achieve increased classification results from 3 % to 9 % compared to other methods. Farhan and Kamil [20] conducted another research to analyze the texture of mammograms using the HOG method. In this research, Contrast Limited Adaptive Histogram Equalization (CLAHE) method was used for preprocessing, feature extraction used was from HOG, and it was classified using the SVM method. The results obtained from this research using a mini-MIAS database were 90 %, a sensitivity of 69 %, and a specificity of 100 %.
Another feature extraction method is shape feature extraction. Wibawa and Novianti [21] conducted a technique to optimize the classification of breast tumors. This study used contour and textural features such as radius, perimeter, area, cohesiveness, smoothness, concave, concave point, symmetry, fractal dimension, and texture. The feature extraction results were classified using the KNN method by comparing feature reduction methods such as PCA, RFE, and RFECV. This research obtained the best accuracy using PCA and KNN of 0.9736 with 1.231 seconds. Ma et al. [22] conducted another research to predict molecular subtypes of breast cancer with mammography radiomic features. The study used 39 attributes, including morphological features such as shape, size, perimeter, area, concavity, roundness, and descriptions of Fourier coefficients and grayscale statistical features and Haralick texture features. The results of this feature extraction used the Naïve Bayes method. The research obtained the best results by combining a craniocaudal and mediolateral oblique appearance with a value of 0.796 to compare triplenegative and non-triple negative.
Research on breast cancer was also carried out in [23] using the Neural Network classification and comparing the GLCM and HOG methods at the feature extraction stage. This study indicates that the proposed method obtains an accuracy of 96.67 % by using HOG. However, the study does not show how fast the Neural Network works. Therefore, this research proposes a further analysis of the SVM method in classifying breast cancer on mammography images by comparing GLCM, HOG methods, and shape feature extraction. This research is expected to provide the best results to help medical authorities classify breast cancer to reduce the death rate of breast cancer.

II. RESEARCH METHODS
The research is quantitative research to find the knowledge using data numbers as a tool to analyze the information [24]. The types and sources of data used in this research are secondary data from MIAS (Mammographic Image Analysis Society) consisted of 320 mammogram images [25]. This data is breast images from the right and left positions (RCC and LCC) in PGM format. In this research, the data were divided into 203 images of normal and 117 images of abnormal. This research consists of various stages, such as the data preprocessing step to improve image quality, segmentation, feature extraction, classification, and model testing, as depicted in Figure 1.

A. Preprocessing and Segmentation
After the data are obtained, the preprocessing stage is carried out to improve image quality. In this research, the Gaussian filter method is used, and the edge is detected using Canny. The preprocessing of data is carried out again at the segmentation stage. Segmentation is used to separate objects from the background to obtain important objects that will be used in the next stage [26]. This research uses the thresholding method in the segmentation stage.

B. Feature Extraction
Feature extraction aims to separate relevant information that characterizes each class to form features. This feature will be used in the classification stage to introduce the input unit to the target output to be easier in the classification stage [27]. In this research, feature extraction process compares HOG method [28]- [30], GLCM [31]- [34], and shape feature extraction [35]- [37]. In this research, GLCM uses four parameters, namely contrast, correlation, energy, and homogeneity. Feature extraction uses four parameters, namely area, perimeter, metric, and eccentricity.

C. Classification
The image from the feature extraction stage is classified using SVM [38]- [40]. Image data is divided into test data and training data using the k-fold crossvalidation method. The classification results are tested using a confusion matrix with the accuracy, sensitivity, and specificity parameters to determine the method's accuracy in the image data used [33].

III. RESULTS AND DISCUSSION
The result of this study is the accuracy value of the classification results to measure the success of the model used. The steps taken are preprocessing, segmentation, feature extraction, and classification. Figure 2 shows the data sample used in this study.

A. Preprocessing and segmentation
This stage begins with cropping to focus on the breast area. In this research, cropping was done manually by the authors. The following process is preprocessing using a Gaussian filter to smoothen and reduce noise in the image. The following process is Canny edge detection to make it easier to identify objects at the segmentation stage. Furthermore, the segmentation stage was carried out using thresholding. Figure 3 shows the results of the preprocessing and segmentation stages.

B. Feature Extraction
Feature extraction is used to identify the characteristics of each image at the classification stage. In this research, the feature extraction used HOG, GLCM, shape feature extraction, which will later be used as a comparison at the model testing stage. The HOG feature extraction used nine blocks so that each image will be divided into 3 x 3. The results of HOG samples are shown in Table 1. The GLCM feature extraction used four parameters, namely contrast (Ct), correlation (Cr), energy (En), and homogeneity (H). The results of GLCM samples can be seen in Table 2. There are four parameters in shape feature extraction in this research, namely area (A), perimeter (Per), metric (Mtr), and eccentricity (Ecc). Table 3 shows the results of shape feature extraction samples.
The process before classification is data sharing. The data is divided into training and test data using the k-fold cross-validation method. In this research, the used k value was 5, which obtains the highest classification accuracy reported so far after the K test from 2 to 10 in [41]. Therefore, the data was divided into 256 images for training data and 64 images for testing data.

C. Classification
The data from feature extraction is classified into two classes, namely normal and abnormal. The data classification uses the SVM method. In this research, the SVM method uses Linear, Gaussian, and Polynomial kernels. The classification results are expressed in a confusion matrix to calculate accuracy (Acc), sensitivity (Sens), and specificity (Spec). The confusion matrix can be seen in Table 4.
Based on Table 4, HOG-SVM has 8 data classified as positive cancer, 4 data falsely classified as positive cancer, 16 data falsely classified as negative cancer, and 36 data classified as negative cancer. In the GLCM-SVM, there are 8 data classified as positive cancer, 4 data falsely classified as positive cancer, 15 data falsely classified as negative cancer, and 37 data classified as negative cancer. In the shape feature extraction-SVM, there are 24 data classified as positive cancer, 1 data falsely classified as positive cancer, 0 data falsely classified as negative cancer, and 39 data classified as negative. Table 5 shows that the best accuracy is obtained using the Gaussian kernel with an accuracy of 68.75 %. The best results on sensitivity are achieved at 33.33 % using Gaussian and Linear kernel. The best specificity is reached using a Gaussian kernel of 90.00 %. Based on these results, HOG-SVM classification obtains the best performance using the Gaussian kernel with 0.04 seconds for computational time T [42]. Table 6 shows that the Gaussian kernel with 0 o has an accuracy of 54.69 % with sensitivity and specificity values of 12.50 % and 80.00 % with 0.05 seconds for computational time. Therefore, this kernel is not a good classifier. This poor accuracy results are caused by the GLCM result data that cannot be appropriately separated [34]. The best results are obtained using a Gaussian kernel with 45 o with 70.31 % accuracy, 34.78 % sensitivity, and 90.24 % specificity. It gives better accuracy than [17], [18]. Therefore, GLCM-SVM gains the best results with the Gaussian kernel with 0.03 seconds for computational time. The Gaussian kernel explains data distribution better than the Polynomial and Linear kernels in the data mapping process [43]. Table 7 reveals that the best accuracy is obtained using a Linear kernel of 98.44 %. The best sensitivity is obtained using a Linear and Gaussian kernel with a sensitivity value of 100 %. For specificity, the best results are obtained using a Linear kernel at 97.50 %. The Linear kernels explain data distribution better than the Polynomial and Gaussian kernels in the data mapping process [44]. Based on these results, shape feature extraction-SVM classification obtains the best results using the Linear kernel with 0.04 seconds for computational time. The results show that shape feature extraction-SVM using a Linear kernel is the best model. It gives better accuracy than [45]. HOG and GLCM obtain lower accuracy values than shape feature extraction. In the segmentation process, the image is from a binary image converted back to a grayscale image caused some features are missing [46].

IV. CONCLUSION
Research to identify breast cancer has been proposed by comparing feature extraction and classification using SVM. The proposed method obtains good accuracy with fast computation time. Feature shape extraction methods are capable of detecting the presence of cancer. SVM classification is very good at identifying breast cancer compared to using neural networks in previous studies. Further research is expected to be carried out using appropriate classification methods with fast computation time.

SUPPLEMENTARY MATERIALS
Supplementary material associated with this article can be found, in the online version, at doi: 10