Maturity classification of cacao through spectrogram and convolutional neural network

Cacao pod's ideal harvesting time is when it is about to be ripe. Immature harvest would result in hard cacao beans not suitable for fermentation, while overripe cacao pods lead to fungal-infected, defective, and poor-quality yields. The demand for high-quality cacao products is expected to rise due to advancing technology in the present. Pre-harvesting needs to provide optimal identification of which amongst the pods are ripened enough and ready for the next stage of the cacao process. This paper recommends a technique to determine the ripeness of cacao. Nine hundred thirtythree cacao samples were used to collect thumping audio data at five different pod's exocarp locations. Each sound file is 1 second long, creating 4665 cacao sound file datasets at 16kHz sample rate and 16-bit audio bit depth. The process of the Mel-Frequency Cepstral Coefficient Spectogram was then applied to extract recognizable features for the training process. The deep learning method integrated was a convolutional neural network (CNN) to classify the cacao sound successfully. The experimental design model's output exhibits an accuracy of 97.50 % for the training data and 97.13 % for the validation data. While the overall accuracy mean of the classification system is 97.46 %, whether the cacao is unripe or ripe.


I. INTRODUCTION
Cacao (Theobroma cacao L.) is a perennial tropical crop cultivated in America for thousands of years that constantly yields pods throughout the year to sustain the great market demand for cocoa products. Its cacao pod, an ovoid indehiscent fruit, has sizes 15-30 cm (6-12 in) long and 8-10 cm (3-4 in) wide in various forms such as oblate, spherical, elliptical and oblong. The pod can be smooth or rough during its immature state of the colors green and red, depending on its genotype [1]. Currently, all varieties of cultivated cacao are formed through human hybrid intervention; however, the main classifications of cacao three broad groups: Criollo or native, Forastero or peasant, Trinitario, or hybrid [2].
Under the International Cocoa Organization (ICCO), from years 2016 to 2017, approximately 4.7 million tons of dried cacao with 4.3 million tons of world grindings were produced worldwide. These high quality produced cacao beans, which are of high price, are resultant from the cacao pod maturity during its time of harvesting. Lots of issues ground for poor cacao quality in the Philippines, including lack of quality fermentation and untimely harvesting of pods, whether it is immature or overripe, which influences the beans' overall quality [3].
Harvesting is an essential initial procedure in cacao processing to gain high-quality cacao beans [4]. For a cacao farmer, harvesting of cacao pods does not ripen at the same time, making it very arduous and laborintensive for them. Although, for about 2 to 3 weeks, these pods can be left until it is suitable for harvest, after which time, cacao pods are infected or starting to germinate with black beans.
Moreover, it is significant to note that cacao is a non-climacteric kind of fruit, which means it does not further ripen once picked from the tree. Thus, it is important not to reap it unripe for the beans inside will not be ready for fermenting. Hence, only those fullyripened pods are suitable to achieve harvesting optimum yield of quality cacao beans.
Nonetheless, it differs from other fruits, making it challenging to accurately determine the maturity level of cacao pod from its exterior appearance since its pod's external wall color is dependent on its variant. For instance, Forastero pod shows change from green to yellow, Criollo cacao from dark red to red, and other types into shades of red-orange [5]. The color indicator of cacao pods is designed as a general ripening guide, but tapping it is usually used by knowledgeable farmers or experts to validate their initial measure. If the cacao pod produces a somewhat hollow vibrating sound, the cacao beans inside are losing its intactness, signifying that the pod is slowly becoming ripe [4]. In [6], a decreasing correlation between the acoustic response of the fruit is directly related to its firmness that happens during the ripening stage of the certain fruit. Cooke and Rand [7] have established proof using a mathematical Copyright ©2020, JTSiskom, e-ISSN: 2338-0403, p-ISSN: 2620-4002 Submitted: 27 January 2020; Revised: 7 July 2020, 31 July 2020; Accepted: 10 August 2020; Published: 31 October 2020 model on the correlation of the acoustic response to its modulus elasticity, shape, and mass.
The ideal condition for pods harvest comes from knowing if it is at the first stage of ripeness. Overripening promotes contamination with some fungal diseases resulting in quality defects. Thus, green pods should not be part of the harvest because its seeds are hard, and cannot be separated easily and due to an unfinished forming mucilage inside the seed [4]. In the Philippines, the basis for the determination of cacao pod ripeness after harvest has remained extremely subjective. Aside from observing changes in color, farmers traditionally use sounds formed by tapping young cacao pods to differentiate their maturity levels by using his fingernails, knuckles, rounded end of a knife, or the knife itself. However, the experience is needed for this, and it is difficult for people with lessgifted ears or in a noisy farm environment.
In this context, the development of technologies that were automated has enabled commercially feasible noninvasive methods for determining the maturity level of agricultural products. Properties involving firmness, reflectance, vibrational transmissions are some of these non-destructive methods employed. Among these methods, [8] stated that acoustic sensing had been demonstrated to have the potential to measure the ripeness similarly, with [9] stressing out the reliability of integrating this acoustic technique.
Most of these acoustic techniques utilize an accelerometer, microphone, piezoelectric sensors, in testing fruits such as kiwi [10], mandarin [11], peach [12], melon [13], apple [14], and pomelo [15]. On the other hand, an acoustic tester with a mechanical plunger was found in a study to determine the ripeness of Juan canary melon [16]. Similarly, to capture the peak sound frequency with a microphone, [17] utilized the fusing of a sensor of both the electronic nose and an acoustic sensor in his research. Aside from using a microphone, some studies made usage of piezoelectric sensors and films like in ripeness determination of mountain papaya [18], and tomato fruits [19]. Besides, an experimental study using a piezoelectric transducer on acoustic sensing has established a ripeness classification model for cacao [20]. Thus, the use of a microphone suitable with the chosen microcontroller was integrated to capture these signals from cacao that will be later passed through a deep learning process.
In training the data, machine learning, as displayed in several studies, used the most used like Support Vector Method (SVM) like in the studies of ripeness estimation of papaya [18], cacao [20] which give them satisfactory accuracy rates. On the one hand, the potential of using K-nearest neighbor as a classifier was also seen useful in providing good feature vectors like in the study of the maturity of a watermelon [21].
Aside from these methods, integration of deep neural networks like Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) also gained popularity in areas of computer vision, and audio processing wherein captured signals are passed through acoustic spectrogram transformations, and then classified based on extracted spectrogram features. For instance, studies on ripeness estimation of grapes [22], banana [23], pineapple [24], durian [25], coffee beans [26], and other artificially ripened fruits [27] have utilized the potential of CNN in carrying out high accuracy classification. Other useful studies on the manipulation of CNN modeling are shown in other applications like in [28]- [30], while commonly used machine learning like SVM, Decision Tree, and Naïve Bayes can also be used in training the data [31], [32].
The study aimed to design a cacao maturity detector with the integration of acoustic sensing, spectrogram, and deep neural network learning to increase the accuracy of determining the maturity level of cacao. It will also specifically answer the objective of creating a cacao pod's maturity classification model based on the integration of deep CNN.

II. RESEARCH METHODS
This section provides the important procedures conducted to carry out the objectives of this study.

A. Dataset gathering procedure
The sound signals from the cacao were generated by the thumping or knocking motion on the exocarp layer of the cacao pod and recorded as individual audio files at a sampling frequency of 16 kilohertz (kHz), 16-bit mono audio like shown in Figure 1. This is done on five different face locations of the pod. The dataset consists of 4465 files, divided into two classes: (1) 2242 unripe files, and (2) 2223 ripe files, which came from 893 different cacao pods. Each file is composed of a single time thump sound in 1 second (s), and the labeling part was solely based on the cacao farmer's knowledge. These audio files are in the .wav extensions to ensure there is no loss in the quality of the recording once processed.
The dataset was then divided into 70% training data, 10% for validation, and 20% testing data. The data acquisition was made privately in a recognized cacao farm, with the cacao experts doing all the labelings of each sampled data. There are minimal and almost no noises in the farm background, but still, the datasets are ensured to undergone a preprocessing to remove unwanted noises that are done simultaneously during the feature extraction process.

B. Feature extraction
The initial step in this classification process is the extraction of good features that will identify the signal's useful attributes. Before the feature extraction process, a cleaning procedure with the signal envelope's aid is carried out to remove possible dead spaces in the signal, as shown in Figure 2a. Then, a Mel Frequency Cepstral Coefficients (MFCC) feature extractor is utilized to get recognizable sound features. Figure 2b shows the sample signals waveform in time domain wherein it is hard to differentiate the signal. Thus integrating the signals with a short-time fourier transform or a decibel spectrogram like in Figure 2c could be a good option to get recognizable features from it. But as for this study, to rebin and rephrase the audio, a Mel-spaced filterbank was used to rescale it to provide good Mel spectrogram inputs that will be later on feed onto the deep neural network learning process, like the one on Figure 2d.

C. Convolutional neural network (CNN) model
The convolutional neural network (CNN) is used in signal and time series prediction when the input signal has been converted into an image by transforming it from 1D into a 2D representation like spectrograms [33]. Convolution neural network is written using Keras and with TensorFlow as a backend. The fact that it uses a spectrogram as an input, the architecture of the neural network in this study will utilize a 4-layer CNN model composed of a convolutional 2D layer, an activation, or ReLU, and a pooling (Figure 3). Each layer will follow three main processes: (1) inputting the vector form of extracted audio features into the convolutional 2D layer, (2) the ReLU layer performing threshold operation to each element of the vector input, (3) a pooling which is a sample-based discretization process used to reduce dimensionality.
After passing through 4 layers intended for feature learning, its output will then be ready for the classification stage, a dropout, which be flattened and then fed into a dense or fully-connected layer for the model output. Finally, outputs of the neural network model will be categorized into either unripe or ripe cacao classes. The designed model for cacao classification was solely patterned with this architecture wherein four layers of convolutional 2D, a ReLU, and pooling were initially executed before proceeding in flattening, dropout, and dense.
Adam's optimization was used to vary the learning rate. The model designed uses a compatible 0.001 learning rate. Rectified Linear Unit (ReLU) or activations are now easier to compute and easily converges during training. As for max the pooling, the designed model computes the maximum values for each patch in the feature map. It is quite better than using average pooling. Resulting pools feature now emphasizes map with the most current feature in the patch. Moreover, during model training, the dropout technique randomly slashes neural. 25% and 50% dropout were applied to convolutional 2D layers 2 and 4, and the dense/ fully connected layer. Besides, the convolutional model also randomly assigns probabilistic weight between 0 to 1 for each neural network.

D. Training, validation, and testing of datasets
Two thousand two hundred thirty-one of the input spectrograms were subjected to training to learn and understand all features. Four hundred forty-seven of the data were used as validation datasets to verify if the algorithm is indeed learning all features while generalizing its understanding of the data. Furthermore, 894 data for testing datasets will be passed onto the network to assess the performance while ensuring the  result is unbiased. Table 1 shows the model training's performance level in terms of training loss, training accuracy, validation loss, and validation accuracy, as the number of epochs is increased.

III. RESULTS AND DISCUSSION
Implementation of the system model used Python 3.7 and TensorFlow. It was developed in a Windows 10 PC with Intel Core i5-6300HQ, 2.3GHz, 12GB DDR4 RAM, and runs an NVIDIA GeForce GTX 950M and 4GB RAM. Keras package was also utilized together with the TensorFlow as backend.
The dataset consists of 4465 cacao sound files, divided into two classes: (1) 2242 unripe files, and (2) 2223 ripe files, wherein every single file is composed of a single time thump sound in one second sampled at 16kHz, 16-bit audio bit depth in .wav extension. Datasets were also subjected to cleaning via its signal envelope before getting its corresponding Mel spectrograms. During the training phase, each class's probability was computed to get the necessary model for the cacao classifier.
An epoch of 15, as shown in Table 1, is quite the appropriate epoch displaying a 97.13 % validation accuracy. Table 2 shows the instances' confusion matrix with the model of 894 instances of testing data and 4465 instances of all datasets. Then, based on Table 3, the training phase using the convolutional model has an overall accuracy of 97.49 %, which is approximately close to the overall testing accuracy of 97.43 %. In the confusion matrix displayed, out of the 447 testing samples in the ripe class, 436 were accurately predicted, and 11 were incorrectly assumed to be the unripe class, which is only 2.46 % error. Likewise, 435 unripe class were accurately predicted, and 12 data samples were incorrectly predicted under the ripe class resulting in a 2.68% percentage error.
In summary, the percentages of accuracies and losses during both the training and testing stage shown in Table 3 results in an overall accuracy mean of 97.46% for the cacao ripeness classification model. Besides, these model results have shown slightly higher overall accuracy than the study of Arenga et al. [20], wherewith SVM machine learning, they have acquired a 95.72 % for classifying cacao level of maturities. Possibilities of error were reviewed to be likely originating from signal noises during the acquisition of the sounds.

IV. CONCLUSION
This paper proposes a ripeness classification system for cacao. The model system integrates both spectrograms and deep convolutional neural network and has successfully carried out the objective of the study to create a ripeness classification for cacao. The experimental results assured the robustness of both the data acquisition, preparation, and modeling process. The output results displayed an overall accuracy mean of 97.46 %, which is slightly higher than those reviewed existing studies that utilize other neural network machine learning.
For further improvement of this work, using the designed model of this study, implementing a smart mobile phone is highly recommended. Moreover, further research on the effectivity of using spectrograms and other deep neural network learning to classify both the cacao varieties and grading quality is highly recommended.