Sampled and discretized of short-time Fourier transform and non-negative matrix factorization: the single-channel source separation case

The Short-time Fourier transform (STFT) is a popular time-frequency representation in many source separation problems. In this work, the sampled and discretized version of Discrete Gabor Transform (DGT) is proposed to replace STFT within the singlechannel source separation problem of the Nonnegative Matrix Factorization (NMF) framework. The result shows that NMF-DGT is better than NMFSTFT according to Signal-to-Interference Ratio (SIR), Signal-to-Artifact Ratio (SAR), and Signal-toDistortion Ratio (SDR). In the supervised scheme, NMF-DGT has a SIR of 18.60 dB compared to 16.24 dB in NMF-STFT, SAR of 13.77 dB to 13.69 dB, and SDR of 12.45 dB to 11.16 dB. In the unsupervised scheme, NMF-DGT has a SIR of 0.40 dB compared to 0.27 dB by NMF-STFT, SAR of -10.21 dB to -10.36 dB, and SDR of -15.01 dB to -15.23 dB.


I. INTRODUCTION
The source separation problem is an effort to extract interests from mixtures of time-varying data sequences. In the signal processing field, this kind of data is called signals. The final goal of extraction can be any type, like signals filtering and signals separation. The researchers have found and proposed many solutions to this problem from various approaches. Moreover, they are exploited to solve multiple issues in different fields like medical issues [1]- [3], robotics [4], [5], fault monitoring [6], imaging [7]- [9], and even uncountably in the field of speech processing.
The non-Negative Matrix Factorization (NMF) is a frequently used method in source separation like music and speech for single-channel source separation. The single-channel source separation problem is to extract or separate mixed sources from just one mixture. The NMF can approximate the power spectrum of a singlechannel mixture as the product of two non-negative matrices (every matrices element is prohibited from containing negative values). The signal-source mixture is then decomposed into its constituents by using the Wiener filter and reusing its original phase [10], [11]. The NMF utilizes complex Time-Frequency representation in the process, usually Short-time Fourier Transform (STFT). The STFT is a Fourier transform in instantaneous time that uses a window to smooth the transition flow of frames. The representation of STFT is a matrix filled with complex elements. These elements can give enormous information like power spectrum (magnitudes), instantaneous phases, and instantaneous frequencies. The STFT is utilized in many new source separation methods and can be divided into three categories: underdetermined cases [ The STFT is a fully redundant time-frequency representation as it has a window that translates frame by frame. In order to avoid this redundancy, the undersampling can be applied. As a result, the sampled and discretized version of STFT is constructed. Furthermore, this version is called Discrete Gabor transform (DGT). Similar to STFT, the DGT is also invertible. The invertibility is essential to reconstruct the separated sources or signals into their time representation.
The DGT is another representation of timefrequency used in this work to replace widely used representation, STFT. According to our knowledge, the DGT and its reciprocal is rarely used in a single-channel source separation problem, especially in NMF. Hence, this research aims to examine the performance of DGT being implemented in the single-channel source separation. It also benefits from its capability to avoid redundancies in representing a signal. Furthermore, speech signals are used herein to evaluate the system as they have broad frequency bins compare to merely monochrome signal (single frequency signal).

II. RESEARCH METHODS
It is indispensable to use Time-Frequency (T-F) analysis to localize information, especially in the short duration of oscillation signals. In NMF-based singlechannel source separation, T-F is utilized to calculate the basis vector and activation matrix. A mixture is created from two clean speeches. T-F matrix is calculated from a mixture using DGT instead of STFT. The basis vector and dictionary are calculated using NMF. A mask is created from a basis vector and a dictionary. Wiener filter and the mask are used to get the power of each component of the mixture. Their phases are restored and the DGT is inverted to get the time-domain components.

A. Discrete Gabor transform
Assume x(t) denotes a real continuous signal and transformed with STFT, as expressed in Eq. 1 [21]. It yields T-F representation in radian frequency vs time, X(,t)  . The window ℂ ց () is a complex conjugate of translated-fixed window function with ( ց ) ≠ 0. Eq. 1 can also be defined in linear frequency, f, as shown in Eq. 2 by merely replacing  with 2πf.
To analyze using computing machines upon oscillating signals like sounds and speeches, they must be discretized (sampled and quantized). The T-F analysis method should also be in discrete form. Hence, in discrete form, STFT can be defined as is given by Eq. 3 [21]. K denotes wide of the signal, and n is time-discrete, K, n  . The final product of STFT will be completely ℤ redundant as a result of window translation. One way to avoid redundancy is by reducing the number of points involved in the calculation process. It is naturally the Gabor transform principle. Gabor transformation for a continuous signal is expressed in Eq. 4.
In the discrete form, this formula turns to be Eq. 5. The Discrete Gabor coefficients is a complex matrix, X[m, n]  ℂ MxN , while window coefficients and signal sequence can be (not necessarily in) complex, T-F shift parameters are denoted with a, b > 0, which is a hop factor in time and frequency, respectively. While (k) ց is a fixed window function, m = 0, …, M -1 and n = 0, …, N -1, m, n  of which ℤ M = L/b and N = L/a represent the number of channels and number of time shifts, respectively. The length of the signal L should be divisible to a and b, and zero paddings are commonly used to fulfill this condition. Eq. 5 also describes how DGT is a sort of sampled and discretized version of STFT. As such, the signal should be finite with periodic boundaries [21]- [23].

B. Inversion of discrete Gabor transform
Similar to STFT, DGT is also invertible. The reciprocal guarantees that the analyzed signal can be synthesized to a time-domain signal with a significantly small error.
The STFT can be considered superior when dealing with inversion. One can use any window in the reconstruction phase. Meanwhile, the synthesis window in IDGT is restricted to appropriateness with respect to the analysis window.

C. Single-channel source separation problem
Source separation has been one of the popular topics in signal processing. It is intended to separate a mixture of signals or sources. Assume there are P sources and Q sensors in hand. The separation problem can then be classified into three groups, which are underdetermined (P>Q), determined (P=Q), and overdetermined (P<Q). The mixtures can be considered as mathematical entanglement of each source. The way they are mathematically entangled is classified into instantaneous, anechoic, and convolutive [24].
In this work, the determined problem with a linearly mixed-instantaneous model has been used, of which two speech signals were mixed to form a single mixture. Using this signal as the data test is because of its multiple spectrum frequency and closely spaced compared to musical instruments notes. In other words, musical instruments' notes can be considered sharper than what speech has. Further, the framework (conventional NMF) used in the test was chosen as simple as possible. Hence, it can help to firmly conclude which is the most superior to the other (DGT or STFT) in a basic implementation.
In a nutshell, a mixture of signals used in this work can be described by Eq. 7 where P = Q, APxQ is a mixing is the sources  ℝ QxL , and L is the signal length [12]. The mixing matrix was generated randomly for normal distribution. In the end, this matrix would not be discovered as it was not the aim of the NMF algorithm. Eq. 7 can be rewritten in matrix form as shown by Eq. 8 with 2x2 mixing matrix. Solely one arbitrary mixture is used for the separation.

D. Non-negative matrix factorization
In the NMF framework, a set of mixtures can be expressed in terms of their decomposition as shown in Eq. 9. X is the mixtures, W is the basis vector or basis spectral or dictionary, and H is the activation matrix or weight matrix. All elements of these matrices are real and positive. The NMF does not require the components to be statistically independent, as ICA does. This method recovers the sources without pre-knowledge about estimating the first and second moments (mean and variance, respectively) of the sources like in ICA. However, this method cannot deliver a unique solution like most ICA-based solutions [25]- [28]. Figure 1 shows the block diagram of how the singlechannel source separation is conducted. The NMF depends only on the power spectrum of instantaneous T-F coefficients. This method is sensitive to the initial value, W. Based on how the initial value is set up, NMF can be divided into two classes: supervised and unsupervised. The difference between them is how the initial value of the basis spectral is calculated. The supervised NMF requires an initial basis spectral built from real speech signal while the initial dictionaries for unsupervised NMF are generated randomly. The supervised NMF often outperforms unsupervised because of this initial value. The NMF is also said to be non-convex and capable of finding local optimum only.
Based on how the initial value is set up, NMF can be divided into two classes, namely supervised and unsupervised. The difference between them is how the initial value of basis spectral is calculated. The supervised NMF requires an initial basis spectral built from real speech signal while the initial dictionaries for unsupervised NMF are generated randomly. The supervised NMF often outperforms unsupervised because of this initial value. The NMF is also said to be non-convex and capable of finding local optimum only.
There are various types of NMF especially related to how the divergence between WH and X is calculated, D(V||WH), like Euclidean and Kullback-Leibler (KL). In this work, we simplify the NMF to use ordinary iterations merely. Algorithm 1 elaborates every single step of the diagram block above. Algorithm 2 explains how the basis spectral and vector weight are updated.
The benefit of NMF is it is guaranteed to be always in convergence. But the number of selected basis spectral, K, has to be defined prior to separation, typically K < F < T. The selection of K indicates the use of a low-rank matrix in NMF. This method also requires the knowledge of the number of sources, which is often unknown. Also, it is hard to get real-world signals to lack thereof of noise and reverberances in real practices. Input: x(t) ; single-channel mixture K ; K-selected basis Num_Sources Output: ŝ1(t) … ŝP(t) ; separated sources

X(t,f)  DGT(x(t))
; inevitable when dealing with this kind of signal. In our work, the signals used for evaluation were well provided, of which the noise and reverberances have been filtered out.

E. Materials
The machine and software specifications used to simulate the whole process here are described in Table 1. The speech signals were taken from TIMIT Corpus [29]. The STFT and NMF program was built on our own. Meanwhile, the DGT program was a combination of our wrapper and the one accessed from the source [21].

F. Performance evaluation
The single-channel source separation performance utilizing DGT was evaluated with three popular ratios of interest, namely Signal-to-Interference Ratio (SIR), Signal-to-Artifact Ratio (SAR), and the total error by Signal-to-Distortion Ratio (SDR) [30], [31] (Eq. 10 -Eq. 12). In this work, the performance of DGT was compared to STFT in the same framework of NMF.
The first three parameters can be calculated by Eq. 13 -Eq. 15. starget(t) denotes the target signal, einterf is the error due to interferences, eartif is the error due to artifacts, and ||.|| 2 is the squared 2-norm. {si(t)} denotes the original anechoic signals, sj(t) is the anechoic target signal, and ŝj(t) is the estimated target signal. While the P(.) denotes projection operator. Higher values of all three ratios show better performance.

III. RESULTS AND DISCUSSION
The matrix used to mix speeches was randomly generated. The speeches were resampled to 8 kHz with a duration of around 2 seconds. The speeches are anechoic and noise-free. If there is no initial value provided for basis spectral (dictionary), then a random number generator is used for W and H. The sample of the mixture is shown in Figure 2.
In this work, generating a random number is an iterative process. They were iterated about 50,000 times and averaged. The W and H update inside NMF's loop was iterated 200 times. While the separation process (the outer loop) was also iterated 1,000 times. It means there is 1,000 x (50,000 + 200) iterations in total. The purpose of this huge iteration is to support reliable decisions. The number of frequency bins, window size, window type, and overlap factor are 1024, 512, Hanning, and 50 %, respectively.

A. Supervised NMF
In this scheme, two speeches are used to build the initial dictionary, W, as shown in Figure 3. Only 1000 samples are shown to have better visualization. The goal is to estimate H and calculate W update. The amount of basis vector, K, used was 25 for every source. Figure 4 shows the approach of the original T-F representation (matrix) of a mixture. The T-F matrix is complex values calculated using DGT. Another term for this matrix is the spectrogram. The spectrogram   frequently calculated using STFT, but in this work, it is shown how DGT can be used to yield a spectrogram. This spectrogram tells frequencies distribution in every time frame. Yellowish color depicts the presence of particular frequencies.
As DGT employs Fourier transform in its process as well as STFT, the dictionary matrices are filled with estimated non-negative (power) spectrum, as shown by Figure 4(b) and Figure 4(d). While the activation matrices, which are also non-negative elements, as shown by Figure 4(c) and Figure 4(e), contain weight values that control the contribution of every element of the dictionary in a mixture. That is why this concept can be grafted in this single-channel source separation problem flawlessly.
In Figure 5, the reconstruction of successfully separated sources is shown-the reconstruction of first and second sources close to the original. According to Figure 5(b), the reconstructed source even follows the flow of the original sequence. The problem with reconstruction mostly happens in the detail or high frequencies part. They are changing at a high pace from a certain frequency to another. However, this reconstruction still can estimate where the flow goes on. The other difference between reconstructed sources and original is on amplitudes. This difference only tells about the strength of the portion of sound (in the speech signal).
To help the reconstruction before they are inverted to the time domain, a Wiener filter is applied. The filter coefficients are selected from W and H according to K. This filter is multiplied with the power spectrum (non-negative matrix) of the mixture. The masks for the first and second sources are shown in Figure 6.
In Figure 7, the comparisons between DGT and STFT are shown. The overall performances are measured using three popular ratios. The benefit of using these ratios is their ability to neglect phase changes on reconstructed sources [21]- [23]. In the case  of speech, most of the time, the phase changes to the known value π, which will not change how it sounds. Nevertheless, this is risky when different evaluations are used to measure the error between reconstructed and original. The SIR, SAR, and SDR of NMF-DGT outperform the NMF-STFT with scores of 18.60 dB compared to 16.24 dB, 13.77 dB to 13.69 dB, and 12.45 dB to 11.16 dB, respectively. The SIR value indicates that reconstructed signal using DGT has lower interferences and induces slightly low artifacts compared to STFT. When combined, the total error of DGT is much lower than STFT, as indicated by SDR around 1.29 dB. It concludes that in terms of supervised NMF, the DGT has better performance than STFT, consistent with [25]- [28].

B. Unsupervised NMF
The difference between supervised and unsupervised NMF is in the way the initial value of W is provided. Unlike the previously explained supervised scheme, the initial value of W is generated randomly. However, the amount of K is predetermined. In this case, the value of K is set to 40 for every source.
The pairs of dictionary and activation matrices are shown in Figure 8. They reasonably much vary compared to the supervised scheme. The overlay visual between original and reconstructed sources are shown in Figure 9. Similar to the supervised scheme, the reconstructed sources have the same phase as the original. However, according to Figure 9(a), the reconstructed sources seem to lose their details. The loss is because of the lack of information about high frequencies (with low amplitude) that occupied a certain time domain that generated the initial dictionary. It happens to both reconstructed sources. The masks that the dictionary and activation matrices created are shown in Figure 10.
The evaluation compares to NMF-STFT is shown in Figure 11. Again, the NMF-DGT outperforms NMF-STFT even though the difference is not significant. The scores are very low for both T-F representations. Even more, the SAR and SDR lie below zero. Negative values for both evaluation methods mean some artifacts and interferences cause a total error bigger. The only reason is that the initial dictionary containing basis spectral (frequency bins) cannot estimate the basis spectral correctly as more interferences are preserved and artifacts induced. However, the probability that a randomly generated initial dictionary would estimate correct frequencies is extremely low. Overall, the supervised NMF gives better performances compared to unsupervised NMF [25]- [28].

IV. CONCLUSION
The performance of DGT being utilized to replace the STFT for a single-channel source separation problem is better in both schemes: a supervised NMF and unsupervised NMF. Even though NMF-DGT also exhibits low performance in unsupervised NMF, but it is still better than NMF-STFT. It shows that DGT can replace STFT in single-channel source separation, especially in the NMF framework. Possibly, this T-F representation can be beneficial in a wide area of the signal processing field. In the future, the DGT will be utilized to design a novel method to estimate the mixing matrix based on active single-point estimation.