CNN-Based Feature Integration Network for Speech Enhancement in Microphone Arrays

Ji XI; Pengxu JIANG; Yue XIE; Wei JIANG; Hao DING

doi:10.1587/transinf.2024EDL8014

1. Introduction

Multi-microphone noise reduction technology refers to reducing the impact of environmental noise on speech signals through the collection and signal processing of multiple microphones, thereby improving the quality of speech communication [1]. Traditional single microphones are often subject to various environmental noise interferences when collecting speech, such as human voices, car sounds, wind sounds, etc. These noises can cause speech signal distortion and reduce speech recognition accuracy [2]. Multi-microphone noise reduction technology can eliminate or reduce the impact of these noises by fusing and processing signals from multiple microphones.

Deep learning based multi-microphone noise reduction is a technique that uses neural network algorithms to process audio signals recorded by multiple microphones to reduce noise interference [3]. It efficiently and accurately removes environmental noise by extracting valuable speech information from complex noise using deep learning models. Currently, deep learning based multi-microphone noise reduction technology has been widely applied [4]-[6].

The conventional speech denoising model [7], [8] in deep learning typically comprises an encoder and a decoder. Encoding involves transforming the noisy signal, whereas the decoding process aims to recover a clean speech signal by utilizing the received information. Furthermore, it is typical for the encoder and decoder to be interconnected via a skip structure. When dealing with the task of reducing noise using numerous microphones, it is common practice to combine data from many microphones as input for a single model. However, this could lead to the system’s inability to retrieve independent information for various microphones. One approach involves establishing separate networks for individual microphones. However, this configuration may result in the loss of correlation information among the different microphones.

In this paper, we designed a convolutional neural network (CNN) based feature integration network for speech enhancement in microphone arrays. The main structure of the model is shown in Fig. 1. Short-time Fourier transform (STFT) is the model input, and CNN is used to obtain time-frequency related information of STFT. In addition, the CNN network consists of encoder and decoder layers [9] and includes a skip structure designed for symmetric encoders. In order to enhance the acquisition of feature-related information across various microphones, the designed model is mainly divided into two paths, One of the paths is used to learn different microphone features separately, including \(X_1\in R^{T\times F\times 1},\ X_2\in R^{T\times F\times 1},\ \ldots,\ X_n\in R^{T\times F\times 1}\), where \(n\) is the number of microphones, T corresponds to the time dimension, F corresponds to the frequency dimension. Another path is used to learn the combination \(X_A\) of all microphone features, where \(X_A \in R^{T\times F\times n}\). Due to the need to learn the STFT of multiple microphones for noise reduction tasks, CNN may lose associated time-frequency information when learning different features separately. Given this, we have designed a feature integration layer to replace the original skip mechanism. The feature integration layer can gather weighted information from all microphones and provide feedback to their channels. In addition, a feature fusion layer was devised in order to integrate features utilizing weight calculation of output information across several microphones. Finally, the outputs of the two paths are fused to fit the corresponding STFT of pure speech.

Fig. 1 Illustration of the proposed model.

Page top

2. System Description

2.1 Convolutional Neural Network

The designed baseline CNN consists of an encoder and a decoder [10]. The encoder consists of a complex number of lower sampling layers, batch normalization layer, and activation function layer, and the decoder consists of an upper sampling layer, batch normalization layer, and activation function layer. In addition, two jump structures are included for each convolutional network to fit the symmetric convolutional encoder. The encoding and decoding layers in two CNN paths have the same parameter shape.

2.2 Feature Integration Layer

The feature integration layer is used to connect the encoder layer and decoder layer of CNN, to replace skip structure. The structure of the feature integration layer is shown in Fig. 2. \(Skip(X_1),\ Skip(X_2),\ \ldots,\ Skip(X_n)\) are used as inputs. Firstly, perform channel connection and frame connection on all input features. Then, frame and channel convolution are used to obtain corresponding information for different inputs, respectively. Specifically, the convolution kernel size for frame convolution is [1, k], and for channel convolution is [1,1], where k corresponds to the frame dimension. Finally, the feature fusion of different paths serves as the output of the module. In addition, each CNN contains two feature integration layers, corresponding to convolutional layers of different depths.

Fig. 2 Illustration of the feature integration layer.

2.3 Feature Fusion Layer

The feature fusion layer integrates all features using the weight distribution between input tensors. The structure of the feature fusion layer is shown in Fig. 3. The input of the feature fusion layer is the training results of the microphone array in CNN. All input tensors first pass through a convolutional layer with a shape and number of kernels of 1 to reduce the original input parameters. Then, concatenate all input features to form a feature set \(X_\beta \in R ^{n\times F}\), where n is the number of microphone arrays, and F corresponds to the frequency dimension. Subsequently, two dense layers map the input features to the specified space and obtain the weight coefficients between different microphones using the \(SoftMax\) function:

\[\begin{align} y &= \sigma(X_\beta W_\theta) W_\phi, \tag{1} \\ \eta &= SoftMax(y) \in R^{n\times 1}, \tag{2} \end{align}\]

here \(W_\theta \in R^{F \times F}\), \(W_\phi \in R^{F \times 1}\), \(\sigma\) is the activation function \(sigmoid\). Finally, concatenate the time dimensions of all input features, which can be represented as \(X_\alpha\), multiply and accumulate them with the corresponding \(\eta\), as the output \(X\) of the feature fusion layer.

Fig. 3 Illustration of the feature fusion layer.

Page top

3. Experiments

3.1 Preprocessing

\(\mathbf{Dataset:}\) We utilize CHiME3 [11] dataset to show the performance of our proposed model. CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge. We selected isolated 7138 English speech samples for the pure speech of our model, Using four types of noise (Cafe, Street, bus, Pedestrian) as our noise samples to generate noisy speech randomly. All data is provided as 16-bit WAV files sampled at 16kHz. The training set accounts for approximately 80%.

In the simulation experiment, the far-field model linear microphone array is used and placed in a room acoustic environment of 6\(\times\)5\(\times\)3m. The coordinates of the center of the microphone array are (2, 3, 1.5), and the distance between adjacent microphones is 0.02m. The coordinates of the three microphones are (2.02, 3, 1.5), (2, 3, 1.5) and (1.98, 3, 1.5). The room reverberation environment is realized through the IMAGE [12] algorithm based on the Allen and Berkley image algorithm, and the reverberation time is 300ms. The sampling rate of the speech signal is set to 16 kHz. The target sound source is located 1m away from the center of the microphone array, and the incident direction is 90\(^\circ\). The interference source is 16kHz white noise from the NOISEX-92 noise database. The interference source is about 1.5m away from the array’s center, located in a 180\(^\circ\) direction, and the signal-to-interference ratio is set to 40dB. In such an acoustic simulation environment, a multi-mic speech dataset is generated by inputting different single-channel target speech signals. Therefore, we obtained three additional microphone inputs, totaling four.

\(\mathbf{Feature}\) \(\mathbf{ Generation:}\) To obtain STFT, we defined a periodic Hamming window with a length of 256 and a hop count of 64, removing the symmetric half to obtain the top 129 points. In addition, our input consists of the current STFT noise vector plus the previous seven noise STFT vectors, which means that the input size of one vector is (129,8,1).

\(\mathbf{Model}\) \(\mathbf{ parameter:}\) The baseline CNN we use is mainly based on [9]. The model parameters are trained through the Adam optimizer, with a batch size of 512 for each training session and a learning rate of 0.0001. The detailed parameters of the baseline CNN are shown in Table 1. “Conv” is the convolutional layer.

Table 1 Proposed baseline CNN.

\(\mathbf{Evaluation}\) \(\mathbf{ Metric:}\) Short-time Objective Intelligibility (STOI) [13] and Perceptual Evaluation of Speech Distortion (PESQ) [14] are used to evaluate the designed model.

3.2 Experiment

The main contribution of this article is to propose a CNN structure for multi-microphone noise reduction, and based on this structure, propose a feature integration layer and a feature fusion layer. To verify the structure and related algorithms proposed in this article, Table 2 shows the denoising experiment for multiple microphones, The following is an introduction to different experimental strategies.

Table 2 Results of multiple microphones system.

CNN-A: \(X_A\) as the input of the baseline CNN. only includes the bottom path in Fig. 1.
CNN-B: The CNN structure proposed in this article. As shown in Fig. 1, excluding the feature integration layer and feature fusion layer, the skip structure connects the encoder layer and decoder layer.
CNN (w/FF): Including the proposed CNN architecture and feature fusion layer.
CNN (w/FI): Including the proposed CNN architecture and feature integration layer.
CNN MM: As shown in Fig. 1, including all proposed modules.

We can conclude from the observation data that the PESQ and STOI are improved by the proposed CNN structure. Firstly, compared to CNN-A, CNN-B’s PESQ and STOI have increased by 0.26% and 0.05%, indicating the necessity to consider both individual microphone information and comprehensive microphone information simultaneously. In addition, CNN (w/FF) and CNN (w/FI) can further improve the values of PSEQ and STOI, indicating the effectiveness of the proposed module. CNN MM achieved the best performance, indicating the superiority of the noise reduction architecture proposed in the paper for multiple microphones.

In order to further explore the performance of our designed model, we analyzed the denoising effects of CNN-A and CNN MM in both the time and frequency domains. Figure 4 shows the denoising results of different models after adding noise to the original speech. Figure 5 compares the noise reduction effects of different models in the frequency domain under different noise environments. Among them, ‘BUS’, ‘CAF’, ‘PED’, and ‘STR’ refer to different noise environments ‘On the bus’, ‘Cafe’, ‘Pedestrian area’, and ‘Street’, respectively. From the waveform of the denoised audio, we can see that for silent segments, CNN MM can more effectively eliminate environmental noise than CNN-A, which proves the effectiveness of the proposed feature integration layer and feature fusion layer. In addition, it can be more clearly seen from the noise reduction spectra of different noises that CNN greatly improves its denoising performance at low frequencies, especially in “CAF” and “STR”. CNN MM can eliminate more irrelevant information in low frequencies, which proves the effectiveness of the proposed architecture.

Fig. 4 Waveform samples.

Fig. 5 Spectrogram samples.

Next, we compare the proposed method with models with similar structures, including DDAEC [7] and CRN [8], \(X_A\) as input to the model. The comparison of the results of all experiments is shown in Table 3. From the table, even compared to speech enhancement models with similar structures, our proposed model still has performance advantages, which proves the superiority of our proposed multi-microphone speech enhancement model.

Table 3 Comparison of different methods.

Page top

4. Conclusion

This paper presented a CNN-based feature integration network for speech enhancement in microphone arrays. STFT is the model input. CNN with encoder, decoder, and skip structure as a baseline model. In addition, we designed a feature integration layer in a multi-microphone-based CNN path to replace the original skip structure and a feature fusion layer to fuse different microphone information. Multiple experimental results have demonstrated the superiority of our designed model.

Page top

References

[1] N. Das, S. Chakraborty, J. Chaki, N. Padhy, and N. Dey, “Fundamentals, present and future perspectives of speech enhancement,” International Journal of Speech Technology, vol.24, no.4, pp.883-901, 2021.
CrossRef

[2] S. Vihari, A.S. Murthy, P. Soni, and D.C. Naik, “Comparison of speech enhancement algorithms,” Procedia computer science, vol.89, pp.666-676, 2016.
CrossRef

[3] X. Cui, Z. Chen, and F. Yin, “Multi-objective based multi-channel speech enhancement with BiLSTM network,” Applied Acoustics, vol.177, p.107927, 2021.
CrossRef

[4] H. Taherian, Z.-Q. Wang, J. Chang, and D.L. Wang, “Robust speaker recognition based on single-channel and multi-channel speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol.28, pp.1293-1302, 2020.
CrossRef

[5] K. Tan, X. Zhang, and D.L. Wang, “Deep learning based real-time speech enhancement for dual-microphone mobile phones,” IEEE/ACM Trans. Audio, Speech, Language Process., vol.29, pp.1853-1863, 2021.
CrossRef

[6] M. Barhoush, A. Hallawa, A. Peine, L. Martin, and A. Schmeink, “Localization-Driven Speech Enhancement in Noisy Multi-Speaker Hospital Environments Using Deep Learning and Meta Learning,” IEEE/ACM Trans. Audio, Speech, Language Process., vol.31, pp.670-683, 2022.
CrossRef

[7] A. Pandey and D. Wang, “Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in The Time Domain,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp.6629-6633, 2020.
CrossRef

[8] K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” Proc. Interspeech, pp.3229-3233, Sept. 2018.
CrossRef

[9] S.R. Park and J.W. Lee, “A fully convolutional neural network for speech enhancement,” arXiv preprint arXiv:1609.07132, 2016.

[10] A. Pandey and D.L. Wang, “Dense CNN with self-attention for time-domain speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol.29, pp.1270-1279, 2021.
CrossRef

[11] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE, pp.504-511, 2015.
CrossRef

[12] E.A. Lehmann and A.M. Johansson, “Diffuse reverberation model for efficient image-source simulation of room impulse responses,” IEEE Trans. Audio, Speech, Language Process., vol.18, no.6, pp.1429-1439, 2010.
CrossRef

[13] C.H. Taal, R.C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” 2010 IEEE international conference on acoustics, speech and signal processing, IEEE, pp.4214-4217, 2010.
CrossRef

[14] M. Torcoli, T. Kastner, and J. Herre, “Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence,” IEEE/ACM Trans. Audio, Speech, Language Process., vol.29, pp.1530-1541, 2021.
CrossRef

Page top