1-8hit |
Xingyu ZHANG Xia ZOU Meng SUN Penglong WU Yimin WANG Jun HE
In order to improve the noise robustness of automatic speaker recognition, many techniques on speech/feature enhancement have been explored by using deep neural networks (DNN). In this work, a DNN multi-level enhancement (DNN-ME), which consists of the stages of signal enhancement, cepstrum enhancement and i-vector enhancement, is proposed for text-independent speaker recognition. Given the fact that these enhancement methods are applied in different stages of the speaker recognition pipeline, it is worth exploring the complementary role of these methods, which benefits the understanding of the pros and cons of the enhancements of different stages. In order to use the capabilities of DNN-ME as much as possible, two kinds of methods called Cascaded DNN-ME and joint input of DNNs are studied. Weighted Gaussian mixture models (WGMMs) proposed in our previous work is also applied to further improve the model's performance. Experiments conducted on the Speakers in the Wild (SITW) database have shown that DNN-ME demonstrated significant superiority over the systems with only a single enhancement for noise robust speaker recognition. Compared with the i-vector baseline, the equal error rate (EER) was reduced from 5.75 to 4.01.
Wenkai LIU Cuizhu QIN Menglong WU Wenle BAI Hongxia DONG
Pose estimation is a research hot spot in computer vision tasks and the key to computer perception of human activities. The core concept of human pose estimation involves describing the motion of the human body through major joint points. Large receptive fields and rich spatial information facilitate the keypoint localization task, and how to capture features on a larger scale and reintegrate them into the feature space is a challenge for pose estimation. To address this problem, we propose a multi-scale convergence network (MSCNet) with a large receptive field and rich spatial information. The structure of the MSCNet is based on an hourglass network that captures information at different scales to present a consistent understanding of the whole body. The multi-scale receptive field (MSRF) units provide a large receptive field to obtain rich contextual information, which is then selectively enhanced or suppressed by the Squeeze-Excitation (SE) attention mechanism to flexibly perform the pose estimation task. Experimental results show that MSCNet scores 73.1% AP on the COCO dataset, an 8.8% improvement compared to the mainstream CMUPose method. Compared to the advanced CPN, the MSCNet has 68.2% of the computational complexity and only 55.4% of the number of parameters.
Wenkai LIU Lin ZHANG Menglong WU Xichang CAI Hongxia DONG
The goal of Acoustic Scene Classification (ASC) is to simulate human analysis of the surrounding environment and make accurate decisions promptly. Extracting useful information from audio signals in real-world scenarios is challenging and can lead to suboptimal performance in acoustic scene classification, especially in environments with relatively homogeneous backgrounds. To address this problem, we model the sobering-up process of “drunkards” in real-life and the guiding behavior of normal people, and construct a high-precision lightweight model implementation methodology called the “drunkard methodology”. The core idea includes three parts: (1) designing a special feature transformation module based on the different mechanisms of information perception between drunkards and ordinary people, to simulate the process of gradually sobering up and the changes in feature perception ability; (2) studying a lightweight “drunken” model that matches the normal model's perception processing process. The model uses a multi-scale class residual block structure and can obtain finer feature representations by fusing information extracted at different scales; (3) introducing a guiding and fusion module of the conventional model to the “drunken” model to speed up the sobering-up process and achieve iterative optimization and accuracy improvement. Evaluation results on the official dataset of DCASE2022 Task1 demonstrate that our baseline system achieves 40.4% accuracy and 2.284 loss under the condition of 442.67K parameters and 19.40M MAC (multiply-accumulate operations). After adopting the “drunkard” mechanism, the accuracy is improved to 45.2%, and the loss is reduced by 0.634 under the condition of 551.89K parameters and 23.6M MAC.
Menglong WU Yongfa XIE Yongchao SHI Jianwen ZHANG Tianao YAO Wenkai LIU
Direct-current biased optical orthogonal frequency division multiplexing (DCO-OFDM) converts bipolar OFDM signals into unipolar non-negative signals by introducing a high DC bias, which satisfies the requirement that the signal transmitted by intensity modulated/direct detection (IM/DD) must be positive. However, the high DC bias results in low power efficiency of DCO-OFDM. An adaptively biased optical OFDM was proposed, which could be designed with different biases according to the signal amplitude to improve power efficiency in this letter. The adaptive bias does not need to be taken off deliberately at the receiver, and the interference caused by the adaptive bias will only be placed on the reserved subcarriers, which will not affect the effective information. Moreover, the proposed OFDM uses Hartley transform instead of Fourier transform used in conventional optical OFDM, which makes this OFDM have low computational complexity and high spectral efficiency. The simulation results show that the normalized optical bit energy to noise power ratio (Eb(opt)/N0) required by the proposed OFDM at the bit error rate (BER) of 10-3 is, on average, 7.5 dB and 3.4 dB lower than that of DCO-OFDM and superimposed asymmetrically clipped optical OFDM (ACO-OFDM), respectively.
Menglong WU Jianwen ZHANG Yongfa XIE Yongchao SHI Tianao YAO
Direct-current biased optical orthogonal frequency division multiplexing (DCO-OFDM) exhibits a high peak-to-average power ratio (PAPR), which leads to nonlinear distortion in the system. In response to the above, the study proposes a scheme that combines direct-current biased optical orthogonal frequency division multiplexing with index modulation (DCO-OFDM-IM) and convex optimization algorithms. The proposed scheme utilizes partially activated subcarriers of the system to transmit constellation modulated symbol information, and transmits additional symbol information of the system through the combination of activated carrier index. Additionally, a dither signal is added to the system’s idle subcarriers, and the convex optimization algorithm is applied to solve for the optimal values of this dither signal. Therefore, by ensuring the system’s peak power remains unchanged, the scheme enhances the system’s average transmission power and thus achieves a reduction in the PAPR. Experimental results indicate that at a system’s complementary cumulative distribution function (CCDF) of 10-4, the proposed scheme reduces the PAPR by approximately 3.5 dB compared to the conventional DCO-OFDM system. Moreover, at a bit error rate (BER) of 10-3, the proposed scheme can lower the signal-to-noise ratio (SNR) by about 1 dB relative to the traditional DCO-OFDM system. Therefore, the proposed scheme enables a more substantial reduction in PAPR and improvement in BER performance compared to the conventional DCO-OFDM approach.
Ying KANG Cong LIU Ning WANG Dianxi SHI Ning ZHOU Mengmeng LI Yunlong WU
Siamese visual tracking, viewed as a problem of max-similarity matching to the target template, has absorbed increasing attention in computer vision. However, it is a challenge for current Siamese trackers that the demands of balance between accuracy in real-time tracking and robustness in long-time tracking are hard to meet. This work proposes a new Siamese based tracker with a dual-pipeline correlated fusion network (named as ADF-SiamRPN), which consists of one initial template for robust correlation, and the other transient template with the ability of adaptive feature optimal selection for accurate correlation. By the promotion from the learnable correlation-response fusion network afterwards, we are in pursuit of the synthetical improvement of tracking performance. To compare the performance of ADF-SiamRPN with state-of-the-art trackers, we conduct lots of experiments on benchmarks like OTB100, UAV123, VOT2016, VOT2018, GOT-10k, LaSOT and TrackingNet. The experimental results of tracking demonstrate that ADF-SiamRPN outperforms all the compared trackers and achieves the best balance between accuracy and robustness.
Menglong WU Cuizhu QIN Hongxia DONG Wenkai LIU Xiaodong NIE Xichang CAI Yundong LI
In many screen to camera communication (S2C) systems, the barcode preprocessing method is a significant prerequisite because barcodes may be deformed due to various environmental factors. However, previous studies have focused on barcode detection under static conditions; to date, few studies have been carried out on dynamic conditions (for example, the barcode video stream or the transmitter and receiver are moving). Therefore, we present a detection and tracking method for dynamic barcodes based on a Siamese network. The backbone of the CNN in the Siamese network is improved by SE-ResNet. The detection accuracy achieved 89.5%, which stands out from other classical detection networks. The EAO reaches 0.384, which is better than previous tracking methods. It is also superior to other methods in terms of accuracy and robustness. The SE-ResNet in this paper improved the EAO by 1.3% compared with ResNet in SiamMask. Also, our method is not only applicable to static barcodes but also allows real-time tracking and segmentation of barcodes captured in dynamic situations.
Wentao LV Jiliang LIU Xiaomin BAO Xiaocheng YANG Long WU
The classification of warheads and decoys is a core technology in the defense of the ballistic missile. Usually, a high range resolution is favorable for the development of the classification algorithm, which requires a high sampling rate in fast time, and thus leads to a heavy computation burden for data processing. In this paper, a novel method based on compressed sensing (CS) is presented to improve the range resolution of the target with low computational complexity. First, a tool for electromagnetic calculation, such as CST Microwave Studio, is used to simulate the frequency response of the electromagnetic scattering of the target. Second, the range-resolved signal of the target is acquired by further processing. Third, a greedy algorithm is applied to this signal. By the iterative search of the maximum value from the signal rather than the calculation of the inner product for raw echo, the scattering coefficients of the target can be reconstructed efficiently. A series of experimental results demonstrates the effectiveness of our method.