Hongliang FU Qianqian LI Huawei TAO Chunhua ZHU Yue XIE Ruxue GUO
Speech emotion recognition (SER) is a key research technology to realize the third generation of artificial intelligence, which is widely used in human-computer interaction, emotion diagnosis, interpersonal communication and other fields. However, the aliasing of language and semantic information in speech tends to distort the alignment of emotion features, which affects the performance of cross-corpus SER system. This paper proposes a cross-corpus SER model based on causal emotion information representation (CEIR). The model uses the reconstruction loss of the deep autoencoder network and the source domain label information to realize the preliminary separation of causal features. Then, the causal correlation matrix is constructed, and the local maximum mean difference (LMMD) feature alignment technology is combined to make the causal features of different dimensions jointly distributed independent. Finally, the supervised fine-tuning of labeled data is used to achieve effective extraction of causal emotion information. The experimental results show that the average unweighted average recall (UAR) of the proposed algorithm is increased by 3.4% to 7.01% compared with the latest partial algorithms in the field.
Yoshinori TANAKA Takashi DATEKI
Efficient multiplexing of ultra-reliable and low-latency communications (URLLC) and enhanced mobile broadband (eMBB) traffic, as well as ensuring the various reliability requirements of these traffic types in 5G wireless communications, is becoming increasingly important, particularly for vertical services. Interference management techniques, such as coordinated inter-cell scheduling, can enhance reliability in dense cell deployments. However, tight inter-cell coordination necessitates frequent information exchange between cells, which limits implementation. This paper introduces a novel RAN slicing framework based on centralized frequency-domain interference control per slice and link adaptation optimized for URLLC. The proposed framework does not require tight inter-cell coordination but can fulfill the requirements of both the decoding error probability and the delay violation probability of each packet flow. These controls are based on a power-law estimation of the lower tail distribution of a measured data set with a smaller number of discrete samples. As design guidelines, we derived a theoretical minimum radio resource size of a slice to guarantee the delay violation probability requirement. Simulation results demonstrate that the proposed RAN slicing framework can achieve the reliability targets of the URLLC slice while improving the spectrum efficiency of the eMBB slice in a well-balanced manner compared to other evaluated benchmarks.
Tetsuo KOSAKA Kazuya SAEKI Yoshitaka AIZAWA Masaharu KATO Takashi NOSE
Emotional speech recognition is generally considered more difficult than non-emotional speech recognition. The acoustic characteristics of emotional speech differ from those of non-emotional speech. Additionally, acoustic characteristics vary significantly depending on the type and intensity of emotions. Regarding linguistic features, emotional and colloquial expressions are also observed in their utterances. To solve these problems, we aim to improve recognition performance by adapting acoustic and language models to emotional speech. We used Japanese Twitter-based Emotional Speech (JTES) as an emotional speech corpus. This corpus consisted of tweets and had an emotional label assigned to each utterance. Corpus adaptation is possible using the utterances contained in this corpus. However, regarding the language model, the amount of adaptation data is insufficient. To solve this problem, we propose an adaptation of the language model by using online tweet data downloaded from the internet. The sentences used for adaptation were extracted from the tweet data based on certain rules. We extracted the data of 25.86 M words and used them for adaptation. In the recognition experiments, the baseline word error rate was 36.11%, whereas that with the acoustic and language model adaptation was 17.77%. The results demonstrated the effectiveness of the proposed method.
Hiroki KAWAKAMI Hirohisa WATANABE Keisuke SUGIURA Hiroki MATSUTANI
High-performance deep neural network (DNN)-based systems are in high demand in edge environments. Due to its high computational complexity, it is challenging to deploy DNNs on edge devices with strict limitations on computational resources. In this paper, we derive a compact while highly-accurate DNN model, termed dsODENet, by combining recently-proposed parameter reduction techniques: Neural ODE (Ordinary Differential Equation) and DSC (Depthwise Separable Convolution). Neural ODE exploits a similarity between ResNet and ODE, and shares most of weight parameters among multiple layers, which greatly reduces the memory consumption. We apply dsODENet to a domain adaptation as a practical use case with image classification datasets. We also propose a resource-efficient FPGA-based design for dsODENet, where all the parameters and feature maps except for pre- and post-processing layers can be mapped onto on-chip memories. It is implemented on Xilinx ZCU104 board and evaluated in terms of domain adaptation accuracy, inference speed, FPGA resource utilization, and speedup rate compared to a software counterpart. The results demonstrate that dsODENet achieves comparable or slightly better domain adaptation accuracy compared to our baseline Neural ODE implementation, while the total parameter size without pre- and post-processing layers is reduced by 54.2% to 79.8%. Our FPGA implementation accelerates the inference speed by 23.8 times.
Yuto KIHIRA Yusuke KODA Koji YAMAMOTO Takayuki NISHIO
Broadcast services for wireless local area networks (WLANs) are being standardized in the IEEE 802.11 task group bc. Envisaging the upcoming coexistence of broadcast access points (APs) with densely-deployed legacy APs, this paper addresses a learning-based spatial reuse with only partial receiver-awareness. This partial awareness means that the broadcast APs can leverage few acknowledgment frames (ACKs) from recipient stations (STAs). This is in view of the specific concerns of broadcast communications. In broadcast communications for a very large number of STAs, ACK implosions occur unless some STAs are stopped from responding with ACKs. Given this, the main contribution of this paper is to demonstrate the feasibility to improve the robustness of learning-based spatial reuse to hidden interferers only with the partial receiver-awareness while discarding any re-training of broadcast APs. The core idea is to leverage robust adversarial reinforcement learning (RARL), where before a hidden interferer is installed, a broadcast AP learns a rate adaptation policy in a competition with a proxy interferer that provides jamming signals intelligently. Therein, the recipient STAs experience interference and the partial STAs provide a feedback overestimating the effect of interference, allowing the broadcast AP to select a data rate to avoid frame losses in a broad range of recipient STAs. Simulations demonstrate the suppression of the throughput degradation under a sudden installation of a hidden interferer, indicating the feasibility of acquiring robustness to the hidden interferer.
Han WANG Ruiliu FU Xuejun ZHANG Jun ZHOU Qingwei ZHAO
Lifelong language learning (LLL) aims at learning new tasks and retaining old tasks in the field of NLP. LAMOL is a recent LLL framework following data-free constraints. Previous works have been researched based on LAMOL with additional computing with more time costs or new parameters. However, they still have a gap between multi-task learning (MTL), which is regarded as the upper bound of LLL. In this paper, we propose Metacognitive Adaptation (Metac-Adapt) almost without adding additional time cost and computational resources to make the model generate better pseudo samples and then replay them. Experimental results demonstrate that Metac-Adapt is on par with MTL or better.
Han MA Qiaoling ZHANG Roubing TANG Lu ZHANG Yubo JIA
Recently, robust speech recognition for real-world applications has attracted much attention. This paper proposes a robust speech recognition method based on the teacher-student learning framework for domain adaptation. In particular, the student network will be trained based on a novel optimization criterion defined by the encoder outputs of both teacher and student networks rather than the final output posterior probabilities, which aims to make the noisy audio map to the same embedding space as clean audio, so that the student network is adaptive in the noise domain. Comparative experiments demonstrate that the proposed method obtained good robustness against noise.
Gang LI Shuren GUO Yi ZHOU Zaixiu YANG
Regional Short Message Communication (RSMC) service of BeiDou Navigation Satellite System (BDS) has been widely used in various fields. BDS-3 officially started to provide service in 2020, and the performance of RSMC service was greatly improved, which offers an opportunity for large-scale applications of RSMC in consumer electronic products. Due to the complex application scenarios, the low-cost and low-power of RSMC terminals, a better coding scheme is needed to improve performance. In this paper, we propose a new polar encoding scheme with low code rate and variable code length, which adopts Polarization Weight (PW) to generate the reliability sequence of Polar codes and use a Nested Rate Adaptation Sequence (NRAS) to realize rate adaption for the BDS-3 RSMC. The performance of encoding gain and decoding complexity is analyzed by simulation and experiments. The results validate the effective of this scheme. Compared with Turbo codes, the proposed polar codes scheme achieves about 0.5dB gain with about 50% decoding complexity when the information length including CRC is 128 and code rate is 1/2. The proposed polar codes scheme provides a good reference for further applications in BDS.
Yang WANG Hongliang FU Huawei TAO Jing YANG Hongyi GE Yue XIE
This letter focuses on the cross-corpus speech emotion recognition (SER) task, in which the training and testing speech signals in cross-corpus SER belong to different speech corpora. Existing algorithms are incapable of effectively extracting common sentiment information between different corpora to facilitate knowledge transfer. To address this challenging problem, a novel convolutional auto-encoder and adversarial domain adaptation (CAEADA) framework for cross-corpus SER is proposed. The framework first constructs a one-dimensional convolutional auto-encoder (1D-CAE) for feature processing, which can explore the correlation among adjacent one-dimensional statistic features and the feature representation can be enhanced by the architecture based on encoder-decoder-style. Subsequently the adversarial domain adaptation (ADA) module alleviates the feature distributions discrepancy between the source and target domains by confusing domain discriminator, and specifically employs maximum mean discrepancy (MMD) to better accomplish feature transformation. To evaluate the proposed CAEADA, extensive experiments were conducted on EmoDB, eNTERFACE, and CASIA speech corpora, and the results show that the proposed method outperformed other approaches.
Jie ZHU Yuan ZONG Hongli CHANG Li ZHAO Chuangao TANG
Unsupervised domain adaptation (DA) is a challenging machine learning problem since the labeled training (source) and unlabeled testing (target) sets belong to different domains and then have different feature distributions, which has recently attracted wide attention in micro-expression recognition (MER). Although some well-performing unsupervised DA methods have been proposed, these methods cannot well solve the problem of unsupervised DA in MER, a. k. a., cross-domain MER. To deal with such a challenging problem, in this letter we propose a novel unsupervised DA method called Joint Patch weighting and Moment Matching (JPMM). JPMM bridges the source and target micro-expression feature sets by minimizing their probability distribution divergence with a multi-order moment matching operation. Meanwhile, it takes advantage of the contributive facial patches by the weight learning such that a domain-invariant feature representation involving micro-expression distinguishable information can be learned. Finally, we carry out extensive experiments to evaluate the proposed JPMM method is superior to recent state-of-the-art unsupervised DA methods in dealing with cross-domain MER.
In this paper, we propose rate adaptation mechanisms for robust and low-latency video transmissions exploiting multiple access points (Multi-AP) wireless local area networks (WLANs). The Multi-AP video transmissions employ link-level broadcast and packet-level forward error correction (FEC) in order to realize robust and low-latency video transmissions from a WLAN station (STA) to a gateway (GW). The PHY (physical layer) rate and FEC rate play a key role to control trade-off between the achieved reliability and airtime (i.e., occupancy period of the shared channel) for Multi-AP WLANs. In order to finely control this trade-off while improving the transmitted video quality, the proposed rate adaptation controls PHY rate and FEC rate to be employed for Multi-AP transmissions based on the link quality and frame format of conveyed video traffic. With computer simulations, we evaluate and investigate the effectiveness of the proposed rate adaptation in terms of packet delivery rate (PDR), airtime, delay, and peak signal to noise ratio (PSNR). Furthermore, the quality of video is assessed by using the traffic encoded/decoded by the actual video encoder/decoder. All these results show that the proposed rate adaptation controls trade-off between the reliability and airtime well while offering the high-quality and low-latency video transmissions.
Daming LIN Jie WANG Yundong LI
Rapid building damage identification plays a vital role in rescue operations when disasters strike, especially when rescue resources are limited. In the past years, supervised machine learning has made considerable progress in building damage identification. However, the usage of supervised machine learning remains challenging due to the following facts: 1) the massive samples from the current damage imagery are difficult to be labeled and thus cannot satisfy the training requirement of deep learning, and 2) the similarity between partially damaged and undamaged buildings is high, hindering accurate classification. Leveraging the abundant samples of auxiliary domains, domain adaptation aims to transfer a classifier trained by historical damage imagery to the current task. However, traditional domain adaptation approaches do not fully consider the category-specific information during feature adaptation, which might cause negative transfer. To address this issue, we propose a novel domain adaptation framework that individually aligns each category of the target domain to that of the source domain. Our method combines the variational autoencoder (VAE) and the Gaussian mixture model (GMM). First, the GMM is established to characterize the distribution of the source domain. Then, the VAE is constructed to extract the feature of the target domain. Finally, the Kullback-Leibler (KL) divergence is minimized to force the feature of the target domain to observe the GMM of the source domain. Two damage detection tasks using post-earthquake and post-hurricane imageries are utilized to verify the effectiveness of our method. Experiments show that the proposed method obtains improvements of 4.4% and 9.5%, respectively, compared with the conventional method.
Rintaro YANAGI Ren TOGO Takahiro OGAWA Miki HASEYAMA
Various cross-modal retrieval methods that can retrieve images related to a query sentence without text annotations have been proposed. Although a high level of retrieval performance is achieved by these methods, they have been developed for a single domain retrieval setting. When retrieval candidate images come from various domains, the retrieval performance of these methods might be decreased. To deal with this problem, we propose a new domain adaptive cross-modal retrieval method. By translating a modality and domains of a query and candidate images, our method can retrieve desired images accurately in a different domain retrieval setting. Experimental results for clipart and painting datasets showed that the proposed method has better retrieval performance than that of other conventional and state-of-the-art methods.
Siyang YU Kazuaki KONDO Yuichi NAKAMURA Takayuki NAKAJIMA Masatake DANTSUJI
This article introduces our investigation on learning state estimation in e-learning on the condition that visual observation and recording of a learner's behaviors is possible. In this research, we examined methods of adaptation for a new learner for whom a small number of ground truth data can be obtained.
Jiateng LIU Wenming ZHENG Yuan ZONG Cheng LU Chuangao TANG
In this letter, we propose a novel deep domain-adaptive convolutional neural network (DDACNN) model to handle the challenging cross-corpus speech emotion recognition (SER) problem. The framework of the DDACNN model consists of two components: a feature extraction model based on a deep convolutional neural network (DCNN) and a domain-adaptive (DA) layer added in the DCNN utilizing the maximum mean discrepancy (MMD) criterion. We use labeled spectrograms from source speech corpus combined with unlabeled spectrograms from target speech corpus as the input of two classic DCNNs to extract the emotional features of speech, and train the model with a special mixed loss combined with a cross-entrophy loss and an MMD loss. Compared to other classic cross-corpus SER methods, the major advantage of the DDACNN model is that it can extract robust speech features which are time-frequency related by spectrograms and narrow the discrepancies between feature distribution of source corpus and target corpus to get better cross-corpus performance. Through several cross-corpus SER experiments, our DDACNN achieved the state-of-the-art performance on three public emotion speech corpora and is proved to handle the cross-corpus SER problem efficiently.
Hao LIANG Aijun LIU Heng WANG Kui XU
This Letter explores the adaptive hybrid automatic repeat request (HARQ) using rate-compatible polar codes constructed with a common information set. The rate adaptation problem is formulated using Markov decision process and solved by a dynamic programming framework in a low-complexity way. Simulation verifies the throughput efficiency of the proposed adaptive HARQ.
Haitong YANG Guangyou ZHOU Tingting HE Maoxi LI
In this paper, we study domain adaptation of semantic role classification. Most systems utilize the supervised method for semantic role classification. But, these methods often suffer severe performance drops on out-of-domain test data. The reason for the performance drops is that there are giant feature differences between source and target domain. This paper proposes a framework called Adversarial Domain Adaption Network (ADAN) to relieve domain adaption of semantic role classification. The idea behind our method is that the proposed framework can derive domain-invariant features via adversarial learning and narrow down the gap between source and target feature space. To evaluate our method, we conduct experiments on English portion in the CoNLL 2009 shared task. Experimental results show that our method can largely reduce the performance drop on out-of-domain test data.
Xiuzhen CHEN Xiaoyan ZHOU Cheng LU Yuan ZONG Wenming ZHENG Chuangao TANG
For cross-corpus speech emotion recognition (SER), how to obtain effective feature representation for the discrepancy elimination of feature distributions between source and target domains is a crucial issue. In this paper, we propose a Target-adapted Subspace Learning (TaSL) method for cross-corpus SER. The TaSL method trys to find a projection subspace, where the feature regress the label more accurately and the gap of feature distributions in target and source domains is bridged effectively. Then, in order to obtain more optimal projection matrix, ℓ1 norm and ℓ2,1 norm penalty terms are added to different regularization terms, respectively. Finally, we conduct extensive experiments on three public corpuses, EmoDB, eNTERFACE and AFEW 4.0. The experimental results show that our proposed method can achieve better performance compared with the state-of-the-art methods in the cross-corpus SER tasks.
Qing YU Masashi ANZAWA Sosuke AMANO Kiyoharu AIZAWA
Since the development of food diaries could enable people to develop healthy eating habits, food image recognition is in high demand to reduce the effort in food recording. Previous studies have worked on this challenging domain with datasets having fixed numbers of samples and classes. However, in the real-world setting, it is impossible to include all of the foods in the database because the number of classes of foods is large and increases continually. In addition to that, inter-class similarity and intra-class diversity also bring difficulties to the recognition. In this paper, we solve these problems by using deep convolutional neural network features to build a personalized classifier which incrementally learns the user's data and adapts to the user's eating habit. As a result, we achieved the state-of-the-art accuracy of food image recognition by the personalization of 300 food records per user.
Qiusheng HE Xiuyan SHAO Wei CHEN Xiaoyun LI Xiao YANG Tongfeng SUN
In order to solve the influence of scale change on target tracking using the drone, a multi-scale target tracking algorithm is proposed which based on the color feature tracking algorithm. The algorithm realized adaptive scale tracking by training position and scale correlation filters. It can first obtain the target center position of next frame by computing the maximum of the response, where the position correlation filter is learned by the least squares classifier and the dimensionality reduction for color features is analyzed by principal component analysis. The scale correlation filter is obtained by color characteristics at 33 rectangular areas which is set by the scale factor around the central location and is reduced dimensions by orthogonal triangle decomposition. Finally, the location and size of the target are updated by the maximum of the response. By testing 13 challenging video sequences taken by the drone, the results show that the algorithm has adaptability to the changes in the target scale and its robustness along with many other performance indicators are both better than the most state-of-the-art methods in illumination Variation, fast motion, motion blur and other complex situations.