IEICE globals.ieice.org Site

Keyword Search Result

[Keyword] transformer(83hit)

1-20hit(83hit)

D2PT: Density to Point Transformer with Knowledge Distillation for Crowd Counting and Localization Open Access
Fan LI Enze YANG Chao LI Shuoyan LIU Haodong WANG

LETTER-Image Recognition, Computer Vision

Pubricized:
2024/09/17
Vol:
E108-D No:2
Page(s):
165-168
Crowd counting is a crucial task in computer vision, which poses a significant challenge yet holds vast potential for practical applications in public safety and transportation. Traditional crowd counting approaches typically rely on a single framework to predict density maps or head point distributions. However, the straightforward architectures often fall short in cases of over-counting or omission, particularly in diverse crowded scenes. To address these limitations, we introduce the Density to Point Transformer (D2PT), an innovative approach for effective crowd counting and localization. Specifically, D2PT employs a Transformer-based teacher-student framework that integrates the insights of density-based and head-point-based methods. Furthermore, we introduce feature-aligned knowledge distillation, formulating a collaborative training approach that enhances the performance of both density estimation and point map prediction. Optimized with multiple loss functions, D2PT achieves state-of-the-art performance across five crowd counting datasets, demonstrating its robustness and effectiveness for intricate crowd counting and localization challenges.
Vision Transformer with Key-Select Routing Attention for Single Image Dehazing Open Access
Lihan TONG Weijia LI Qingxia YANG Liyuan CHEN Peng CHEN

LETTER-Image Recognition, Computer Vision

Pubricized:
2024/07/01
Vol:
E107-D No:11
Page(s):
1472-1475
We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.
Power Peak Load Forecasting Based on Deep Time Series Analysis Method Open Access
Ying-Chang HUNG Duen-Ren LIU

PAPER-Artificial Intelligence, Data Mining

Pubricized:
2024/03/21
Vol:
E107-D No:7
Page(s):
845-856
The prediction of peak power load is a critical factor directly impacting the stability of power supply, characterized significantly by its time series nature and intricate ties to the seasonal patterns in electricity usage. Despite its crucial importance, the current landscape of power peak load forecasting remains a multifaceted challenge in the field. This study aims to contribute to this domain by proposing a method that leverages a combination of three primary models - the GRU model, self-attention mechanism, and Transformer mechanism - to forecast peak power load. To contextualize this research within the ongoing discourse, it’s essential to consider the evolving methodologies and advancements in power peak load forecasting. By delving into additional references addressing the complexities and current state of the power peak load forecasting problem, this study aims to build upon the existing knowledge base and offer insights into contemporary challenges and strategies adopted within the field. Data preprocessing in this study involves comprehensive cleaning, standardization, and the design of relevant functions to ensure robustness in the predictive modeling process. Additionally, recognizing the necessity to capture temporal changes effectively, this research incorporates features such as “Weekly Moving Average” and “Monthly Moving Average” into the dataset. To evaluate the proposed methodologies comprehensively, this study conducts comparative analyses with established models such as LSTM, Self-attention network, Transformer, ARIMA, and SVR. The outcomes reveal that the models proposed in this study exhibit superior predictive performance compared to these established models, showcasing their effectiveness in accurately forecasting electricity consumption. The significance of this research lies in two primary contributions. Firstly, it introduces an innovative prediction method combining the GRU model, self-attention mechanism, and Transformer mechanism, aligning with the contemporary evolution of predictive modeling techniques in the field. Secondly, it introduces and emphasizes the utility of “Weekly Moving Average” and “Monthly Moving Average” methodologies, crucial in effectively capturing and interpreting seasonal variations within the dataset. By incorporating these features, this study enhances the model’s ability to account for seasonal influencing factors, thereby significantly improving the accuracy of peak power load forecasting. This contribution aligns with the ongoing efforts to refine forecasting methodologies and addresses the pertinent challenges within power peak load forecasting.
Analysis of Blood Cell Image Recognition Methods Based on Improved CNN and Vision Transformer Open Access
Pingping WANG Xinyi ZHANG Yuyan ZHAO Yueti LI Kaisheng XU Shuaiyin ZHAO

PAPER-Neural Networks and Bioengineering

Pubricized:
2023/09/15
Vol:
E107-A No:6
Page(s):
899-908
Leukemia is a common and highly dangerous blood disease that requires early detection and treatment. Currently, the diagnosis of leukemia types mainly relies on the pathologist’s morphological examination of blood cell images, which is a tedious and time-consuming process, and the diagnosis results are highly subjective and prone to misdiagnosis and missed diagnosis. This research suggests a blood cell image recognition technique based on an enhanced Vision Transformer to address these problems. Firstly, this paper incorporate convolutions with token embedding to replace the positional encoding which represent coarse spatial information. Then based on the Transformer’s self-attention mechanism, this paper proposes a sparse attention module that can select identifying regions in the image, further enhancing the model’s fine-grained feature expression capability. Finally, this paper uses a contrastive loss function to further increase the intra-class consistency and inter-class difference of classification features. According to experimental results, The model in this study has an identification accuracy of 92.49% on the Munich single-cell morphological dataset, which is an improvement of 1.41% over the baseline. And comparing with sota Swin transformer, this method still get greater performance. So our method has the potential to provide reference for clinical diagnosis by physicians.
Finformer: Fast Incremental and General Time Series Data Prediction Open Access
Savong BOU Toshiyuki AMAGASA Hiroyuki KITAGAWA

PAPER

Pubricized:
2024/01/09
Vol:
E107-D No:5
Page(s):
625-637
Forecasting time-series data is useful in many fields, such as stock price predicting system, autonomous driving system, weather forecast, etc. Many existing forecasting models tend to work well when forecasting short-sequence time series. However, when working with long sequence time series, the performance suffers significantly. Recently, there has been more intense research in this direction, and Informer is currently the most efficient predicting model. Informer’s main drawback is that it does not allow for incremental learning. In this paper, we propose a Fast Informer called Finformer, which addresses the above bottleneck by reducing the training/predicting time of Informer. Finformer can efficiently compute the positional/temporal/value embedding and Query/Key/Value of the self-attention incrementally. Theoretically, Finformer can improve the speed of both training and predicting over the state-of-the-art model Informer. Extensive experiments show that Finformer is about 26% faster than Informer for both short and long sequence time series prediction. In addition, Finformer is about 20% faster than InTrans for the general Conv1d, which is one of our previous works and is the predecessor of Finformer.
SimpleViTFi: A Lightweight Vision Transformer Model for Wi-Fi-Based Person Identification Open Access
Jichen BIAN Min ZHENG Hong LIU Jiahui MAO Hui LI Chong TAN

PAPER-Sensing

Vol:
E107-B No:4
Page(s):
377-386
Wi-Fi-based person identification (PI) tasks are performed by analyzing the fluctuating characteristics of the Channel State Information (CSI) data to determine whether the person's identity is legitimate. This technology can be used for intrusion detection and keyless access to restricted areas. However, the related research rarely considers the restricted computing resources and the complexity of real-world environments, resulting in lacking practicality in some scenarios, such as intrusion detection tasks in remote substations without public network coverage. In this paper, we propose a novel neural network model named SimpleViTFi, a lightweight classification model based on Vision Transformer (ViT), which adds a downsampling mechanism, a distinctive patch embedding method and learnable positional embedding to the cropped ViT architecture. We employ the latest IEEE 802.11ac 80MHz CSI dataset provided by [1]. The CSI matrix is abstracted into a special “image” after pre-processing and fed into the trained SimpleViTFi for classification. The experimental results demonstrate that the proposed SimpleViTFi has lower computational resource overhead and better accuracy than traditional classification models, reflecting the robustness on LOS or NLOS CSI data generated by different Tx-Rx devices and acquired by different monitors.
An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection
Shiyu TENG Jiaqing LIU Yue HUANG Shurong CHAI Tomoko TATEYAMA Xinyin HUANG Lanfen LIN Yen-Wei CHEN

PAPER

Pubricized:
2023/12/15
Vol:
E107-D No:3
Page(s):
342-353
Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.
Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention
Peng GAO Xin-Yue ZHANG Xiao-Li YANG Jian-Cheng NI Fei WANG

LETTER-Image Recognition, Computer Vision

Pubricized:
2023/10/20
Vol:
E107-D No:1
Page(s):
161-164
Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.
Lightweight and Fast Low-Light Image Enhancement Method Based on PoolFormer
Xin HU Jinhua WANG Sunhan XU

LETTER-Image Processing and Video Processing

Pubricized:
2023/10/05
Vol:
E107-D No:1
Page(s):
157-160
Images captured in low-light environments have low visibility and high noise, which will seriously affect subsequent visual tasks such as target detection and face recognition. Therefore, low-light image enhancement is of great significance in obtaining high-quality images and is a challenging problem in computer vision tasks. A low-light enhancement model, LLFormer, based on the Vision Transformer, uses axis-based multi-head self-attention and a cross-layer attention fusion mechanism to reduce the complexity and achieve feature extraction. This algorithm can enhance images well. However, the calculation of the attention mechanism is complex and the number of parameters is large, which limits the application of the model in practice. In response to this problem, a lightweight module, PoolFormer, is used to replace the attention module with spatial pooling, which can increase the parallelism of the network and greatly reduce the number of model parameters. To suppress image noise and improve visual effects, a new loss function is constructed for model optimization. The experiment results show that the proposed method not only reduces the number of parameters by 49%, but also performs better in terms of image detail restoration and noise suppression compared with the baseline model. On the LOL dataset, the PSNR and SSIM were 24.098dB and 0.8575 respectively. On the MIT-Adobe FiveK dataset, the PSNR and SSIM were 27.060dB and 0.9490. The evaluation results on the two datasets are better than the current mainstream low-light enhancement algorithms.
CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification
Aorui GOU Jingjing LIU Xiaoxiang CHEN Xiaoyang ZENG Yibo FAN

PAPER-Image

Pubricized:
2023/07/06
Vol:
E107-A No:1
Page(s):
141-156
Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable performance in detection and classification tasks. Nevertheless, their feature extraction cannot consider both local and global information, so the detection and classification performance can be further improved. In addition, more and more deep learning networks are designed as more and more complex, and the amount of computation and storage space required is also significantly increased. This paper proposes a combination of CNN and transformer, and designs a local feature enhancement module and global context modeling module to enhance the cascade network. While the local feature enhancement module increases the range of feature extraction, the global context modeling is used to capture the feature maps' global information. To decrease the model complexity, a shared sublayer is designed to realize the sharing of weight parameters between the adjacent convolutional layers or cross convolutional layers, thereby reducing the number of convolutional weight parameters. Moreover, to effectively improve the detection performance of neural networks without increasing network parameters, the optimal transport assignment approach is proposed to resolve the problem of label assignment. The classification loss and regression loss are the summations of the cost between the demander and supplier. The experiment results demonstrate that the proposed Combination of CNN and Transformer with Shared Sublayer (CCTSS) performs better than the state-of-the-art methods in various datasets and applications.
Local-to-Global Structure-Aware Transformer for Question Answering over Structured Knowledge
Yingyao WANG Han WANG Chaoqun DUAN Tiejun ZHAO

PAPER-Artificial Intelligence, Data Mining

Pubricized:
2023/06/27
Vol:
E106-D No:10
Page(s):
1705-1714
Question-answering tasks over structured knowledge (i.e., tables and graphs) require the ability to encode structural information. Traditional pre-trained language models trained on linear-chain natural language cannot be directly applied to encode tables and graphs. The existing methods adopt the pre-trained models in such tasks by flattening structured knowledge into sequences. However, the serialization operation will lead to the loss of the structural information of knowledge. To better employ pre-trained transformers for structured knowledge representation, we propose a novel structure-aware transformer (SATrans) that injects the local-to-global structural information of the knowledge into the mask of the different self-attention layers. Specifically, in the lower self-attention layers, SATrans focus on the local structural information of each knowledge token to learn a more robust representation of it. In the upper self-attention layers, SATrans further injects the global information of the structured knowledge to integrate the information among knowledge tokens. In this way, the SATrans can effectively learn the semantic representation and structural information from the knowledge sequence and the attention mask, respectively. We evaluate SATrans on the table fact verification task and the knowledge base question-answering task. Furthermore, we explore two methods to combine symbolic and linguistic reasoning for these tasks to solve the problem that the pre-trained models lack symbolic reasoning ability. The experiment results reveal that the methods consistently outperform strong baselines on the two benchmarks.
Siamese Transformer for Saliency Prediction Based on Multi-Prior Enhancement and Cross-Modal Attention Collaboration
Fazhan YANG Xingge GUO Song LIANG Peipei ZHAO Shanhua LI

PAPER-Image Recognition, Computer Vision

Pubricized:
2023/06/20
Vol:
E106-D No:9
Page(s):
1572-1583
Visual saliency prediction has improved dramatically since the advent of convolutional neural networks (CNN). Although CNN achieves excellent performance, it still cannot learn global and long-range contextual information well and lacks interpretability due to the locality of convolution operations. We proposed a saliency prediction model based on multi-prior enhancement and cross-modal attention collaboration (ME-CAS). Concretely, we designed a transformer-based Siamese network architecture as the backbone for feature extraction. One of the transformer branches captures the context information of the image under the self-attention mechanism to obtain a global saliency map. At the same time, we build a prior learning module to learn the human visual center bias prior, contrast prior, and frequency prior. The multi-prior input to another Siamese branch to learn the detailed features of the underlying visual features and obtain the saliency map of local information. Finally, we use an attention calibration module to guide the cross-modal collaborative learning of global and local information and generate the final saliency map. Extensive experimental results demonstrate that our proposed ME-CAS achieves superior results on public benchmarks and competitors of saliency prediction models. Moreover, the multi-prior learning modules enhance images express salient details, and model interpretability.
A Fusion Deraining Network Based on Swin Transformer and Convolutional Neural Network
Junhao TANG Guorui FENG

LETTER-Image Processing and Video Processing

Pubricized:
2023/04/24
Vol:
E106-D No:7
Page(s):
1254-1257
Single image deraining is an ill-posed problem which also has been a long-standing issue. In past few years, convolutional neural network (CNN) methods almost dominated the computer vision and achieved considerable success in image deraining. Recently the Swin Transformer-based model also showed impressive performance, even surpassed the CNN-based methods and became the state-of-the-art on high-level vision tasks. Therefore, we attempt to introduce Swin Transformer to deraining tasks. In this paper, we propose a deraining model with two sub-networks. The first sub-network includes two branches. Rain Recognition Network is a Unet with the Swin Transformer layer, which works as preliminarily restoring the background especially for the location where rain streaks appear. Detail Complement Network can extract the background detail beneath the rain streak. The second sub-network which called Refine-Unet utilizes the output of the previous one to further restore the image. Through experiments, our network achieves improvements on single image deraining compared with the previous Transformer research.
GazeFollowTR: A Method of Gaze Following with Reborn Mechanism
Jingzhao DAI Ming LI Xuejiao HU Yang LI Sidan DU

PAPER-Vision

Pubricized:
2022/11/30
Vol:
E106-A No:6
Page(s):
938-946
Gaze following is the task of estimating where an observer is looking inside a scene. Both the observer and scene information must be learned to determine the gaze directions and gaze points. Recently, many existing works have only focused on scenes or observers. In contrast, revealed frameworks for gaze following are limited. In this paper, a gaze following method using a hybrid transformer is proposed. Based on the conventional method (GazeFollow), we conduct three developments. First, a hybrid transformer is applied for learning head images and gaze positions. Second, the pinball loss function is utilized to control the gaze point error. Finally, a novel ReLU layer with the reborn mechanism (reborn ReLU) is conducted to replace traditional ReLU layers in different network stages. To test the performance of our developments, we train our developed framework with the DL Gaze dataset and evaluate the model on our collected set. Through our experimental results, it can be proven that our framework can achieve outperformance over our referred methods.
Time Series Forecasting Based on Convolution Transformer
Na WANG Xianglian ZHAO

PAPER-Fundamentals of Information Systems

Pubricized:
2023/02/15
Vol:
E106-D No:5
Page(s):
976-985
For many fields in real life, time series forecasting is essential. Recent studies have shown that Transformer has certain advantages when dealing with such problems, especially when dealing with long sequence time input and long sequence time forecasting problems. In order to improve the efficiency and local stability of Transformer, these studies combine Transformer and CNN with different structures. However, previous time series forecasting network models based on Transformer cannot make full use of CNN, and they have not been used in a better combination of both. In response to this problem in time series forecasting, we propose the time series forecasting algorithm based on convolution Transformer. (1) ES attention mechanism: Combine external attention with traditional self-attention mechanism through the two-branch network, the computational cost of self-attention mechanism is reduced, and the higher forecasting accuracy is obtained. (2) Frequency enhanced block: A Frequency Enhanced Block is added in front of the ESAttention module, which can capture important structures in time series through frequency domain mapping. (3) Causal dilated convolution: The self-attention mechanism module is connected by replacing the traditional standard convolution layer with a causal dilated convolution layer, so that it obtains the receptive field of exponentially growth without increasing the calculation consumption. (4) Multi-layer feature fusion: The outputs of different self-attention mechanism modules are extracted, and the convolutional layers are used to adjust the size of the feature map for the fusion. The more fine-grained feature information is obtained at negligible computational cost. Experiments on real world datasets show that the time series network forecasting model structure proposed in this paper can greatly improve the real-time forecasting performance of the current state-of-the-art Transformer model, and the calculation and memory costs are significantly lower. Compared with previous algorithms, the proposed algorithm has achieved a greater performance improvement in both effectiveness and forecasting accuracy.
Image and Model Transformation with Secret Key for Vision Transformer
Hitoshi KIYA Ryota IIJIMA Aprilpyone MAUNGMAUNG Yuma KINOSHITA

INVITED PAPER

Pubricized:
2022/11/02
Vol:
E106-D No:1
Page(s):
2-11
In this paper, we propose a combined use of transformed images and vision transformer (ViT) models transformed with a secret key. We show for the first time that models trained with plain images can be directly transformed to models trained with encrypted images on the basis of the ViT architecture, and the performance of the transformed models is the same as models trained with plain images when using test images encrypted with the key. In addition, the proposed scheme does not require any specially prepared data for training models or network modification, so it also allows us to easily update the secret key. In an experiment, the effectiveness of the proposed scheme is evaluated in terms of performance degradation and model protection performance in an image classification task on the CIFAR-10 dataset.
Vehicle Re-Identification Based on Quadratic Split Architecture and Auxiliary Information Embedding
Tongwei LU Hao ZHANG Feng MIN Shihai JIA

LETTER-Image

Pubricized:
2022/05/24
Vol:
E105-A No:12
Page(s):
1621-1625
Convolutional neural network (CNN) based vehicle re-identificatioin (ReID) inevitably has many disadvantages, such as information loss caused by downsampling operation. Therefore we propose a vision transformer (Vit) based vehicle ReID method to solve this problem. To improve the feature representation of vision transformer and make full use of additional vehicle information, the following methods are presented. (I) We propose a Quadratic Split Architecture (QSA) to learn both global and local features. More precisely, we split an image into many patches as “global part” and further split them into smaller sub-patches as “local part”. Features of both global and local part will be aggregated to enhance the representation ability. (II) The Auxiliary Information Embedding (AIE) is proposed to improve the robustness of the model by plugging a learnable camera/viewpoint embedding into Vit. Experimental results on several benchmarks indicate that our method is superior to many advanced vehicle ReID methods.
Analysis and Design of a Linear Ka-Band Power Amplifier in 65-nm CMOS for 5G Applications
Chongyu YU Jun FENG

PAPER-Microwaves, Millimeter-Waves

Pubricized:
2021/12/14
Vol:
E105-C No:5
Page(s):
184-193
A linear and broadband power amplifier (PA) for 5G phased-array is presented. The design improves the linearity by operating the transistors in deep class AB region. The design broadens the bandwidth by applying the inter-stage weakly-coupled transformer. The theory of transformers is illustrated by analyzing the odd- and even-mode model. Based on this, the odd-mode Q factor is used to evaluate the quality of impedance matching. Weakly- and strongly-coupled transformers are compared and analyzed in both the design process and applicable characteristics. Besides, a well-founded method to achieve the transformer-based balanced-unbalanced transformation is proposed. The fully integrated two-stage PA is designed and implemented in a 65-nm CMOS process with a 1-V power supply to provide a maximum small-signal gain of 19dB. The maximum output 1-dB compressed power (P1dB) of 17.4dBm and the saturated output power (PSAT) of 18dBm are measured at 28GHz. The power-added efficiency (PAE) of the P1dB is 26.5%. From 23 to 32GHz, the measured P1dB is above 16dBm, covering the potential 5G bands worldwide around 28GHz.
A Study on the Bandwidth of the Transformer Matching Circuits
Satoshi TANAKA

PAPER

Pubricized:
2021/10/25
Vol:
E105-A No:5
Page(s):
844-852
With the spread of the 5th generation mobile phone, the increase of the output power of PA (power amplifier) has become important, and in recent years, differential amplifiers that can increase the output voltage amplitude for the power supply voltage have been examined from the viewpoint of power synthesis. In the case of a differential PA, in addition to the advantage of voltage amplitude, the load impedance can be set 4 times as much as that of a single-ended PA, which makes it possible to reduce the impact of parasitic resistance. With the study of the differential PA, many transformer matching circuits have been studied in addition to the LC matching circuits that have been widely used in the past. The transformer matching circuit can easily realize the differential-single conversion, and the transformer matching circuit is an indispensable technology in the differential PA. As with the LC matching circuit, widening the bandwidth of the transformer matching circuit is at issue. In this paper, characteristics of basic transformer matching circuits are analyzed by adding input/output shunt capacitance to transformers and the conditions of bandwidth improvement are clarified. In addition, by comparing the FBW (fractional bandwidth) with the LC 2-stage matching circuit, it is shown that the FBW can be competitive.
Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
Jing WANG Yiyu LUO Weiming YI Xiang XIE

PAPER-Speech and Hearing

Pubricized:
2022/01/11
Vol:
E105-D No:4
Page(s):
766-777
Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audio-visual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

1-20hit(83hit)

Keyword Search Result

[Keyword] transformer(83hit)

D2PT: Density to Point Transformer with Knowledge Distillation for Crowd Counting and Localization Open Access

Vision Transformer with Key-Select Routing Attention for Single Image Dehazing Open Access

Power Peak Load Forecasting Based on Deep Time Series Analysis Method Open Access

Analysis of Blood Cell Image Recognition Methods Based on Improved CNN and Vision Transformer Open Access

Finformer: Fast Incremental and General Time Series Data Prediction Open Access

SimpleViTFi: A Lightweight Vision Transformer Model for Wi-Fi-Based Person Identification Open Access

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection

Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention

Lightweight and Fast Low-Light Image Enhancement Method Based on PoolFormer

CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

Local-to-Global Structure-Aware Transformer for Question Answering over Structured Knowledge

Siamese Transformer for Saliency Prediction Based on Multi-Prior Enhancement and Cross-Modal Attention Collaboration

A Fusion Deraining Network Based on Swin Transformer and Convolutional Neural Network

GazeFollowTR: A Method of Gaze Following with Reborn Mechanism

Time Series Forecasting Based on Convolution Transformer

Image and Model Transformation with Secret Key for Vision Transformer

Vehicle Re-Identification Based on Quadratic Split Architecture and Auxiliary Information Embedding

Analysis and Design of a Linear Ka-Band Power Amplifier in 65-nm CMOS for 5G Applications

A Study on the Bandwidth of the Transformer Matching Circuits

Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles