IEICE globals.ieice.org Site

Keyword Search Result

[Keyword] video prediction(2hit)

1-2hit

Compensation of Communication Latency in Remote Monitoring Systems by Video Prediction Open Access
Toshio SATO Yutaka KATSUYAMA Xin QI Zheng WEN Kazuhiko TAMESUE Wataru KAMEYAMA Yuichi NAKAMURA Jiro KATTO Takuro SATO

PAPER

Vol:
E107-B No:12
Page(s):
945-954
Remote video monitoring over networks inevitably introduces a certain degree of communication latency. Although numerous studies have been conducted to reduce latency in network systems, achieving “zero-latency” is fundamentally impossible for video monitoring. To address this issue, we investigate a practical method to compensate for latency in video monitoring using video prediction techniques. We apply the lightweight PredNet to predict future frames, and their image qualities are evaluated through quantitative image quality metrics and subjective assessment. The evaluation results suggest that for simple movements of the robot arm, the prediction time to generate future frames can tolerate up to 333 ms. The video prediction method is integrated into a remote monitoring system, and its processing time is also evaluated. We define the object-to-display latency for video monitoring and explore the potential for realizing a zero-latency remote video monitoring system. The evaluation, involving simultaneous capture of the robot arm’s movement and the display of the remote monitoring system, confirms the feasibility of compensating for the object-to-display latency of several hundred milliseconds by using video prediction. Experimental results demonstrate that our approach can function as a new compensation method for communication latency.
Representation Learning of Tongue Dynamics for a Silent Speech Interface
Hongcui WANG Pierre ROUSSEL Bruce DENBY

PAPER-Speech and Hearing

Pubricized:
2021/08/24
Vol:
E104-D No:12
Page(s):
2209-2217
A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.