Ryosuke ICHIKAWA Takumi WATANABE Hiroki TAKATSUKA Shiro SUYAMA Hirotsugu YAMAMOTO
We introduce a new aquatic display optical system based on aerial imaging by retro-reflection (AIRR). This system places passive optical components (a beam splitter and retro-reflector) in water to eliminate disturbances due to water motion. To demonstrate the effectiveness of the proposed optical system, we develop a prototype optical system and compensate for the motion of the water surface. We analyze the motion compensation and quantify its effectiveness using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics. From these results, we see that the optical system maintains a static image in water even when the water surface is undulating.
Kazuya KAKIZAKI Kazuto FUKUCHI Jun SAKUMA
This paper develops certified defenses for deep neural network (DNN) based content-based image retrieval (CBIR) against adversarial examples (AXs). Previous works put their effort into certified defense for classification to improve certified robustness, which guarantees that no AX to cause misclassification exists around the sample. Such certified defense, however, could not be applied to CBIR directly because the goals of adversarial attack against classification and CBIR are completely different. To develop the certified defense for CBIR, we first define the new certified robustness of CBIR, which guarantees that no AX that changes the ranking results of CBIR exists around the input images. Then, we propose computationally tractable verification algorithms that verify whether a given feature extraction DNN satisfies the certified robustness of CBIR at given input images. Our proposed verification algorithms are achieved by evaluating the upper and lower bounds of distances between feature representations of perturbed and non-perturbed images in deterministic and probabilistic manners. Finally, we propose robust training methods to obtain feature extraction DNNs that increase the number of inputs that satisfy the certified robustness of CBIR by tightening the upper and lower bounds. We experimentally show that our proposed certified defenses can guarantee robustness deterministically and probabilistically on various datasets.
Synthetic aperture radar (SAR) image generation is crucial to SAR image interpretation when sufficient image samples are unavailable. Against this background, a method for SAR image generation of three-dimensional (3D) target is proposed in this paper. Specifically, this method contains three steps. Firstly, according to the system parameters, the echo signal in the two-dimensional (2D) time domain is generated, based on which 2D Fast Fourier Transform (2DFFT) is performed. Secondly, the hybrid moments (MoM)-large element physical optics (LEPO) method is used to calculate the scattering characteristics with the certain frequency points and incident angles according to the system parameters. Finally, range Doppler algorithm (RDA) is adopted to process the signal in the 2D-frequency domain with radar cross section (RCS) exported from electromagnetic calculations. These procedures combine RCS computations by FKEO solver and RDA to simulate raw echo signal and then generate SAR image samples for different squint angles and targets with reduced computational load, laying foundations for transmit waveform design, SAR image interpretation and other SAR related work.
Qi QI Zi TENG Hongmei HUO Ming XU Bing BAI
To super-resolve low-resolution (LR) face image suffering from strong noise and fuzzy interference, we present a novel approach for noisy face super-resolution (SR) that is based on three-level information representation constraints. To begin with, we develop a feature distillation network that focuses on extracting pertinent face information, which incorporates both statistical anti-interference models and latent contrast algorithms. Subsequently, we incorporate a face identity embedding model and a discrete wavelet transform model, which serve as additional supervision mechanisms for the reconstruction process. The face identity embedding model ensures the reconstruction of identity information in hypersphere identity metric space, while the discrete wavelet transform model operates in the wavelet domain to supervise the restoration of spatial structures. The experimental results clearly demonstrate the efficacy of our proposed method, which is evident through the lower Learned Perceptual Image Patch Similarity (LPIPS) score and Fréchet Inception Distances (FID), and overall practicability of the reconstructed images.
Dinesh DAULTANI Masayuki TANAKA Masatoshi OKUTOMI Kazuki ENDO
Image classification is a typical computer vision task widely used in practical applications. The images used for training image classification networks are often clean, i.e., without any image degradation. However, Convolutional neural networks trained on clean images perform poorly on degraded or corrupted images in the real world. In this study, we effectively utilize robust data augmentation (DA) with knowledge distillation to improve the classification performance of degraded images. We first categorize robust data augmentations into geometric-and-color and cut-and-delete DAs. Next, we evaluate the effectual positioning of cut-and-delete DA when we apply knowledge distillation. Moreover, we also experimentally demonstrate that combining the RandAugment and Random Erasing approach for geometric-and-color and cut-and-delete DA improves the generalization of the student network during the knowledge transfer for the classification of degraded images.
Jingjing LIU Chuanyang LIU Yiquan WU Zuo SUN
As one of electrical components in transmission lines, vibration damper plays a role in preventing the power lines dancing, and its recognition is an important task for intelligent inspection. However, due to the complex background interference in aerial images, current deep learning algorithms for vibration damper detection often lack accuracy and robustness. To achieve vibration damper detection more accurately, in this study, improved You Only Look Once (YOLO) model is proposed for performing damper detection. Firstly, a damper dataset containing 1900 samples with different scenarios was created. Secondly, the backbone network of YOLOv4 was improved by combining the Res2Net module and Dense blocks, reducing computational consumption and improving training speed. Then, an improved path aggregation network (PANet) structure was introduced in YOLOv4, combined with top-down and bottom-up feature fusion strategies to achieve feature enhancement. Finally, the proposed YOLO model and comparative model were trained and tested on the damper dataset. The experimental results and analysis indicate that the proposed model is more effective and robust than the comparative models. More importantly, the average precision (AP) of this model can reach 98.8%, which is 6.2% higher than that of original YOLOv4 model; and the prediction speed of this model is 62 frames per second (FPS), which is 5 FPS faster than that of YOLOv4 model.
In recent years, deep convolutional neural networks (CNN) have been widely used in synthetic aperture radar (SAR) image recognition. However, due to the difficulty in obtaining SAR image samples, training data is relatively few and overfitting is easy to occur when using traditional CNNS used in optical image recognition. In this paper, a CNN-based SAR image recognition algorithm is proposed, which can effectively reduce network parameters, avoid model overfitting and improve recognition accuracy. The algorithm first constructs a convolutional network feature extractor with a small size convolution kernel, then constructs a classifier based on the convolution layer, and designs a loss function based on distance measurement. The networks are trained in two stages: in the first stage, the distance measurement loss function is used to train the feature extraction network; in the second stage, cross-entropy is used to train the whole model. The public benchmark dataset MSTAR is used for experiments. Comparison experiments prove that the proposed method has higher accuracy than the state-of-the-art algorithms and the classical image recognition algorithms. The ablation experiment results prove the effectiveness of each part of the proposed algorithm.
Toshio SATO Yutaka KATSUYAMA Xin QI Zheng WEN Kazuhiko TAMESUE Wataru KAMEYAMA Yuichi NAKAMURA Jiro KATTO Takuro SATO
Remote video monitoring over networks inevitably introduces a certain degree of communication latency. Although numerous studies have been conducted to reduce latency in network systems, achieving “zero-latency” is fundamentally impossible for video monitoring. To address this issue, we investigate a practical method to compensate for latency in video monitoring using video prediction techniques. We apply the lightweight PredNet to predict future frames, and their image qualities are evaluated through quantitative image quality metrics and subjective assessment. The evaluation results suggest that for simple movements of the robot arm, the prediction time to generate future frames can tolerate up to 333 ms. The video prediction method is integrated into a remote monitoring system, and its processing time is also evaluated. We define the object-to-display latency for video monitoring and explore the potential for realizing a zero-latency remote video monitoring system. The evaluation, involving simultaneous capture of the robot arm’s movement and the display of the remote monitoring system, confirms the feasibility of compensating for the object-to-display latency of several hundred milliseconds by using video prediction. Experimental results demonstrate that our approach can function as a new compensation method for communication latency.
Lihan TONG Weijia LI Qingxia YANG Liyuan CHEN Peng CHEN
We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.
Multi-focus image fusion involves combining partially focused images of the same scene to create an all-in-focus image. Aiming at the problems of existing multi-focus image fusion algorithms that the benchmark image is difficult to obtain and the convolutional neural network focuses too much on the local region, a fusion algorithm that combines local and global feature encoding is proposed. Initially, we devise two self-supervised image reconstruction tasks and train an encoder-decoder network through multi-task learning. Subsequently, within the encoder, we merge the dense connection module with the PS-ViT module, enabling the network to utilize local and global information during feature extraction. Finally, to enhance the overall efficiency of the model, distinct loss functions are applied to each task. To preserve the more robust features from the original images, spatial frequency is employed during the fusion stage to obtain the feature map of the fused image. Experimental results demonstrate that, in comparison to twelve other prominent algorithms, our method exhibits good fusion performance in objective evaluation. Ten of the selected twelve evaluation metrics show an improvement of more than 0.28%. Additionally, it presents superior visual effects subjectively.
This article focuses on improving the BiSeNet v2 bilateral branch image segmentation network structure, enhancing its learning ability for spatial details and overall image segmentation accuracy. A modified network called “BiconvNet” is proposed. Firstly, to extract shallow spatial details more effectively, a parallel concatenated strip and dilated (PCSD) convolution module is proposed and used to extract local features and surrounding contextual features in the detail branch. Continuing on, the semantic branch is reconstructed using the lightweight capability of depth separable convolution and high performance of ConvNet, in order to enable more efficient learning of deep advanced semantic features. Finally, fine-tuning is performed on the bilateral guidance aggregation layer of BiSeNet v2, enabling better fusion of the feature maps output by the detail branch and semantic branch. The experimental part discusses the contribution of stripe convolution and different sizes of empty convolution to image segmentation accuracy, and compares them with common convolutions such as Conv2d convolution, CG convolution and CCA convolution. The experiment proves that the PCSD convolution module proposed in this paper has the highest segmentation accuracy in all categories of the Cityscapes dataset compared with common convolutions. BiConvNet achieved a 9.39% accuracy improvement over the BiSeNet v2 network, with only a slight increase of 1.18M in model parameters. A mIoU accuracy of 68.75% was achieved on the validation set. Furthermore, through comparative experiments with commonly used autonomous driving image segmentation algorithms in recent years, BiConvNet demonstrates strong competitive advantages in segmentation accuracy on the Cityscapes and BDD100K datasets.
Xiangrun LI Qiyu SHENG Guangda ZHOU Jialong WEI Yanmin SHI Zhen ZHAO Yongwei LI Xingfeng LI Yang LIU
Automated tongue segmentation plays a crucial role in the realm of computer-aided tongue diagnosis. The challenge lies in developing algorithms that achieve higher segmentation accuracy and maintain less memory space and swift inference capabilities. To relieve this issue, we propose a novel Pool-unet integrating Pool-former and Multi-task mask learning for tongue image segmentation. First of all, we collected 756 tongue images taken in various shooting environments and from different angles and accurately labeled the tongue under the guidance of a medical professional. Second, we propose the Pool-unet model, combining a hierarchical Pool-former module and a U-shaped symmetric encoder-decoder with skip-connections, which utilizes a patch expanding layer for up-sampling and a patch embedding layer for down-sampling to maintain spatial resolution, to effectively capture global and local information using fewer parameters and faster inference. Finally, a Multi-task mask learning strategy is designed, which improves the generalization and anti-interference ability of the model through the Multi-task pre-training and self-supervised fine-tuning stages. Experimental results on the tongue dataset show that compared to the state-of-the-art method (OET-NET), our method has 25% fewer model parameters, achieves 22% faster inference times, and exhibits 0.91% and 0.55% improvements in Mean Intersection Over Union (MIOU), and Mean Pixel Accuracy (MPA), respectively.
Rina TAGAMI Hiroki KOBAYASHI Shuichi AKIZUKI Manabu HASHIMOTO
Due to the revitalization of the semiconductor industry and efforts to reduce labor and unmanned operations in the retail and food manufacturing industries, objects to be recognized at production sites are increasingly diversified in color and design. Depending on the target objects, it may be more reliable to process only color information, while intensity information may be better, or a combination of color and intensity information may be better. However, there are not many conventional method for optimizing the color and intensity information to be used, and deep learning is too costly for production sites. In this paper, we optimize the combination of the color and intensity information of a small number of pixels used for matching in the framework of template matching, on the basis of the mutual relationship between the target object and surrounding objects. We propose a fast and reliable matching method using these few pixels. Pixels with a low pixel pattern frequency are selected from color and grayscale images of the target object, and pixels that are highly discriminative from surrounding objects are carefully selected from these pixels. The use of color and intensity information makes the method highly versatile for object design. The use of a small number of pixels that are not shared by the target and surrounding objects provides high robustness to the surrounding objects and enables fast matching. Experiments using real images have confirmed that when 14 pixels are used for matching, the processing time is 6.3 msec and the recognition success rate is 99.7%. The proposed method also showed better positional accuracy than the comparison method, and the optimized pixels had a higher recognition success rate than the non-optimized pixels.
Zheqing ZHANG Hao ZHOU Chuan LI Weiwei JIANG
Single-image dehazing is a challenging task in computer vision research. Aiming at the limitations of traditional convolutional neural network representation capabilities and the high computational overhead of the self-attention mechanism in recent years, we proposed image attention and designed a single image dehazing network based on the image attention: IAD-Net. The proposed image attention is a plug-and-play module with the ability of global modeling. IAD-Net is a parallel network structure that combines the global modeling ability of image attention and the local modeling ability of convolution, so that the network can learn global and local features. The proposed network model has excellent feature learning ability and feature expression ability, has low computational overhead, and also improves the detail information of hazy images. Experiments verify the effectiveness of the image attention module and the competitiveness of IAD-Net with state-of-the-art methods.
Shi BAO Xiaoyan SONG Xufei ZHUANG Min LU Gao LE
Images with rich color information are an important source of information that people obtain from the objective world. Occasionally, it is difficult for people with red-green color vision deficiencies to obtain color information from color images. We propose a method of color correction for dichromats based on the physiological characteristics of dichromats, considering hue information. First, the hue loss of color pairs under normal color vision was defined, an objective function was constructed on its basis, and the resultant image was obtained by minimizing it. Finally, the effectiveness of the proposed method is verified through comparison tests. Red-green color vision deficient people fail to distinguish between partial red and green colors. When the red and green connecting lines are parallel to the a* axis of CIE L*a*b*, red and green perception defectives cannot distinguish the color pair, but can distinguish the color pair parallel to the b* axis. Therefore, when two colors are parallel to the a* axis, their color correction yields good results. When color correction is performed on a color, the hue loss between the two colors under normal color vision is supplemented with b* so that red-green color vision-deficient individuals can distinguish the color difference between the color pairs. The magnitude of the correction is greatest when the connecting lines of the color pairs are parallel to the a* axis, and no color correction is applied when the connecting lines are parallel to the b* axis. The objective evaluation results show that the method achieves a higher score, indicating that the proposed method can maintain the naturalness of the image while reducing confusing colors.
Qi LIU Bo WANG Shihan TAN Shurong ZOU Wenyi GE
For flight simulators, it is crucial to create three-dimensional terrain using clear remote sensing images. However, due to haze and other contributing variables, the obtained remote sensing images typically have low contrast and blurry features. In order to build a flight simulator visual system, we propose a deep learning-based dehaze model for remote sensing images dehazing. An encoder-decoder architecture is proposed that consists of a multiscale fusion module and a gated large kernel convolutional attention module. This architecture can fuse multi-resolution global and local semantic features and can adaptively extract image features under complex terrain. The experimental results demonstrate that, with good generality and application, the model outperforms existing comparison techniques and achieves high-confidence dehazing in remote sensing images with a variety of haze concentrations, multi-complex terrains, and multi-spatial resolutions.
Yuhao LIU Zhenzhong CHU Lifei WEI
In the realm of Single Image Super-Resolution (SISR), the meticulously crafted Nonlocal Sparse Attention-based block demonstrates its efficacy in noise reduction and computational cost reduction for nonlocal (global) features. However, it neglect the traditional Convolutional-based block, which proficient in handling local features. Thus, merging both the Nonlocal Sparse Attention-based block and the Convolutional-based block to concurrently manage local and nonlocal features poses a significant challenge. To tackle the aforementioned issues, this paper introduces the Channel Contrastive Attention-based Local-Nonlocal Mutual block (CCLN) for Super-Resolution (SR). (1) We introduce the CCLN block, encompassing the Local Sparse Convolutional-based block for local features and the Nonlocal Sparse Attention-based network block for nonlocal features. (2) We introduce Channel Contrastive Attention (CCA) blocks, incorporating Sparse Aggregation into Convolutional-based blocks. Additionally, we introduce a robust framework to fuse these two blocks, ensuring that each branch operates according to its respective strengths. (3) The CCLN block can seamlessly integrate into established network backbones like the Enhanced Deep Super-Resolution network (EDSR), achieving in the Channel Attention based Local-Nonlocal Mutual Network (CCLNN). Experimental results show that our CCLNN effectively leverages both local and nonlocal features, outperforming other state-of-the-art algorithms.
Zhichao SHA Ziji MA Kunlai XIONG Liangcheng QIN Xueying WANG
Diagnosis at an early stage is clinically important for the cure of skin cancer. However, since some skin cancers have similar intuitive characteristics, and dermatologists rely on subjective experience to distinguish skin cancer types, the accuracy is often suboptimal. Recently, the introduction of computer methods in the medical field has better assisted physicians to improve the recognition rate but some challenges still exist. In the face of massive dermoscopic image data, residual network (ResNet) is more suitable for learning feature relationships inside big data because of its deeper network depth. Aiming at the deficiency of ResNet, this paper proposes a multi-region feature extraction and raising dimension matching method, which further improves the utilization rate of medical image features. This method firstly extracted rich and diverse features from multiple regions of the feature map, avoiding the deficiency of traditional residual modules repeatedly extracting features in a few fixed regions. Then, the fused features are strengthened by up-dimensioning the branch path information and stacking it with the main path, which solves the problem that the information of two paths is not ideal after fusion due to different dimensionality. The proposed method is experimented on the International Skin Imaging Collaboration (ISIC) Archive dataset, which contains more than 40,000 images. The results of this work on this dataset and other datasets are evaluated to be improved over networks containing traditional residual modules and some popular networks.
Qi QI Liuyi MENG Ming XU Bing BAI
In face super-resolution reconstruction, the interference caused by the texture and color of the hair region on the details and contours of the face region can negatively affect the reconstruction results. This paper proposes a semantic-based, dual-branch face super-resolution algorithm to address the issue of varying reconstruction complexities and mutual interference among different pixel semantics in face images. The algorithm clusters pixel semantic data to create a hierarchical representation, distinguishing between facial pixel regions and hair pixel regions. Subsequently, independent image enhancement is applied to these distinct pixel regions to mitigate their interference, resulting in a vivid, super-resolution face image.
Hua HUANG Yiwen SHAN Chuan LI Zhi WANG
Image denoising is an indispensable process of manifold high level tasks in image processing and computer vision. However, the traditional low-rank minimization-based methods suffer from a biased problem since only the noisy observation is used to estimate the underlying clean matrix. To overcome this issue, a new low-rank minimization-based method, called nuclear norm minus Frobenius norm rank residual minimization (NFRRM), is proposed for image denoising. The propose method transforms the ill-posed image denoising problem to rank residual minimization problems through excavating the nonlocal self-similarity prior. The proposed NFRRM model can perform an accurate estimation to the underlying clean matrix through treating each rank residual component flexibly. More importantly, the global optimum of the proposed NFRRM model can be obtained in closed-form. Extensive experiments demonstrate that the proposed NFRRM method outperforms many state-of-the-art image denoising methods.