Wei WU Dazhi ZHANG Jilei HOU Yu WANG Tao LU Huabing ZHOU
In this letter, we propose a semantic guided infrared and visible image fusion method, which can train a network to fuse different semantic objects with different fusion weights according to their own characteristics. First, we design the appropriate fusion weights for each semantic object instead of the whole image. Second, we employ the semantic segmentation technology to obtain the semantic region of each object, and generate special weight maps for the infrared and visible image via pre-designed fusion weights. Third, we feed the weight maps into the loss function to guide the image fusion process. The trained fusion network can generate fused images with better visual effect and more comprehensive scene representation. Moreover, we can enhance the modal features of various semantic objects, benefiting subsequent tasks and applications. Experiment results demonstrate that our method outperforms the state-of-the-art in terms of both visual effect and quantitative metrics.
We propose a new framework for estimating depth information from a single image. Our framework is relatively small and straightforward by employing a two-stage architecture: a residual network and a simple decoder network. Our residual network in this paper is a remodeled of the original ResNet-50 architecture, which consists of only thirty-eight convolution layers in the residual block following by pair of two up-sampling and layers. While the simple decoder network, stack of five convolution layers, accepts the initial depth to be refined as the final output depth. During training, we monitor the loss behavior and adjust the learning rate hyperparameter in order to improve the performance. Furthermore, instead of using a single common pixel-wise loss, we also compute loss based on gradient-direction, and their structure similarity. This setting in our network can significantly reduce the number of network parameters, and simultaneously get a more accurate image depth map. The performance of our approach has been evaluated by conducting both quantitative and qualitative comparisons with several prior related methods on the publicly NYU and KITTI datasets.
Thi Thu Thao KHONG Takashi NAKADA Yasuhiko NAKASHIMA
Adversarial attacks are viewed as a danger to Deep Neural Networks (DNNs), which reveal a weakness of deep learning models in security-critical applications. Recent findings have been presented adversarial training as an outstanding defense method against adversaries. Nonetheless, adversarial training is a challenge with respect to big datasets and large networks. It is believed that, unless making DNN architectures larger, DNNs would be hard to strengthen the robustness to adversarial examples. In order to avoid iteratively adversarial training, our algorithm is Bayes without Bayesian Learning (BwoBL) that performs the ensemble inference to improve the robustness. As an application of transfer learning, we use learned parameters of pretrained DNNs to build Bayesian Neural Networks (BNNs) and focus on Bayesian inference without costing Bayesian learning. In comparison with no adversarial training, our method is more robust than activation functions designed to enhance adversarial robustness. Moreover, BwoBL can easily integrate into any pretrained DNN, not only Convolutional Neural Networks (CNNs) but also other DNNs, such as Self-Attention Networks (SANs) that outperform convolutional counterparts. BwoBL is also convenient to apply to scaling networks, e.g., ResNet and EfficientNet, with better performance. Especially, our algorithm employs a variety of DNN architectures to construct BNNs against a diversity of adversarial attacks on a large-scale dataset. In particular, under l∞ norm PGD attack of pixel perturbation ε=4/255 with 100 iterations on ImageNet, our proposal in ResNets, SANs, and EfficientNets increase by 58.18% top-5 accuracy on average, which are combined with naturally pretrained ResNets, SANs, and EfficientNets. This enhancement is 62.26% on average below l2 norm C&W attack. The combination of our proposed method with pretrained EfficientNets on both natural and adversarial images (EfficientNet-ADV) drastically boosts the robustness resisting PGD and C&W attacks without additional training. Our EfficientNet-ADV-B7 achieves the cutting-edge top-5 accuracy, which is 92.14% and 94.20% on adversarial ImageNet generated by powerful PGD and C&W attacks, respectively.
Xinran LIU Zhongju WANG Long WANG Chao HUANG Xiong LUO
A hybrid Retinex-based image enhancement algorithm is proposed to improve the quality of images captured by unmanned aerial vehicles (UAVs) in this paper. Hyperparameters of the employed multi-scale Retinex with chromaticity preservation (MSRCP) model are automatically tuned via a two-phase evolutionary computing algorithm. In the two-phase optimization algorithm, the Rao-2 algorithm is applied to performing the global search and a solution is obtained by maximizing the objective function. Next, the Nelder-Mead simplex method is used to improve the solution via local search. Real UAV-taken images of bad quality are collected to verify the performance of the proposed algorithm. Meanwhile, four famous image enhancement algorithms, Multi-Scale Retinex, Multi-Scale Retinex with Color Restoration, Automated Multi-Scale Retinex, and MSRCP are utilized as benchmarking methods. Meanwhile, two commonly used evolutionary computing algorithms, particle swarm optimization and flower pollination algorithm, are considered to verify the efficiency of the proposed method in tuning parameters of the MSRCP model. Experimental results demonstrate that the proposed method achieves the best performance compared with benchmarks and thus the proposed method is applicable for real UAV-based applications.
Yuya KAMATAKI Yusuke KAMEDA Yasuyo KITA Ichiro MATSUDA Susumu ITOH
This paper proposes a lossless coding method for HDR color images stored in a floating point format called Radiance RGBE. In this method, three mantissa and a common exponent parts, each of which is represented in 8-bit depth, are encoded using the block-adaptive prediction technique with some modifications considering the data structure.
Hao ZHOU Zhuangzhuang ZHANG Yun LIU Meiyan XUAN Weiwei JIANG Hailing XIONG
Single image dehazing algorithm based on Dark Channel Prior (DCP) is widely known. More and more image dehazing algorithms based on DCP have been proposed. However, we found that it is more effective to use DCP in the RAW images before the ISP pipeline. In addition, for the problem of DCP failure in the sky area, we propose an algorithm to segment the sky region and compensate the transmission. Extensive experimental results on both subjective and objective evaluation demonstrate that the performance of the modified DCP (MDCP) has been greatly improved, and it is competitive with the state-of-the-art methods.
Yue LI Xiaosheng YU Haijun CAO Ming XU
An autoencoder is trained to generate the background from the surveillance image by setting the training label as the shuffled input, instead of the input itself in a traditional autoencoder. Then the multi-scale features are extracted by a sparse autoencoder from the surveillance image and the corresponding background to detect foreground.
A method for detecting the timing of photodiode (PD) saturation without using an in-pixel time-to-digital converter (TDC) is proposed. Detecting PD saturation time is an approach to extend the dynamic range of a CMOS image sensor (CIS) without multiple exposures. In addition to accumulated charges in a PD, PD saturation time can be used as a signal related to light intensity. However, in previously reported CISs with detecting PD saturation time, an in-pixel TDC is used to detect and store PD saturation time. That makes the resolution of a CIS lower because an in-pixel TDC requires a large area. As for the proposed pixel circuit, PD saturation time is detected and stored as a voltage in a capacitor. The voltage is read and converted to a digital code by a column ADC after an exposure. As a result, an in-pixel TDC is not required. A signal-processing and calibration method for combining two signals, which are saturation time and accumulated charges, linearly are also proposed. Circuit simulations confirmed that the proposed method extends the dynamic range by 36 dB and its total dynamic range to 95 dB. Effectiveness of the calibration was also confirmed through circuit simulations.
Lin CAO Xibao HUO Yanan GUO Kangning DU
Sketch face recognition refers to matching photos with sketches, which has effectively been used in various applications ranging from law enforcement agencies to digital entertainment. However, due to the large modality gap between photos and sketches, sketch face recognition remains a challenging task at present. To reduce the domain gap between the sketches and photos, this paper proposes a cascaded transformation generation network for cross-modality image generation and sketch face recognition simultaneously. The proposed cascaded transformation generation network is composed of a generation module, a cascaded feature transformation module, and a classifier module. The generation module aims to generate a high quality cross-modality image, the cascaded feature transformation module extracts high-level semantic features for generation and recognition simultaneously, the classifier module is used to complete sketch face recognition. The proposed transformation generation network is trained in an end-to-end manner, it strengthens the recognition accuracy by the generated images. The recognition performance is verified on the UoM-SGFSv2, e-PRIP, and CUFSF datasets; experimental results show that the proposed method is better than other state-of-the-art methods.
Yung-Hui LI Muhammad Saqlain ASLAM Latifa Nabila HARFIYA Ching-Chun CHANG
The recent development of deep learning-based generative models has sharply intensified the interest in data synthesis and its applications. Data synthesis takes on an added importance especially for some pattern recognition tasks in which some classes of data are rare and difficult to collect. In an iris dataset, for instance, the minority class samples include images of eyes with glasses, oversized or undersized pupils, misaligned iris locations, and iris occluded or contaminated by eyelids, eyelashes, or lighting reflections. Such class-imbalanced datasets often result in biased classification performance. Generative adversarial networks (GANs) are one of the most promising frameworks that learn to generate synthetic data through a two-player minimax game between a generator and a discriminator. In this paper, we utilized the state-of-the-art conditional Wasserstein generative adversarial network with gradient penalty (CWGAN-GP) for generating the minority class of iris images which saves huge amount of cost of human labors for rare data collection. With our model, the researcher can generate as many iris images of rare cases as they want and it helps to develop any deep learning algorithm whenever large size of dataset is needed.
Feature detection and matching procedure require most of processing time in image matching where the time dramatically increases according to the number of feature points. The number of features is needed to be controlled for specific applications because of their processing time. This paper proposes a feature detection method based on significancy of local features. The feature significancy is computed for all pixels and higher significant features are chosen considering spatial distribution. The method contributes to reduce the number of features in order to match two images with maintaining high matching accuracy. It was shown that this approach was faster about two times in average processing time than FAST detector for natural scene images in the experiments.
Pengtao JIA Qi ZHAO Boze LI Jing ZHANG
Gait recognition distinguishes one individual from others according to the natural patterns of human gaits. Gait recognition is a challenging signal processing technology for biometric identification due to the ambiguity of contours and the complex feature extraction procedure. In this work, we proposed a new model - the convolutional neural network (CNN) joint attention mechanism (CJAM) - to classify the gait sequences and conduct person identification using the CASIA-A and CASIA-B gait datasets. The CNN model has the ability to extract gait features, and the attention mechanism continuously focuses on the most discriminative area to achieve person identification. We present a comprehensive transformation from gait image preprocessing to final identification. The results from 12 experiments show that the new attention model leads to a lower error rate than others. The CJAM model improved the 3D-CNN, CNN-LSTM (long short-term memory), and the simple CNN by 8.44%, 2.94% and 1.45%, respectively.
Zimin ZHAO Ying KANG Aiqin HOU Daguang GAN
Differentiable neural architecture search (DARTS) is now a widely disseminated weight-sharing neural architecture search method and it consists of two stages: search and evaluation. However, the original DARTS suffers from some well-known shortcomings. Firstly, the width and depth of the network, as well as the operation of two stages are discontinuous, which causes a performance collapse. Secondly, DARTS has a high computational overhead. In this paper, we propose a synchronous progressive approach to solve the discontinuity problem for network depth and width and we use the 0-1 loss function to alleviate the discontinuity problem caused by the discretization of operation. The computational overhead is reduced by using the partial channel connection. Besides, we also discuss and propose a solution to the aggregation of skip operations during the search process of DARTS. We conduct extensive experiments on CIFAR-10 and WANFANG datasets, specifically, our approach reduces search time significantly (from 1.5 to 0.1 GPU days) and improves the accuracy of image recognition.
Seiichi KOJIMA Noriaki SUETAKE
LIME is a method for low-light image enhancement. Though LIME significantly enhances the contrast in dark regions, the effect of contrast enhancement tends to be insufficient in bright regions. In this letter, we propose an improved method of LIME. In the proposed method, the contrast in bright regions are improved while maintaining the contrast enhancement effect in dark regions.
Ryosuke NISHIHARA Hidehiko MATSUBAYASHI Tomomoto ISHIKAWA Kentaro MORI Yutaka HATA
The frequency of uterine peristalsis is closely related to the success rate of pregnancy. An ultrasonic imaging is almost always employed for the measure of the frequency. The physician subjectively evaluates the frequency from the ultrasound image by the naked eyes. This paper aims to measure the frequency of uterine peristalsis from the ultrasound image. The ultrasound image consists of relative amounts in the brightness, and the contour of the uterine is not clear. It was not possible to measure the frequency by using the inter-frame difference and optical flow, which are the representative methods of motion detection, since uterine peristaltic movement is too small to apply them. This paper proposes a measurement method of the frequency of the uterine peristalsis from the ultrasound image in the implantation phase. First, traces of uterine peristalsis are semi-automatically done from the images with location-axis and time-axis. Second, frequency analysis of the uterine peristalsis is done by Fourier transform for 3 minutes. As a result, the frequency of uterine peristalsis was known as the frequency with the dominant frequency ingredient with maximum value among the frequency spectrums. Thereby, we evaluate the number of the frequency of uterine peristalsis quantitatively from the ultrasound image. Finally, the success rate of pregnancy is calculated from the frequency based on Fuzzy logic. This enabled us to evaluate the success rate of pregnancy by measuring the uterine peristalsis from the ultrasound image.
Shan HE Yuanyao LU Shengnan CHEN
The development of deep learning and neural networks has brought broad prospects to computer vision and natural language processing. The image captioning task combines cutting-edge methods in two fields. By building an end-to-end encoder-decoder model, its description performance can be greatly improved. In this paper, the multi-branch deep convolutional neural network is used as the encoder to extract image features, and the recurrent neural network is used to generate descriptive text that matches the input image. We conducted experiments on Flickr8k, Flickr30k and MSCOCO datasets. According to the analysis of the experimental results on evaluation metrics, the model proposed in this paper can effectively achieve image caption, and its performance is better than classic image captioning models such as neural image annotation models.
Xiongfei SHAN Mingyang PAN Depeng ZHAO Deqiang WANG Feng-Jang HWANG Chi-Hua CHEN
During the detection of maritime targets, the jitter of the shipborne camera usually causes the video instability and the false or missed detection of targets. Aimed at tackling this problem, a novel algorithm for maritime target detection based on the electronic image stabilization technology is proposed in this study. The algorithm mainly includes three models, namely the points line model (PLM), the points classification model (PCM), and the image classification model (ICM). The feature points (FPs) are firstly classified by the PLM, and stable videos as well as target contours are obtained by the PCM. Then the smallest bounding rectangles of the target contours generated as the candidate bounding boxes (bboxes) are sent to the ICM for classification. In the experiments, the ICM, which is constructed based on the convolutional neural network (CNN), is trained and its effectiveness is verified. Our experimental results demonstrate that the proposed algorithm outperformed the benchmark models in all the common metrics including the mean square error (MSE), peak signal to noise ratio (PSNR), structural similarity index (SSIM), and mean average precision (mAP) by at least -47.87%, 8.66%, 6.94%, and 5.75%, respectively. The proposed algorithm is superior to the state-of-the-art techniques in both the image stabilization and target ship detection, which provides reliable technical support for the visual development of unmanned ships.
Thi Diem TRAN Yasuhiko NAKASHIMA
Convolutional neural networks (CNNs) have dominated a range of applications, from advanced manufacturing to autonomous cars. For energy cost-efficiency, developing low-power hardware for CNNs is a research trend. Due to the large input size, the first few convolutional layers generally consume most latency and hardware resources on hardware design. To address these challenges, this paper proposes an innovative architecture named SLIT to extract feature maps and reconstruct the first few layers on CNNs. In this reconstruction approach, total multiply-accumulate operations are eliminated on the first layers. We evaluate new topology with MNIST, CIFAR, SVHN, and ImageNet datasets on image classification application. Latency and hardware resources of the inference step are evaluated on the chip ZC7Z020-1CLG484C FPGA with Lenet-5 and VGG schemes. On the Lenet-5 scheme, our architecture reduces 39% of latency and 70% of hardware resources with a 0.456 W power consumption compared to previous works. Even though the VGG models perform with a 10% reduction in hardware resources and latency, we hope our overall results will potentially give a new impetus for future studies to reach a higher optimization on hardware design. Notably, the SLIT architecture efficiently merges with most popular CNNs at a slightly sacrificing accuracy of a factor of 0.27% on MNIST, ranging from 0.5% to 1.5% on CIFAR, approximately 2.2% on ImageNet, and remaining the same on SVHN databases.
Hao ZHOU Hailing XIONG Chuan LI Weiwei JIANG Kezhong LU Nian CHEN Yun LIU
Image dehazing is of great significance in computer vision and other fields. The performance of dehazing mainly relies on the precise computation of transmission map. However, the computation of the existing transmission map still does not work well in the sky area and is easily influenced by noise. Hence, the dark channel prior (DCP) and luminance model are used to estimate the coarse transmission in this work, which can deal with the problem of transmission estimation in the sky area. Then a novel weighted variational regularization model is proposed to refine the transmission. Specifically, the proposed model can simultaneously refine the transmittance and restore clear images, yielding a haze-free image. More importantly, the proposed model can preserve the important image details and suppress image noise in the dehazing process. In addition, a new Gaussian Adaptive Weighted function is defined to smooth the contextual areas while preserving the depth discontinuity edges. Experiments on real-world and synthetic images illustrate that our method has a rival advantage with the state-of-art algorithms in different hazy environments.
Yu WANG Tao LU Feng YAO Yuntao WU Yanduo ZHANG
In recent years, single face image super-resolution (SR) using deep neural networks have been well developed. However, most of the face images captured by the camera in a real scene are from different views of the same person, and the existing traditional multi-frame image SR requires alignment between images. Due to multi-view face images contain texture information from different views, which can be used as effective prior information, how to use this prior information from multi-views to reconstruct frontal face images is challenging. In order to effectively solve the above problems, we propose a novel face SR network based on multi-view face images, which focus on obtaining more texture information from multi-view face images to help the reconstruction of frontal face images. And in this network, we also propose a texture attention mechanism to transfer high-precision texture compensation information to the frontal face image to obtain better visual effects. We conduct subjective and objective evaluations, and the experimental results show the great potential of using multi-view face images SR. The comparison with other state-of-the-art deep learning SR methods proves that the proposed method has excellent performance.