Xiongfei SHAN Mingyang PAN Depeng ZHAO Deqiang WANG Feng-Jang HWANG Chi-Hua CHEN
During the detection of maritime targets, the jitter of the shipborne camera usually causes the video instability and the false or missed detection of targets. Aimed at tackling this problem, a novel algorithm for maritime target detection based on the electronic image stabilization technology is proposed in this study. The algorithm mainly includes three models, namely the points line model (PLM), the points classification model (PCM), and the image classification model (ICM). The feature points (FPs) are firstly classified by the PLM, and stable videos as well as target contours are obtained by the PCM. Then the smallest bounding rectangles of the target contours generated as the candidate bounding boxes (bboxes) are sent to the ICM for classification. In the experiments, the ICM, which is constructed based on the convolutional neural network (CNN), is trained and its effectiveness is verified. Our experimental results demonstrate that the proposed algorithm outperformed the benchmark models in all the common metrics including the mean square error (MSE), peak signal to noise ratio (PSNR), structural similarity index (SSIM), and mean average precision (mAP) by at least -47.87%, 8.66%, 6.94%, and 5.75%, respectively. The proposed algorithm is superior to the state-of-the-art techniques in both the image stabilization and target ship detection, which provides reliable technical support for the visual development of unmanned ships.
Shan HE Yuanyao LU Shengnan CHEN
The development of deep learning and neural networks has brought broad prospects to computer vision and natural language processing. The image captioning task combines cutting-edge methods in two fields. By building an end-to-end encoder-decoder model, its description performance can be greatly improved. In this paper, the multi-branch deep convolutional neural network is used as the encoder to extract image features, and the recurrent neural network is used to generate descriptive text that matches the input image. We conducted experiments on Flickr8k, Flickr30k and MSCOCO datasets. According to the analysis of the experimental results on evaluation metrics, the model proposed in this paper can effectively achieve image caption, and its performance is better than classic image captioning models such as neural image annotation models.
Misaki SHIKAKURA Yusuke KAMEDA Takayuki HAMAMOTO
This paper reports the evolution and application potential of image sensors with high-speed brightness gradient sensors. We propose an adaptive exposure time control method using the apparent motion estimated by this sensor, and evaluate results for the change in illuminance and global / local motion.
Junhao ZHANG Masafumi KAZUNO Mizuki MOTOYOSHI Suguru KAMEDA Noriharu SUEMATSU
In this paper, we propose a direct digital RF transmitter with a 1-bit band-pass delta-sigma modulator (BP-DSM) that uses high order image components of the 7th Nyquist zone in Manchester coding for microwave and milimeter wave application. Compared to the conventional non-return-to-zero (NRZ) coding, in which the high order image components of 1-bit BP-DSM attenuate severely in the form of sinc function, the proposed 1-bit direct digital RF transmitter in Manchester code can improve the output power and signal-to-noise ratio (SNR) of the image components at specific (4n-1)th and (4n-2)th Nyquist Zone, which is confirmed by calculating of the power spectral density. Measurements are made to compare three types of 1-bit digital-to-analog converter (DAC) signal in output power and SNR; NRZ, 50% duty return-to-zero (RZ) and Manchester coding. By using 1 Vpp/8Gbps DAC output, 1-bit signals in Manchester coding show the highest output power of -20.3dBm and SNR of 40.3dB at 7th Nyquist Zone (26GHz) in CW condition. As a result, compared to NRZ and RZ coding, at 7th Nyquist zone, the output power is improved by 8.1dB and 6dB, respectively. Meanwhile, the SNR is improved by 7.6dB and 4.9dB, respectively. In 5Mbps-QPSK condition, 1-bit signals in Manchester code show the lowest error vector magnitude (EVM) of 2.4% and the highest adjacent channel leakage ratio (ACLR) of 38.2dB with the highest output power of -18.5dBm at 7th Nyquist Zone (26GHz), respectively, compared to the NRZ and 50% duty RZ coding. The measurement and simulation results of the image component of 1-bit signals at 7th Nyquist Zone (26GHz) are consistent with the calculation results.
Zhen LI Baojun ZHAO Wenzheng WANG Baoxian WANG
Hyperspectral images (HSIs) are generally susceptible to various noise, such as Gaussian and stripe noise. Recently, numerous denoising algorithms have been proposed to recover the HSIs. However, those approaches cannot use spectral information efficiently and suffer from the weakness of stripe noise removal. Here, we propose a tensor decomposition method with two different constraints to remove the mixed noise from HSIs. For a HSI cube, we first employ the tensor singular value decomposition (t-SVD) to effectively preserve the low-rank information of HSIs. Considering the continuity property of HSIs spectra, we design a simple smoothness constraint by using Tikhonov regularization for tensor decomposition to enhance the denoising performance. Moreover, we also design a new unidirectional total variation (TV) constraint to filter the stripe noise from HSIs. This strategy will achieve better performance for preserving images details than original TV models. The developed method is evaluated on both synthetic and real noisy HSIs, and shows the favorable results.
Rintaro YANAGI Ren TOGO Takahiro OGAWA Miki HASEYAMA
Various cross-modal retrieval methods that can retrieve images related to a query sentence without text annotations have been proposed. Although a high level of retrieval performance is achieved by these methods, they have been developed for a single domain retrieval setting. When retrieval candidate images come from various domains, the retrieval performance of these methods might be decreased. To deal with this problem, we propose a new domain adaptive cross-modal retrieval method. By translating a modality and domains of a query and candidate images, our method can retrieve desired images accurately in a different domain retrieval setting. Experimental results for clipart and painting datasets showed that the proposed method has better retrieval performance than that of other conventional and state-of-the-art methods.
Hiryu KAMOSHITA Daichi KITAHARA Ken'ichi FUJIMOTO Laurent CONDAT Akira HIRABAYASHI
This paper proposes a high-quality computed tomography (CT) image reconstruction method from low-dose X-ray projection data. A state-of-the-art method, proposed by Xu et al., exploits dictionary learning for image patches. This method generates an overcomplete dictionary from patches of standard-dose CT images and reconstructs low-dose CT images by minimizing the sum of a data fidelity and a regularization term based on sparse representations with the dictionary. However, this method does not take characteristics of each patch, such as textures or edges, into account. In this paper, we propose to classify all patches into several classes and utilize an individual dictionary with an individual regularization parameter for each class. Furthermore, for fast computation, we introduce the orthogonality to column vectors of each dictionary. Since similar patches are collected in the same cluster, accuracy degradation by the orthogonality hardly occurs. Our simulations show that the proposed method outperforms the state-of-the-art in terms of both accuracy and speed.
Yih-Cherng LEE Hung-Wei HSU Jian-Jiun DING Wen HOU Lien-Shiang CHOU Ronald Y. CHANG
Automatic tracking and classification are essential for studying the behaviors of wild animals. Owing to dynamic far-shooting photos, the occlusion problem, protective coloration, the background noise is irregular interference for designing a computerized algorithm for reducing human labeling resources. Moreover, wild dolphin images are hard-acquired by on-the-spot investigations, which takes a lot of waiting time and hardly sets the fixed camera to automatic monitoring dolphins on the ocean in several days. It is challenging tasks to detect well and classify a dolphin from polluted photos by a single famous deep learning method in a small dataset. Therefore, in this study, we propose a generic Cascade Small Object Detection (CSOD) algorithm for dolphin detection to handle small object problems and develop visualization to backbone based classification (V2BC) for removing noise, highlighting features of dolphin and classifying the name of dolphin. The architecture of CSOD consists of the P-net and the F-net. The P-net uses the crude Yolov3 detector to be a core network to predict all the regions of interest (ROIs) at lower resolution images. Then, the F-net, which is more robust, is applied to capture the ROIs from high-resolution photos to solve single detector problems. Moreover, a visualization to backbone based classification (V2BC) method focuses on extracting significant regions of occluded dolphin and design significant post-processing by referencing the backbone of dolphins to facilitate for classification. Compared to the state of the art methods, including faster-rcnn, yolov3 detection and Alexnet, the Vgg, and the Resnet classification. All experiments show that the proposed algorithm based on CSOD and V2BC has an excellent performance in dolphin detection and classification. Consequently, compared to the related works of classification, the accuracy of the proposed designation is over 14% higher. Moreover, our proposed CSOD detection system has 42% higher performance than that of the original Yolov3 architecture.
Lei SONG Xue-Cheng SUN Zhe-Ming LU
In this Letter, we propose a blind and robust multiple watermarking scheme using Contourlet transform and singular value decomposition (SVD). The host image is first decomposed by Contourlet transform. Singular values of Contourlet coefficient blocks are adopted to embed watermark information, and a fast calculation method is proposed to avoid the heavy computation of SVD. The watermark is embedded in both low and high frequency Contourlet coefficients to increase the robustness against various attacks. Moreover, the proposed scheme intrinsically exploits the characteristics of human visual system and thus can ensure the invisibility of the watermark. Simulation results show that the proposed scheme outperforms other related methods in terms of both robustness and execution time.
Shinobu KUDO Shota ORIHASHI Ryuichi TANIDA Seishi TAKAMURA Hideaki KIMATA
Recently, image compression systems based on convolutional neural networks that use flexible nonlinear analysis and synthesis transformations have been developed to improve the restoration accuracy of decoded images. Although these methods that use objective metric such as peak signal-to-noise ratio and multi-scale structural similarity for optimization attain high objective results, such metric may not reflect human visual characteristics and thus degrade subjective image quality. A method using a framework called a generative adversarial network (GAN) has been reported as one of the methods aiming to improve the subjective image quality. It optimizes the distribution of restored images to be close to that of natural images; thus it suppresses visual artifacts such as blurring, ringing, and blocking. However, since methods of this type are optimized to focus on whether the restored image is subjectively natural or not, components that are not correlated with the original image are mixed into the restored image during the decoding process. Thus, even though the appearance looks natural, subjective similarity may be degraded. In this paper, we investigated why the conventional GAN-based compression techniques degrade subjective similarity, then tackled this problem by rethinking how to handle image generation in the GAN framework between image sources with different probability distributions. The paper describes a method to maximize mutual information between the coding features and the restored images. Experimental results show that the proposed mutual information amount is clearly correlated with subjective similarity and the method makes it possible to develop image compression systems with high subjective similarity.
Daisuke INOUE Tomomi MIYAKE Mitsuhiro SUGIMOTO
Although transmittance changes like a quadratic function due to the DC offset voltage in FFS mode LCD, its bottom position and flicker minimum DC offset voltage varies depending on the gray level due to the flexoelectric effect. We demonstrated how the influence of the flexoelectric effect changes depending on the electrode width or black matrix position.
Zhaolin LU Ziyan ZHANG Yi WANG Liang DONG Song LIANG
This letter presents an image quality assessment (IQA) metric for scanning electron microscopy (SEM) images based on texture inpainting. Inspired by the observation that the texture information of SEM images is quite sensitive to distortions, a texture inpainting network is first trained to extract texture features. Then the weights of the trained texture inpainting network are transferred to the IQA network to help it learn an effective texture representation of the distorted image. Finally, supervised fine-tuning is conducted on the IQA network to predict the image quality score. Experimental results on the SEM image quality dataset demonstrate the advantages of the presented method.
Sanghoon KANG Hanhoon PARK Jong-Il PARK
Image deformations caused by different steganographic methods are typically extremely small and highly similar, which makes their detection and identification to be a difficult task. Although recent steganalytic methods using deep learning have achieved high accuracy, they have been made to detect stego images to which specific steganographic methods have been applied. In this letter, a staganalytic method is proposed that uses hierarchical residual neural networks (ResNet), allowing detection (i.e. classification between stego and cover images) and identification of four spatial steganographic methods (i.e. LSB, PVD, WOW and S-UNIWARD). Experimental results show that using hierarchical ResNets achieves a classification rate of 79.71% in quinary classification, which is approximately 23% higher compared to using a plain convolutional neural network (CNN).
Longjiao ZHAO Yu WANG Jien KATO
Recently, local features computed using convolutional neural networks (CNNs) show good performance to image retrieval. The local convolutional features obtained by the CNNs (LC features) are designed to be translation invariant, however, they are inherently sensitive to rotation perturbations. This leads to miss-judgements in retrieval tasks. In this work, our objective is to enhance the robustness of LC features against image rotation. To do this, we conduct a thorough experimental evaluation of three candidate anti-rotation strategies (in-model data augmentation, in-model feature augmentation, and post-model feature augmentation), over two kinds of rotation attack (dataset attack and query attack). In the training procedure, we implement a data augmentation protocol and network augmentation method. In the test procedure, we develop a local transformed convolutional (LTC) feature extraction method, and evaluate it over different network configurations. We end up a series of good practices with steady quantitative supports, which lead to the best strategy for computing LC features with high rotation invariance in image retrieval.
Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.
Isao ECHIZEN Noboru BABAGUCHI Junichi YAMAGISHI Naoko NITTA Yuta NAKASHIMA Kazuaki NAKAMURA Kazuhiro KONO Fuming FANG Seiko MYOJIN Zhenzhong KUANG Huy H. NGUYEN Ngoc-Dung T. TIEU
With the spread of high-performance sensors and social network services (SNS) and the remarkable advances in machine learning technologies, fake media such as fake videos, spoofed voices, and fake reviews that are generated using high-quality learning data and are very close to the real thing are causing serious social problems. We launched a research project, the Media Clone (MC) project, to protect receivers of replicas of real media called media clones (MCs) skillfully fabricated by means of media processing technologies. Our aim is to achieve a communication system that can defend against MC attacks and help ensure safe and reliable communication. This paper describes the results of research in two of the five themes in the MC project: 1) verification of the capability of generating various types of media clones such as audio, visual, and text derived from fake information and 2) realization of a protection shield for media clones' attacks by recognizing them.
Maximum inner product search (MIPS) problem has gained much attention in a wide range of applications. In order to overcome the curse of dimensionality in high-dimensional spaces, most of existing methods first transform the MIPS problem into another approximate nearest neighbor search (ANNS) problem and then solve it by Locality Sensitive Hashing (LSH). However, due to the error incurred by the transmission and incomprehensive search strategies, these methods suffer from low precision and have loose probability guarantees. In this paper, we propose a novel search method named Adaptive-LSH (AdaLSH) to solve MIPS problem more efficiently and more precisely. AdaLSH examines objects in the descending order of both norms and (the probably correctly estimated) cosine angles with a query object in support of LSH with extendable windows. Such extendable windows bring not only efficiency in searching but also the probability guarantee of finding exact or approximate MIP objects. AdaLSH gives a better probability guarantee of success than those in conventional algorithms, bringing less running times on various datasets compared with them. In addition, AdaLSH can even support exact MIPS with probability guarantee.
Yubo LIU Yangting LAI Jianyong CHEN Lingyu LIANG Qiaoming DENG
Computer aided design (CAD) technology is widely used for architectural design, but current CAD tools still require high-level design specifications from human. It would be significant to construct an intelligent CAD system allowing automatic architectural layout parsing (AutoALP), which generates candidate designs or predicts architectural attributes without much user intervention. To tackle these problems, many learning-based methods were proposed, and benchmark dataset become one of the essential elements for the data-driven AutoALP. This paper proposes a new dataset called SCUT-AutoALP for multi-paradigm applications. It contains two subsets: 1) Subset-I is for floor plan design containing 300 residential floor plan images with layout, boundary and attribute labels; 2) Subset-II is for urban plan design containing 302 campus plan images with layout, boundary and attribute labels. We analyzed the samples and labels statistically, and evaluated SCUT-AutoALP for different layout parsing tasks of floor plan/urban plan based on conditional generative adversarial networks (cGAN) models. The results verify the effectiveness and indicate the potential applications of SCUT-AutoALP. The dataset is available at https://github.com/designfuturelab702/SCUT-AutoALP-Database-Release.
Mami NAGOYA Tomoaki KIMURA Hiroyuki TSUJI
A simple depth-key-based image composition is proposed, which uses two still images with depth information, background and foreground object. The proposed method can place the object at various locations in the background considering the depth in the 3D world coordinate system. The main feature is that a simple algorithm is provided, which enables us to achieve the depthward movement within the camera plane, without being aware of the 3D world coordinate system. Two algorithms are proposed (P-OMDD and O-OMDD), which are based on the pin-hole camera model. As an advantage, camera calibration is not required before applying the algorithm in these methods. Since a single image is used for the object representation, each of the proposed methods has its limitations in terms of fidelity of the composite image. P-OMDD faithfully reproduces the angle at which the object is seen, but the pixels of the hidden surface are missing. On the contrary, O-OMDD can avoid the hidden surface problem, but the angle of the object is fixed, wherever it moves. It is verified through several experiments that, when using O-OMDD, subjectively natural composite images can be obtained under any object movement, in terms of size and position in the camera plane. Future tasks include improving the change in illumination due to positional changes and the partial loss of objects due to noise in depth images.
Kouki SEO Chihiro GO Yuma KINOSHITA Hitoshi KIYA
We propose a novel hue-correction scheme for multi-exposure image fusion (MEF). Various MEF methods have so far been studied to generate higher-quality images. However, there are few MEF methods considering hue distortion unlike other fields of image processing, due to a lack of a reference image that has correct hue. In the proposed scheme, we generate an HDR image as a reference for hue correction, from input multi-exposure images. After that, hue distortion in images fused by an MEF method is removed by using hue information of the HDR one, on the basis of the constant-hue plane in the RGB color space. In simulations, the proposed scheme is demonstrated to be effective to correct hue-distortion caused by conventional MEF methods. Experimental results also show that the proposed scheme can generate high-quality images, regardless of exposure conditions of input multi-exposure images.