Ye PENG Wentao ZHAO Wei CAI Jinshu SU Biao HAN Qiang LIU
Due to the superior performance, deep learning has been widely applied to various applications, including image classification, bioinformatics, and cybersecurity. Nevertheless, the research investigations on deep learning in the adversarial environment are still on their preliminary stage. The emerging adversarial learning methods, e.g., generative adversarial networks, have introduced two vital questions: to what degree the security of deep learning with the presence of adversarial examples is; how to evaluate the performance of deep learning models in adversarial environment, thus, to raise security advice such that the selected application system based on deep learning is resistant to adversarial examples. To see the answers, we leverage image classification as an example application scenario to propose a framework of Evaluating Deep Learning for Image Classification (EDLIC) to conduct comprehensively quantitative analysis. Moreover, we introduce a set of evaluating metrics to measure the performance of different attacking and defensive techniques. After that, we conduct extensive experiments towards the performance of deep learning for image classification under different adversarial environments to validate the scalability of EDLIC. Finally, we give some advice about the selection of deep learning models for image classification based on these comparative results.
Aiming at the complexity of posture recognition with Kinect, a method of posture recognition using distance characteristics is proposed. Firstly, depth image data was collected by Kinect, and three-dimensional coordinate information of 20 skeleton joints was obtained. Secondly, according to the contribution of joints to posture expression, 60 dimensional Kinect skeleton joint data was transformed into a vector of 24-dimensional distance characteristics which were normalized according to the human body structure. Thirdly, a static posture recognition method of the shortest distance and a dynamic posture recognition method of the minimum accumulative distance with dynamic time warping (DTW) were proposed. The experimental results showed that the recognition rates of static postures, non-cross-subject dynamic postures and cross-subject dynamic postures were 95.9%, 93.6% and 89.8% respectively. Finally, posture selection, Kinect placement, and comparisons with literatures were discussed, which provides a reference for Kinect based posture recognition technology and interaction design.
Yuya HOSODA Arata KAWAMURA Youji IIGUNI
In this paper, we propose an image to sound mapping method. This technique treats an image as a spectrogram and maps it to a sound by taking inverse FFT of the spectrogram. Amplitude spectra of a speech signal are embedded to the spectrogram to give speech intelligibility for the mapped sound. Specifically, we hold amplitude spectra of a speech signal with strong power and embed the image brightness in other frequency bands. Holding amplitude spectra of a speech signal with strong power preserves a speech spectral envelope and improves the speech quality of the mapped sound. The amplitude spectra of the mapped sound with weak power represent the image brightness, and then the image is successfully reconstructed from the mapped sound. Simulation results show that the proposed method achieves sufficient speech quality.
Ming DAI Zhiheng ZHOU Tianlei WANG Yongfan GUO
In many real application scenarios of image segmentation problems involving limited and low-quality data, employing prior information can significantly improve the segmentation result. For example, the shape of the object is a kind of common prior information. In this paper, we introduced a new kind of prior information, which is named by prior distribution. On the basis of nonparametric statistical active contour model, we proposed a novel distribution prior model. Unlike traditional shape prior model, our model is not sensitive to the shapes of object boundary. Using the intensity distribution of objects and backgrounds as prior information can simplify the process of establishing and solving the model. The idea of constructing our energy function is as follows. During the contour curve convergence, while maximizing distribution difference between the inside and outside of the active contour, the distribution difference between the inside/outside of contour and the prior object/background is minimized. We present experimental results on a variety of synthetic and natural images. Experimental results demonstrate the potential of the proposed method that with the information of prior distribution, the segmentation effect and speed can be both improved efficaciously.
The conventional linear or tiled address maps can degrade performance and memory utilization when traffic patterns are not matched with an underlying address map. The address map is usually fixed at design time. Accordingly, it is difficult to adapt to given applications. Modern embedded system usually accommodates memory management units (MMUs). As a result, depending on virtual address patterns, the system can suffer from performance overheads due to page table walks. To alleviate this performance overhead, we propose to cluster and rearrange tiles to construct an MMU-aware configurable address map. To construct the clustered tiled map, the generic tile number remapping algorithm is presented. In the presented scheme, an address map is configured based on the adaptive dimensioning algorithm. Considering image processing applications, a design, an analysis, an implementation, and simulations are conducted. The results indicate the proposed method can improve the performance and the memory utilization with moderate hardware costs.
Yuta UKON Koji YAMAZAKI Koyo NITTA
Advanced information-processing services based on cloud computing are in great demand. However, users want to be able to customize cloud services for their own purposes. To provide image-processing services that can be optimized for the purpose of each user, we propose a technique for chaining image-processing functions in a CPU-field programmable gate array (FPGA) coupled server architecture. One of the most important requirements for combining multiple image-processing functions on a network, is low latency in server nodes. However, large delay occurs in the conventional CPU-FPGA architecture due to the overheads of packet reordering for ensuring the correctness of image processing and data transfer between the CPU and FPGA at the application level. This paper presents a CPU-FPGA server architecture with a real-time packet reordering circuit for low-latency image processing. In order to confirm the efficiency of our idea, we evaluated the latency of histogram of oriented gradients (HOG) feature calculation as an offloaded image-processing function. The results show that the latency is about 26 times lower than that of the conventional CPU-FPGA architecture. Moreover, the throughput decreased by less than 3.7% under the worst-case condition where 90 percent of the packets are randomly swapped at a 40-Gbps input rate. Finally, we demonstrated that a real-time video monitoring service can be provided by combining image processing functions using our architecture.
We propose an image identification scheme for double-compressed encrypted JPEG images that aims to identify encrypted JPEG images that are generated from an original JPEG image. To store images without any visual sensitive information on photo sharing services, encrypted JPEG images are generated by using a block-scrambling-based encryption method that has been proposed for Encryption-then-Compression systems with JPEG compression. In addition, feature vectors robust against JPEG compression are extracted from encrypted JPEG images. The use of the image encryption and feature vectors allows us to identify encrypted images recompressed multiple times. Moreover, the proposed scheme is designed to identify images re-encrypted with different keys. The results of a simulation show that the identification performance of the scheme is high even when images are recompressed and re-encrypted.
Shu FUJITA Keita TAKAHASHI Toshiaki FUJII
A light field, which is equivalent to a dense set of multi-view images, has various applications such as depth estimation and 3D display. One of the essential problems in light field applications is light field interpolation, i.e., view interpolation. The interpolation accuracy is enhanced by exploiting an inherent property of a light field. One example is that an epipolar plane image (EPI), which is a 2D subset of the 4D light field, consists of many lines, and these lines have almost the same slope in a local region. This structure induces a sparse representation in the frequency domain, where most of the energy resides on a line passing through the origin. On the basis of this observation, we propose a group sparsity prior suitable for light fields to exploit their line structure fully for interpolation. Specifically, we designed the directional groups in the discrete Fourier transform (DFT) domain so that the groups can represent the concentration of the energy, and we thereby formulated an LF interpolation problem as an overlapping group lasso. We also introduce several techniques to improve the interpolation accuracy such as applying a window function, determining group weights, expanding processing blocks, and merging blocks. Our experimental results show that the proposed method can achieve better or comparable quality as compared to state-of-the-art LF interpolation methods such as convolutional neural network (CNN)-based methods.
(k,n)-visual secret sharing scheme ((k,n)-VSSS) is a method to divide a secret image into n images called shares that enable us to restore the original image by only stacking at least k of them without any complicated computations. In this paper, we consider (2,2)-VSSS to share two secret images at the same time only by two shares, and investigate the methods to improve the quality of decoded images. More precisely, we consider (2,2)-VSSS in which the first secret image is decoded by stacking those two shares in the usual way, while the second one is done by stacking those two shares in the way that one of them is used reversibly. Since the shares must have some subpixels that inconsistently correspond to pixels of the secret images, the decoded pixels do not agree with the corresponding pixels of the secret images, which causes serious degradation of the quality of decoded images. To reduce such degradation, we propose several methods to construct shares that utilize 8-neighbor Laplacian filter and halftoning. Then we show that the proposed methods can effectively improve the quality of decoded images. Moreover, we demonstrate that the proposed methods can be naturally extended to (2,2)-VSSS for RGB images.
Huyen T. T. TRAN Trang H. HOANG Phu N. MINH Nam PHAM NGOC Truong CONG THANG
Thanks to the ability to bring immersive experiences to users, Virtual Reality (VR) technologies have been gaining popularity in recent years. A key component in VR systems is omnidirectional content, which can provide 360-degree views of scenes. However, at a given time, only a portion of the full omnidirectional content, called viewport, is displayed corresponding to the user's current viewing direction. In this work, we first develop Weighted-Viewport PSNR (W-VPSNR), an objective quality metric for quality assessment of omnidirectional content. The proposed metric takes into account the foveation feature of the human visual system. Then, we build a subjective database consisting of 72 stimuli with spatial varying viewport quality. By using this database, an evaluation of the proposed metric and four conventional metrics is conducted. Experiment results show that the W-VPSNR metric well correlates with the mean opinion scores (MOS) and outperforms the conventional metrics. Also, it is found that the conventional metrics do not perform well for omnidirectional content.
Songlin DU Yuhao XU Tingting HU Takeshi IKENAGA
High frame rate and ultra-low delay matching system plays an important role in various human-machine interactive applications, which demands better performance in matching deformable and out-of-plane rotating objects. Although many algorithms have been proposed for deformation tracking and matching, few of them are suitable for hardware implementation due to complicated operations and large time consumption. This paper proposes a hardware-oriented template update and recovery method for high frame rate and ultra-low delay deformation matching system. In the proposed method, the new template is generated in real time by partially updating the template descriptor and adding new keypoints simultaneously with the matching process in pixels (proposal #1), which avoids the large inter-frame delay. The size and shape of region of interest (ROI) are made flexible and the Hamming threshold used for brute-force matching is adjusted according to pixel position and the flexible ROI (proposal #2), which solves the problem of template drift. The template is recovered by the previous one with a relative center-shifting vector when it is judged as lost via region-wise difference check (proposal #3). Evaluation results indicate that the proposed method successfully achieves the real-time processing of 784fps at the resolution of 640×480 on field-programmable gate array (FPGA), with a delay of 0.808ms/frame, as well as achieves satisfactory deformation matching results in comparison with other general methods.
Shoya OOHARA Mitsuji MUNEYASU Soh YOSHIDA Makoto NAKASHIZUKA
For image restoration, an image prior that is obtained from the morphological gradient has been proposed. In the field of mathematical morphology, the optimization of the structuring element (SE) used for this morphological gradient using a genetic algorithm (GA) has also been proposed. In this paper, we introduce a new image prior that is the sum of the morphological gradients and total variation for an image restoration problem to improve the restoration accuracy. The proposed image prior makes it possible to almost match the fitness to a quantitative evaluation such as the mean square error. It also solves the problem of the artifact due to the unsuitability of the SE for the image. An experiment shows the effectiveness of the proposed image restoration method.
Chihiro GO Yuma KINOSHITA Sayaka SHIOTA Hitoshi KIYA
This paper proposes a novel multi-exposure image fusion (MEF) scheme for single-shot high dynamic range imaging with spatially varying exposures (SVE). Single-shot imaging with SVE enables us not only to produce images without color saturation regions from a single-shot image, but also to avoid ghost artifacts in the producing ones. However, the number of exposures is generally limited to two, and moreover it is difficult to decide the optimum exposure values before the photographing. In the proposed scheme, a scene segmentation method is applied to input multi-exposure images, and then the luminance of the input images is adjusted according to both of the number of scenes and the relationship between exposure values and pixel values. The proposed method with the luminance adjustment allows us to improve the above two issues. In this paper, we focus on dual-ISO imaging as one of single-shot imaging. In an experiment, the proposed scheme is demonstrated to be effective for single-shot high dynamic range imaging with SVE, compared with conventional MEF schemes with exposure compensation.
Siaw-Lang WONG Raveendran PARAMESRAN Ibuki YOSHIDA Akira TAGUCHI
Light scattering and absorption of light in water cause underwater images to be poorly contrasted, haze and dominated by a single color cast. A solution to this is to find methods to improve the quality of the image that eventually leads to better visualization. We propose an integrated approach using Adaptive Gray World (AGW) and Differential Gray-Levels Histogram Equalization for Color Images (DHECI) to remove the color cast as well as improve the contrast and colorfulness of the underwater image. The AGW is an adaptive version of the GW method where apart from computing the global mean, the local mean of each channel of an image is taken into consideration and both are weighted before combining them. It is applied to remove the color cast, thereafter the DHECI is used to improve the contrast and colorfulness of the underwater image. The results of the proposed method are compared with seven state-of-the-art methods using qualitative and quantitative measures. The experimental results showed that in most cases the proposed method produced better quantitative scores than the compared methods.
Motohiro TAKAGI Akito SAKURAI Masafumi HAGIWARA
Current image quality assessment (IQA) methods require the original images for evaluation. However, recently, IQA methods that use machine learning have been proposed. These methods learn the relationship between the distorted image and the image quality automatically. In this paper, we propose an IQA method based on deep learning that does not require a reference image. We show that a convolutional neural network with distortion prediction and fixed filters improves the IQA accuracy.
Qing YU Masashi ANZAWA Sosuke AMANO Kiyoharu AIZAWA
Since the development of food diaries could enable people to develop healthy eating habits, food image recognition is in high demand to reduce the effort in food recording. Previous studies have worked on this challenging domain with datasets having fixed numbers of samples and classes. However, in the real-world setting, it is impossible to include all of the foods in the database because the number of classes of foods is large and increases continually. In addition to that, inter-class similarity and intra-class diversity also bring difficulties to the recognition. In this paper, we solve these problems by using deep convolutional neural network features to build a personalized classifier which incrementally learns the user's data and adapts to the user's eating habit. As a result, we achieved the state-of-the-art accuracy of food image recognition by the personalization of 300 food records per user.
Gou HOUBEN Shu FUJITA Keita TAKAHASHI Toshiaki FUJII
Depth (disparity) estimation from a light field (a set of dense multi-view images) is currently attracting much research interest. This paper focuses on how to handle a noisy light field for disparity estimation, because if left as it is, the noise deteriorates the accuracy of estimated disparity maps. Several researchers have worked on this problem, e.g., by introducing disparity cues that are robust to noise. However, it is not easy to break the trade-off between the accuracy and computational speed. To tackle this trade-off, we have integrated a fast denoising scheme in a fast disparity estimation framework that works in the epipolar plane image (EPI) domain. Specifically, we found that a simple 1-D slanted filter is very effective for reducing noise while preserving the underlying structure in an EPI. Moreover, this simple filtering does not require elaborate parameter configurations in accordance with the target noise level. Experimental results including real-world inputs show that our method can achieve good accuracy with much less computational time compared to some state-of-the-art methods.
Kazumi TAKEMURA Toshiyuki YOSHIDA
This paper proposes a novel Depth From Defocus (DFD) technique based on the property that two images having different focus settings coincide if they are reblurred with the opposite focus setting, which is referred to as the “cross reblurring” property in this paper. Based on the property, the proposed technique estimates the block-wise depth profile for a target object by minimizing the mean squared error between the cross-reblurred images. Unlike existing DFD techniques, the proposed technique is free of lens parameters and independent of point spread function models. A compensation technique for a possible pixel disalignment between images is also proposed to improve the depth estimation accuracy. The experimental results and comparisons with the other DFD techniques show the advantages of our technique.
Akira KUBOTA Kazuya KODAMA Asami ITO
A pupil function of aperture in image capturing systems is theoretically derived such that one can perfectly reconstruct all-in-focus image through linear filtering of the focal stack. The perfect reconstruction filters are also designed based on the derived pupil function. The designed filters are space-invariant; hence the presented method does not require region segmentation. Simulation results using synthetic scenes shows effectiveness of the derived pupil function and the filters.
Luis Rafael MARVAL-PÉREZ Koichi ITO Takafumi AOKI
Access control and surveillance applications like walking-through security gates and immigration control points have a great demand for convenient and accurate biometric recognition in unconstrained scenarios with low user cooperation. The periocular region, which is a relatively new biometric trait, has been attracting much attention for recognition of an individual in such scenarios. This paper proposes a periocular recognition method that combines Phase-Based Correspondence Matching (PB-CM) with a texture enhancement technique. PB-CM has demonstrated high recognition performance in other biometric traits, e.g., face, palmprint and finger-knuckle-print. However, a major limitation for periocular region is that the performance of PB-CM degrades when the periocular skin has poor texture. We address this problem by applying texture enhancement and found out that variance normalization of texture significantly improves the performance of periocular recognition using PB-CM. Experimental evaluation using three public databases demonstrates the advantage of the proposed method compared with conventional methods.