Jiafeng MAO Qing YU Kiyoharu AIZAWA
Well annotated dataset is crucial to the training of object detectors. However, the production of finely annotated datasets for object detection tasks is extremely labor-intensive, therefore, cloud sourcing is often used to create datasets, which leads to these datasets tending to contain incorrect annotations such as inaccurate localization bounding boxes. In this study, we highlight a problem of object detection with noisy bounding box annotations and show that these noisy annotations are harmful to the performance of deep neural networks. To solve this problem, we further propose a framework to allow the network to modify the noisy datasets by alternating refinement. The experimental results demonstrate that our proposed framework can significantly alleviate the influences of noise on model performance.
Nagul COOHAROJANANONE Kiyoharu AIZAWA
In this paper we will present a new color distance measure, that is, angular distance of cumulative histogram. The proposed measure is robust to light variation. We also applied the weitght value to DR, DG, DB according to a Hue histogram of the query image. Moreover, we have compared the measure to previous popular measure that is cumulative L1 distance measure. We show that our method performed more accurate and perceptually relevant result.
Koki TSUBOTA Hiroaki AKUTSU Kiyoharu AIZAWA
Image quality assessment (IQA) is a fundamental metric for image processing tasks (e.g., compression). With full-reference IQAs, traditional IQAs, such as PSNR and SSIM, have been used. Recently, IQAs based on deep neural networks (deep IQAs), such as LPIPS and DISTS, have also been used. It is known that image scaling is inconsistent among deep IQAs, as some perform down-scaling as pre-processing, whereas others instead use the original image size. In this paper, we show that the image scale is an influential factor that affects deep IQA performance. We comprehensively evaluate four deep IQAs on the same five datasets, and the experimental results show that image scale significantly influences IQA performance. We found that the most appropriate image scale is often neither the default nor the original size, and the choice differs depending on the methods and datasets used. We visualized the stability and found that PieAPP is the most stable among the four deep IQAs.
Toshihiko YAMASAKI Takayuki ISHIKAWA Kiyoharu AIZAWA
Recently, cars are equipped with a lot of sensors for safety driving. We have been trying to store the driving-scene video with such sensor data and to detect the change of scenery of streets. Detection results can be used for building historical database of town scenery, automatic landmark updating of maps, and so forth. In order to compare images to detect changes, image retrieval taken at nearly identical locations is required as the first step. Since Global Positioning System (GPS) data essentially contain some noises, we cannot rely only on GPS data for our image retrieval. Therefore, we have developed an image retrieval algorithm employing edge-histogram-based image features in conjunction with hierarchical search. By using edge histograms projected onto the vertical and horizontal axes, the retrieval has been made robust to image variation due to weather change, clouds, obstacles, and so on. In addition, matching cost has been made small by limiting the matching candidates employing the hierarchical search. Experimental results have demonstrated that the mean retrieval accuracy has been improved from 65% to 76% for the front-view images and from 34% to 53% for the side-view images.
Masayuki TANIMOTO Kohichi SAKANIWA Kiyoharu AIZAWA Kazuyoshi OSHIMA Kiyomi KUMOZAKI Shuji TASAKA Yoichi MAEDA Takeshi MIZUIKE Mikio YAMASHITA Hideaki YAMANAKA Koichiro WAKASUGI Masaaki KATAYAMA
Yoshinori HATORI Shuichi MATSUMOTO Hiroshi KOTERA Kiyoharu AIZAWA Fumitaka ONO Hideo KITAJIMA Taizo KINOSHITA Shigeru KUROE Yutaka TANAKA Hideo HASHIMOTO Mitsuharu YANO Toshiaki WATANABE
Vincent van de LAAR Kiyoharu AIZAWA
This paper describes a scheme to capture a wide-view image using a camera setup with uncalibrated cameras. The setup is such that the optical axes are pointed in divergent directions. The direction of view of the resulting image can be chosen freely in any direction between these two optical axes. The scheme uses eight-parameter perspective transformations to warp the images, the parameters of which are obtained by using a relative orientation algorithm. The focal length and scale factor of the two images are estimated by using Powell's multi-dimensional optimization technique. Experiments on real images show the accuracy of the scheme.
Yasuhiro OHTSUKA Takayuki HAMAMOTO Kiyoharu AIZAWA
We propose a new sampling control system on image sensor array. Contrary to the random access pixels, the proposed sensor is able to read out spatially variant sampled pixels at high speed, without inputting pixel address for each access. The sampling positions can be changed dynamically by rewriting the sampling position memory. The proposed sensor has a memory array that stores the sampling positions. It can achieve any spatially varying sampling patterns. A prototype of 64 64 pixels are fabricated under 0.7 µm CMOS precess.
Hiroshi HARASHIMA Kiyoharu AIZAWA Takahiro SAITO
This paper deals with the recent trends of reseaches on intelligent image coding technology focusing on model-based analysis synthesis coding. By means of the intelligent image coding scheme, we will be able to realize epock-making ultra-low-rate image transmission and/or so-called value-added visual telecommunications. In order to categorize the various image coding systems and examine their potential applications in the future, an approach to define generations of image coding technologies is presented. The future generation coding systems include the model-based analysis synthesis coding and knowledge-based intelligent coding. The latter half of the paper will be devoted to the recent work of the authors on the model-based analysis-synthesis coding system for facial images.
Masashi ANZAWA Sosuke AMANO Yoko YAMAKATA Keiko MOTONAGA Akiko KAMEI Kiyoharu AIZAWA
We investigate image recognition of multiple food items in a single photo, focusing on a buffet restaurant application, where menu changes at every meal, and only a few images per class are available. After detecting food areas, we perform hierarchical recognition. We evaluate our results, comparing to two baseline methods.
Jianfeng XU Toshihiko YAMASAKI Kiyoharu AIZAWA
3D video, which consists of a sequence of mesh models, can reproduce dynamic scenes containing 3D information. To summarize 3D video, a key frame extraction method is developed using rate-distortion (R-D) trade-off. For this purpose, an effective feature vector is extracted for each frame. Shot detection is performed using the feature vectors as a preprocessing followed by key frame extraction. Simple but reasonable definitions of rate and distortion are presented. Based on an assumption of linearity, an R-D curve is generated in each shot, where the locations of the key frames are optimized. Finally, R-D trade-off can be achieved by optimizing a cost function using a Lagrange multiplier, where the number of key frames is optimized in each shot. Therefore, our system will automatically determine the best locations and the number of key frames in the sense of R-D trade-off. Our experimental results show the extracted key frames are compact and faithful to the original 3D video.
Computational sensor (smart sensor, vision chip in other words) is a very small integrated system, in which processing and sensing are unified on a single VLSI chip. It is designed for a specific targeted application. Research activities of computational sensor are described in this paper. There have been quite a few proposals and implementations in computational sensors. Firstly, their approaches are summarized from several points of view, such as advantage vs. disadvantage, neural vs. functional, architecture, analog vs. digital, local vs. global processing, imaging vs. processing, new processing paradigms. Then, several examples are introduced which are spatial processings, temporal processings, A/D conversions, programmable computational sensors. Finally, the paper is concluded.
Digital watermarking schemes have been discussed to solve the problem associated with copyright enforcement. Previously, we proposed a method using inter-block correlation of DCT coefficients. It has the features that the embedded watermark can be extracted without the original image nor the parameters used in embedding process and that the amount of modification, the strength of embedded watermark, depends on the local feature of an image. This feature makes it difficult for pirate to predict the position in which the watermark signal is embedded. In this paper, we propose a method which can embed/extract watermark with high speed by utilizing this watermarking method for JPEG file format.
Viet-Quoc PHAM Takashi MIYAKI Toshihiko YAMASAKI Kiyoharu AIZAWA
We present a robust object-based watermarking algorithm using the scale-invariant feature transform (SIFT) in conjunction with a data embedding method based on Discrete Cosine Transform (DCT). The message is embedded in the DCT domain of randomly generated blocks in the selected object region. To recognize the object region after being distorted, its SIFT features are registered in advance. In the detection scheme, we extract SIFT features from the distorted image and match them with the registered ones. Then we recover the distorted object region based on the transformation parameters obtained from the matching result using SIFT, and the watermarked message can be detected. Experimental results demonstrated that our proposed algorithm is very robust to distortions such as JPEG compression, scaling, rotation, shearing, aspect ratio change, and image filtering.
Liyanage C. DE SILVA Kiyoharu AIZAWA Mitsutoshi HATORI
In this paper face feature detection and tracking are discussed, using methods called edge pixel counting and deformable circular template matching. Instead of utilizing color or gray scale information of the facial image, the proposed edge pixel counting method utilizes the edge information to estimate the face feature positions such as eyes, nose and mouth, using a variable size face feature template, the initial size of which is predetermined by using a facial image database. The method is robust in the sense that the detection is possible with facial images with different skin color and different facial orientations. Subsequently, by using a deformable circular template matching two iris positions of the face are determined and are used in the edge pixel counting, to track the features in the next frame. Although feature tracking using gray scale template matching often fails when inter frame correlation around the feature areas are very low due to facial expression change (such as, talking, smiling, eye blinking etc.), feature tracking using edge pixel counting can track facial features reliably. Some experimental results are shown to demonstrate the effectiveness of the proposed method.
Cha Keon CHEONG Kiyoharu AIZAWA Takahiro SAITO Mitsutoshi HATORI
In this paper, subband image coding with symmetric biorthogonal wavelet filters is studied. In order to implement the symmetric biorthogonal wavelet basis, we use the Laplacian Pyramid Model (LPM) and the trigonometric polynomial solution method. These symmetric biorthogonal wavelet basis are used to form filters in each subband. Also coefficients of the filter are optimized with respect to the coding efficiency. From this optimization, we show that the values of a in the LPM generating kernel have the best coding efficiency in the range of 0.7 to 0.75. We also present an optimal bit allocation method based on considerations of the reconstruction filter characteristics. The step size of each subband uniform quantizer is determined by using this bit allocation method. The coding efficiency of the symmetric biorthogonal wavelet filter is compared with those of other filters: QMF, SSKF and Orthonormal wavelet filter. Simulation results demonstrate that the symmetric biorthogonal wavelet filter is useful as a basic means for image analysis/synthesis filters and can give better coding efficiency than other filters.
Conny GUNADI Hiroyuki SHIMIZU Kazuya KODAMA Kiyoharu AIZAWA
Construction of large-scale virtual environment is gaining more attentions for its applications in virtual mall, virtual sightseeing, tele-presence, etc. This paper presents a framework for building a realistic virtual environment from geometry-based approach. We propose an algorithm to construct a realistic 3-D model from multi-view range data and multi-view texture images. The proposed method tries to adopt the result of region segmentation of range images in some phases of the modeling process. It is shown that the relations obtained from region segmentation are quite effective in improving the result of registration as well as mesh merging.
Toshihiko YAMASAKI Kiyoharu AIZAWA
This paper presents a non-blind watermarking technique that is robust to non-linear geometric distortion attacks. This is one of the most challenging problems for copyright protection of digital content because it is difficult to estimate the distortion parameters for the embedded blocks. In our proposed scheme, the location of the blocks are recorded by the translation parameters from multiple Scale Invariant Feature Transform (SIFT) feature points. This method is based on two assumptions: SIFT features are robust to non-linear geometric distortion and even such non-linear distortion can be regarded as “linear” distortion in local regions. We conducted experiments using 149,800 images (7 standard images and 100 images downloaded from Flickr, 10 different messages, 10 different embedding block patterns, and 14 attacks). The results show that the watermark detection performance is drastically improved, while the baseline method can achieve only chance level accuracy.
In this paper, we present a novel portrait impression estimation method using nine pairs of semantic impression words: bitter-majestic, clear-pure, elegant-mysterious, gorgeous-mature, modern-intellectual, natural-mild, sporty-agile, sweet-sunny, and vivid-dynamic. In the first part of the study, we analyzed the relationship between the facial features in deformed portraits and the nine semantic impression word pairs over a large dataset, which we collected by a crowdsourcing process. In the second part, we leveraged the knowledge from the results of the analysis to develop a ranking network trained on the collected data and designed to estimate the semantic impression associated with a portrait. Our network demonstrated superior performance in impression estimation compared with current state-of-the-art methods.
This paper gives a detailed presentation of a "vision chip" for a very fast detection of motion vectors. The chip's design consists of a parallel pixel array and column parallel block-matching processors. Each pixel of the pixel array contains a photo detector, an edge detector and 4 bits of memory. In the detection of motion vectors, first, the gray level image is binarized by the edge detector and subsequently the binary edge data is used in the block matching processor. The block-matching takes place locally in pixel and globally in column. The chip can create a dense field of motion where a vector is assigned to each pixel by overlapping 2 2 target blocks. A prototype with 16 16 pixels and four block-matching processors has been designed and implemented. Preliminary results obtained by the prototype are shown.