In recent years, deep convolutional neural networks (CNN) have been widely used in synthetic aperture radar (SAR) image recognition. However, due to the difficulty in obtaining SAR image samples, training data is relatively few and overfitting is easy to occur when using traditional CNNS used in optical image recognition. In this paper, a CNN-based SAR image recognition algorithm is proposed, which can effectively reduce network parameters, avoid model overfitting and improve recognition accuracy. The algorithm first constructs a convolutional network feature extractor with a small size convolution kernel, then constructs a classifier based on the convolution layer, and designs a loss function based on distance measurement. The networks are trained in two stages: in the first stage, the distance measurement loss function is used to train the feature extraction network; in the second stage, cross-entropy is used to train the whole model. The public benchmark dataset MSTAR is used for experiments. Comparison experiments prove that the proposed method has higher accuracy than the state-of-the-art algorithms and the classical image recognition algorithms. The ablation experiment results prove the effectiveness of each part of the proposed algorithm.
Rina TAGAMI Hiroki KOBAYASHI Shuichi AKIZUKI Manabu HASHIMOTO
Due to the revitalization of the semiconductor industry and efforts to reduce labor and unmanned operations in the retail and food manufacturing industries, objects to be recognized at production sites are increasingly diversified in color and design. Depending on the target objects, it may be more reliable to process only color information, while intensity information may be better, or a combination of color and intensity information may be better. However, there are not many conventional method for optimizing the color and intensity information to be used, and deep learning is too costly for production sites. In this paper, we optimize the combination of the color and intensity information of a small number of pixels used for matching in the framework of template matching, on the basis of the mutual relationship between the target object and surrounding objects. We propose a fast and reliable matching method using these few pixels. Pixels with a low pixel pattern frequency are selected from color and grayscale images of the target object, and pixels that are highly discriminative from surrounding objects are carefully selected from these pixels. The use of color and intensity information makes the method highly versatile for object design. The use of a small number of pixels that are not shared by the target and surrounding objects provides high robustness to the surrounding objects and enables fast matching. Experiments using real images have confirmed that when 14 pixels are used for matching, the processing time is 6.3 msec and the recognition success rate is 99.7%. The proposed method also showed better positional accuracy than the comparison method, and the optimized pixels had a higher recognition success rate than the non-optimized pixels.
Dichao LIU Yu WANG Kenji MASE Jien KATO
Fine-grained image classification is a difficult problem, and previous studies mainly overcome this problem by locating multiple discriminative regions in different scales and then aggregating complementary information explored from the located regions. However, locating discriminative regions introduces heavy overhead and is not suitable for real-world application. In this paper, we propose the recursive multi-scale channel-spatial attention module (RMCSAM) for addressing this problem. Following the experience of previous research on fine-grained image classification, RMCSAM explores multi-scale attentional information. However, the attentional information is explored by recursively refining the deep feature maps of a convolutional neural network (CNN) to better correspond to multi-scale channel-wise and spatial-wise attention, instead of localizing attention regions. In this way, RMCSAM provides a lightweight module that can be inserted into standard CNNs. Experimental results show that RMCSAM can improve the classification accuracy and attention capturing ability over baselines. Also, RMCSAM performs better than other state-of-the-art attention modules in fine-grained image classification, and is complementary to some state-of-the-art approaches for fine-grained image classification. Code is available at https://github.com/Dichao-Liu/Recursive-Multi-Scale-Channel-Spatial-Attention-Module.
Masayuki HIROMOTO Hisanao AKIMA Teruo ISHIHARA Takuji YAMAMOTO
Zero-shot learning (ZSL) aims to classify images of unseen classes by learning relationship between visual and semantic features. Existing works have been improving recognition accuracy from various approaches, but they employ computationally intensive algorithms that require iterative optimization. In this work, we revisit the primary approach of the pattern recognition, ı.e., nearest neighbor classifiers, to solve the ZSL task by an extremely simple and fast way, called SimpleZSL. Our algorithm consists of the following three simple techniques: (1) just averaging feature vectors to obtain visual prototypes of seen classes, (2) calculating a pseudo-inverse matrix via singular value decomposition to generate visual features of unseen classes, and (3) inferring unseen classes by a nearest neighbor classifier in which cosine similarity is used to measure distance between feature vectors. Through the experiments on common datasets, the proposed method achieves good recognition accuracy with drastically small computational costs. The execution time of the proposed method on a single CPU is more than 100 times faster than those of the GPU implementations of the existing methods with comparable accuracies.
Ryosuke KURAMOCHI Hiroki NAKAHARA
Convolutional neural networks (CNNs) are widely used for image processing tasks in both embedded systems and data centers. In data centers, high accuracy and low latency are desired for various tasks such as image processing of streaming videos. We propose an FPGA-based low-latency CNN inference for randomly wired convolutional neural networks (RWCNNs), whose layer structures are based on random graph models. Because RWCNNs have several convolution layers that have no direct dependencies between them, our architecture can process them efficiently using a pipeline method. At each layer, we need to use the calculation results of multiple layers as the input. We use an FPGA with HBM2 to enable parallel access to the input data with multiple HBM2 channels. We schedule the order of execution of the layers to improve the pipeline efficiency. We build a conflict graph using the scheduling results. Then, we allocate the calculation results of each layer to the HBM2 channels by coloring the graph. Because the pipeline execution needs to be properly controlled, we developed an automatic generation tool for hardware functions. We implemented the proposed architecture on the Alveo U50 FPGA. We investigated a trade-off between latency and recognition accuracy for the ImageNet classification task by comparing the inference performances for different input image sizes. We compared our accelerator with a conventional accelerator for ResNet-50. The results show that our accelerator reduces the latency by 2.21 times. We also obtained 12.6 and 4.93 times better efficiency than CPU and GPU, respectively. Thus, our accelerator for RWCNNs is suitable for low-latency inference.
Qing YU Masashi ANZAWA Sosuke AMANO Kiyoharu AIZAWA
Since the development of food diaries could enable people to develop healthy eating habits, food image recognition is in high demand to reduce the effort in food recording. Previous studies have worked on this challenging domain with datasets having fixed numbers of samples and classes. However, in the real-world setting, it is impossible to include all of the foods in the database because the number of classes of foods is large and increases continually. In addition to that, inter-class similarity and intra-class diversity also bring difficulties to the recognition. In this paper, we solve these problems by using deep convolutional neural network features to build a personalized classifier which incrementally learns the user's data and adapts to the user's eating habit. As a result, we achieved the state-of-the-art accuracy of food image recognition by the personalization of 300 food records per user.
Masashi ANZAWA Sosuke AMANO Yoko YAMAKATA Keiko MOTONAGA Akiko KAMEI Kiyoharu AIZAWA
We investigate image recognition of multiple food items in a single photo, focusing on a buffet restaurant application, where menu changes at every meal, and only a few images per class are available. After detecting food areas, we perform hierarchical recognition. We evaluate our results, comparing to two baseline methods.
Chun-Yu LIU Wei-Hao LIAO Shanq-Jang RUAN
The abnormal crowd behavior detection is an important research topic in computer vision to improve the response time of critical events. In this letter, we introduce a novel method to detect and localize the crowd gathering in surveillance videos. The proposed foreground stillness model is based on the foreground object mask and the dense optical flow to measure the instantaneous crowd stillness level. Further, we obtain the long-term crowd stillness level by the leaky bucket model, and the crowd gathering behavior can be detected by the threshold analysis. Experimental results indicate that our proposed approach can detect and locate crowd gathering events, and it is capable of distinguishing between standing and walking crowd. The experiments in realistic scenes with 88.65% accuracy for detection of gathering frames show that our method is effective for crowd gathering behavior detection.
Recently, mobile applications for recording everyday meals draw much attention for self dietary. However, most of the applications return food calorie values simply associated with the estimated food categories, or need for users to indicate the rough amount of foods manually. In fact, it has not been achieved to estimate food calorie from a food photo with practical accuracy, and it remains an unsolved problem. Then, in this paper, we propose estimating food calorie from a food photo by simultaneous learning of food calories, categories, ingredients and cooking directions using deep learning. Since there exists a strong correlation between food calories and food categories, ingredients and cooking directions information in general, we expect that simultaneous training of them brings performance boosting compared to independent single training. To this end, we use a multi-task CNN. In addition, in this research, we construct two kinds of datasets that is a dataset of calorie-annotated recipe collected from Japanese recipe sites on the Web and a dataset collected from an American recipe site. In the experiments, we trained both multi-task and single-task CNNs, and compared them. As a result, a multi-task CNN achieved the better performance on both food category estimation and food calorie estimation than single-task CNNs. For the Japanese recipe dataset, by introducing a multi-task CNN, 0.039 were improved on the correlation coefficient, while for the American recipe dataset, 0.090 were raised compared to the result by the single-task CNN. In addition, we showed that the proposed multi-task CNN based method outperformed search-based methods proposed before.
Kei SAWADA Akira TAMAMORI Kei HASHIMOTO Yoshihiko NANKAKU Keiichi TOKUDA
This paper proposes a Bayesian approach to image recognition based on separable lattice hidden Markov models (SL-HMMs). The geometric variations of the object to be recognized, e.g., size, location, and rotation, are an essential problem in image recognition. SL-HMMs, which have been proposed to reduce the effect of geometric variations, can perform elastic matching both horizontally and vertically. This makes it possible to model not only invariances to the size and location of the object but also nonlinear warping in both dimensions. The maximum likelihood (ML) method has been used in training SL-HMMs. However, in some image recognition tasks, it is difficult to acquire sufficient training data, and the ML method suffers from the over-fitting problem when there is insufficient training data. This study aims to accurately estimate SL-HMMs using the maximum a posteriori (MAP) and variational Bayesian (VB) methods. The MAP and VB methods can utilize prior distributions representing useful prior information, and the VB method is expected to obtain high generalization ability by marginalization of model parameters. Furthermore, to overcome the local maximum problem in the MAP and VB methods, the deterministic annealing expectation maximization algorithm is applied for training SL-HMMs. Face recognition experiments performed on the XM2VTS database indicated that the proposed method offers significantly improved image recognition performance. Additionally, comparative experiment results showed that the proposed method was more robust to geometric variations than convolutional neural networks.
Recent studies have obtained superior performance in image recognition tasks by using, as an image representation, the fully connected layer activations of Convolutional Neural Networks (CNN) trained with various kinds of images. However, the CNN representation is not very suitable for fine-grained image recognition tasks involving food image recognition. For improving performance of the CNN representation in food image recognition, we propose a novel image representation that is comprised of the covariances of convolutional layer feature maps. In the experiment on the ETHZ Food-101 dataset, our method achieved 58.65% averaged accuracy, which outperforms the previous methods such as the Bag-of-Visual-Words Histogram, the Improved Fisher Vector, and CNN-SVM.
Integral image is the sum of input image pixel values. It is mainly used to speed up the process of a box filter operation, such as Haar-like features. However, large memory capacity for integral image data can be an obstacle in an embedded environment with limited hardware. In a previous research, [5] reduced the size of integral image memory using 2×2 block structure with additional calculations. It can be easily extended to n×n block structure for further reduction, but it requires more additional calculations. In this paper, we propose a new block structure for the integral image by modifying the location of the reference pixel in the block. It results in much less additional calculations by reducing the number of memory accesses, while keeping the same amount of memory as the original block structure.
Kei KINOSHITA Yoshiki YAMAGUCHI Daisuke TAKANO Tomoyuki OKAMURA Tetsuhiko YAO
This paper seeks to improve power-performance efficiency of embedded systems by the use of dynamic reconfiguration. Programmable logic devices (PLDs) have the competence to optimize the power consumption by the use of partial and/or dynamic reconfiguration. It is a non-exclusive approach, which can use other power-reduction techniques simultaneous, and thus it is applicable to a myriad of systems. The power-performance improvement by dynamic reconfiguration was evaluated through an augmented reality system that translates Japanese into English. It is a wearable and mobile system with a head-mounted display (HMD). In the system, the computing core detects a Japanese word from an input video frame and the translated term will be output to the HMD. It includes various image processing approaches such as pattern recognition and object tracking, and these functions run sequentially. The system does not need to prepare all functions simultaneously, which provides a function by reconfiguration only when it is needed. In other words, by dynamic reconfiguration, the spatiotemporal module-based pipeline can introduce the reduction of its circuit amount and power consumption compared to the naive approach. The approach achieved marked improvements; the computational speed was the same but the power consumption was reduced to around $rac{1}{6}$.
Akira TAMAMORI Yoshihiko NANKAKU Keiichi TOKUDA
In this paper, a novel statistical model based on 2-D HMMs for image recognition is proposed. Recently, separable lattice 2-D HMMs (SL2D-HMMs) were proposed to model invariance to size and location deformation. However, their modeling accuracy is still insufficient because of the following two assumptions, which are inherited from 1-D HMMs: i) the stationary statistics within each state and ii) the conditional independent assumption of state output probabilities. To overcome these shortcomings in 1-D HMMs, trajectory HMMs were proposed and successfully applied to speech recognition and speech synthesis. This paper derives 2-D trajectory HMMs by reformulating the likelihood of SL2D-HMMs through the imposition of explicit relationships between static and dynamic features. The proposed model can efficiently capture dependencies between adjacent observations without increasing the number of model parameters. The effectiveness of the proposed model was evaluated in face recognition experiments on the XM2VTS database.
An integral image is the sum of input image pixel values. It is mainly used to speed up the process of a box filter operation, such as Haar-like features. However, large memory for integral image data can be an obstacle in an embedded environment with limited hardware. Therefore, an efficient method to store the integral image is necessary. In this paper, we propose a memory size reduction method for integral image. The method uses four types image information: an integral image, a row integral image, a column integral image, and an input image. Using this method, integral image memory can be reduced by 42.6% on a 640×480 8-bit gray-scale input image. The same idea can be applied for bigger size images.
Akira TAMAMORI Yoshihiko NANKAKU Keiichi TOKUDA
This paper proposes a new generative model which can deal with rotational data variations by extending Separable Lattice 2-D HMMs (SL2D-HMMs). In image recognition, geometrical variations such as size, location and rotation degrade the performance. Therefore, the appropriate normalization processes for such variations are required. SL2D-HMMs can perform an elastic matching in both horizontal and vertical directions; this makes it possible to model invariance to size and location. To deal with rotational variations, we introduce additional HMM states which represent the shifts of the state alignments among the observation lines in a particular direction. Face recognition experiments show that the proposed method improves the performance significantly for rotational variation data.
Kosuke MIZUNO Hiroki NOGUCHI Guangji HE Yosuke TERACHI Tetsuya KAMINO Tsuyoshi FUJINAGA Shintaro IZUMI Yasuo ARIKI Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
This paper describes a SIFT (Scale Invariant Feature Transform) descriptor generation engine which features a VLSI oriented SIFT algorithm, three-stage pipelined architecture and novel systolic array architectures for Gaussian filtering and key-point extraction. The ROI-based scheme has been employed for the VLSI oriented algorithm. The novel systolic array architecture drastically reduces the number of operation cycle and memory access. The cycle counts of Gaussian filtering module is reduced by 82%, compared with the SIMD architecture. The number of memory accesses of the Gaussian filtering module and the key-point extraction module are reduced by 99.8% and 66% respectively, compared with the results obtained assuming the SIMD architecture. The proposed schemes provide processing capability for HDTV resolution video (1920 1080 pixels) at 30 frames per second (fps). The test chip has been fabricated in 65 nm CMOS technology and occupies 4.2 4.2 mm2 containing 1.1 M gates and 1.38 Mbit on-chip memory. The measured data demonstrates 38.2 mW power consumption at 78 MHz and 1.2 V.
Kazuyuki SAKURAI Shorin KYO Shin'ichiro OKAZAKI
This paper describes the real-time implementation of a vision-based overtaking vehicle detection method for driver assistance systems using IMAPCAR, a highly parallel SIMD linear array processor. The implemented overtaking vehicle detection method is based on optical flows detected by block matching using SAD and detection of the flows' vanishing point. The implementation is done efficiently by taking advantage of the parallel SIMD architecture of IMAPCAR. As a result, video-rate (33 frames/s) implementation could be achieved.
Koichi ITO Takafumi AOKI Hiroshi NAKAJIMA Koji KOBAYASHI Tatsuo HIGUCHI
This paper presents a palmprint recognition algorithm using Phase-Only Correlation (POC). The use of phase components in 2D (two-dimensional) discrete Fourier transforms of palmprint images makes it possible to achieve highly robust image registration and matching. In the proposed algorithm, POC is used to align scaling, rotation and translation between two palmprint images, and evaluate similarity between them. Experimental evaluation using a palmprint image database clearly demonstrates efficient matching performance of the proposed algorithm.
This paper presents media processor architectures for automotive applications. Media processing applications with their requirements for LSI implementations are first described for vision based driver assistance as well as graphical user interface for car navigation using 3D graphics. Then, parallel processing architectures for vision and graphics in these applications are reviewed with their performance and cost. After that, future trends of automotive media processing such as integration of vision and 3D graphics functions are shown with their applications and the required performance. Moreover, parallel processing architectures are discussed for the integration of vision and graphics. Finally, an prospect of a next-generation media processing LSI for automotives is provided.