Reliability is an important figure of merit of the system and it must be satisfied in safety-critical applications. This paper considers parallel applications on heterogeneous embedded systems and proposes a two-phase algorithm framework to minimize energy consumption for satisfying applications’ reliability requirement. The first phase is for initial assignment and the second phase is for either satisfying the reliability requirement or improving energy efficiency. Specifically, when the application’s reliability requirement cannot be achieved via the initial assignment, an algorithm for enhancing the reliability of tasks is designed to satisfy the application’s reliability requirement. Considering that the reliability of initial assignment may exceed the application’s reliability requirement, an algorithm for reducing the execution frequency of tasks is designed to improve energy efficiency. The proposed algorithms are compared with existing algorithms by using real parallel applications. Experimental results demonstrate that the proposed algorithms consume less energy while satisfying the application’s reliability requirements.
Masanori TSUJIKAWA Yoshinobu KAJIKAWA
In this paper, we propose a low-complexity and accurate noise suppression based on an a priori SNR (Speech to Noise Ratio) model for greater robustness w.r.t. short-term noise-fluctuation. The a priori SNR, the ratio of speech spectra and noise spectra in the spectral domain, represents the difference between speech features and noise features in the feature domain, including the mel-cepstral domain and the logarithmic power spectral domain. This is because logarithmic operations are used for domain conversions. Therefore, an a priori SNR model can easily be expressed in terms of the difference between the speech model and the noise model, which are modeled by the Gaussian mixture models, and it can be generated with low computational cost. By using a priori SNRs accurately estimated on the basis of an a priori SNR model, it is possible to calculate accurate coefficients of noise suppression filters taking into account the variance of noise, without serious increase in computational cost over that of a conventional model-based Wiener filter (MBW). We have conducted in-car speech recognition evaluation using the CENSREC-2 database, and a comparison of the proposed method with a conventional MBW showed that the recognition error rate for all noise environments was reduced by 9%, and that, notably, that for audio-noise environments was reduced by 11%. We show that the proposed method can be processed with low levels of computational and memory resources through implementation on a digital signal processor.
Lukas NAKAMURA Hiromitsu AWANO
We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.
Akira JINGUJI Shimpei SATO Hiroki NAKAHARA
Convolutional neural network (CNN) has a high recognition rate in image recognition and are used in embedded systems such as smartphones, robots and self-driving cars. Low-end FPGAs are candidates for embedded image recognition platforms because they achieve real-time performance at a low cost. However, CNN has significant parameters called weights and internal data called feature maps, which pose a challenge for FPGAs for performance and memory capacity. To solve these problems, we exploit a split-CNN and weight sparseness. The split-CNN reduces the memory footprint by splitting the feature map into smaller patches and allows the feature map to be stored in the FPGA's high-throughput on-chip memory. Weight sparseness reduces computational costs and achieves even higher performance. We designed a dedicated architecture of a sparse CNN and a memory buffering scheduling for a split-CNN and implemented this on the PYNQ-Z1 FPGA board with a low-end FPGA. An experiment on classification using VGG16 shows that our implementation is 3.1 times faster than the GPU, and 5.4 times faster than an existing FPGA implementation.
Emerging byte-addressable non-volatile memory devices attract much attention. A non-volatile main memory (NVMM) built on them enables larger memory size and lower power consumption than a traditional DRAM main memory. To fully utilize an NVMM, both software and hardware must be cooperatively optimized. Simultaneously, even focusing on a memory module, its micro architecture is still being developed though real non-volatile memory modules, such as Intel Optane DC persistent memory (DCPMM), have been on the market. Looking at existing NVMM evaluation environments, software simulators can evaluate various micro architectures with their long simulation time. Emulators can evaluate the whole system fast with less flexibility in their configuration than simulators. Thus, an NVMM emulator that can realize flexible and fast system evaluation still has an important role to explore the optimal system. In this paper, we introduce an NVMM emulator for embedded systems and explore a direction of optimization techniques for NVMMs by using it. It is implemented on an SoC-FPGA board employing three NVMM behaviour models: coarse-grain, fine-grain and DCPMM-based. The coarse and fine models enable NVMM performance evaluations based on extensions of traditional DRAM behaviour. The DCPMM-based model emulates the behaviour of a real DCPMM. Whole evaluation environment is also provided including Linux kernel modifications and several runtime functions. We first validate the developed emulator with an existing NVMM emulator, a cycle-accurate NVMM simulator and a real DCPMM. Then, the program behavior differences among three models are evaluated with SPEC CPU programs. As a result, the fine-grain model reveals the program execution time is affected by the frequency of NVMM memory requests rather than the cache hit ratio. Comparing with the fine-grain model and the coarse-grain model under the condition of the former's longer total write latency than the latter's, the former shows lower execution time for four of fourteen programs than the latter because of the bank-level parallelism and the row-buffer access locality exploited by the former model.
Chia-Yu WANG Chia-Hsin TSAI Sheng-Chung WANG Chih-Yu WEN Robert Chen-Hao CHANG Chih-Peng FAN
In this paper, the effective Long Range (LoRa) based wireless sensor network is designed and implemented to provide the remote data sensing functions for the planned smart agricultural recycling rapid processing factory. The proposed wireless sensor network transmits the sensing data from various sensors, which measure the values of moisture, viscosity, pH, and electrical conductivity of agricultural organic wastes for the production and circulation of organic fertilizers. In the proposed wireless sensor network design, the LoRa transceiver module is used to provide data transmission functions at the sensor node, and the embedded platform by Raspberry Pi module is applied to support the gateway function. To design the cloud data server, the MySQL methodology is applied for the database management system with Apache software. The proposed wireless sensor network for data communication between the sensor node and the gateway supports a simple one-way data transmission scheme and three half-duplex two-way data communication schemes. By experiments, for the one-way data transmission scheme under the condition of sending one packet data every five seconds, the packet data loss rate approaches 0% when 1000 packet data is transmitted. For the proposed two-way data communication schemes, under the condition of sending one packet data every thirty seconds, the average packet data loss rates without and with the data-received confirmation at the gateway side can be 3.7% and 0%, respectively.
For embedded systems, verifying both real-time properties and logical validity are important. The embedded system is not only required to the accurate operation but also required to strictly real-time properties. To verify real-time properties is a key problem in model checking. In order to verify real-time properties of assembly program, we develop the simulator to propose the model checking method for verifying assembly programs. Simultaneously, we propose a timed Kripke structure and implement the simulator of the robot's processor to be verified. We propose the timed Kripke structure including the execution time which extends Kripke structure. For the input assembly program, the simulator generates timed Kripke structure by dynamic program analysis. Also, we implement model checker after generating timed Kripke structure in order to verify whether timed Kripke structure satisfies RTCTL formulas. Finally, to evaluate a proposed method, we conduct experiments with the implementation of the verification system. To solve the real problem, we have experimented with real microcontroller software.
Takashi KONO Yasuhiko TAITO Hideto HIDAKA
Embedded system approaches to edge computing in IoT implementations are proposed and discussed. Rationales of edge computing and essential core capabilities for IoT data supply innovation are identified. Then, innovative roles and development of MCU and embedded flash memory are illustrated by technology and applications, expanding from CPS to big-data and nomadic/autonomous elements of IoT requirements. Conclusively, a technology roadmap construction specific to IoT is proposed.
This letter presents a novel technique to achieve a fast inference of the binarized convolutional neural networks (BCNN). The proposed technique modifies the structure of the constituent blocks of the BCNN model so that the input elements for the max-pooling operation are binary. In this structure, if any of the input elements is +1, the result of the pooling can be produced immediately; the proposed technique eliminates such computations that are involved to obtain the remaining input elements, so as to reduce the inference time effectively. The proposed technique reduces the inference time by up to 34.11%, while maintaining the classification accuracy.
The conventional linear or tiled address maps can degrade performance and memory utilization when traffic patterns are not matched with an underlying address map. The address map is usually fixed at design time. Accordingly, it is difficult to adapt to given applications. Modern embedded system usually accommodates memory management units (MMUs). As a result, depending on virtual address patterns, the system can suffer from performance overheads due to page table walks. To alleviate this performance overhead, we propose to cluster and rearrange tiles to construct an MMU-aware configurable address map. To construct the clustered tiled map, the generic tile number remapping algorithm is presented. In the presented scheme, an address map is configured based on the adaptive dimensioning algorithm. Considering image processing applications, a design, an analysis, an implementation, and simulations are conducted. The results indicate the proposed method can improve the performance and the memory utilization with moderate hardware costs.
Koichi MITSUNARI Jaehoon YU Takao ONOYE Masanori HASHIMOTO
Visual object detection on embedded systems involves a multi-objective optimization problem in the presence of trade-offs between power consumption, processing performance, and detection accuracy. For a new Pareto solution with high processing performance and low power consumption, this paper proposes a hardware architecture for decision tree ensemble using multiple channels of features. For efficient detection, the proposed architecture utilizes the dimensionality of feature channels in addition to parallelism in image space and adopts task scheduling to attain random memory access without conflict. Evaluation results show that an FPGA implementation of the proposed architecture with an aggregated channel features pedestrian detector can process 229 million samples per second at 100MHz operation frequency while it requires a relatively small amount of resources. Consequently, the proposed architecture achieves 350fps processing performance for 1080P Full HD images and outperforms conventional object detection hardware architectures developed for embedded systems.
Naoki FUJIEDA Kiyohiro SATO Ryodai IWAMOTO Shuichi ICHIKAWA
Instruction set randomization (ISR) is a cost-effective obfuscation technique that modifies or enhances the relationship between instructions and machine languages. An Instruction Register File (IRF), a list of frequently used instructions, can be used for ISR by providing the way of indirect access to them. This study examines the IRF that integrates a positional register, which was proposed as a supplementary unit of the IRF, for the sake of tamper resistance. According to our evaluation, with a new design for the contents of the positional register, the measure of tamper resistance was increased by 8.2% at a maximum, which corresponds to a 32.2% increase in the size of the IRF. The number of logic elements increased by the addition of the positional register was 3.5% of its baseline processor.
Yining XU Ittetsu TANIGUCHI Hiroyuki TOMIYAMA
Task mapping is one of the most important design processes in embedded manycore systems. This paper proposes a static task mapping technique for manycore real-time systems. The technique minimizes the number of cores while satisfying deadline constraints of individual tasks.
Jinli RAO Zhangqing HE Shu XU Kui DAI Xuecheng ZOU
Buffer overflow is one of the main approaches to get control of vulnerable programs. This paper presents a protection technique called BFWindow for performance and resource sensitive embedded systems. By coloring data structure in memory with single associate property bit to each byte and extending the target memory block to a BFWindow(2), it validates each memory write by speculatively checking consistency of data properties within the extended buffer window. Property bits are generated by compiler statically and checked by hardware at runtime. They are transparent to users. Experimental results show that the proposed mechanism is effective to prevent sequential memory writes from crossing buffer boundaries which is the common scenario of buffer overflow exploitations. The performance overhead for practical protection mode across embedded system benchmarks is under 1%.
Yining XU Yang LIU Junya KAIDA Ittetsu TANIGUCHI Hiroyuki TOMIYAMA
This paper proposes a static application mapping technique, based on integer linear programming, for non-hierarchical manycore embedded systems. Unlike previous work which was designed for hierarchical manycore SoCs, this work allows more flexible application mapping to achieve higher performance. The experimental results show the effectiveness of this work.
Integral image is the sum of input image pixel values. It is mainly used to speed up the process of a box filter operation, such as Haar-like features. However, large memory capacity for integral image data can be an obstacle in an embedded environment with limited hardware. In a previous research, [5] reduced the size of integral image memory using 2×2 block structure with additional calculations. It can be easily extended to n×n block structure for further reduction, but it requires more additional calculations. In this paper, we propose a new block structure for the integral image by modifying the location of the reference pixel in the block. It results in much less additional calculations by reducing the number of memory accesses, while keeping the same amount of memory as the original block structure.
Hideki TAKASE Gang ZENG Lovic GAUTHIER Hirotaka KAWASHIMA Noritoshi ATSUMI Tomohiro TATEMATSU Yoshitake KOBAYASHI Takenori KOSHIRO Tohru ISHIHARA Hiroyuki TOMIYAMA Hiroaki TAKADA
This paper presents a framework for reducing the energy consumption of embedded real-time systems. We implemented the presented framework as both an optimization toolchain and an energy-aware real-time operating system. The framework consists of the integration of multiple techniques to optimize the energy consumption. The main idea behind our approach is to utilize trade-offs between the energy consumption and the performance of different processor configurations during task checkpoints, and to maintain memory allocation during task context switches. In our framework, a target application is statically analyzed at both intra-task and inter-task levels. Based on these analyzed results, runtime optimization is performed in response to the behavior of the application. A case study shows that our toolchain and real-time operating systems have achieved energy reduction while satisfying the real-time performance. The toolchain has also been successfully applied to a practical application.
Ittetsu TANIGUCHI Junya KAIDA Takuji HIEDA Yuko HARA-AZUMI Hiroyuki TOMIYAMA
This paper studies mapping techniques of multiple applications on embedded many-core SoCs. The mapping techniques proposed in this paper are static which means the mapping is decided at design time. The mapping techniques take into account both inter-application and intra-application parallelism in order to fully utilize the potential parallelism of the many-core architecture. Additionally, the proposed static mapping supports dynamic application switching, which means the applications mapped onto the same cores are switched to each other at runtime. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. Experimental results show the effectiveness of the proposed techniques.
An integral image is the sum of input image pixel values. It is mainly used to speed up the process of a box filter operation, such as Haar-like features. However, large memory for integral image data can be an obstacle in an embedded environment with limited hardware. Therefore, an efficient method to store the integral image is necessary. In this paper, we propose a memory size reduction method for integral image. The method uses four types image information: an integral image, a row integral image, a column integral image, and an input image. Using this method, integral image memory can be reduced by 42.6% on a 640×480 8-bit gray-scale input image. The same idea can be applied for bigger size images.
Chungsoo LIM Soojeong LEE Jae-Hun CHOI Joon-Hyuk CHANG
In this letter, we propose a simple but effective technique that improves statistical model-based voice activity detection (VAD) by both reducing computational complexity and increasing detection accuracy. The improvements are made by applying Taylor series approximations to the exponential and logarithmic functions in the VAD algorithm based on an in-depth analysis of the algorithm. Experiments performed on a smartphone as well as on a desktop computer with various background noises confirm the effectiveness of the proposed technique.