1-11hit |
Hoang-Gia VU Shinya TAKAMAEDA-YAMAZAKI Takashi NAKADA Yasuhiko NAKASHIMA
Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).
Kazuhiro YOSHIMURA Takuya IWAKAMI Takashi NAKADA Jun YAO Hajime SHIMADA Yasuhiko NAKASHIMA
Recently, we have proposed using a Linear Array Pipeline Processor (LAPP) to improve energy efficiency for various workloads such as image processing and to maintain programmability by working on VLIW codes. In this paper, we proposed an instruction mapping scheme for LAPP to fully exploit the array execution of functional units (FUs) and bypass networks by a mapper to fit the VLIW codes onto the FUs. The mapping can be finished within multi-cycles during a data prefetch before the array execution of FUs. According to an HDL based implementation, the hardware required for mapping scheme is 84% of the cost introduced by a baseline method. In addition, the proposed mapper can further help to shrink the size of array stage, as our results show that their combination becomes 88% of the baseline model in area.
Ryuta SHINGAI Yuria HIRAGA Hisakazu FUKUOKA Takamasa MITANI Takashi NAKADA Yasuhiko NAKASHIMA
Modern deep learning has significantly improved performance and has been used in a wide variety of applications. Since the amount of computation required for the inference process of the neural network is large, it is processed not by the data acquisition location like a surveillance camera but by the server with abundant computing power installed in the data center. Edge computing is getting considerable attention to solve this problem. However, edge computing can provide limited computation resources. Therefore, we assumed a divided/distributed neural network model using both the edge device and the server. By processing part of the convolution layer on edge, the amount of communication becomes smaller than that of the sensor data. In this paper, we have evaluated AlexNet and the other eight models on the distributed environment and estimated FPS values with Wi-Fi, 3G, and 5G communication. To reduce communication costs, we also introduced the compression process before communication. This compression may degrade the object recognition accuracy. As necessary conditions, we set FPS to 30 or faster and object recognition accuracy to 69.7% or higher. This value is determined based on that of an approximation model that binarizes the activation of Neural Network. We constructed performance and energy models to find the optimal configuration that consumes minimum energy while satisfying the necessary conditions. Through the comprehensive evaluation, we found that the optimal configurations of all nine models. For small models, such as AlexNet, processing entire models in the edge was the best. On the other hand, for huge models, such as VGG16, processing entire models in the server was the best. For medium-size models, the distributed models were good candidates. We confirmed that our model found the most energy efficient configuration while satisfying FPS and accuracy requirements, and the distributed models successfully reduced the energy consumption up to 48.6%, and 6.6% on average. We also found that HEVC compression is important before transferring the input data or the feature data between the distributed inference processes.
Jun YAO Yasuhiko NAKASHIMA Naveen DEVISETTI Kazuhiro YOSHIMURA Takashi NAKADA
General purpose many-core architecture (MCA) such as GPGPU has recently been used widely to continue the performance scaling when the continuous increase in the working frequency has approached the manufacturing limitation. However, both the general purpose MCA and its building block general purpose processor (GPP) lack a tuning capability to boost energy efficiency for individual applications, especially computation intensive applications. As an alternative to the above MCA platforms, we propose in this paper our LAPP (Linear Array Pipeline) architecture, which takes a special-purpose reconfigurable structure for an optimal MIPS/W. However, we also keep the backward binary compatibility, which is not featured in most special hardware. More specifically, we used a general purpose VLIW processor, interpreting a commercial VLIW ISA, as the baseline frontend part to provide the backward binary compatibility. We also extended the functional unit (FU) stage into an FU array to form the reconfigurable backend for efficient execution of program hotspots to exploit parallelism. The hardware modules in this general purpose reconfigurable architecture have been locally zoned into several groups to apply preferable low-power techniques according to the module hardware features. Our results show that under a comparable performance, the tightly coupled general/special purpose hardware, which is based on a 180nm cell library, can achieve 10.8 times the MIPS/W of MCA architecture of the same technology features. When a 65 technology node is assumed, a similar 9.4x MIPS/W can be achieved by using the LAPP without changing program binaries.
Thi Thu Thao KHONG Takashi NAKADA Yasuhiko NAKASHIMA
We introduce a hybrid Bayesian-convolutional neural network (hyBCNN) for improving the robustness against adversarial attacks and decreasing the computation time in the Bayesian inference phase. Our hyBCNN models are built from a part of BNN and CNN. Based on pre-trained CNNs, we only replace convolutional layers and activation function of the initial stage of CNNs with our Bayesian convolutional (BC) and Bayesian activation (BA) layers as a term of transfer learning. We keep the remainder of CNNs unchanged. We adopt the Bayes without Bayesian Learning (BwoBL) algorithm for hyBCNN networks to execute Bayesian inference towards adversarial robustness. Our proposal outperforms adversarial training and robust activation function, which are currently the outstanding defense methods of CNNs in the resistance to adversarial attacks such as PGD and C&W. Moreover, the proposed architecture with BwoBL can easily integrate into any pre-trained CNN, especially in scaling networks, e.g., ResNet and EfficientNet, with better performance on large-scale datasets. In particular, under l∞ norm PGD attack of pixel perturbation ε=4/255 with 100 iterations on ImageNet, our best hyBCNN EfficientNet reaches 93.92% top-5 accuracy without additional training.
Renyuan ZHANG Takashi NAKADA Yasuhiko NAKASHIMA
A programmable analog calculation unit (ACU) is designed for vector computations in continuous-time with compact circuit scale. From our early study, it is feasible to retrieve arbitrary two-variable functions through support vector regression (SVR) in silicon. In this work, the dimensions of regression are expanded for vector computations. However, the hardware cost and computing error greatly increase along with the expansion of dimensions. A two-stage architecture is proposed to organize multiple ACUs for high dimensional regression. The computation of high dimensional vectors is separated into several computations of lower dimensional vectors, which are implemented by the free combination of several ACUs with lower cost. In this manner, the circuit scale and regression error are reduced. The proof-of-concept ACU is designed and simulated in a 0.18μm technology. From the circuit simulation results, all the demonstrated calculations with nine operands are executed without iterative clock cycles by 4960 transistors. The calculation error of example functions is below 8.7%.
Takashi NAKADA Hiroyuki YANAGIHASHI Kunimaro IMAI Hiroshi UEKI Takashi TSUCHIYA Masanori HAYASHIKOSHI Hiroshi NAKAMURA
Near real-time periodic tasks, which are popular in multimedia streaming applications, have deadline periods that are longer than the input intervals thanks to buffering. For such applications, the conventional frame-based schedulings cannot realize the optimal scheduling due to their shortsighted deadline assumptions. To realize globally energy-efficient executions of these applications, we propose a novel task scheduling algorithm, which takes advantage of the long deadline period. We confirm our approach can take advantage of the longer deadline period and reduce the average power consumption by up to 18%.
Yuan HE Masaaki KONDO Takashi NAKADA Hiroshi SASAKI Shinobu MIWA Hiroshi NAKAMURA
Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.
Thi Thu Thao KHONG Takashi NAKADA Yasuhiko NAKASHIMA
Adversarial attacks are viewed as a danger to Deep Neural Networks (DNNs), which reveal a weakness of deep learning models in security-critical applications. Recent findings have been presented adversarial training as an outstanding defense method against adversaries. Nonetheless, adversarial training is a challenge with respect to big datasets and large networks. It is believed that, unless making DNN architectures larger, DNNs would be hard to strengthen the robustness to adversarial examples. In order to avoid iteratively adversarial training, our algorithm is Bayes without Bayesian Learning (BwoBL) that performs the ensemble inference to improve the robustness. As an application of transfer learning, we use learned parameters of pretrained DNNs to build Bayesian Neural Networks (BNNs) and focus on Bayesian inference without costing Bayesian learning. In comparison with no adversarial training, our method is more robust than activation functions designed to enhance adversarial robustness. Moreover, BwoBL can easily integrate into any pretrained DNN, not only Convolutional Neural Networks (CNNs) but also other DNNs, such as Self-Attention Networks (SANs) that outperform convolutional counterparts. BwoBL is also convenient to apply to scaling networks, e.g., ResNet and EfficientNet, with better performance. Especially, our algorithm employs a variety of DNN architectures to construct BNNs against a diversity of adversarial attacks on a large-scale dataset. In particular, under l∞ norm PGD attack of pixel perturbation ε=4/255 with 100 iterations on ImageNet, our proposal in ResNets, SANs, and EfficientNets increase by 58.18% top-5 accuracy on average, which are combined with naturally pretrained ResNets, SANs, and EfficientNets. This enhancement is 62.26% on average below l2 norm C&W attack. The combination of our proposed method with pretrained EfficientNets on both natural and adversarial images (EfficientNet-ADV) drastically boosts the robustness resisting PGD and C&W attacks without additional training. Our EfficientNet-ADV-B7 achieves the cutting-edge top-5 accuracy, which is 92.14% and 94.20% on adversarial ImageNet generated by powerful PGD and C&W attacks, respectively.
Yukihiro SASAGAWA Jun YAO Takashi NAKADA Yasuhiko NAKASHIMA
Recently, the DVS (Dynamic Voltage Scaling) method has been aggressively applied to processors with Razor Flip-Flops. With Razor FF detecting setup errors, the supply voltage in these processors is down-scaled to a near critical setup timing level for a maximum power consumption reduction. However, the conventional Razor and DVS combinations cannot tolerate well error rate variations caused by IR-drops and environment changes. At the near critical setup timing point, even a small error rate change will result in sharp performance degradation. In this paper, we propose RazorProtector, a DVS application method based on a redundant data-path which uses a multi-cycle redundant calculation to shorten the recovery penalty after a setup error occurrence. A dynamic redundancy-adapting scheme is also given to use effectively the designed redundant data-path based on a study of the program, device and error rate characteristics. Our results show that RazorProtector with the adaptive redundancy architecture can, compared to the traditional DVS method with Razor FF, under a large setup rate caused by a 10% unwanted voltage drop, reduce EDP up to 78% at 100 µs/V, 88% at 200 µs/V voltage scaling slope.
Takashi NAKADA Tomoki HATANAKA Hiroshi UEKI Masanori HAYASHIKOSHI Toru SHIMIZU Hiroshi NAKAMURA
Improving energy efficiency is critical for embedded systems in our rapidly evolving information society. Near real-time data processing tasks, such as multimedia streaming applications, exhibit a common fact that their deadline periods are longer than their input intervals due to buffering. In general, executing tasks at lower performance is more energy efficient. On the other hand, higher performance is necessary for huge tasks to meet their deadlines. To minimize the energy consumption while meeting deadlines strictly, adaptive task scheduling including dynamic performance mode selection is very important. In this work, we propose an energy efficient slack-based task scheduling algorithm for such tasks by adapting to task size variations and applying DVFS with the help of statistical analysis. We confirmed that our proposal can further reduce the energy consumption when compared to oracle frame-based scheduling.