Ryo SHIBATA Gou HOSOYA Hiroyuki YASHIMA
In racetrack memories (RM), a position error (insertion or deletion error) results from unstable data reading. For position errors in RM with multiple read-heads (RHs), we propose a protograph-based LDPC coded system specified by a protograph and a protograph-aware permutation. The protograph-aware permutation facilitates the design and analysis of the coded system. By solving a multi-objective optimization problem, the coded system attains the properties of fast convergence decoding, a good decoding threshold, and a linear minimum distance growth. In addition, the coded system can adapt to varying numbers of RHs without any modification. The asymptotic decoding thresholds with a limited number of iterations verify the good properties of the system. Furthermore, for varying numbers of RHs, the simulation results with both small and large number of iterations, exhibit excellent decoding performances, both with short and long block lengths, and without error floors.
Kanghee KIM Wooseok LEE Sangbang CHOI
Hardware prefetching involves a sophisticated balance between accuracy, coverage, and timeliness while minimizing hardware cost. Recent prefetchers have achieved these goals, but they still require complex hardware and a significant amount of storage. In this paper, we propose an efficient Per-page Most-Offset Prefetcher (PMOP) that minimizes hardware cost and simultaneously improves accuracy while maintaining coverage and timeliness. We achieve these objectives using an enhanced offset prefetcher that performs well with a reasonable hardware cost. Our approach first addresses coverage and timeliness by allowing multiple Most-Offset predictions. To minimize offset interference between pages, the PMOP leverages a fine-grain per-page offset filter. This filter records the access history with page-IDs, which enables efficient mapping and tracking of multiple offset streams from diverse pages. Analysis results show that PMOP outperforms the state-of-the-art Signature Path Prefetcher while reducing storage overhead by a factor of 3.4.
Sang-Su LEE Sung-Hyun YOU Seok-Kyoon KIM
Digital phase-locked loops (DPLLs) have been designed in a number of ways to correctly generate pulse signals in various systems. However, the existing DPLLs have poor acquisition performance or are prone to the divergence phenomenon when modeling and/or round-off errors exist and the noise statistics are incorrect. In this paper, we propose a novel DPLL whose phase estimator is designed in hybrid form that utilizes the advantages of Monte Carlo estimation, which is robust to nonlinear effects such as measurement quantization, and a finite memory estimator, which is robust against incorrect noise information and system model mismatch. The robustness of the proposed hybrid Monte Carlo/finite memory DPLL is demonstrated by comparing its phase estimation performance via a numerical example.
Tao LIU Huaxi GU Yue WANG Wei ZOU
An optimized low-power optical memory access network is proposed to alleviate the cost of microring resonators (MRs) in kilocore systems, such as the pass-by loss and integration difficulty. Compared with traditional electronic bus interconnect, the proposed network reduces power consumption and latency by 80% to 89% and 21% to 24%. Moreover, the new network decreases the number of MRs by 90.6% without an increase in power consumption and latency when making a comparison with Optical Ring Network-on-Chip (ORNoC).
Masanori HAYASHIKOSHI Hiroaki TANIZAKI Yasumitsu MURAI Takaharu TSUJI Kiyoshi KAWABATA Koji NII Hideyuki NODA Hiroyuki KONDO Yoshio MATSUDA Hideto HIDAKA
A 1-Transistor 4-Magnetic Tunnel Junction (1T-4MTJ) memory cell has been proposed for field type of Magnetic Random Access Memory (MRAM). Proposed 1T-4MTJ memory cell array is achieved 44% higher density than that of conventional 1T-1MTJ thanks to the common access transistor structure in a 4-bit memory cell. A self-reference sensing scheme which can read out with write-back in four clock cycles has been also proposed. Furthermore, we add to estimate with considering sense amplifier variation and show 1T-4MTJ cell configuration is the best solution in IoT applications. A 1-Mbit MRAM test chip is designed and fabricated successfully using 130-nm CMOS process. By applying 1T-4MTJ high density cell and partially embedded wordline driver peripheral into the cell array, the 1-Mbit macro size is 4.04 mm2 which is 35.7% smaller than the conventional one. Measured data shows that the read access is 55 ns at 1.5 V typical supply voltage and 25C. Combining with conventional high-speed 1T-1MTJ caches and proposed high-density 1T-4MTJ user memories is an effective on-chip hierarchical non-volatile memory solution, being implemented for low-power MCUs and SoCs of IoT applications.
Hakbeom JANG Jonghyun BAE Tae Jun HAM Jae W. LEE
This paper introduces e-spill, an eager spill mechanism, which dynamically finds the optimal spill-threshold by monitoring the GC time at runtime and thereby prevent expensive GC overhead. Our e-spill adopts a slow-start model to gradually increase the spill-threshold until it reaches the optimal point without substantial GCs. We prototype e-spill as an extension to Spark and evaluate it using six workloads on three different parallel platforms. Our evaluations show that e-spill improves performance by up to 3.80× and saves the cost of cluster operation on Amazon EC2 cloud by up to 51% over the baseline system following Spark Tuning Guidelines.
Kazuichi OE Mitsuru SATO Takeshi NANRI
The response times of solid state drives (SSDs) have decreased dramatically due to the growing use of non-volatile memory express (NVMe) devices. Such devices have response times of less than 100 micro seconds on average. The response times of all-flash-array systems have also decreased dramatically through the use of NVMe SSDs. However, there are applications, particularly virtual desktop infrastructure and in-memory database systems, that require storage systems with even shorter response times. Their workloads tend to contain many input-output (IO) concentrations, which are aggregations of IO accesses. They target narrow regions of the storage volume and can continue for up to an hour. These narrow regions occupy a few percent of the logical unit number capacity, are the target of most IO accesses, and appear at unpredictable logical block addresses. To drastically reduce the response times for such workloads, we developed an automated tiered storage system called “automated tiered storage with fast memory and slow flash storage” (ATSMF) in which the data in targeted regions are migrated between storage devices depending on the predicted remaining duration of the concentration. The assumed environment is a server with non-volatile memory and directly attached SSDs, with the user applications executed on the server as this reduces the average response time. Our system predicts the effect of migration by using the previously monitored values of the increase in response time during migration and the change in response time after migration. These values are consistent for each type of workload if the system is built using both non-volatile memory and SSDs. In particular, the system predicts the remaining duration of an IO concentration, calculates the expected response-time increase during migration and the expected response-time decrease after migration, and migrates the data in the targeted regions if the sum of response-time decrease after migration exceeds the sum of response-time increase during migration. Experimental results indicate that ATSMF is at least 20% faster than flash storage only and that its memory access ratio is more than 50%.
Pattaravut MALEEHUAN Yuki CHIBA Toshiaki AOKI
In multiprocessors, memory models are introduced to describe the executions of programs among processors. Relaxed memory models, which relax the order of executions, are used in the most of the modern processors, such as ARM and POWER. Due to a relaxed memory model could change the program semantics, the executions of the programs might not be the same as our expectation that should preserve the program correctness. In addition to relaxed memory models, the way to execute an instruction is described by an instruction semantics, which varies among processor architectures. Dealing with instruction semantics among a variety of assembly programs is a challenge for program verification. Thus, this paper proposes a way to verify a variety of assembly programs that are executed under a relaxed memory model. The variety of assembly programs can be abstracted as the way to execute the programs by introducing an operation structure. Besides, there are existing frameworks for modeling relaxed memory models, which can realize program executions to be verified with a program property. Our work adopts an SMT solver to automatically reveal the program executions under a memory model and verify whether the executions violate the program property or not. If there is any execution from the solver, the program correctness is not preserved under the relaxed memory model. To verify programs, an experimental tool was developed to encode the given programs for a memory model into a first-order formula that violates the program correctness. The tool adopts a modeling framework to encode the programs into a formula for the SMT solver. The solver then automatically finds a valuation that satisfies the formula. In our experiments, two encoding methods were implemented based on two modeling frameworks. The valuations resulted by the solver can be considered as the bugs occurring in the original programs.
Akira YAMAWAKI Hiroshi KAMABE Shan LU
In multilevel flash memory, in general, multiple read thresholds are required to read a single logical page. Random I/O (RIO) code, introduced by Sharon and Alrod, is a coding scheme that enables the reading of one logical page using a single read threshold. It was shown that the construction of RIO codes is equivalent to the construction of write-once memory (WOM) codes. Yaakobi and Motwani proposed a family of RIO codes, called parallel RIO (P-RIO) code, in which all logical pages are encoded in parallel. In this paper, we utilize coset coding with Hamming codes in order to construct P-RIO codes. Coset coding is a technique to construct WOM codes using linear binary codes. We leverage information on the data of all pages to encode each page. Our P-RIO codes, using which more pages can be stored than RIO codes constructed via coset coding, have parameters for which RIO codes do not exist.
In 1973, Arimoto proved the strong converse theorem for the discrete memoryless channels stating that when transmission rate R is above channel capacity C, the error probability of decoding goes to one as the block length n of code word tends to infinity. He proved the theorem by deriving the exponent function of error probability of correct decoding that is positive if and only if R > C. Subsequently, in 1979, Dueck and Körner determined the optimal exponent of correct decoding. Recently the author determined the optimal exponent on the correct probability of decoding have the form similar to that of Dueck and Körner determined. In this paper we give a rigorous proof of the equivalence of the above exponet function of Dueck and Körner to a exponent function which can be regarded as an extention of Arimoto's bound to the case with the cost constraint on the channel input.
In this letter, we propose a static wear leveling technique, called Recency-based Wear Leveling (RbWL). The basic idea of RbWL is to execute static wear leveling at minimum levels, because the frequent migrations of cold data by static wear leveling cause significant overhead in a NAND flash memory system. RbWL adjusts the execution frequency according to a threshold value that reflects the lifetime difference of the hot/cold blocks and the total lifetime of the NAND flash memory system. The evaluation results show that RbWL improves the lifetime of NAND flash memory systems by 52%, and it also reduces the overhead of wear leveling from 8% to 42% and from 13% to 51%, in terms of the number of erase operations and the number of page migrations of valid pages, respectively, compared with other algorithms.
Joon-Young PAIK Rize JIN Tae-Sun CHUNG
In terms of system reliability, data recovery is a crucial capability. The lack of data recovery leads to the permanent loss of valuable data. This paper aims at improving data recovery in flash-based storage devices where extremely poor data recovery is shown. For this, we focus on garbage collection that determines the life span of data which have high possibility of data recovery requests by users. A new garbage collection mechanism with awareness of data recovery is proposed. First, deleted or overwritten data are categorized into shallow invalid data and deep invalid data based on the possibility of data recovery requests. Second, the proposed mechanism selects victim area for reclamation of free space, considering the shallow invalid data that have the high possibility of data recovery requests. Our proposal prohibits more shallow invalid data from being eliminated during garbage collections. The experimental results show that our garbage collection mechanism can improve data recovery with minor performance degradation.
Yu ZHANG Pengyuan ZHANG Qingwei ZHAO
In this letter, we explored the usage of spatio-temporal information in one unified framework to improve the performance of multichannel speech recognition. Generalized cross correlation (GCC) is served as spatial feature compensation, and an attention mechanism across time is embedded within long short-term memory (LSTM) neural networks. Experiments on the AMI meeting corpus show that the proposed method provides a 8.2% relative improvement in word error rate (WER) over the model trained directly on the concatenation of multiple microphone outputs.
Tatsuro KOJO Masashi TAWADA Masao YANAGISAWA Nozomu TOGAWA
Non-volatile memories are a promising alternative to memory design but data stored in them still may be destructed due to crosstalk and radiation. The data stored in them can be restored by using error-correcting codes but they require extra bits to correct bit errors. One of the largest problems in non-volatile memories is that they consume ten to hundred times more energy than normal memories in bit-writing. It is quite necessary to reduce writing bits. Recently, a REC code (bit-write-reducing and error-correcting code) is proposed for non-volatile memories which can reduce writing bits and has a capability of error correction. The REC code is generated from a linear systematic error-correcting code but it must include the codeword of all 1's, i.e., 11…1. The codeword bit length must be longer in order to satisfy this condition. In this letter, we propose a method to generate a relaxed REC code which is generated from a relaxed error-correcting code, which does not necessarily include the codeword of all 1's and thus its codeword bit length can be shorter. We prove that the maximum flipping bits of the relaxed REC code is still limited theoretically. Experimental results show that the relaxed REC code efficiently reduce the number of writing bits.
Automatic speech recognition (ASR) and keyword search (KWS) have more and more found their way into our everyday lives, and their successes could boil down lots of factors. In these factors, large scale of speech data used for acoustic modeling is the key factor. However, it is difficult and time-consuming to acquire large scale of transcribed speech data for some languages, especially for low-resource languages. Thus, at low-resource condition, it becomes important with which transcribed data for acoustic modeling for improving the performance of ASR and KWS. In view of using acoustic data for acoustic modeling, there are two different ways. One is using the target language data, and another is using large scale of other source languages data for cross-lingual transfer. In this paper, we propose some approaches for efficient selecting acoustic data for acoustic modeling. For target language data, a submodular based unsupervised data selection approach is proposed. The submodular based unsupervised data selection could select more informative and representative utterances for manual transcription for acoustic modeling. For other source languages data, the high misclassified as target language based submodular multilingual data selection approach and knowledge based group multilingual data selection approach are proposed. When using selected multilingual data for multilingual deep neural network training for cross-lingual transfer, it could improve the performance of ASR and KWS of target language. When comparing our proposed multilingual data selection approach with language identification based multilingual data selection approach, our proposed approach also obtains better effect. In this paper, we also analyze and compare the language factor and the acoustic factor influence on the performance of ASR and KWS. The influence of different scale of target language data on the performance of ASR and KWS at mono-lingual condition and cross-lingual condition are also compared and analyzed, and some significant conclusions can be concluded.
Ichraf LAHOULI Robby HAELTERMAN Joris DEGROOTE Michal SHIMONI Geert DE CUBBER Rabah ATTIA
Video surveillance from airborne platforms can suffer from many sources of blur, like vibration, low-end optics, uneven lighting conditions, etc. Many different algorithms have been developed in the past that aim to recover the deblurred image but often incur substantial CPU-time, which is not always available on-board. This paper shows how a “strap-on” quasi-Newton method can accelerate the convergence of existing iterative methods with little extra overhead while keeping the performance of the original algorithm, thus paving the way for (near) real-time applications using on-board processing.
Koki ISHIDA Masamitsu TANAKA Takatsugu ONO Koji INOUE
CMOS microprocessors are limited in their capacity for clock speed improvement because of increasing computing power, i.e., they face a power-wall problem. Single-flux-quantum (SFQ) circuits offer a solution with their ultra-fast-speed and ultra-low-power natures. This paper introduces our contributions towards ultra-high-speed cryogenic SFQ computing. The first step is to design SFQ microprocessors. From qualitatively and quantitatively evaluating past-designed SFQ microprocessors, we have found that revisiting the architecture of SFQ microprocessors and on-chip caches is the first critical challenge. On the basis of cross-layer discussions and analysis, we came to the conclusion that a bit-parallel gate-level pipeline architecture is the best solution for SFQ designs. This paper summarizes our current research results targeting SFQ microprocessors and on-chip cache architectures.
Hirofumi TAKISHITA Yutaka ADACHI Chihiro MATSUI Ken TAKECUHI
NAND flash memories used in solid-state drives (SSDs) will be replaced with storage-class memories (SCMs), which are comparable with NAND flash in their cost, and with DRAM in their speed. This paper describes the performance difference of the SCM/NAND flash hybrid SSD and the SCM-based SSD with between sector-unit read (512 Byte) and page-unit read (16 KByte, NAND flash page-size) using synthetic and real workload. Also, effect of the SCM read-unit size on SSD performance are analyzed. When SCM write/read latency is 0.1 us, performance difference of the SCM/NAND flash hybrid SSD with between page- and sector-unit read is about 1% and 6% at most for the write-intensive and read-intensive workloads, respectively. However, performance of the SCM-based SSD is significantly improved when sector-unit read is used because extra read latency does not occur. Especially, the SCM-based SSD IOPS is improved by 131% for proj_3 (read-hot-random), because its read request size is small but its read request ratio is large. This paper also shows IOPS of SCM-based SSD write/read with sector-unit read can be predicted by the average write/read request size of workloads.
Jingwei YAN Wenming ZHENG Zhen CUI Peng SONG
Facial expressions are generated by the actions of the facial muscles located at different facial regions. The spatial dependencies of different spatial facial regions are worth exploring and can improve the performance of facial expression recognition. In this letter we propose a joint convolutional bidirectional long short-term memory (JCBLSTM) framework to model the discriminative facial textures and spatial relations between different regions jointly. We treat each row or column of feature maps output from CNN as individual ordered sequence and employ LSTM to model the spatial dependencies within it. Moreover, a shortcut connection for convolutional feature maps is introduced for joint feature representation. We conduct experiments on two databases to evaluate the proposed JCBLSTM method. The experimental results demonstrate that the JCBLSTM method achieves state-of-the-art performance on Multi-PIE and very competitive result on FER-2013.
Yusuke YAMAGA Chihiro MATSUI Yukiya SAKAKI Ken TAKEUCHI
In order to reduce the memory cell errors in real-usage of NAND flash-based SSD, real usage-based precise reliability test for NAND flash of SSDs has been proposed. Reliability of the NAND flash memories of the SSDs is seriously degraded as the scaling of memory cells. However, conventional simple reliability tests of read-disturb and data-retention cannot give the same result as the real-life VTH shift and memory cell errors. To solve this problem, the proposed reliability test precisely reproduces the real memory cell failures by emulating the complicated read, write, and data-retention with SSD emulator. In this paper, the real-life VTH shift and memory cell errors between two generations of NAND flash memory with different characterized real workloads are provided. Using the proposed test method, 1.6-times BER difference is observed when write-cold and read-hot workload (hm_1) and write-hot and read-hot workload (prxy_1) are compared in 1Ynm MLC NAND flash. In addition, by NAND flash memory scaling from 1Xnm to 1Ynm generations, the discrepancy of error numbers between the conventional reliability test result and actual reliability measured by proposed reliability test is increased by 6.3-times. Finally, guidelines for read reference voltage shifts and strength of ECCs are given to achieve high memory cell reliability for various workloads.