Masafumi KINOSHITA Osamu TAKADA Izumi MIZUTANI Takafumi KOIKE Kenji LEIBNITZ Masayuki MURATA
In the big data era, messaging systems are required to process large volumes of message traffic with high scalability and availability. However, conventional systems have two issues regarding availability. The first issue is that failover processing itself has a risk of failure. The second issue is to find a trade-off between consistency and availability. We propose a resilient messaging system based on a distributed in-memory key-value store (KVS). Its servers are interconnected with each other and messages are distributed to multiple servers in normal processing state. This architecture can continue messaging services wherever in the messaging system server/process failures occur without using failover processing. Furthermore, we propose two methods for improved resilience: the round-robin method with a slowdown KVS exclusion and the two logical KVS counter-rotating rings to provide short-term-availability in the messaging system. Evaluation results demonstrate that the proposed system can continue service without failover processing. Compared with the conventional method, our proposed distribution method reduced 92% of error responses to clients caused by server failures.
Shuhei TANAKAMARU Masafumi DOI Ken TAKEUCHI
A design strategy (the required ECC strength and the judgment method of the dominant error mode) of error-prediction low-density parity-check (EP-LDPC) error-correcting code (ECC) and error-recovery schemes for scaled NAND flash memories is discussed in this paper. The reliability characteristics of NAND flash memories are investigated with 1X, 2X and 3Xnm NAND flash memories. Moreover, the system-level reliability of SSDs is analyzed from the acceptable data-retention time of the SSD. The reliability of the NAND flash memory is continuously degrading as the design rule shrinks due to various problems. As a result, future SSDs will not be able to maintain system-level reliability unless advanced ECCs with signal processing are adopted. Therefore, EP-LDPC and error-recovery (ER) schemes are previously proposed to improve the reliability. The reliability characteristics such as the bit-error rate (BER) versus the data-retention time and the effect of the cell-to-cell interference on the BER are measured. These reliability characteristics obtained in this paper are stored in an SSD as a reliability table, which plays a principal role in EP-LDPC scheme. The effectiveness of the EP-LDPC scheme with the scaling of the NAND flash memory is also discussed by analyzing the cell-to-cell interference. An interference factor $alpha$ is proposed to discuss the impact of the cell-to-cell coupling. As a result, the EP-LDPC scheme is assumed to be effective down to 1Xnm NAND flash memory. On the other hand, the ER scheme applies different voltage pulses to memory cells, according to the dominant error mode: program-disturb or data-retention error dominant mode. This paper examines when the error mode changes, corresponding to which pulse should be applied. Additionally, the estimation methods of the dominant error mode by ER scheme are also discussed. Finally, as a result of the system-level reliability analysis, it is concluded that the use of the EP-LDPC scheme can maintain the reliability of the NAND flash memory in 1Xnm technology node.
Takeki NINOMIYA Zhiqiang WEI Shinichi YONEDA Kenji SHIRAISHI
We considered the oxygen diffusivity around a conductive filament of resistive switching oxides, with the aim of designing material appropriate for highly reliable non-volatile memory. Both theoretical and experimental analyses were performed for this consideration. The theoretically obtained oxygen chemical potential difference, which works as a driving force for diffusion, significantly depends on a material. Then, we experimentally confirmed that the oxygen diffusion behaviors vary greatly depending on the chemical potential differences.
Yuya KORA Kyohei YAMAGUCHI Hideki ANDO
Single-thread performance has not improved much over the past few years, despite an ever increasing transistor budget. One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall. A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit memory-level parallelism (MLP). However, simply enlarging the window resources lengthens the clock cycle time. Although pipelining the resources solves this problem, it in turn prevents instruction-level parallelism (ILP) from being exploited because issuing instructions requires multiple clock cycles. This paper proposed a dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP. Specifically, if the scheme predicts that MLP is available during execution, the instruction window is enlarged and the window resources are pipelined, thereby exploiting MLP. Conversely, if the scheme predicts that less MLP is available, that is, ILP is exploitable for improved performance, the instruction window is shrunk and the window resources are de-pipelined, thereby exploiting ILP. Our evaluation results using the SPEC2006 benchmark programs show that the proposed scheme achieves nearly the best performance possible with fixed-size resources. On average, our scheme realizes a performance improvement of 21% over that of a conventional processor, with additional cost of only 6% of the area of the conventional processor core or 3% of that of the entire processor chip. The evaluation results also show 8% better energy efficiency in terms of 1/EDP (energy-delay product).
Akihiko KASAGI Koji NAKANO Yasuaki ITO
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. We first present that the conventional algorithm runs $D_w(P)+2{nover w}+3L-3$ time units using n threads on the HMM with width w and latency L, where Dw(P) is the distribution of P. We next show that important regular permutations including transpose, shuffle, and bit-reversal permutations run $2{nover w}+2{nover kw}+2L-2$ time units on the HMM with k DMMs. We have implemented permutation algorithms for these regular permutations on GeForce GTX 680 GPU. The experimental results show that these algorithms run much faster than the conventional algorithm. We also present an offline permutation algorithm for any permutation running in $16{nover w}+16{nover kw}+16L-16$ time units on the HMM with k DMMs. Quite surprisingly, our offline permutation algorithm on the GPU achieves better performance than the conventional algorithm in random permutation, although the running time has a large constant factor. We can say that the experimental results provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.
Duhu MAN Koji NAKANO Yasuaki ITO
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The approximate string matching (ASM) for two strings X and Y of length m and n is a task to find a substring of Y most similar to X. The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and implement it on GeForce GTX 580 GPU. Our algorithm runs in $O({nover w}+{mnover dw}+{nLover p}+{mnlover p})$ time units on the HMM with p threads, d streaming processors, memory band width w, global memory access latency L, and shared memory access latency l. We also show that the lower bound of the computing time is $Omega({nover w}+{mnover dw}+{nLover p}+{mnlover p})$ time units. Thus, our algorithm for the approximate string matching is time optimal. Further, we implemented our algorithm on GeForce GTX 580 GPU and evaluated the performance. The experimental results show that the ASM of two strings of 1024 and 4M (=222) characters can be done in 419.6ms, while the sequential algorithm can compute it in 27720ms. Thus, our implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.
Ye GAO Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI
Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.
Toshiyuki DOBASHI Tatsuya MUROFUSHI Masahiro IWAHASHI Hitoshi KIYA
A global tone mapping operation (TMO) for high dynamic range (HDR) images with fixed-point arithmetic is proposed and evaluated in this paper. A TMO generates a low dynamic range (LDR) image from an HDR image by compressing its dynamic range. Since an HDR image is generally expressed in a floating-point data format, a TMO also deals with floating-point data even though a resultant LDR image is integer data. The proposed method treats a floating-point number as two 8-bit integer numbers which correspond to an exponent part and a mantissa part, and applies tone mapping to these integer numbers separately. Moreover, the method conducts all calculations in the tone mapping with only fixed-point arithmetic. As a result, the method reduces a memory cost and a computational cost. The evaluation shows that the proposed method reduces 81.25% of memory usage. The experimental results show that the processing speed of the proposed method with fixed-point arithmetic is 23.1 times faster than the conventional method with floating-point arithmetic. Furthermore, they also show the PSNR of LDR images obtained by the proposed method are comparable to those of the conventional method, though reducing computational and memory cost.
A global tree local X-net network (GTLX) is introduced to realize high-performance data transfer in a multiple-valued fine-grain reconfigurable VLSI (MVFG-RVLSI). A global pipelined tree network is utilized to realize high-performance long-distance bit-parallel data transfer. Moreover, a logic-in-memory architecture is employed for solving data transfer bottleneck between a block data memory and a cell. A local X-net network is utilized to realize simple interconnections and compact switch blocks for eight-near neighborhood data transfer. Moreover, multiple-valued signaling is utilized to improve the utilization of the X-net network, where two binary data can be transferred from two adjacent cells to one common adjacent cell simultaneously at each “X” intersection. To evaluate the MVFG-RVLSI, a fast Fourier transform (FFT) operation is mapped onto a previous MVFG-RVLSI using only the X-net network and the MVFG-RVLSI using the GTLX. As a result, the computation time, the power consumption and the transistor count of the MVFG-RVLSI using the GTLX are reduced by 25%, 36% and 56%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network.
Jun ISHII Hiroyuki MAEOMICHI Akihiro TSUTSUI Ikuo YODA
This paper propose a novel method for obtaining statistical results such as averages, variances, and correlations without leaking any raw data values from data-holders by using multiple pseudonyms. At present, to obtain statistical results using a large amount of data, we need to collect all data in the same storage device. However, gathering real-world data that were generated by different people is not easy because they often contain private information. The authors split the roles of servers into publishing pseudonyms and collecting answers. Splitting these roles, different entities can more easily join as pseudonym servers than in previous secure multi-party computation methods and there is less chance of collusion between servers. Thus, our method enables data holders to protect themselves against malicious attacks from data users. We also estimated a typical problem that occurred with our method and added a pseudonym availability confirmation protocol to prevent the problem. We report our evaluation of the effectiveness of our method through implementation and experimentation and discuss how we incorporated the WebSocket protocol and MySQL Memoty Storage Engine to remove the bottleneck and improve the implementation style. Finally, we explain how our method can obtain averages, variances, and correlation from 5000 data holders within 50 seconds.
Complex-valued Hopfield associative memory (CHAM) is one of the most promising neural network models to deal with multilevel information. CHAM has an inherent property of rotational invariance. Rotational invariance is a factor that reduces a network's robustness to noise, which is a critical problem. Here, we proposed complex-valued bipartite auto-associative memory (CBAAM) to solve this reduction in noise robustness. CBAAM consists of two layers, a visible complex-valued layer and an invisible real-valued layer. The invisible real-valued layer prevents rotational invariance and the resulting reduction in noise robustness. In addition, CBAAM has high parallelism, unlike CHAM. By computer simulations, we show that CBAAM is superior to CHAM in noise robustness. The noise robustness of CHAM decreased as the resolution factor increased. On the other hand, CBAAM provided high noise robustness independent of the resolution factor.
Ju Hee CHOI Jong Wook KWAK Chu Shik JHON
Non-Volatile Memories (NVMs) are considered as promising memory technologies for Last-Level Cache (LLC) due to their low leakage and high density. However, NVMs have some drawbacks such as high dynamic energy in modifying NVM cells, long latency for write operation, and limited write endurance. A number of approaches have been proposed to overcome these drawbacks. But very little attention is paid to consider the cache coherency issue. In this letter, we suggest a new cache coherence protocol to reduce the write operations of the LLC. In our protocol, the block data of the LLC is updated only if the cache block is written-back from a private cache, which leads to avoiding useless write operations in the LLC. The simulation results show that our protocol provides 27.1% energy savings and 26.3% lifetime improvements in STT-RAM at maximum.
Woo Hyun AHN Joon-Woo CHOI Jaewon OH Seung-Ho LIM Kyungbaek KIM
Virtual memory systems page out a cluster of contiguous modified pages in virtual memory to a swap disk at one disk I/O but cannot find large clusters in applications mainly changing non-contiguous pages. Our proposal stores small clusters at one disk I/O. This decreases disk writes for paging out small clusters, thus improving page-out performance.
Koh JOHGUCHI Kasuaki YOSHIOKA Ken TAKEUCHI
In this paper, we propose an optimum access method for a phase change memory (PCM) with NAND strings. A PCM with a block erase interface is proposed. The method, which has a SET block erase operation and fast RESET programming, is proposed since the SET operation causes a slow access time for conventional PCM;. From the results of measurement, the SET-ERASE operation is successfully completed while the RESET-ERASE operation is incomplete owing to serial connection. As a result, the block erase interface with the SET-ERASE and RESET program method realizes a 7.7 times faster write speed compared than a conventional RAM interface owing to the long SET time. We also give pass-transistor design guidelines for PCM with NAND strings. In addition, the write-capability and write-disturb problems are investigated. The ERASE operation for the proposed device structure can be realized with the same current as that for the SET operation of a single cell. For the pass transistor, about 4.4 times larger on-current is needed to carry out the RESET operation and to avoid the write-disturb problem than the minimum RESET current of a single cell. In this paper, the SET programming method is also verified for a conventional RAM interface. The experimental results show that the write-capability and write-disturb problems are negligible.
Takayuki NOZAKI Kenta KASAI Kohichi SAKANIWA
In this paper, we propose a message passing decoding algorithm which lowers decoding error rates in the error floor regions for non-binary low-density parity-check (LDPC) codes transmitted over the binary erasure channel (BEC) and the memoryless binary-input output-symmetric (MBIOS) channels. In the case for the BEC, this decoding algorithm is a combination with belief propagation (BP) decoding and maximum a posteriori (MAP) decoding on zigzag cycles, which cause decoding errors in the error floor region. We show that MAP decoding on the zigzag cycles is realized by means of a message passing algorithm. Moreover, we extend this decoding algorithm to the MBIOS channels. Simulation results demonstrate that the decoding error rates in the error floor regions by the proposed decoding algorithm are lower than those by the BP decoder.
Koh JOHGUCHI Toru EGAMI Kousuke MIYAJI Ken TAKEUCHI
This paper gives a write voltage and read reference current generator considering temperature characteristics for multi-level Ge2Sb2Te5-based phase change memories. Since the optimum SET and RESET voltages linearly changes by the temperature, the voltage supply circuit must track this characteristic. In addition, the measurement results show that the read current depends on both read temperature and the write temperature and has exponential dependence on the read temperature. Thus, the binning technique is applied for each read and write temperature regions. The proposed variable TC generator can achieve below ±0.5 LSB precision from the measured differential non-linearity and integral non-linearity. As a result, the temperature characteristics of both the linear write voltage and the exponential read current can be tracked with the proposed variation tolerant linear temperature coefficient current generator.
Eisuke FUKUDA Yasuyuki OISHI Takeshi TAKANO Daisuke TAKAGO Yoshimasa DAIDO Hiroyuki MORIKAWA
This paper describes the details of the iteration process used to determine the transfer functions of linear time-invariant (LTI) circuits causing the memory effect of power amplifier (PA). An outline of the method is reported in our work presented at ICCS2012. The accuracy of the method is improved by using cross-correlation spectra at three signal levels, and its validity is confirmed by a computer simulation. The method can be applied to online updating of PAs operating in mobile communication systems. The updating is realized separately from the fast varying nonlinear coefficients. The possibility of updating with a short interval is indirectly shown for the nonlinear coefficients using a procedure similar to that of memoryless PAs. For PAs characterized by the method, this paper also describes the inverses that cancel the nonlinear distortion with minimum complexity. The validity of the inverse is confirmed by a computer simulation on the power spectrum of the PA for orthogonal frequency-division multiplexing (OFDM) signals with 500 subcarriers. The simulated spectra show that the fifth order or higher inverses are effective in keeping adjacent channel leakage power ratio (ACLR) lower than -60dB at the practical signal level. Improvements in the error vector magnitude (EVM) due to the inverse were also confirmed by reductions of gain and phase variations under varying envelope conditions.
The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It is assumed that warps (or groups of threads) on the DMM and the UMM work synchronously in a round-robin manner. However, warps work asynchronously in real GPUs, in the sense that they are randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce asynchronous versions of these models in which warps are arbitrarily dispatched. In addition, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation may be costly, we should evaluate and minimize the number of barrier synchronization operations executed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads on the asynchronous UMM with width w and latency l. Since the computation of the sum takes at least Ω(n/w+llog n) time units, this algorithm is time optimal. Finally, we show that the prefix-sums of n numbers can also be computed in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads.
Goichi ONO Yuki MORI Michiaki NAKAYAMA Yusuke KANNO
In order to analyze an impact of threshold voltage (Vth) fluctuation induced by random telegraph noise (RTN) on LSI circuit design, we measured a 40-nm 6-Tr-SRAM TEG which enables to evaluate individual bit-line current. RTN phenomenon was successfully measured and we also identified that the transfer MOSFET in an SRAM bit-cell was the most sensitive MOSFET. The proposed word line boosting technique, which applies slightly extra stress to the transfer MOSFET, improves about 30% of detecting probability of fail-bit cells caused by RTN.
Traditionally, in computer systems, file I/O has been a big performance bottleneck for I/O intensive applications. The recent advent of non-volatile byte-addressable memory (NVM) technologies such as STT-MRAM and PCM, provides a chance to store persistent data with a high performance close to DRAM's. However, as the location of the persistent storage device gets closer to the CPU, the system software layers overheads for accessing the data such as file system layer including virtual file system layer and device driver are no longer negligible. In this paper, we propose a light-weight user-level persistent storage, called UStore, which is physically allocated on the NVM and is mapped directly into the virtual address space of an application. UStore makes it possible for the application to fast access the persistent data without the system software overheads and extra data copy between the user space and kernel space. We show how UStore is easily applied to existing applications with little elaboration and evaluate its performance enhancement through several benchmark tests.