Keyword Search Result

[Keyword] memory(654hit)

161-180hit(654hit)

  • Improved Resilience through Extended KVS-Based Messaging System

    Masafumi KINOSHITA  Osamu TAKADA  Izumi MIZUTANI  Takafumi KOIKE  Kenji LEIBNITZ  Masayuki MURATA  

     
    PAPER-Internet Operation and Management

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    578-587

    In the big data era, messaging systems are required to process large volumes of message traffic with high scalability and availability. However, conventional systems have two issues regarding availability. The first issue is that failover processing itself has a risk of failure. The second issue is to find a trade-off between consistency and availability. We propose a resilient messaging system based on a distributed in-memory key-value store (KVS). Its servers are interconnected with each other and messages are distributed to multiple servers in normal processing state. This architecture can continue messaging services wherever in the messaging system server/process failures occur without using failover processing. Furthermore, we propose two methods for improved resilience: the round-robin method with a slowdown KVS exclusion and the two logical KVS counter-rotating rings to provide short-term-availability in the messaging system. Evaluation results demonstrate that the proposed system can continue service without failover processing. Compared with the conventional method, our proposed distribution method reduced 92% of error responses to clients caused by server failures.

  • A Design Strategy of Error-Prediction Low-Density Parity-Check (EP-LDPC) Error-Correcting Code (ECC) and Error-Recovery Schemes for Scaled NAND Flash Memories

    Shuhei TANAKAMARU  Masafumi DOI  Ken TAKEUCHI  

     
    PAPER-Integrated Electronics

      Vol:
    E98-C No:1
      Page(s):
    53-61

    A design strategy (the required ECC strength and the judgment method of the dominant error mode) of error-prediction low-density parity-check (EP-LDPC) error-correcting code (ECC) and error-recovery schemes for scaled NAND flash memories is discussed in this paper. The reliability characteristics of NAND flash memories are investigated with 1X, 2X and 3Xnm NAND flash memories. Moreover, the system-level reliability of SSDs is analyzed from the acceptable data-retention time of the SSD. The reliability of the NAND flash memory is continuously degrading as the design rule shrinks due to various problems. As a result, future SSDs will not be able to maintain system-level reliability unless advanced ECCs with signal processing are adopted. Therefore, EP-LDPC and error-recovery (ER) schemes are previously proposed to improve the reliability. The reliability characteristics such as the bit-error rate (BER) versus the data-retention time and the effect of the cell-to-cell interference on the BER are measured. These reliability characteristics obtained in this paper are stored in an SSD as a reliability table, which plays a principal role in EP-LDPC scheme. The effectiveness of the EP-LDPC scheme with the scaling of the NAND flash memory is also discussed by analyzing the cell-to-cell interference. An interference factor $alpha$ is proposed to discuss the impact of the cell-to-cell coupling. As a result, the EP-LDPC scheme is assumed to be effective down to 1Xnm NAND flash memory. On the other hand, the ER scheme applies different voltage pulses to memory cells, according to the dominant error mode: program-disturb or data-retention error dominant mode. This paper examines when the error mode changes, corresponding to which pulse should be applied. Additionally, the estimation methods of the dominant error mode by ER scheme are also discussed. Finally, as a result of the system-level reliability analysis, it is concluded that the use of the EP-LDPC scheme can maintain the reliability of the NAND flash memory in 1Xnm technology node.

  • Theoretical and Experimental Approaches to Select Resistive Switching Material

    Takeki NINOMIYA  Zhiqiang WEI  Shinichi YONEDA  Kenji SHIRAISHI  

     
    BRIEF PAPER-Electronic Materials

      Vol:
    E98-C No:1
      Page(s):
    62-64

    We considered the oxygen diffusivity around a conductive filament of resistive switching oxides, with the aim of designing material appropriate for highly reliable non-volatile memory. Both theoretical and experimental analyses were performed for this consideration. The theoretically obtained oxygen chemical potential difference, which works as a driving force for diffusion, significantly depends on a material. Then, we experimentally confirmed that the oxygen diffusion behaviors vary greatly depending on the chemical potential differences.

  • MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

    Yuya KORA  Kyohei YAMAGUCHI  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2014/09/22
      Vol:
    E97-D No:12
      Page(s):
    3110-3123

    Single-thread performance has not improved much over the past few years, despite an ever increasing transistor budget. One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall. A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit memory-level parallelism (MLP). However, simply enlarging the window resources lengthens the clock cycle time. Although pipelining the resources solves this problem, it in turn prevents instruction-level parallelism (ILP) from being exploited because issuing instructions requires multiple clock cycles. This paper proposed a dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP. Specifically, if the scheme predicts that MLP is available during execution, the instruction window is enlarged and the window resources are pipelined, thereby exploiting MLP. Conversely, if the scheme predicts that less MLP is available, that is, ILP is exploitable for improved performance, the instruction window is shrunk and the window resources are de-pipelined, thereby exploiting ILP. Our evaluation results using the SPEC2006 benchmark programs show that the proposed scheme achieves nearly the best performance possible with fixed-size resources. On average, our scheme realizes a performance improvement of 21% over that of a conventional processor, with additional cost of only 6% of the area of the conventional processor core or 3% of that of the entire processor chip. The evaluation results also show 8% better energy efficiency in terms of 1/EDP (energy-delay product).

  • Offline Permutation on the CUDA-enabled GPU

    Akihiko KASAGI  Koji NAKANO  Yasuaki ITO  

     
    PAPER-GPU

      Vol:
    E97-D No:12
      Page(s):
    3052-3062

    The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. We first present that the conventional algorithm runs $D_w(P)+2{nover w}+3L-3$ time units using n threads on the HMM with width w and latency L, where Dw(P) is the distribution of P. We next show that important regular permutations including transpose, shuffle, and bit-reversal permutations run $2{nover w}+2{nover kw}+2L-2$ time units on the HMM with k DMMs. We have implemented permutation algorithms for these regular permutations on GeForce GTX 680 GPU. The experimental results show that these algorithms run much faster than the conventional algorithm. We also present an offline permutation algorithm for any permutation running in $16{nover w}+16{nover kw}+16L-16$ time units on the HMM with k DMMs. Quite surprisingly, our offline permutation algorithm on the GPU achieves better performance than the conventional algorithm in random permutation, although the running time has a large constant factor. We can say that the experimental results provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.

  • An Optimal Implementation of the Approximate String Matching on the Hierarchical Memory Machine, with Performance Evaluation on the GPU

    Duhu MAN  Koji NAKANO  Yasuaki ITO  

     
    PAPER-GPU

      Vol:
    E97-D No:12
      Page(s):
    3063-3071

    The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The approximate string matching (ASM) for two strings X and Y of length m and n is a task to find a substring of Y most similar to X. The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and implement it on GeForce GTX 580 GPU. Our algorithm runs in $O({nover w}+{mnover dw}+{nLover p}+{mnlover p})$ time units on the HMM with p threads, d streaming processors, memory band width w, global memory access latency L, and shared memory access latency l. We also show that the lower bound of the computing time is $Omega({nover w}+{mnover dw}+{nLover p}+{mnlover p})$ time units. Thus, our algorithm for the approximate string matching is time optimal. Further, we implemented our algorithm on GeForce GTX 580 GPU and evaluated the performance. The experimental results show that the ASM of two strings of 1024 and 4M (=222) characters can be done in 419.6ms, while the sequential algorithm can compute it in 27720ms. Thus, our implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

  • MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

    Ye GAO  Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Pubricized:
    2014/08/22
      Vol:
    E97-D No:11
      Page(s):
    2835-2843

    Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.

  • A Fixed-Point Global Tone Mapping Operation for HDR Images in the RGBE Format

    Toshiyuki DOBASHI  Tatsuya MUROFUSHI  Masahiro IWAHASHI  Hitoshi KIYA  

     
    PAPER

      Vol:
    E97-A No:11
      Page(s):
    2147-2153

    A global tone mapping operation (TMO) for high dynamic range (HDR) images with fixed-point arithmetic is proposed and evaluated in this paper. A TMO generates a low dynamic range (LDR) image from an HDR image by compressing its dynamic range. Since an HDR image is generally expressed in a floating-point data format, a TMO also deals with floating-point data even though a resultant LDR image is integer data. The proposed method treats a floating-point number as two 8-bit integer numbers which correspond to an exponent part and a mantissa part, and applies tone mapping to these integer numbers separately. Moreover, the method conducts all calculations in the tone mapping with only fixed-point arithmetic. As a result, the method reduces a memory cost and a computational cost. The evaluation shows that the proposed method reduces 81.25% of memory usage. The experimental results show that the processing speed of the proposed method with fixed-point arithmetic is 23.1 times faster than the conventional method with floating-point arithmetic. Furthermore, they also show the PSNR of LDR images obtained by the proposed method are comparable to those of the conventional method, though reducing computational and memory cost.

  • Multiple-Valued Fine-Grain Reconfigurable VLSI Using a Global Tree Local X-Net Network

    Xu BAI  Michitaka KAMEYAMA  

     
    PAPER-VLSI Architecture

      Vol:
    E97-D No:9
      Page(s):
    2278-2285

    A global tree local X-net network (GTLX) is introduced to realize high-performance data transfer in a multiple-valued fine-grain reconfigurable VLSI (MVFG-RVLSI). A global pipelined tree network is utilized to realize high-performance long-distance bit-parallel data transfer. Moreover, a logic-in-memory architecture is employed for solving data transfer bottleneck between a block data memory and a cell. A local X-net network is utilized to realize simple interconnections and compact switch blocks for eight-near neighborhood data transfer. Moreover, multiple-valued signaling is utilized to improve the utilization of the X-net network, where two binary data can be transferred from two adjacent cells to one common adjacent cell simultaneously at each “X” intersection. To evaluate the MVFG-RVLSI, a fast Fourier transform (FFT) operation is mapped onto a previous MVFG-RVLSI using only the X-net network and the MVFG-RVLSI using the GTLX. As a result, the computation time, the power consumption and the transistor count of the MVFG-RVLSI using the GTLX are reduced by 25%, 36% and 56%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network.

  • Privacy-Preserving Statistical Analysis Method by Splitting Server Roles for Distributed Real-World Data

    Jun ISHII  Hiroyuki MAEOMICHI  Akihiro TSUTSUI  Ikuo YODA  

     
    PAPER

      Vol:
    E97-B No:9
      Page(s):
    1779-1789

    This paper propose a novel method for obtaining statistical results such as averages, variances, and correlations without leaking any raw data values from data-holders by using multiple pseudonyms. At present, to obtain statistical results using a large amount of data, we need to collect all data in the same storage device. However, gathering real-world data that were generated by different people is not easy because they often contain private information. The authors split the roles of servers into publishing pseudonyms and collecting answers. Splitting these roles, different entities can more easily join as pseudonym servers than in previous secure multi-party computation methods and there is less chance of collusion between servers. Thus, our method enables data holders to protect themselves against malicious attacks from data users. We also estimated a typical problem that occurred with our method and added a pseudonym availability confirmation protocol to prevent the problem. We report our evaluation of the effectiveness of our method through implementation and experimentation and discuss how we incorporated the WebSocket protocol and MySQL Memoty Storage Engine to remove the bottleneck and improve the implementation style. Finally, we explain how our method can obtain averages, variances, and correlation from 5000 data holders within 50 seconds.

  • Complex-Valued Bipartite Auto-Associative Memory

    Yozo SUZUKI  Masaki KOBAYASHI  

     
    PAPER-Nonlinear Problems

      Vol:
    E97-A No:8
      Page(s):
    1680-1687

    Complex-valued Hopfield associative memory (CHAM) is one of the most promising neural network models to deal with multilevel information. CHAM has an inherent property of rotational invariance. Rotational invariance is a factor that reduces a network's robustness to noise, which is a critical problem. Here, we proposed complex-valued bipartite auto-associative memory (CBAAM) to solve this reduction in noise robustness. CBAAM consists of two layers, a visible complex-valued layer and an invisible real-valued layer. The invisible real-valued layer prevents rotational invariance and the resulting reduction in noise robustness. In addition, CBAAM has high parallelism, unlike CHAM. By computer simulations, we show that CBAAM is superior to CHAM in noise robustness. The noise robustness of CHAM decreased as the resolution factor increased. On the other hand, CBAAM provided high noise robustness independent of the resolution factor.

  • Write Avoidance Cache Coherence Protocol for Non-volatile Memory as Last-Level Cache in Chip-Multiprocessor

    Ju Hee CHOI  Jong Wook KWAK  Chu Shik JHON  

     
    LETTER-Computer System

      Vol:
    E97-D No:8
      Page(s):
    2166-2169

    Non-Volatile Memories (NVMs) are considered as promising memory technologies for Last-Level Cache (LLC) due to their low leakage and high density. However, NVMs have some drawbacks such as high dynamic energy in modifying NVM cells, long latency for write operation, and limited write endurance. A number of approaches have been proposed to overcome these drawbacks. But very little attention is paid to consider the cache coherency issue. In this letter, we suggest a new cache coherence protocol to reduce the write operations of the LLC. In our protocol, the block data of the LLC is updated only if the cache block is written-back from a private cache, which leads to avoiding useless write operations in the LLC. The simulation results show that our protocol provides 27.1% energy savings and 26.3% lifetime improvements in STT-RAM at maximum.

  • Paging out Multiple Clusters to Improve Virtual Memory System Performance

    Woo Hyun AHN  Joon-Woo CHOI  Jaewon OH  Seung-Ho LIM  Kyungbaek KIM  

     
    LETTER-Software System

      Vol:
    E97-D No:7
      Page(s):
    1905-1909

    Virtual memory systems page out a cluster of contiguous modified pages in virtual memory to a swap disk at one disk I/O but cannot find large clusters in applications mainly changing non-contiguous pages. Our proposal stores small clusters at one disk I/O. This decreases disk writes for paging out small clusters, thus improving page-out performance.

  • NAND Phase Change Memory with Block Erase Architecture and Pass-Transistor Design Requirements for Write and Disturbance

    Koh JOHGUCHI  Kasuaki YOSHIOKA  Ken TAKEUCHI  

     
    PAPER

      Vol:
    E97-C No:4
      Page(s):
    351-359

    In this paper, we propose an optimum access method for a phase change memory (PCM) with NAND strings. A PCM with a block erase interface is proposed. The method, which has a SET block erase operation and fast RESET programming, is proposed since the SET operation causes a slow access time for conventional PCM;. From the results of measurement, the SET-ERASE operation is successfully completed while the RESET-ERASE operation is incomplete owing to serial connection. As a result, the block erase interface with the SET-ERASE and RESET program method realizes a 7.7 times faster write speed compared than a conventional RAM interface owing to the long SET time. We also give pass-transistor design guidelines for PCM with NAND strings. In addition, the write-capability and write-disturb problems are investigated. The ERASE operation for the proposed device structure can be realized with the same current as that for the SET operation of a single cell. For the pass transistor, about 4.4 times larger on-current is needed to carry out the RESET operation and to avoid the write-disturb problem than the minimum RESET current of a single cell. In this paper, the SET programming method is also verified for a conventional RAM interface. The experimental results show that the write-capability and write-disturb problems are negligible.

  • Message Passing Decoder with Decoding on Zigzag Cycles for Non-binary LDPC Codes

    Takayuki NOZAKI  Kenta KASAI  Kohichi SAKANIWA  

     
    PAPER-Coding Theory

      Vol:
    E97-A No:4
      Page(s):
    975-984

    In this paper, we propose a message passing decoding algorithm which lowers decoding error rates in the error floor regions for non-binary low-density parity-check (LDPC) codes transmitted over the binary erasure channel (BEC) and the memoryless binary-input output-symmetric (MBIOS) channels. In the case for the BEC, this decoding algorithm is a combination with belief propagation (BP) decoding and maximum a posteriori (MAP) decoding on zigzag cycles, which cause decoding errors in the error floor region. We show that MAP decoding on the zigzag cycles is realized by means of a message passing algorithm. Moreover, we extend this decoding algorithm to the MBIOS channels. Simulation results demonstrate that the decoding error rates in the error floor regions by the proposed decoding algorithm are lower than those by the BP decoder.

  • A Temperature Tracking Read Reference Current and Write Voltage Generator for Multi-Level Phase Change Memories

    Koh JOHGUCHI  Toru EGAMI  Kousuke MIYAJI  Ken TAKEUCHI  

     
    PAPER

      Vol:
    E97-C No:4
      Page(s):
    342-350

    This paper gives a write voltage and read reference current generator considering temperature characteristics for multi-level Ge2Sb2Te5-based phase change memories. Since the optimum SET and RESET voltages linearly changes by the temperature, the voltage supply circuit must track this characteristic. In addition, the measurement results show that the read current depends on both read temperature and the write temperature and has exponential dependence on the read temperature. Thus, the binning technique is applied for each read and write temperature regions. The proposed variable TC generator can achieve below ±0.5 LSB precision from the measured differential non-linearity and integral non-linearity. As a result, the temperature characteristics of both the linear write voltage and the exponential read current can be tracked with the proposed variation tolerant linear temperature coefficient current generator.

  • Method of Determining Delay Dependence of the Memory Effect in Power Amplifiers and Derivation of Inverse to Cancel the Nonlinear Distortions

    Eisuke FUKUDA  Yasuyuki OISHI  Takeshi TAKANO  Daisuke TAKAGO  Yoshimasa DAIDO  Hiroyuki MORIKAWA  

     
    PAPER

      Vol:
    E97-A No:3
      Page(s):
    749-758

    This paper describes the details of the iteration process used to determine the transfer functions of linear time-invariant (LTI) circuits causing the memory effect of power amplifier (PA). An outline of the method is reported in our work presented at ICCS2012. The accuracy of the method is improved by using cross-correlation spectra at three signal levels, and its validity is confirmed by a computer simulation. The method can be applied to online updating of PAs operating in mobile communication systems. The updating is realized separately from the fast varying nonlinear coefficients. The possibility of updating with a short interval is indirectly shown for the nonlinear coefficients using a procedure similar to that of memoryless PAs. For PAs characterized by the method, this paper also describes the inverses that cancel the nonlinear distortion with minimum complexity. The validity of the inverse is confirmed by a computer simulation on the power spectrum of the PA for orthogonal frequency-division multiplexing (OFDM) signals with 500 subcarriers. The simulated spectra show that the fifth order or higher inverses are effective in keeping adjacent channel leakage power ratio (ACLR) lower than -60dB at the practical signal level. Improvements in the error vector magnitude (EVM) due to the inverse were also confirmed by reductions of gain and phase variations under varying envelope conditions.

  • Asynchronous Memory Machine Models with Barrier Synchronization

    Koji NAKANO  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E97-D No:3
      Page(s):
    431-441

    The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It is assumed that warps (or groups of threads) on the DMM and the UMM work synchronously in a round-robin manner. However, warps work asynchronously in real GPUs, in the sense that they are randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce asynchronous versions of these models in which warps are arbitrarily dispatched. In addition, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation may be costly, we should evaluate and minimize the number of barrier synchronization operations executed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads on the asynchronous UMM with width w and latency l. Since the computation of the sum takes at least Ω(n/w+llog n) time units, this algorithm is time optimal. Finally, we show that the prefix-sums of n numbers can also be computed in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads.

  • A Method for Measuring of RTN by Boosting Word-Line Voltage in 6-Tr-SRAMs

    Goichi ONO  Yuki MORI  Michiaki NAKAYAMA  Yusuke KANNO  

     
    PAPER-Integrated Electronics

      Vol:
    E97-C No:3
      Page(s):
    215-221

    In order to analyze an impact of threshold voltage (Vth) fluctuation induced by random telegraph noise (RTN) on LSI circuit design, we measured a 40-nm 6-Tr-SRAM TEG which enables to evaluate individual bit-line current. RTN phenomenon was successfully measured and we also identified that the transfer MOSFET in an SRAM bit-cell was the most sensitive MOSFET. The proposed word line boosting technique, which applies slightly extra stress to the transfer MOSFET, improves about 30% of detecting probability of fail-bit cells caused by RTN.

  • UStore: STT-MRAM Based Light-Weight User-Level Storage for Enhancing Performance of Accessing Persistent Data

    Yong SONG  Kyuho PARK  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E97-D No:3
      Page(s):
    497-509

    Traditionally, in computer systems, file I/O has been a big performance bottleneck for I/O intensive applications. The recent advent of non-volatile byte-addressable memory (NVM) technologies such as STT-MRAM and PCM, provides a chance to store persistent data with a high performance close to DRAM's. However, as the location of the persistent storage device gets closer to the CPU, the system software layers overheads for accessing the data such as file system layer including virtual file system layer and device driver are no longer negligible. In this paper, we propose a light-weight user-level persistent storage, called UStore, which is physically allocated on the NVM and is mapped directly into the virtual address space of an application. UStore makes it possible for the application to fast access the persistent data without the system software overheads and extra data copy between the user space and kernel space. We show how UStore is easily applied to existing applications with little elaboration and evaluate its performance enhancement through several benchmark tests.

161-180hit(654hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.