Yasuyuki OISHI Shigekazu KIMURA Eisuke FUKUDA Takeshi TAKANO Yoshimasa DAIDO Kiyomichi ARAKI
To reduce laborious tasks of the phase determination, our previous paper has proposed a method to derive phase reference for two-tone intermodulation distortion (IMD) measurement of a power amplifier (PA) by using small-signal S-parameters. Since the method is applicable to low output power level, this paper proposes an iterative process to extend the applicable power level up to 1-dB compression. The iterative process is based on extraction of linear response: the principle of the extraction is described theoretically by using an accurate model of the PA with memory effect. Measurement of two-tone IMD is made for a GaN FET PA. Validity of the iteration is confirmed as convergence of the extracted linear response to that given by the product of S21 and input signal. Measured results also show validity of the physical model of the memory effect provided by Vuolevi et al. because beat frequency dependences of IMD's are accurately fitted by bias impedances at even order harmonics of envelope frequency. The PA is characterized by using measured results and the third and fifth order inverses of the PA are designed. Improvement of IMD is theoretically confirmed by using the inverses as predistorters.
Hideki TAKASE Hiroyuki TOMIYAMA Hiroaki TAKADA
Energy minimization has become one of the primary goals in the embedded real-time domains. Consequently, scratch-pad memory has been employed as partial or entire replacement for cache memory due to its better energy efficiency. However, most previous approaches were not applicable to a preemptive multi-task environment. We propose three methods of partitioning and allocation of scratch-pad memory for fixed-priority-based preemptive multi-task systems. The three methods, i.e., spatial, temporal, and hybrid methods, achieve energy reduction in the instruction memory subsystems. With the spatial method, each task occupies its exclusive space in scratch-pad memory. With the temporal method, the running task uses entire scratch-pad space. The content of scratch-pad memory is swapped out as a task executes or gets preempted. The hybrid method is based on the spatial one but a higher priority task can temporarily use the space of lower priority task. The amount of space is prioritized for higher priority tasks. We formulate each method as an integer programming problem that simultaneously determines (1) partitioning of scratch-pad memory space for the tasks, and (2) allocation of program code to scratch-pad memory space for each task. Our methods not only support the real-time task scheduling but also consider aggressively the periods and priorities of tasks for the energy minimization. Additionally, we implement an RTOS-hardware cooperative support mechanism for runtime code allocation to the scratch-pad memory space. We have made the experiments with the fully functional real-time operating system. The experimental results have demonstrated the effectiveness of our techniques. Up to 73% energy reduction compared to a conventional method was achieved.
Je-Hoon LEE Young-Jun SONG Sang-Choon KIM
This paper presents a self-timed SRAM system employing new memory segment technique that divides memory cell arrays into multiple regions based on its latency, not the size of the memory cell array. This is the main difference between the proposed memory segmentation technique and the conventional method. Consequently, the proposed method provides a more efficient way to reduce the memory access time. We also proposed an architecture of dummy cell and completion signal generator for the handshaking protocol. We synthesized a 8 MB SRAM system consisting of 16 512K memory blocks using Hynix 0.35-µm CMOS process. Our implantation shows 15% higher performance compared to the other systems. Our implementation results shows a trade-off between the area overhead and the performance for the number of memory segmentation.
Masaki KOBAYASHI Hirofumi YAMADA Michimasa KITAHARA
Complex-valued Associative Memory (CAM) is an advanced model of Hopfield Associative Memory. The CAM is based on multi-state neurons and has the high ability of representation. Lee proposed gradient descent learning for the CAM to improve the storage capacity. It is based on only the phases of input signals. In this paper, we propose another type of gradient descent learning based on both the phases and the amplitude. The proposed learning method improves the noise robustness and accelerates the learning speed.
This paper presents a content-addressable memory (CAM) using a phase-change device. A hierarchical match-line structure and a one-hot-spot block code are indispensable to suppress the resistance ratio of the phase-change device and the area overhead of match detectors. As a result, an 8-nsec 72-bit-parallel-search CAM is implemented using a phase-change-device/MOS-hybrid circuitry, where high and low resistances are higher than 2.3 MΩ and lower than 97 kΩ, respectively, while maintaining one-day retention.
Won-young CHUNG Ha-young JEONG Won Woo RO Yong-surk LEE
In this paper, we propose a novel low-cost Message Passing Interface (MPI) unit between processor nodes, which supports message passing in multiprocessor systems using distributed memory architecture. Our MPI unit operates in the standard mode – using the buffered mode for small amounts of data transaction and the synchronous mode for large amounts of data transaction. This results in increased performance by reducing the control message transmission time for the small amount of data. We verified the performance with a simulator designed based on SystemC. Additionally, we designed the MPI unit using VerilogHDL, and we synthesized it with a synopsys design compiler. The proposed standard mode MPI unit shows a high performance even though the size of the MPI unit occupies less than 1% of the whole chip. Thus, with respect to low-cost design and scalability, this MPI hardware unit is useful to increase overall performance of the embedded Multiprocessor System on a Chip (MPSoC).
Bo AI Zhang-Dui ZHONG Bo LI Lin-hua MA
In this paper, a robust fractional order memory polynomial pre-distorter with two novel schemes to conduct digital base-band power amplifier pre-distortion is proposed. For the first scheme, fractional order terms are included in the conventional memory polynomial containing the odd and even order polynomial terms, which is called Scheme One. The second scheme, called Scheme Two, simply replaces even order polynomial terms with fractional order polynomial terms to improve the linear performance of power amplifiers. The mathematical expressions for these two schemes are derived. The computer simulations and numerical analysis show that, compared with the conventional pre-distortion methods, 11 dB and 8.5 dB more out-of-band suppression gain can be obtained by Scheme One and Scheme Two, respectively. Corresponding FPGA realization shows that the two schemes are cost-effective in terms of hardware resources.
Woong-Kee LOH Yang-Sae MOON Wookey LEE
Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on `divide-and-conquer' strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.
Recently, the 3-dimensional (3-D) vertical Floating Gate (FG) type NAND flash memory cell arrays with the Extended Sidewall Control Gate (ESCG) was proposed [7]. Using this novel structure, we successfully implemented superior program speed, read current, and less interference characteristics, by the high Control Gate (CG) coupling ratio with less interference capacitance and highly electrical inverted S/D technique. However, the process stability of the ESCG structure has not been sufficiently confirmed such as the variations of the physical dimensions. In this paper, we intensively investigated the electrical dependency according to the physical dimensions of ESCG, such as the line and spacing of ESCG and the thickness of barrier oxide. Using the 2-dimentional (2-D) TCAD simulations, we compared the basic characteristics of the FG type flash cell operation, in the aspect of program speed, read current, and interference effect. Finally, we check the process window and suggest the optimum target of the ESCG structure for reliable flash cell operation. From above all, we confirmed that this 3-dimensional vertical FG NAND flash memory cell arrays using the ESCG structure is the most attractive candidate for terabit 3-D vertical NAND flash cell array.
Jong-Dae LEE Hyun-Min SEUNG Kyoung-Cheol KWON Jea-Gun PARK
In summary, we successfully developed the polymer nonvolatile 4F2 memory-cell. It was based on nonvolatile memory characteristics such as memory margin and retention time, which was observed in memory-cell embedded with Ag nanocrystals in PVK layer. The nonvolatile memory characteristics depend on the shape, distribution and isolation of Ag nanocrystals. Accordingly, the thickness of Ag film has an important role in optimizing the Ag nanocrystals. Therefore, the polymer nonvolatile memory-cell is fabricated by appropriate thickness of film and need an improvement of interface between Ag nanocrystals and PVK for sufficient nonvolatile memory characteristics.
Several kinds of capacitor-less DRAM cells based on planar SOI-MOSFET technology have been proposed and researched to overcome the integration limit of the conventional DRAM. In this paper, we propose the Floating Body type DRAM cell array architecture with the Vertical MOSFET and discuss its basic operation using a 3-D device simulator. In contrast to previous planar SOI-MOSFET technology, the Floating Body type DRAM with the Vertical MOSFET achieves a cell area of 4F2 and obtain its floating body cell by isolating the body from the substrate vertically by the bottom-electrode. Therefore, the necessity for a SOI substrate is eliminated. In this paper, the cell array architecture of Floating Body type 1T-DRAM is proposed, and furthermore, the basic memory operations of read, write, and erase for Vertical type 1 transistor (1T) DRAM in the 45 nm technology node are shown. In addition, the retention and disturb characteristics of the Vertical type 1T-DRAM are discussed.
Akira OTAKE Keita YAMAGUCHI Katsumasa KAMIYA Yasuteru SHIGETA Kenji SHIRAISHI
Due to the aggressive scaling of non-volatile memories, “charge-trap memories” such as MONOS-type memories become one of the most important targets. One of the merits of such MONOS-type memories is that they can trap charges inside atomic-scale defect sites in SiN layers. At the same time, however, charge traps with atomistic scale tend to induce additional large structural changes. Hydrogen has attracted a great attention as an important heteroatom in MONOS-type memories. We theoretically investigate the basic characteristics of hydrogen-defects in SiN layer in MONOS-type memories on the basis of the first-principles calculations. We find that SiN structures with a hydrogen impurity tend to reveal reversible structural change during program/erase operation.
In this paper, we propose a memory-efficient structure for a pulse Doppler radar in order to reduce the hardware's complexity. The conventional pulse Doppler radar is computed by fast frequency transform (FFT) of all range cells in order to extract the velocity of targets. We observed that this method requires a huge amount of memory to perform the FFT processes for all of the range cells. Therefore, instead of detecting the velocity of all range cells, the proposed architecture extracts the velocity of the targets by using the cells related to the moving targets. According to our simulations and experiments, the detection performance of this proposed architecture is 93.5%, and the proposed structure can reduce the hardware's complexity by up to 66.2% compared with the conventional structure.
Sang-Hyeon LEE Moonkyung KIM Byung-ki CHEONG Jooyeon KIM Jo-Won LEE Sandip TIWARI
We report a fast single element nonvolatile memory that employs amorphous to crystalline phase change. Temperature change is induced within a single electronic element in confined geometry transistors to cause the phase change. This novel phase change memory (PCM) operates without the need for charge transport through insulator films for charge storage in a floating gate. GeSbTe (GST) was employed to the phase change material undergoing transition below 200. The phase change, causing conductivity and permittivity change of the film, results in the threshold voltage shift observed in transistors and capacitors.
Jaesun KIM Younghoon KIM Hyuk-Jae LEE
The excessive memory access required to perform motion compensation when decoding compressed video is one of the main limitations in improving the performance of an H.264/AVC decoder. This paper proposes an H.264/AVC decoder that employs three techniques to reduce external memory access events: efficient distribution of reference frame data, on-chip cache memory, and frame memory recompression. The distribution of reference frame data is optimized to reduce the number of row activations during SDRAM access. The novel cache organization is proposed to simplify tag comparisons and ease the access to consecutive 4×4 blocks. A recompression algorithm is modified to improve compression efficiency by using unused storage space in neighboring blocks as well as the correlation with the neighboring pixels stored in the cache. Experimental results show that the three techniques together reduce external memory access time by an average of 90%, which is 16% better than the improvements achieved by previous work. Efficiency of the frame memory recompression algorithm is improved with a 32×32 cache, resulting in a PSNR improvement of 0.371 dB. The H.264/AVC decoder with the three techniques is fabricated and implemented as an ASIC using 0.18 µm technology.
Yuji KUNITAKE Toshinori SATO Hiroto YASUURA
Negative Bias Temperature Instability (NBTI) is one of the major reliability problems in advanced technologies. NBTI causes threshold voltage shift in a PMOS transistor. When the PMOS transistor is biased to negative voltage, threshold voltage shifts to negatively. On the other hand, the threshold voltage recovers if the PMOS transistor is positively biased. In an SRAM cell, due to NBTI, threshold voltage degrades in the load PMOS transistors. The degradation has the impact on Static Noise Margin (SNM), which is a measure of read stability of a 6-T SRAM cell. In this paper, we discuss the relationship between NBTI degradation in an SRAM cell and the dynamic stress and recovery condition. There are two important characteristics. One is a stress probability, which is defined as the rate that the PMOS transistor is negatively biased. The other is a stress and recovery cycle, which is defined as the switching interval of an SRAM value. In our observations, in order to mitigate the NBTI degradation, the stress probability should be small and the stress and recovery cycle should be shorter than 10 msec. Based on the observations, we propose a novel cell-flipping technique, which makes the stress probability close to 50%. In addition, we show results of the case studies, which apply the cell-flipping technique to register file and cache memories.
Teruyoshi HATANAKA Mitsue TAKAHASHI Shigeki SAKAI Ken TAKEUCHI
This paper presents an improvement of the memory cell reliability by the memory cell VTH optimization of the ferroelectric (Fe)-NAND flash memory. The effects of the memory cell VTH on the reliability of the Fe-NAND flash memory are experimentally analyzed for the first time. The reliability is evaluated by the measured VTH shift due to the read disturb, program disturb and data retention. Three types of Fe-NAND flash memory cells, a positive, zero and negative VTH memory cell, are defined on the basis of the memory cell VTH. The middle of VTH of programmed and erased states is 1 V, 0 V and -0.3 V in a positive, zero and negative VTH memory cell, respectively. The VTH shift of the positive, zero and negative VTH memory cells show similar characteristics in the program/erase and the VPASS and VPGM disturbs because the external electric field is so high that the internal depolarization field does not affect the VTH shift. On the other hand, in the data retention, the VTH shift of the three types of VTH memory cells show different characteristics. The reliability of the Fe-NAND flash memory is best optimized in the zero VTH memory cell. In the proposed zero VTH Fe-NAND flash memory cell scheme, the measured VTH shift due to the read disturb, program disturb and data retention decreases by 32%, 24% and 10%, respectively, compared with conventional positive VTH Fe-NAND flash memory cell scheme. Contrarily, in the negative VTH memory cell, the VTH shift during the data retention is 0.49 V and unacceptably large because of the depolarization field. The conventional positive VTH memory cell suffers from a sever read and program disturb. The measured results are drastically different from those of the conventional floating-gate NAND flash memory cell where the negative VTH memory cell is most suitable in terms of the reliability.
Ning DENG Weixing JI Jiaxin LI Qi ZUO Feng SHI
Many state-of-the-art embedded systems adopt scratch-pad memory (SPM) as the main on-chip memory due to its advantages in terms of energy consumption and on-chip area. The cache is automatically managed by the hardware, while SPM is generally manipulated by the software. Traditional compiler-based SPM allocation methods commonly use static analysis and profiling knowledge to identify the frequently used data during runtime. The data transfer is determined at the compiling stage. However, these methods are fragile when the access pattern is unpredictable at compile time. Also, as embedded devices diversify, we expect a novel SPM management that can support embedded application portability over platforms. This paper proposes a novel runtime SPM management method based on the core working set (CWS) theory. A counting-based CWS identification algorithm is adopted to heuristically determine those data blocks in the program's working set with high reference frequency, and then these promising blocks are allocated to SPM. The novelty of this SPM management method lies in its dependence on the program's dynamic access pattern as the main cue to conduct SPM allocation at runtime, thus offloading SPM management from the compiler. Furthermore, the proposed method needs the assistance of MMU to complete address redirection after data transfers. We evaluate the new approach by comparing it with the cache system and a classical profiling-driven method, and the results indicate that the CWS-based SPM management method can achieve a considerable energy reduction compared with the two reference systems without notable degradation on performance.
Hasitha Muthumala WAIDYASOORIYA Masanori HARIYAMA Michitaka KAMEYAMA
Accelerator cores in low-power embedded processors have on-chip multiple memory modules to increase the data access speed and to enable parallel data access. When large functional units such as multipliers and dividers are used for addressing, a large power and chip area are consumed. Therefore, recent low-power processors use small functional units such as adders and counters to reduce the power and area. Such small functional units make it difficult to implement complex addressing patterns without duplicating data among multiple memory modules. The data duplication wastes the memory capacity and increases the data transfer time significantly. This paper proposes a method to reduce the memory duplication for window-based image processing, which is widely used in many applications. Evaluations using an accelerator core show that the proposed method reduces the data amount and data transfer time by more than 50%.
Masayuki ARAI Tatsuro ENDO Kazuhiko IWASAKI Michinobu NAKAO Iwao SUZUKI
To reduce the manufacturing cost of SoCs with many embedded SRAMs, we propose a scheme to reduce the area per good die for the SoC memory built-in self-test (MBIST). We first propose BIST hardware overhead reduction by application of an encoder-based comparator. For the repair of a faulty SRAM module with 2-D redundancy, we propose spare assignement algorithm. Based on an existing range-cheking-first algorithm (RCFA), we propose assign-all-row-RCFA (A-RCFA) which assign unused spare rows to faulty ones, in order to suppress the degradation of repair rate due to compressed fail location information output from the encoder-based comparator. Then, considering that an SoC has many SRAM modules, we propose a heuristic algorithm based on iterative improvement algorithm (IIA), which determines whether each SRAM should have a spare row or not, in order to minimize area per a good die. Experimental results on practical scale benchmark SoCs with more than 1,000 SRAM modules indicate that encoder-based comparators reduce hardware overhead by about 50% compared to traditional ones, and that combining the IIA-based algorithm for determining redundancy architecture with the encoder-based comparator effectively reduces the area per good die.