A reduced-sample-rate (RSR) sigma-delta-pipeline (SDP) analog-to-digital converter architecture suitable for high-resolution and high-speed applications with low oversampling ratios (OSR) is presented. The proposed architecture employs a class of high-order noise transfer function (NTF) with a novel pole-zero locations. A design methodology is developed to reach the optimum NTF. The optimum NTF determines the location of the non-zero poles improving the stability of the loop and implementing the reduced-sample-rate structure, simultaneously. Unity gain signal transfer function to mitigate the analog circuit imperfections, simplified analog implementation with reduced number of operational transconductance amplifiers (OTAs), and novel, aggressive yet stable NTF with high out of band gain to achieve larger peak signal-to-noise ratio (SNR) are the main features of the proposed NTF and ADC architecture. To verify the usefulness of the proposed architecture, NTF, and design methodology, two different cases are investigated. Simulation results show that with a 4th-order modulator, designed making use of the proposed approach, the maximum SNDR of 115 dB and 124.1 dB can be achieved with only OSR of 8, and 16 respectively.
Yasuhiro SUGIMOTO Yuji GOHDA Shigeto TANAKA
The possibility of realizing a CMOS pipelined current-mode A-D converter (ADC) for video applications has been examined. Two times the input current is obtained at the output of a bit-block of a pipelined ADC by subtracting the negative output current from the positive output current in the pseudo-differential configuration. Subtraction of the sub-DAC (D-to-A converter) current from the two times the input current is performed by controlling of the current comparator, which compares the positive and the negative input currents. A prototype chip has been implemented using 0.35 µm CMOS devices. It operates in 28 MS/s, and showed a 42 dB signal-to-noise ratio from the 2 V supply voltage.
Byoung-Han MIN Young-Jae CHO Hee-Sung CHAE Hee-Won PARK Seung-Hoon LEE
This work proposes a 10b 100 MS/s 1.4 mm2 CMOS ADC for low-power multimedia applications. The proposed two-step pipeline ADC minimizes chip area and power dissipation at the target resolution and sampling rate. The wide-band SHA employs a gate-bootstrapping circuit to handle both single-ended and differential inputs of 1.2 Vp-p at 10b accuracy while the second-stage flash ADC employs open-loop offset sampling techniques to achieve 6b resolution. A 3-D fully symmetrical layout reduces the capacitor and device mismatch of the first-stage MDAC. The low-noise references are integrated on chip with optional off-chip voltage references. The prototype 10b ADC implemented in a 0.18 µm CMOS shows the maximum measured DNL and INL of 0.59LSB and 0.77LSB, respectively. The ADC demonstrates an SNDR of 53.7 dB, an SFDR of 61.5 dB, and the power dissipation of 56 mW at 100 MS/s.
Hui QIN Tsutomu SASAO Yukihiro IGUCHI
This paper addresses a pipelined partial rolling (PPR) architecture for the AES encryption. The key technique is the PPR architecture. With the proposed architecture on the Altera Stratix FPGA, two PPR implementations achieve 6.45 Gbps throughput and 12.78 Gbps throughput, respectively. Compared with the unrolling implementation that achieves a throughput of 22.75 Gbps on the same FPGA, the two PPR implementations improve the memory efficiency (i.e., throughput divided by the size of memory for core) by 13.4% and 12.3%, respectively, and reduce the amount of the memory by 75% and 50%, respectively. Also, the PPR implementation has a up to 9.83% higher memory efficiency than the fastest previous FPGA implementation known to date. In terms of resource efficiency (i.e., throughput divided by the equivalent logic element or slice), one PPR implementation offers almost the same as the rolling implementation, and the other PPR implementation offers a medium value between the rolling implementation and the unrolling implementation that has the highest resource efficiency. However, the two PPR implementations can be implemented on the minimum-sized Stratix FPGA while the unrolling implementation cannot. The PPR architecture fills the gap between unrolling and rolling architectures and is suitable for small and medium-sized FPGAs.
An energy-efficient power-aware design is highly desirable for DSP functions that encounter a wide diversity of operating scenarios in battery-powered wireless sensor network systems. Addressing this issue, this letter presents a low-power power-aware scalable pipelined Booth multiplier that makes use of dynamic-range detection unit, sharing common functional units, ensemble of optimized Wallace-trees and a 4-bit array-based adder-tree for DSP applications.
This paper proposes a new modified radix-24 FFT algorithm and an efficient pipeline FFT architecture based on this algorithm for OFDM systems. This pipeline FFT architecture has the same number of multipliers as that of the radix-22 algorithm. However, the multiplication complexity could be reduced by more than 30% by replacing one half of the programmable multipliers by the newly proposed CSD constant multipliers. From the synthesis simulations of a standard 0.35 µm CMOS SAMSUNG process, a proposed CSD constant complex multiplier achieved more than 60% area efficiency when compared to the conventional programmable complex multiplier. This promoted efficiency could be used to the design of a long length FFT processor in wireless OFDM applications, which needs more power and area efficiency.
Sang-Hyun PARK Sungwook YU Jung-Wan CHO
This paper proposes an effective branch folding technique which combines branch instructions with predicted instructions. This technique can be implemented using an instruction queue, which buffers prefetched instructions. Most of the instructions in the instruction queue are forwarded to the execution unit in sequence. Branch instructions, however, are combined with predicted instructions in the instruction queue and these folded instructions are forwarded to the execution unit. Miss-prediction can be recovered by flushing folded instructions without processor state recovery and by restarting from the other path. Simulation and implementation results show that both performance and power consumption are significantly improved with little additional hardware cost.
Makoto ISHIKAWA Tatsuya KAMEI Yuki KONDO Masanao YAMAOKA Yasuhisa SHIMAZAKI Motokazu OZAWA Saneaki TAMAKI Mikio FURUYAMA Tadashi HOSHI Fumio ARAKAWA Osamu NISHII Kenji HIROSE Shinichi YOSHIOKA Toshihiro HATTORI
We have developed an application processor optimized for 3G cellular phones. It provides high energy efficiency by using various low power techniques. For low active power consumption, we use a hierarchical clock gating technique with a static clock gating controlled by software and a two-level dynamic clock gating controlled by hardware. This technique reduces clock power consumption by 35%. And we also apply a pointer-based pipeline to in the CPU core, which reduces the pipeline latch power by 25%. This processor contains 256 kB of on-chip user RAM (URAM) to reduce the external memory access power. The URAM read buffer (URB) enables high-throughput, low latency access to the URAM while keeping the CPU clock frequency high because the URAM read data is transferred to the URB in 256-bit widths at half the frequency of the CPU. The average miss penalty is 3.5 cycles at the CPU clock frequency, hit rate is 89% and the energy used for URAM reads is 8% less that what it would be for URAM without a URB. These techniques reduce the power consumption of the CPU core, and achieve 4500 MIPS/W at 1.0 V power supply (Dhrystone 2.1). For the low leakage requirements, we use internal power switches, and provides resume-standby (R-standby) and ultra-standby (U-standby) modes. Signals across a power boundary are transmitted through µI/O circuits to prevent invalid signal transmission. In the R-standby mode, the power supply to almost all the CPU core area, except for the URAM is cut off and the URAM is set to a retention mode. In the U-standby mode, the power supply to the URAM is also turned off for less leakage current. The leakage currents in the R-standby and in the U-standby modes are respectively only 98 and 12 µA. For quick recovery from the R-standby mode, the boot address register (BAR) and control register contents needed immediately after wake-up are saved by hardware into backup latches. The other contents are saved by software into URAM. It takes 2.8 ms to fully recover from R-standby.
Toshinori SATO Akihiro CHIYONOBU
Power consumption is a major concern in embedded microprocessors design. Reducing power has also been a critical design goal for general-purpose microprocessors. Since they require high performance as well as low power, power reduction at the cost of performance cannot be accepted. There are a lot of device-level techniques that reduce power with maintaining performance. They select non-critical paths as candidates for low-power design, and performance-oriented design is used only in speed-critical paths. The same philosophy can be applied to architectural-level design. We evaluate a technique, which exploits dynamic information regarding instruction criticality in order to reduce power. We evaluate an instruction steering policy for a clustered microarchitecture, which is based on instruction criticality, and find it is substantially energy-efficient while it suffers performance degradation.
Shoji KAWAHITO Kazutaka HONDA Masanori FURUTA Nobuhiro KAWAI Daisuke MIYAZAKI
In this paper, low-power design techniques of high-speed A/D converters are reviewed and discussed. Pipeline and parallel-pipeline architectures are treated as these are dominant architectures when required high sampling rate and high resolution with reasonable power dissipation. A systematic approach to the power optimization of pipeline and parallel pipeline ADC's is introduced based on models of noise analysis and response time of a building block in the multiple-stage pipeline ADC. Finally, the theoretical minimum of required power as functions of the sampling rate, resolution and SNR is discussed. The analysis shows that, with the developments of new circuits and systems to approach to the minimum, the power can be further reduced by a factor of more than 1/10 without changing the basic architectures.
Young-Jae CHO Hyuen-Hee BAE Seung-Hoon LEE
This work proposes an 8b 220 MS/s 230 mW 3-stage pipeline CMOS ADC with on-chip filters for temperature- and power supply- insensitive voltage references. The proposed RC low-pass filters reduce reference settling time at heavy R&C loads and improve switching noise performance without conventional off-chip bypass capacitors. The prototype ADC fabricated in a 0.25 µm CMOS occupies the active die area of 2.25 mm2 and shows the measured DNL and INL of maximum 0.43 LSB and 0.82 LSB, respectively. The ADC maintains the SNDR of 43 dB and 41 dB up to the 110 MHz input at 200 MS/s and 220 MS/s, respectively, while the SNDR at the 500 MHz input is degraded as much as only 3 dB than the SNDR at the 110 MHz input.
Jeong-Gun LEE Suk-Jin KIM Jeong-A LEE Kiseon KIM
This paper presents a new asynchronous FIFO design to reduce forward latency in a linear structure. The operation mode for each cell can be reconfigured dynamically as either of the two schemes, wave pipelining or handshaking, according to the data flow in the FIFO. The adoption of wave pipelining to the conventional self-timed FIFO can reduce the overhead of the handshaking as well as latching control in each stage. Initial pre-layout simulations indicate about two times of improvement on latency performance over a state-of-art asynchronous FIFO, while retaining its throughput.
Sang Gu KANG Doo Hyung WOO Hee Chul LEE
Transferring the image information in analog form between the focal plane array (FPA) and the external electronics causes the disturbance of the outside noise. On-chip analog-to-digital (A/D) converter into the readout integrated circuit (ROIC) can eliminate the possibilities of the cross-talk of noise. Also, the information can be transported more efficiently in power in the digital domain compared to the analog domain. In designing on-chip A/D converter for cooled type high density infrared detector array, the most stringent requirements are power dissipation, number of bits, die area and throughput. In this study, pipelined type A/D converter was adopted because it has high operation speed characteristics with medium power consumption. Capacitor averaging technique and digital error correction for high resolution was used to eliminate the error which is brought out from the device mismatch. The readout circuit was fabricated using 0.6 µm CMOS process for 128 128 mid-wavelength infrared (MWIR) HgCdTe detector array. Fabricated circuit used direct injection type for input stage, and then S/N ratio could be maximized with increasing the integration capacitor. The measured performance of the 14 b A/D converter exhibited 0.2 LSB differential non-linearity (DNL) and 4 LSB integral non-linearity (INL). A/D converter had a 1 MHz operation speed with 75 mW power dissipation at 5 V. It took the die area of 5.6 mm2. It showed the good performance that can apply for cooled type high density infrared detector array.
Fumio ARAKAWA Motokazu OZAWA Osamu NISHII Toshihiro HATTORI Takeshi YOSHINAGA Tomoichi HAYASHI Yoshikazu KIYOSHIGE Takashi OKADA Masakazu NISHIBORI Tomoyuki KODAMA Tatsuya KAMEI Makoto ISHIKAWA
A SuperHTM embedded processor core implemented in a 130-nm CMOS process running at 400 MHz achieved 720 MIPS and 2.8 GFLOPS at a power of 250 mW in worst-case conditions. It has a dual-issue seven-stage pipeline architecture but maintains the 1.8 MIPS/MHz of the previous five-stage processor. The processor meets the requirements of a wide range of applications, and is suitable for digital appliances aimed at the consumer market, such as cellular phones, digital still/video cameras, and car navigation systems.
Mohammad TAHERZADEH-SANI Reza LOTFI Omid SHOAEI
Dynamic non-linearities are of more importance in highly-linear high-speed applications such as software radios. In this paper, a fully-analytical approach to estimate the statistics of dynamic non-linearity parameters of pipeline analog-to-digital converters (ADCs) in the presence of circuit non-idealities is presented. These imperfections include the capacitor mismatches and the non-idealities in the operational amplifiers (op-amps). The most two important ADC dynamic non-linearity parameters, the spurious-free dynamic range (SFDR) and the signal-to-noise-and-distortion ratio (SNDR) are quantified here and closed-form formulas are presented. These formulas are useful for design automation as well as hand calculations of highly-linear pipeline ADCs. Behavioral simulations are presented to show the accuracy of the proposed equations.
Je-Hoon LEE YoungHwan KIM Kyoung-Rok CHO
In this paper, we design and implement a fast asynchronous embedded CISC microprocessor, A8051, introducing well-tuned pipeline architecture and enhanced control schemes. This work shows an asynchronous design methodology for a CISC type processor, handling the complicated control structure and various instructions. We tuned the proposed architecture to the 5-stage pipeline, reducing the number of idle stages. For the work, we regrouped the instructions based on the number of the machine cycles identified. A8051 has three enhanced control features to improve the system performance: multi-looping control of the pipeline stage, variable length instruction register to get a multiple word instruction in a time, and branch prediction accelerating. The proposed A8051 was synthesized to a gate level design using a 0.35 µm CMOS standard cell library. Simulation results indicate that A8051 provides about 18 times higher speed than the traditional Intel 8051 and about 5 times higher speed than the previously designed asynchronous 8051. In power consumption, core of A8051 shows 15 times higher MIPS/Watt than the synchronous H8051.
Jin-Hyeok CHOI Yong-Ju KIM Jae-Kyung WEE Seongsoo LEE
Block-wise shutdown of idle functional blocks in VLSI systems is a promising approach to reduce power consumption. Especially, multi-threshold voltage CMOS (MTCMOS) is widely accepted to save leakage power during idle time. As operating frequency increases, it requires short wake-up time to use the shutdown block in time. However, short wake-up time of a large block causes large current surge during wake-up process. This often leads to system malfunction due to severe power line noise. This is one of the serious problems for practical implementation of MTCMOS block-wise shutdown. This letter proposes an effective wake-up scheme for block-wise shutdown of low-power VLSI systems. It exploits pipelined wake-up strategy that reduces current surge during wake-up process. In this letter, the proposed scheme was analyzed and simulated from the viewpoint of power distribution network. To verify its validity, it was applied to a multiplier block in Compact Flash controller chip on a test board. According to the simulation results of equivalent R, L, and C modeling, the proposed scheme achieved significant improvement over conventional concurrent shutdown schemes.
Naihua YUAN Anh DINH Ha H. NGUYEN
A time-domain equalization (TEQ) algorithm is presented to shorten the effective channel impulse response to increase the transmission efficiency of the 54 Mbps IEEE 802.11a orthogonal frequency division multiplexing (OFDM) system. In solving the linear equation Aw = B for the optimum TEQ coefficients, A is shown to be Hermitian and positive definite. The LDLT and LU decompositions are used to factorize A to reduce the computational complexity. Simulation results show high performance gains at a data rate of 54 Mbps with moderate orders of TEQ finite impulse response (FIR) filter. The design and implementation of the algorithm in field programmable gate array (FPGA) are also presented. The regularities among the elements of A are exploited to reduce hardware complexity. The LDLT and LU decompositions are combined in hardware design to find the TEQ coefficients in less than 4 µs. To compensate the effective channel impulse response, a radix-4 pipeline fast Fourier transform (FFT) is implemented in performing zero forcing equalization. The hardware implementation information is provided and simulation results are compared to mathematical values to verify the functionalities of the chips running at 54 Mbps.
Jun ZHANG JeoungChill SHIM Hiroyuki KURINO Mitsumasa KOYANAGI
The IP routing lookup problem is equivalent to finding the longest prefix of a packet's destination address in a routing table. It is a challenging problem to design a high performance IP routing lookup architecture, because of increasing traffic, higher link speed, frequent updates and increasing routing table size. At first, increasing traffic and higher link speed require that the IP routing can be executed at wire speed. Secondly, frequent routing table updates require that the insertion and deletion operations should be simple and low delay. At last, increasing routing table size hopes that less memory is used in order to reduce cost. Although many schemes to achieve fast lookup exist, less attention is paid on the latter two factors. This paper proposed a novel pipelined IP routing lookup architecture using selective binary search on hash table organized by prefix lengths. The evaluation results show that it can perform IP lookup operations at a maximum rate of one lookup per cycle. The hash operation ratio for one lookup can be reduced to about 1%, less than two hash operations are needed for one table update and only 512 kbytes SRAM is needed for a routing table with about 43000 prefixes. It proves to have higher performance than the existing schemes.
Ronny VELJANOVSKI Aleksandar STOJCEVSKI Jugdutt SINGH Aladin ZAYEGH Michael FAULKNER
A novel reconfigurable architecture has been proposed for a mobile terminal receiver that can drastically reduce power dissipation dependant on adjacent channel interference. The proposed design can automatically scale the number of filter coefficients and word length respectively by monitoring the in-band and out-of-band powers. The new architecture performance was evaluated in a simulation UTRA-TDD environment because of the large near far problem caused by adjacent channel interference from adjacent mobiles and base stations. The UTRA-TDD downlink mode was examined statistically and results show that the reconfigurable architectures can save an average of up to 75% power dissipation respectively when compared to a fixed filter length of 57 and word length of 16 bits. This power saving only applies to the filter and ADC, not the whole receiver. This will prolong talk and standby time in a mobile terminal. The average number of taps and bits were calculated to be 14.98 and 10 respectively, for an outage of 97%.