Ronny VELJANOVSKI Aleksandar STOJCEVSKI Jugdutt SINGH Aladin ZAYEGH Michael FAULKNER
A novel reconfigurable architecture has been proposed for a mobile terminal receiver that can drastically reduce power dissipation dependant on adjacent channel interference. The proposed design can automatically scale the number of filter coefficients and word length respectively by monitoring the in-band and out-of-band powers. The new architecture performance was evaluated in a simulation UTRA-TDD environment because of the large near far problem caused by adjacent channel interference from adjacent mobiles and base stations. The UTRA-TDD downlink mode was examined statistically and results show that the reconfigurable architectures can save an average of up to 75% power dissipation respectively when compared to a fixed filter length of 57 and word length of 16 bits. This power saving only applies to the filter and ADC, not the whole receiver. This will prolong talk and standby time in a mobile terminal. The average number of taps and bits were calculated to be 14.98 and 10 respectively, for an outage of 97%.
In this letter, we present a practical method of implementing the parallel interference cancellation (PIC) algorithm in an asynchronous CDMA system. A novel pipelined structure is employed in this method in order to reduce the processing delay and the memory space comparing to the conventional PIC processing scheme.
Masashi MIZUNO James OKELLO Hiroshi OCHI
In this paper, we propose a pipelined architecture for an equalizer based on the Multilevel Modified Constant Modulus Algorithm (MMCMA). We also provide the correction factor that mathematically converts the proposed pipelined adaptive equalizer into an equivalent non-pipelined conventional MMCMA based equalizer. The proposed method of pipelining uses modules with 6 filter coefficients, resulting in an overall latency of a single sampling period, along the main transmission line. The basic concept of the proposed architecture is to implement the Finite Impulse Response (FIR) filter and the algorithm portion of the adaptive equalizer, such that the critical path of the whole circuit has a maximum of three complex multipliers and three adders.
Tung-Chou CHEN Che-Ho WEI Shyue-Win WEI
Based on a modified step-by-step decoding procedure, a high-speed pipelined Reed-Solomon decoder is presented. The decoder requires only the delay time of three 2-input XOR gates for decoding each coded symbol. The decoder can be operated in a bit rate of Gbits/sec order and thus suitable for the very high speed data transmission systems.
Hiroyasu OBATA Kenji ISHIDA Junichi FUNASAKA Kitsutaro AMANO
Asymmetric networks, which provide asymmetric bandwidth or delay for upstream and downstream transfer, have recently gained much attention since they support popular applications such as the World Wide Web (WWW). HTTP (Hypertext Transfer Protocol) is the basis of most WWW services so, evaluating the performance of HTTP on asymmetric networks is increasingly important, particularly real-world networks. However, the performance of HTTP on the asymmetric networks composed of satellite and terrestrial links has not sufficiently evaluated. This paper proposes new formulas to evaluate the performance of both HTTP1.0 and HTTP1.1 on asymmetric networks. Using these formulas, we calculate the time taken to transfer web data by HTTP1.0/1.1. The calculation results are compared to the results of an existing theoretical formula and experimental results gained from a system that combines a VSAT (Very Small Aperture Terminal) satellite communication system for satellite links (downstream) and the Internet for terrestrial links (upstream). The comparison shows that the proposed formulas yield more accurate results (compared to the measured values) than the existing formula. Furthermore, this paper proposes an evaluation formula for pipelined HTTP1.1, and shows that the values output by the proposed formula agree with those obtained by experiments (on the VSAT system) and simulations.
Ing-Jer HUANG Li-Rong WANG Yu-Min WANG Tai-An LU
This paper presents a case study of synthesis of the industrial embedded microcontroller HT48100 and analysis of performance, cost and software compatibility for its implementation alternatives, using the hardware/software co-design system for microcontrollers/microprocessors PIPER-II. The synthesis tool accepts as input the instruction set architecture (behavioral) specification, and produces as outputs the pipelined RTL designs with their simulators, and the reordering constraints which guide the compiler backend to optimize the code for the synthesized designs. A compiler backend is provided to optimize the application software according to the reordering constraints. The study shows that the co-design approach was able to help the original design team to analyze the architectural properties, identify inefficient architecture features, and explore possible architectural improvements and their impacts in both hardware and software. Feasible future upgrades for the microcontroller family have been identified by the study.
We present pipelined simple matching, called PSM, for an input buffered switch to relax the scheduling timing constraint by modifying pipelined maximal-sized matching (PMM). Like the pipelined manner of PMM, to produce the matching results in every time slot, PSM employs multiple subschedulers which take more than one time slot to complete matching. Using only head-of-line information of input buffers, PSM successively sends each request to all subschedulers to provide a better matching opportunity. To obtain better performance, PSM uses unique starting points of scheduling pointers in which the difference between the starting points is equal for any two adjacent subschedulers for a same output. Using computer simulations under a uniform traffic, we show PSM is more appropriate than PMM for pipelined scheduling of an input buffered switch.
Masanori FURUTA Shoji KAWAHITO Daisuke MIYAZAKI
A digital calibration technique, which corrects errors due to capacitor mismatch in pipelined ADC and directly measures the error coefficients using the ADC INL plot, is described. The proposed technique can be applied for various types of pipelined ADC architectures. Test results using an implemented 10-bit pipelined ADC show that the ADC achieves a peak signal-to-noise-and-distortion ratio of 56.5 dB, a peak integral non-linearity of 0.3 LSB, and a peak differential non-linearity of 0.3 LSB using the digital calibration.
Tatsuji MATSUURA Junya KUDOH Eiki IMAIZUMI
A low-power-consumption 6-bit pipelined analog-to-digital converter for use in a BluetoothTM RF transceiver has been developed. The RF transceiver chip was fabricated using a 0.35-µm BiCMOS process, and the A/D converter is based on CMOS technology for digital logic. To reduce the power consumption of the converter, we used a look-ahead pipeline architecture to reduce the required settling time of an amplifier in the critical path of the converter. We show that through this reduction, amplifier power consumption of 600 µA can be reduced to 250 µA to achieve a 13-MHz conversion rate. We have also developed a low-power two-capacitor switched-capacitor common-mode feedback circuit which enables an offset cancellation of an amplifier during the reset phase. Offset cancellation is used in each stage of the S/H amplifier to reduce the overall offset of the converter. It achieves an effective number of bits of 5.7 at a conversion rate of 13 Msps and 5.0 at 26 Msps. The residual offset of the converter is only 4 mV. It has a low total current consumption of 3.2 mA at 13 Msps and a supply voltage of 2.8 V.
Eiji OKI Roberto ROJAS-CESSA H. Jonathan CHAO
This paper proposes an innovative Pipeline-based Maximal-sized Matching scheduling approach, called PMM, for input-buffered switches. It dramatically relaxes the limitation of a single time slot for completing a maximal matching into any number of time slots. In the PMM approach, arbitration is operated in a pipelined manner, where K subschedulers are used. Each subscheduler is allowed to take more than one time slot for its matching. Every time slot, one of the subschedulers provides the matching result. We adopt an extended version of Dual Round-Robin Matching (DRRM), called iterative DRRM (iDRRM), as a maximal matching algorithm in a subscheduler. PMM maximizes the efficiency of the adopted arbitration scheme by allowing sufficient time for the number of iterations. We show that PMM preserves 100% throughput under uniform traffic and fairness for best-effort traffic of the non-pipelined adopted algorithm, while ensuring that cells from the same virtual output queue (VOQ) are transmitted in sequence. In addition, we confirm that the delay performance of PMM is not significantly degraded by increasing the pipeline degree, or the number of subschedulers, when the number of outstanding requests for each subscheduler from a VOQ is limited to 1.
Toshiyuki YOROZUYA Koji OHASHI Mineo KANEKO
In this paper, we study loop pipeline scheduling problem under given resource assignment (operation to functional unit assignments and data to register assignments), which is one of the key tasks in data-path synthesis based on the assignment solution space exploration. We show an approach using a precedence constraint graph with parametric disjunctive arcs generated from the specified assignment information, and derive a scheduling method using branch-and-bound exploration of the parameter space. As an application of the proposed scheduling method, it is incorporated with Simulated-Annealing (SA) based exploration of assignment solution space, and it is demonstrated that data-paths of the fifth-order elliptic wave filter are successfully synthesized.
Akira AKAHORI Akito SEKIYA Takahiro YAMADA Akira FUJIMAKI Hisao HAYAKAWA
We have designed the Half Adder (HA) circuit and the Carry Save Serial Adder (CSSA) circuit based on pipeline architecture. Our HA has the structure of a two-stage pipeline and consists of 160 Josephson Junctions (JJs). Our CSSA has the structure of a four-stage pipeline with a feedback loop and consists of 360 JJs. These circuits were fabricated by the NEC standard process. There are two issues which should be considered in the design. One is parameter spreads generated by the fabrication process and the other is leakage currents between the gates. We have introduced a parameter optimization method to deal with the parameter spreads. We have also inserted three stages of JTLs to reduce leakage currents. We have experimentally confirmed the correct operations of these circuits. The obtained bias margins were 33.1% for the HA and 24.6% for the CSSA.
Yoshio KAMEDA Shinichi YOROZU Shuichi TAHARA
We describe the logic design of a single-flux-quantum (SFQ) 22 unit switch. It is the main component of the SFQ Banyan packet switch we are developing that enables a switching capacity of over 1 Tbit/s. In this paper, we focus on the design of the controller in the unit switch. The controller does not have a simple "off-the-shelf" conventional circuit, like those used in shift registers or adders. To design such a complicated random logic circuit, we need to adopt a systematic top-down design approach. Using a graphical technique, we first obtained logic functions. Next, to use the deep pipeline architecture, we broke down the functions into one-level logic operations that can be executed within one clock cycle. Finally, we mapped the functions on to the physical circuits using pre-designed SFQ standard cells. The 22 unit switch consists of 59 logic gates and needs about 600 Josephson junctions without gate interconnections. We tested the gate-level circuit by logic simulation and found that it operates correctly at a throughput of 40 GHz.
Masayuki DAITO Kazumasa SUZUKI Ken-ichi UEHIGASHI Hiroshi MORITA Hitoshi SONODA Nobuhito MORIKAWA Masatoshi MORIYAMA Shoichiro SATO Terumi FUKUDA Saori NAKAMURA
A MIPS-architecture-based embedded out-of-order superscalar microprocessor targeting broadband applications has been developed. Aggressive microarchitectures, such as superpipelining and out-of-order execution, have been applied to realize better performance scalability in order to fit with next-generation broadband applications. The chip includes a 32 K-Byte instruction cache, a 32 K-Byte data cache, 6 independent execution units, and has been designed using an ASIC-style design methodology on a 0.13-µm CMOS 5-layer aluminum technology. It can operate up to 500 MHz and achieves 1005 MIPS (Dhrystone 2.1) at 500-MHz operation.
Maria-Cristina MARINESCU Martin RINARD
This paper describes a novel approach to high-level synthesis of complex pipelined circuits, including pipelined circuits with feedback. This approach combines a high-level, modular specification language with an efficient implementation. In our system, the designer specifies the circuit as a set of independent modules connected by conceptually unbounded queues. Our synthesis algorithm automatically transforms this modular, asynchronous specification into a tightly coupled, fully synchronous implementation in synthesizable Verilog.
Takahiro ASAI Tadashi MATSUMOTO
This paper presents the outline of the systolic array recursive least-squares (RLS) processor prototyped primarily with the aim of broadband mobile communication applications. To execute the RLS algorithm effectively, this processor uses an orthogonal triangularization technique known in matrix algebra as QR decomposition for parallel pipelined processing. The processor board comprises 19 application-specific integrated circuit chips, each with approximately one million gates. Thirty-two bit fixed-point signal processing takes place in the processor, with which one cycle of internal cell signal processing requires approximately 500 nsec, and boundary cell signal processing requires approximately 80 nsec. The processor board can estimate up to 10 parameters. It takes approximately 35 µs to estimate 10 parameters using 41 known symbols. To evaluate signal processing performance of the prototyped systolic array processor board, processing time required to estimate a certain number of parameters using the prototyped board was comapred with using a digital signal processing (DSP) board. The DSP board performed a standard form of the RLS algorithm. Additionally, we conducted minimum mean-squared error adaptive array in-lab experiments using a complex baseband fading/array response simulator. In terms of parameter estimation accuracy, the processor is found to produce virtually the same results as a conventional software engine using floating-point operations.
A combination of a software and a systolic hardware implementation for the Quasi Arithmetic compression algorithm is presented. The hardware is implemented as a pipeline hardware implementation. The implementation doesn't change the the algorithm. It just split it into two parts. The combination of parallel software and pipeline hardware can give very fast compression without decline of the compression efficiency.
Motokazu OZAWA Masashi IMAI Yoichiro UENO Hiroshi NAKAMURA Takashi NANYA
Wire delays, instead of gate delays, are moving into dominance in modern VLSI design. Current synchronous processors have the critical path not in the ALU function but in the cache access. Since the cache performance enhancement is limited by the memory access delay which mainly consists of wire delays, a reduction in gate delays may no longer imply any enhancement in processor performance. To solve this problem, this paper presents a novel architecture, called the Cascade ALU. The Cascade ALU allows super-scalar processors with future technologies to move the critical path into the ALU part. Therefore the Cascade ALU can enjoy the expected progress in future device speed. Since the delay of the Cascade ALU varies depending on the executed instructions, an asynchronous system is shown to be suitable for implementing the Cascade ALU. However an asynchronous system may have a large handshake overhead, this paper also presents an asynchronous Fine Grain Pipeline technique that hides the handshake overhead. Finally, this paper presents results of performance and area evaluation for an asynchronous implementation of the cascade ALU. The results show that the cascade ALU architecture has a good performance scalability on the reduction of the ALU latency and imposes little area penalty compared with current synchronous processors.
Hidehisa NAGANO Akihiro MATSUURA Akira NAGOYA
This paper proposes a method for implementing a metric computation accelerator for fractal image compression using reconfigurable hardware. The most time-consuming part in the encoding of this compression is computation of metrics among image blocks. In our method, each processing element (PE) configured for an image block accelerates these computations by pipeline processing. Furthermore, by configuring the PE for a specific image block, we can reduce the number of adders, which are the main computing elements, by a half even in the worst case.
Kiyoshi NISHIKAWA Hitoshi KIYA
This paper proposes a fast implementation technique for RLS adaptive filters. The technique has an adjustable parameter to trade the throughput and the rate of convergence of the filter according to the applications. The conventional methods for improving the throughput do not have this kind of adjustability so that the proposed technique will expand the area of applications for the RLS algorithm. We show that the improvement of the throughput can be easily achieved by rearranging the formula of the RLS algorithm and that there are no need for faster PEs for the improvement.