1-16hit |
This paper presents a low-latency, low-cost architecture for computing square and cube roots in the fixed-point format. The proposed architecture is designed based on a non-iterative root calculation scheme to achieve fast computations. While previous non-iterative root calculators are restricted to a square-root operation due to the limitation of their mathematical property, the root computation is generalized in this paper to apply an approximation method to the non-iterative scheme. On top of that, a recurrent method is proposed to select parameters, which enables us to reduce the table size while keeping the maximum relative error value low. Consequently, the proposed root calculator can support both square and cube roots at the expense of small delay and low area overheads. This extension can be generalized to compute the nth roots, where n is a positive integer.
Square-related functions such as square, inverse square, square-root and inverse square-root operations are widely used in digital signal processing and digital communication algorithms, and their efficient realizations are commonly required to reduce the hardware complexity. In the implementation point of view, approximate realizations are often desired if they do not degrade performance significantly. In this paper, we propose new linear approximations for the square-related functions. The traditional linear approximations need multipliers to calculate slope offsets and tables to store initial offset values and slope values, whereas the proposed approximations exploit the inherent properties of square-related functions to linearly interpolate with only simple operations, such as shift, concatenation and addition, which are usually supported in modern VLSI systems. Regardless of the bit-width of the number system, more importantly, the maximum relative errors of the proposed approximations are bounded to 6.25% and 3.13% for square and square-root functions, respectively. For inverse square and inverse square-root functions, the maximum relative errors are bounded to 12.5% and 6.25% if the input operands are represented in 20 bits, respectively.
Amir Sabbagh MOLAHOSSEINI Chitra DADKHAH Keivan NAVI Mohammad ESHGHI
In this paper, the new residue number system (RNS) moduli sets {22n, 2n -1, 2n+1 -1} and {22n, 2n -1, 2n-1 -1} are introduced. These moduli sets have 4n-bit dynamic range and well-formed moduli which can result in high-performance residue to binary converters as well as efficient RNS arithmetic unit. Next, efficient residue to binary converters for the proposed moduli sets based on mixed-radix conversion (MRC) algorithm are presented. The converters are ROM-free and they are realized using carry-save adders and modulo adders. Comparison with the other residue to binary converters for 4n-bit dynamic range moduli sets shown that the presented designs based on new moduli sets {22n, 2n -1, 2n+1 -1} and {22n, 2n -1, 2n-1 -1} are improved the conversion delay and result in hardware savings. Also, the proposed moduli sets can lead to efficient binary to residue converters, and they can speed-up internal RNS arithmetic processing, compared with the other 4n-bit dynamic range moduli sets.
Naofumi TAKAGI Shunsuke KADOWAKI Kazuyoshi TAKAGI
A hardware algorithm for integer division is proposed. It is based on the radix-2 non-restoring division algorithm. Fast computation is achieved by the use of the radix-2 signed-digit (SD2) representation. The algorithm does not require normalization of the divisor, and hence, does not require an area-consuming leading-one (or zero) detection nor shifts of variable-amount. Combinational (unfolded) implementation of the algorithm yields a regularly structured array divider, and sequential implementation yields compact dividers.
A hardware algorithm for computing the reciprocal of the Euclidean norm of a 3-dimensional (3-D) vector which appears frequently in 3-D computer graphics is proposed. It is based on a digit-recurrence algorithm for computing the Euclidean norm and an on-line division (on-line reciprocal computation) algorithm. These algorithms are modified, so that the reciprocal of the Euclidean norm is computed by performing on-line division where the divisor is the partial result of Euclidean norm computation. Division, square-rooting, and reciprocal square-root computation, which are important operations in 3-D graphics, can also be performed using a circuit based on the proposed algorithm.
Jun SAKIYAMA Naofumi HOMMA Takafumi AOKI Tatsuo HIGUCHI
This paper presents a unified representation of fast addition algorithms based on Counter Tree Diagrams (CTDs). By using CTDs, we can describe and analyze various adder architectures in a systematic way without using specific knowledge about underlying arithmetic algorithms. Examples of adder architectures that can be handled by CTDs include Redundant-Binary (RB) adders, Signed-Digit (SD) adders, Positive-Digit (PD) adders, carry-save adders, parallel counters (e.g., 3-2 counters and 4-2 counters) and networks of such basic adders/counters. This paper also discusses the CTD-based analysis of carry-propagation-free adders using various number representations.
Naofumi TAKAGI Daisuke MATSUOKA Kazuyoshi TAKAGI
A digit-recurrence algorithm for computing reciprocal square-root which appears frequently in multimedia and graphics applications is proposed. The reciprocal square-root is computed by iteration of carry-propagation-free additions, shifts, and multiplications by one digit. Different specific versions of the algorithm are possible, depending on the radix, the redundancy factor of the digit set, and etc. Details of a radix-2 version and a radix-4 version and designs of a floating-point reciprocal square-root circuit based on them are shown.
Woo-Chan PARK Cheol-Ho JEONG Tack-Don HAN
The format conversion operations between a floating-point number and an integer number and a round operation are the important standard floating-point operations. In most cases, these operations are implemented by adding additional hardware to the floating-point adder. The SR (simultaneous rounding) method, one of the techniques used to improve the performance of the floating-point adder, can perform addition and rounding operations at the same stage and is an efficient method with respect to the silicon area and its performance. In this paper, a hardware model to execute CRops (conversion and rounding operations) for the SR floating-point adder is presented and CRops are analyzed on the proposed hardware model. Implementation details are also discussed. The proposed scheme can maintain the advantages of the SR method and can perform each CRop with three pipeline stages.
Takafumi AOKI Kimihiko NAKAZAWA Tatsuo HIGUCHI
In this paper, we propose a unified high-radix division algorithm for high-speed signal and data processing applications, and present the design and evaluation of high-radix parallel dividers based on the proposed algorithm. By prescaling the input operands and converting some significant digits of a partial remainder into non-redundant representation, the quotient digit can be obtained directly from the partial remainder without using quotient digit selection tables. Performance evaluation shows that the proposed radix-4 and radix-8 divider architectures achieve faster computation with the same level of hardware complexity than the binary counterparts. We also show an experimental fabrication of a radix-4 divider chip in 0.35 µm CMOS technology.
A digit-recurrence algorithm for cube rooting is proposed. In cube rooting, the digit-recurrence equation of the residual includes the square of the partial result of the cube root. In the proposed algorithm, the square of the partial result is kept, and the square, as well as the residual, is updated by addition/subtraction, shift, and multiplication by one or two digits. Different specific versions of the algorithm are possible, depending on the radix, the digit set of the cube root, and etc. Any version of the algorithm can be implemented as a sequential (folded) circuit or a combinational (unfolded) circuit, which is suitable for VLSI realization.
Naofumi HOMMA Takafumi AOKI Tatsuo HIGUCHI
This paper presents an efficient graph-based evolutionary optimization technique called Evolutionary Graph Generation (EGG), and its application to the design of fast constant-coefficient multipliers using parallel counter-tree architecture. An important feature of EGG is its capability to handle the general graph structures directly in evolution process instead of encoding the graph structures into indirect representations, such as bit strings and trees. This paper also addresses the major problem of EGG regarding the significant computation time required for verifying the function of generated circuits. To solve this problem, a new functional verification technique for arithmetic circuits is proposed. It is demonstrated that the EGG system can create efficient multiplier structures which are comparable or superior to the known conventional designs.
Takafumi AOKI Ichiro KITAORI Tatsuo HIGUCHI
This paper presents a constant-scale-factor radix-2-4-8 CORDIC algorithm for fast vector rotation and sine/cosine computation. The CORDIC algorithm is a well-known hardware algorithm for computing various elementary functions. Due to its sequential nature of computation, however, significant reduction in processing latency is required for real-time signal processing applications. The proposed radix-2-4-8 CORDIC algorithm dynamically changes the radix of computation during operation, and makes possible the reduction in the number of iterations by 37% for 64-bit precision. This paper also describes the hardware implementation of radix-2-4-8 CORDIC unit that can be installed into practical digital signal processors.
Takafumi AOKI Yoshiki SAWADA Tatsuo HIGUCHI
This paper presents a new number representation called the Signed-Weight (SW) number system, which is useful for designing configurable counter-tree architectures for digital signal processing applications. The SW number system allows the unified manipulation of positive and negative numbers in arithmetic circuits by adjusting the signs assigned to individual digit positions. This makes possible the construction of highly regular arithmetic circuits without introducing irregular arithmetic operations, such as negation and sign extension in the two's complement representation. This paper also presents the design of a Field-Programmable Digital Filter (FPDF) architecture--a special-purpose FPGA architecture for high-speed FIR filtering--using the proposed SW arithmetic system.
Takafumi AOKI Naofumi HOMMA Tatsuo HIGUCHI
This paper presents a new approach to designing arithmetic circuits by using a graph-based evolutionary optimization technique called Evolutionary Graph Generation (EGG). The key idea of the proposed method is to introduce a higher level of abstraction for arithmetic algorithms, in which arithmetic circuit structures are modeled as data-flow graphs associated with specific number representation systems. The EGG system employs evolutionary operations to transform the structure of graphs directly, which makes it possible to generate the desired circuit structure efficiently. The potential capability of EGG is demonstrated through an experiment of generating constant-coefficient multipliers.
Takashi AMISAKI Umpei NAGASHIMA Kazutoshi TANABE
Three multiplicative algorithms for the floating-point divide operation are compared: the Newton-Raphson method, Goldschmidt's algorithm, and a naive method that simply calculates a form of the Taylor series expansion of a reciprocal. The series also provides a theoretical basis for Goldschmidt's algorithm. It is well known that, of the Newton-Raphson method and Goldschmidt's algorithm, the former is the more accurate while the latter is the faster on a pipelined unit. However, little is reported about the naive method. In this report, we analyze the speed and accuracy of each method and present the results of numerical tests, which we conducted to confirm the validity of the accuracy analysis. Basically, the comparison are made in the context of software implementation (e. g. , a macro library) and compliance with the IEEE Standard 754 rounding is not considered. It is shown that the naive method is useful in a realistic setting where the number of iterations is small and the method is implemented on a pipelined floating-point unit with a multiply-accumulate configuration. In such a situation, the naive method gives a more accurate result with a slightly lower latency, as compared with Goldschmidt's algorithm, and is much faster than but slightly inferior in accuracy to the Newton-Raphson method.
A hardware algorithm for modular inversion is proposed. It is based on the extended Euclidean algorithm. All intermediate results are represented in a redundant binary representation with a digit set {0, 1,1}. All addition/subtractions are performed without carry propagation. A modular inversion is carried out in O (n) clock cycles where n is the word length of the modulus. The length of each clock cycle is constant independent of n. A modular inverter based on the algorithm has a regular cellular array structure with a bit slice feature and is very suitable for VLSI implementation. Its amount of hardware is proportional to n.