Keyword Search Result

[Keyword] dynamically(23hit)

1-20hit(23hit)

  • Hardware Software Co-design of H.264 Baseline Encoder on Coarse-Grained Dynamically Reconfigurable Computing System-on-Chip

    Hung K. NGUYEN  Peng CAO  Xue-Xiang WANG  Jun YANG  Longxing SHI  Min ZHU  Leibo LIU  Shaojun WEI  

     
    PAPER-Computer System

      Vol:
    E96-D No:3
      Page(s):
    601-615

    REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.

  • Acceleration of Block Matching on a Low-Power Heterogeneous Multi-Core Processor Based on DTU Data-Transfer with Data Re-Allocation

    Yoshitaka HIRAMATSU  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Toru NOJIRI  Kunio UCHIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E95-C No:12
      Page(s):
    1872-1882

    The large data-transfer time among different cores is a big problem in heterogeneous multi-core processors. This paper presents a method to accelerate the data transfers exploiting data-transfer-units together with complex memory allocation. We used block matching, which is very common in image processing, to evaluate our technique. The proposed method reduces the data-transfer time by more than 42% compared to the earlier works that use CPU-based data transfers. Moreover, the total processing time is only 15 ms for a VGA image with 1616 pixel blocks.

  • Fast AdaBoost-Based Face Detection System on a Dynamically Coarse Grain Reconfigurable Architecture

    Jian XIAO  Jinguo ZHANG  Min ZHU  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    392-402

    An AdaBoost-based face detection system is proposed, on a Coarse Grain Reconfigurable Architecture (CGRA) named “REMUS-II”. Our work is quite distinguished from previous ones in three aspects. First, a new hardware-software partition method is proposed and the whole face detection system is divided into several parallel tasks implemented on two Reconfigurable Processing Units (RPU) and one micro Processors Unit (µPU) according to their relationships. These tasks communicate with each other by a mailbox mechanism. Second, a strong classifier is treated as a smallest phase of the detection system, and every phase needs to be executed by these tasks in order. A phase of Haar classifier is dynamically mapped onto a Reconfigurable Cell Array (RCA) only when needed, and it's quite different from traditional Field Programmable Gate Array (FPGA) methods in which all the classifiers are fabricated statically. Third, optimized data and configuration word pre-fetch mechanisms are employed to improve the whole system performance. Implementation results show that our approach under 200 MHz clock rate can process up-to 17 frames per second on VGA size images, and the detection rate is over 95%. Our system consumes 194 mW, and the die size of fabricated chip is 23 mm2 using TSMC 65 nm standard cell based technology. To the best of our knowledge, this work is the first implementation of the cascade Haar classifier algorithm on a dynamically CGRA platform presented in the literature.

  • Iterative Synthesis Methods Estimating Programmable-Wire Congestion in a Dynamically Reconfigurable Processor

    Takao TOI  Takumi OKAMOTO  Toru AWASHIMA  Kazutoshi WAKABAYASHI  Hideharu AMANO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E94-A No:12
      Page(s):
    2619-2627

    Iterative synthesis methods for making aware of wire congestion are proposed for a multi-context dynamically reconfigurable processor (DRP) with a large number of processing elements (PEs) and programmable-wire connections. Although complex data-paths can be synthesized using the programmable-wire, its delay is long especially when wire connections are congested. We propose two iterative synthesis techniques between a high-level synthesizer (HLS) and the place & route tool to shorten the prolonged wire delay. First, we feed back wire delays for each context to a scheduler in the HLS. The experimental results showed that a critical-path delay was shorten by 21% on average for applications with timing closure problems. Second, we skip the routing and estimate wire delays based on the congestion. The synthesis time was shorten to 1/3 causing delay improvement rate degradation at two points on average.

  • A Switch Block Architecture for Multi-Context FPGAs Based on a Ferroelectric-Capacitor Functional Pass-Gate Using Multiple/Binary Valued Hybrid Signals

    Shota ISHIHARA  Noriaki IDOBATA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Application of Multiple-Valued VLSI

      Vol:
    E93-D No:8
      Page(s):
    2134-2144

    Dynamically Programmable Gate Arrays (DPGAs) provide more area-efficient implementations than conventional Field Programmable Gate Arrays (FPGAs). One of typical DPGA architectures is multi-context architecture. An DPGA based on multi-context architecture is Multi-Context FPGA (MC-FPGA) which achieves fast switching between contexts. The problem of the conventional SRAM-based MC-FPGA is its large area and standby power dissipation because of the large number of configuration memory bits. Moreover, since SRAM is volatile, the SRAM-based multi-context FPGA is difficult to implement power-gating for standby power reduction. This paper presents an area-efficient and nonvolatile multi-context switch block architecture for MC-FPGAs based on a ferroelectric-capacitor functional pass-gate which merges a multiple-valued threshold function and a nonvolatile multiple-valued storage. The test chip for four contexts is fabricated in a 0.35 µm-CMOS/0.60 µm-ferroelectric-capacitor process. The transistor count of the proposed multi-context switch block is reduced to 63% in comparison with that of the SRAM-based one.

  • Resource Minimization Method Satisfying Delay Constraint for Replicating Large Contents

    Sho SHIMIZU  Hiroyuki ISHIKAWA  Yutaka ARAKAWA  Naoaki YAMANAKA  Kosuke SHIBA  

     
    PAPER-Fundamental Theories for Communications

      Vol:
    E92-B No:10
      Page(s):
    3102-3110

    How to minimize the number of mirroring resources under a QoS constraint (resource minimization problem) is an important issue in content delivery networks. This paper proposes a novel approach that takes advantage of the parallelism of dynamically reconfigurable processors (DRPs) to solve the resource minimization problem, which is NP-hard. Our proposal obtains the optimal solution by running an exhaustive search algorithm suitable for DRP. Greedy algorithms, which have been widely studied for tackling the resource minimization problem, cannot always obtain the optimal solution. The proposed method is implemented on an actual DRP and in experiments reduces the execution time by a factor of 40 compared to the conventional exhaustive search algorithm on a Pentium 4 (2.8 GHz).

  • A Preemption Algorithm for a Multitasking Environment on Dynamically Reconfigurable Processors

    Vu Manh TUAN  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E91-D No:12
      Page(s):
    2793-2803

    Task preemption is a critical mechanism for building an effective multi-tasking environment on dynamically reconfigurable processors. When a task is preempted, its necessary state information must be correctly preserved in order for the task to be resumed later. Not only do coarse-grained Dynamically Reconfigurable Processing Array (DRPAs) devices have different architectures using a variety of development tools, but the great amount of state data of hardware tasks executing on such devices are usually distributed on many different storage elements. To address these difficulties, this paper aims at studying a general method for capturing the state data of hardware tasks targeting coarse-grained DRPAs. Based on resource usage, algorithms for identifying preemption points and inserting preemption states subject to user-specified preemption latency are proposed. Moreover, a modification to automatically incorporate proposed steps into the system design flow is also discussed. The performance degradation caused by additional preemption states is minimized by allowing preemption only at predefined points where demanded resources are small. The evaluation result using a model based on NEC Electronics' DRP-1 shows that the proposed method can produce preemption points satisfying a given preemption latency with reasonable hardware overhead (from 6% to 15%).

  • A Retargetable Compiler Based on Graph Representation for Dynamically Reconfigurable Processor Arrays

    Vasutan TUNBUNHENG  Hideharu AMANO  

     
    PAPER-VLSI Systems

      Vol:
    E91-D No:11
      Page(s):
    2655-2665

    For developing design environment of various Dynamically Reconfigurable Processor Arrays (DRPAs), the Graph with Configuration Information (GCI) is proposed to represent configurable resource in the target dynamically reconfigurable architecture. The functional unit, constant unit, register, and routing resource can be represented in the graph as well as the configuration information. The restriction in the hardware is also added in the graph by limiting the possible configuration at a node controlled by the other node. A prototype compiler called Black-Diamond with GCI is now available for three different DRPAs. It translates data-flow graph from C-like front-end description, applies placement and routing by using the GCI, and generates configuration data for each element of the DRPA. Evaluation results of simple applications show that Black-Diamond can generate reasonable designs for all three different architectures. Other target architectures can be easily treated by representing many aspects of architectural property into a GCI.

  • A Mapping Method for Multi-Process Execution on Dynamically Reconfigurable Processors

    Vu MANH TUAN  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E91-D No:9
      Page(s):
    2312-2322

    The multi-process execution in dynamically reconfigurable processors is a technique to enhance throughput by trying to exploit more inherent parallelism of applications. Basically, a total process for an application is divided into small processes, assigned into limited areas of a reconfigurable array, and concurrently executed in a pipelined manner. In order to improve the efficiency of the multi-process execution, a systematic method for mapping processes onto a reconfigurable array consisting of multiple hardware execution units is essential. This paper proposes and investigates a systematic method for mapping an application modeled as a Kahn Process Network onto a dynamically reconfigurable processing array. In order to execute streaming applications in a pipelined manner, the size of Tiles, which is a unit area of dynamically reconfigurable array, and the grouping of processes are adjusted. Using real applications such as DCT, JPEG encoder and Turbo encoder, the impact of different versions mapped onto the NEC Dynamically Reconfigurable Processor on performance is evaluated. Evaluation results show that our proposed mapping algorithm achieves the best performance in terms of the throughput and the execution time.

  • Multi-Context FPGA Using Fine-Grained Interconnection Blocks and Its CAD Environment

    Hasitha Muthumala WAIDYASOORIYA  Weisheng CHONG  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    517-525

    Dynamically-programmable gate arrays (DPGAs) promise lower-cost implementations than conventional field-programmable gate arrays (FPGAs) since they efficiently reuse limited hardware resources in time. One of the typical DPGA architectures is a multi-context FPGA (MC-FPGA) that requires multiple memory bits per configuration bit to realize fast context switching. However, this additional memory bits cause significant overhead in area and power consumption. This paper presents novel architecture of a switch element to overcome the required capacity of configuration memory. Our main idea is to exploit redundancy between different contexts by using a fine-grained switch element. The proposed MC-FPGA is designed in a 0.18 µm CMOS technology. Its maximum clock frequency and the context switching frequency are measured to be 310 MHz and 272 MHz, respectively. Moreover, novel CAD process that exploits the redundancy in configuration data, is proposed to support the MC-FPGA architecture.

  • A Self-Test of Dynamically Reconfigurable Processors with Test Frames

    Tomoo INOUE  Takashi FUJII  Hideyuki ICHIHARA  

     
    PAPER-High-Level Testing

      Vol:
    E91-D No:3
      Page(s):
    756-762

    This paper proposes a self-test method of coarse grain dynamically reconfigurable processors (DRPs) without hardware overhead. In the method, processor elements (PEs) compose a test frame, which consists of test pattern generators (TPGs), processor elements under test (PEUTs) and response analyzers (RAs), while testing themselves one another by changing test frames appropriately. We design several test frames with different structures, and discuss the relationship of the structures to the numbers of contexts and test frames for testing all the functions of PEs. A case study shows that there exists an optimal test frame which minimizes the test application time under a constraint.

  • A 0.8-V Syllabic-Companding Log Domain Filter with 78-dB Dynamic Range in 0.35-µm CMOS

    Ippei AKITA  Kazuyuki WADA  Yoshiaki TADOKORO  

     
    PAPER-Electronic Circuits

      Vol:
    E91-C No:1
      Page(s):
    87-95

    A scheme for a low-voltage CMOS syllabic-companding log domain filter with wide dynamic range is proposed and its prototype is presented. A nodal voltage which is fixed in a conventional filter based on the dynamically adjustable biasing (DAB) technique is adapted for change of input envelope to achieve wide dynamic range. Externally linear and time invariant (ELTI) relation between an input and an output is guaranteed by a state variable correction (SVC) circuit which is also proposed for low-voltage operation. To demonstrate the proposed scheme, a fifth-order Chebychev low-pass filter with 100-kHz cutoff frequency is designed and fabricated in a standard 0.35-µm CMOS process. The filter has a 78-dB dynamic range and consumes 200-µW power from a 0.8-V power supply.

  • A Novel Technique to Design Energy-Efficient Contexts for Reconfigurable Logic Devices

    Hiroshi SHINOHARA  Hideaki MONJI  Masahiro IIDA  Toshinori SUEYOSHI  

     
    LETTER

      Vol:
    E90-D No:12
      Page(s):
    1986-1989

    High power consumption is a constraining factor for the growth of programmable logic devices. We propose two techniques in order to reduce power consumption. The first is a technique for creating contexts. This technique uses data-dependent circuits and wire sharing between contexts. The second is a technique for switching the contexts. In this paper, we evaluate the capability of the two techniques to reduce power consumption using a multi-context logic device. As a result, as compared with the original circuit, our multi-context circuits reduced the power consumption by 9.1% on an average and by a maximum of 19.0%. Furthermore, applying our resource sharing technique to these circuits, we achieved a reduction of 10.6% on an average and a maximum reduction of 18.8%.

  • Data Multicasting Procedure for Increasing Configuration Speed of Coarse Grain Reconfigurable Devices

    Vasutan TUNBUNHENG  Masayasu SUZUKI  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E90-D No:2
      Page(s):
    473-481

    A novel configuration method called Row Multicast Configuration (RoMultiC) is proposed for high speed configuration of coarse grain reconfigurable systems. The same configuration data can be transferred in multicast fashion to configure many Processing Elements (PEs) by using a multicast bit-map provided in row and column directions of PE array. Evaluation results using practical applications show that a model reconfigurable system that incorporates this scheme can reduce configuration clock cycles by up to 73.1% compared with traditional configuration delivery scheme. Amount of required memory to store the configuration data at external memory is also reduced by omitting the duplicated configuration data.

  • A Survey on Dynamically Reconfigurable Processors Open Access

    Hideharu AMANO  

     
    INVITED PAPER

      Vol:
    E89-B No:12
      Page(s):
    3179-3187

    Dynamically reconfigurable processors are consisting of an array of processing elements whose functions and interconnections can be dynamically changed. 9 commercial systems are picked up, and their array structures, processing elements and interconnection architectures are classified.

  • A Multi-Context FPGA Using Floating-Gate-MOS Functional Pass-Gates

    Masanori HARIYAMA  Sho OGATA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E89-C No:11
      Page(s):
    1655-1661

    Multi-context FPGAs (MC-FPGAs) have multiple memory bits per configuration bit forming configuration planes for fast switching between contexts. The additional memory planes cause a large overhead in area when a number of contexts are used. To overcome the overhead, a fine-grained MC-FPGA architecture using a floating-gate-MOS functional pass gate (FGFP) is presented which merges threshold operation and storage function on a single floating-gate MOS transistor. The test chip is designed using a 0.35 µm CMOS-EPROM technology. The transistor count of the proposed multi-context switch (MC-switch) is reduced to 13% in comparison with SRAM-based one. The total area of the proposed MC-FPGA is reduced to about 56% of that of a conventional SRAM-based MC-FPGA.

  • Scheduling of Periodic Tasks on a Dynamically Reconfigurable Device Using Timed Discrete Event Systems

    Kenji ONOGI  Toshimitsu USHIO  

     
    PAPER-Concurrent Systems

      Vol:
    E89-A No:11
      Page(s):
    3227-3234

    A dynamically reconfigurable device is a device that can change its hardware configuration arbitrarily often in order to achieve the desired performance and functions. Since several tasks are executed on the device concurrently, scheduling of both task execution and reconfiguration is an important problem. In our model, the dynamically reconfigurable device is represented by a two-level hierarchical automaton, and execution of each periodic task is represented by a timed discrete event system. We propose a composition rule to get an automaton, which represents non-preemptive execution of periodic tasks on the dynamically reconfigurable device. We introduce a method to get a feasible execution sequence of tasks by using state feedback control.

  • Dynamically Reconfigurable Logic LSI: PCA-2

    Hideyuki ITO  Ryusuke KONISHI  Hiroshi NAKADA  Hideyuki TSUBOI  Yuichi OKUYAMA  Akira NAGOYA  

     
    PAPER-Recornfigurable Systems

      Vol:
    E87-D No:8
      Page(s):
    2011-2020

    Design points and the results seen in the development of a dynamically reconfigurable logic LSI, PCA-2, are described. PCA-2 enables the realization of flexible parallel processing based on the autonomous reconfiguration of logic circuits. To realize this feature, we introduce an asynchronous circuit design and a homogeneous cell array structure. PCA-2 represents an advance on the earlier LSI, PCA-1. Cutting edge CMOS technology is used to realize the structural merits of PCA hardware. Compared to PCA-1, PCA-2 offers 16 times greater integration level for programmable logic. Due to miniaturization and design refinement, PCA-2 provides a 6-fold increase in the circuit frequency of the configuration controller and a 3-fold increase in the operating frequency of the programmable logic. The results gained confirm the effects of refinement and the suitability of our architecture for device miniaturization.

  • Dynamically Reconfigurable Processor Implemented with IPFlex's DAPDNA Technology

    Takayuki SUGAWARA  Keisuke IDE  Tomoyoshi SATO  

     
    INVITED PAPER

      Vol:
    E87-D No:8
      Page(s):
    1997-2003

    The DAPDNA®-2 is the world's first general purpose dynamically reconfigurable processor for commercial usage. It is a dual-core processor consisting of a custom RISC core called the Digital Application Processor (DAP), and a two dimensional array of dynamically reconfigurable processing elements referred to as the Distributed Network Architecture (DNA). The DAP has a 32 bit instruction set architecture with an 8 KB instruction cache and 8 KB data cache that can be accessed in one clock cycle. It has an interrupt control function to detect data processing completion in the DNA-Matrix. The DNA-Matrix has different types of data processing elements such as ALU, delay, and memory elements to process fully parallel computations. The DNA-Matrix includes 32 independent 16 KB high speed SRAM elements (in total 512 KB). The DNA-Matrix, even with its parallel computational capability, can be synchronized and co-work at the same clock frequency as the DAP. The processor operates at a 166 MHz working frequency and fabricated with a 0.11 µm CMOS process. The DAPDNA-2 device can be connected directly with up to 16 units with linear scalability in processing performance, provided the bandwidth requirement is within the maximum communication speed between DNAs, which is 32 Gbps. The DAPDNA-2 performs at a level that is two orders of magnitude higher than conventional high performance processors.

  • A Dynamically Adaptive Hardware on Dynamically Reconfigurable Processor

    Hideharu AMANO  Akiya JOURAKU  Kenichiro ANJO  

     
    INVITED PAPER

      Vol:
    E86-B No:12
      Page(s):
    3385-3391

    A framework of dynamically adaptive hardware mechanism on multicontext reconfigurable devices is proposed, and as an example, an adaptive switching fabric is implemented on NEC's novel reconfigurable device DRP (Dynamically Reconfigurable Processor). In this switch, contexts for the full crossbar and alternative hadware modules, which provide larger bandwidth but can treat only a limited pattern of packet inputs, are prepared. Using the quick context switching functionality, a context for the full crossbar is replaced by alternative contexts according to the packet inputs pattern. If the context corresponding to requested alternative hadware modules is not inside the chip, it is loaded from outside chip to currently unused context memory, then replaced with the full size crossbar. If the traffic includes a lot of packets for specific destinations, a set of contexts frequently used in the traffic is gathered inside the chip like a working set stored in a cache. 4 4 mesh network connected with the proposed adaptive switches is simulated, and it appears that the latency between nodes is improved three times when the traffic between neighboring four nodes is dominant.

1-20hit(23hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.