Keyword Search Result

[Keyword] chip multiprocessor(5hit)

1-5hit
  • A Two-Level Cache Aware Adaptive Data Replication Mechanism for Shared LLC

    Qianqian WU  Zhenzhou JI  

     
    LETTER-Computer System

      Pubricized:
    2022/03/25
      Vol:
    E105-D No:7
      Page(s):
    1320-1324

    The shared last level cache (SLLC) in tile chip multiprocessors (TCMP) provides a low off-chip miss rate, but it causes a long on-chip access latency. In the two-level cache hierarchy, data replication stores replicas of L1 victims in the local LLC (L2 cache) to obtain a short local LLC access latency on the next accesses. Many data replication mechanisms have been proposed, but they do not consider both L1 victim reuse behaviors and LLC replica reception capability. They either produce many useless replicas or increase LLC pressure, which limits the improvement of system performance. In this paper, we propose a two-level cache aware adaptive data replication mechanism (TCDR), which controls replication based on both L1 victim reuse behaviors prediction and LLC replica reception capability monitoring. TCDR not only increases the accuracy of L1 replica selection, but also avoids the pressure of replication on LLC. The results show that TCDR improves the system performance with reasonable hardware overhead.

  • A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts

    Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Vol:
    E96-D No:9
      Page(s):
    2047-2054

    Chip multiprocessors (CMPs) improve performance by simultaneously executing multiple threads using integrated multiple cores. However, since these cores commonly share one cache, inter-thread cache conflicts often limit the performance improvement by multi-threading. This paper focuses on two causes of inter-thread cache conflicts. In shared caches of CMPs, cached data fetched by one thread are frequently evicted by another thread. Such an eviction, called inter-thread kickout (ITKO), is one of the major causes of inter-thread cache conflicts. The other cause is capacity shortage that occurs when one cache is shared by threads demanding large cache capacities. If the total capacity demanded by the threads exceeds the actual cache capacity, the threads compete to use the limited cache capacity, resulting in capacity shortage. To address inter-thread cache conflicts, we must take into account both ITKOs and capacity shortage. Therefore, this paper proposes a capacity-aware thread scheduling method combined with cache partitioning. In the proposed method, inter-thread cache conflicts due to ITKOs and capacity shortage are decreased by cache partitioning and thread scheduling, respectively. The proposed scheduling method estimates the capacity demand of each thread with an estimation method used in the cache partitioning mechanism. Based on the estimation used for cache partitioning, the thread scheduler decides thread combinations sharing one cache so as to avoid capacity shortage. Evaluation results suggest that the proposed method can improve overall performance by up to 8.1%, and the performance of individual threads by up to 12%. The results also show that both cache partitioning and thread scheduling are indispensable to avoid both ITKOs and capacity shortage simultaneously. Accordingly, the proposed method can significantly reduce the inter-thread cache conflicts and hence improve performance.

  • Power-Aware Compiler Controllable Chip Multiprocessor

    Hiroaki SHIKANO  Jun SHIRAKO  Yasutaka WADA  Keiji KIMURA  Hironori KASAHARA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    432-439

    A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.

  • Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture

    Keiji KIMURA  Takeshi KODAKA  Motoki OBATA  Hironori KASAHARA  

     
    PAPER-Architecture and Algorithms

      Vol:
    E86-C No:4
      Page(s):
    570-579

    This paper describes multigrain parallel processing on OSCAR (Optimally SCheduled Advanced multiprocessoR) chip multiprocessor architecture. OSCAR compiler cooperative chip multiprocessor architecture aims at development of scalable, high effective performance and cost effective chip multiprocessor with ease of use by compiler supports. OSCAR chip multiprocessor architecture integrates simple single issue processors having distributed shared data memory for optimal use of data locality over different loops and fine grain data transfer and synchronization, local data memory for private data recognized by compiler, and compiler controllable data transfer unit for overlapping data transfer to hide data transfer overhead. This OSCAR chip multiprocessor and OSCAR multigrain parallelizing compiler have been developed simultaneously. Performance of multigrain parallel processing on OSCAR chip multiprocessor architecture is evaluated using SPEC fp 2000/95 benchmark suite. When microSPARC like single issue core is used, OSCAR chip multiprocessor architecture gives us 2.36 times speedup in fpppp, 2.64 times in su2cor, 2.88 times in turb3d, 2.98 times in hydro2d, 3.84 times in tomcatv, 3.84 times in mgrid and 3.97 times in swim respectively for four processors against single processor.

  • An Area-Effective Datapath Architecture for Embedded Microprocessors and Scalable Systems

    Toshiaki INOUE  Takashi MANABE  Sunao TORII  Satoshi MATSUSHITA  Masato EDAHIRO  Naoki NISHI  Masakazu YAMASHINA  

     
    INVITED PAPER

      Vol:
    E84-C No:8
      Page(s):
    1014-1020

    We have proposed area-reduction techniques for superscalar datapath architectures with 34 SIMD instructions and have developed an integer-media unit based on these techniques. The unit's design is both functionally asymmetrical and integer-SIMD unified, and the resulting savings in area are 27%-48% as compared to other, functionally equivalent mid-level microprocessor designs, with performance that is, at most, only 7.2% lower. Further, in 2-D IDCT processing, the unit outperforms embedded microprocessor designs without SIMD functions by 49%-118%. Specifically, effective area reduction of adders, shifters, and multiply-and-adders has been achieved by using the unified design. These area-effective techniques are useful for embedded microprocessors and scalable systems that employ highly parallel superscalar and on-chip parallel architectures. The integer-media unit has been implemented in an evaluation chip fabricated with 0.15-µm 5-metal CMOS technology.

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.