A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS

Xun HE; Xin JIN; Minghui WANG; Dajiang ZHOU; Satoshi GOTO

doi:10.1587/transfun.E94.A.2609

A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS

Xun HE, Xin JIN, Minghui WANG, Dajiang ZHOU, Satoshi GOTO

Full Text Views

0

Share
Cite this

Summary :

This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm² in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E94-A No.12 pp.2609-2618

Publication Date: 2011/12/01

Publicized

Online ISSN: 1745-1337

DOI: 10.1587/transfun.E94.A.2609

Type of Manuscript: Special Section PAPER (Special Section on VLSI Design and CAD Algorithms)

Category: High-Level Synthesis and System-Level Design

Cite this

Copy

Xun HE, Xin JIN, Minghui WANG, Dajiang ZHOU, Satoshi GOTO, "A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS" in IEICE TRANSACTIONS on Fundamentals, vol. E94-A, no. 12, pp. 2609-2618, December 2011, doi: 10.1587/transfun.E94.A.2609.
Abstract: This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm² in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.
URL: https://globals.ieice.org/en_transactions/fundamentals/10.1587/transfun.E94.A.2609/_p

Copy

@ARTICLE{e94-a_12_2609,
author={Xun HE, Xin JIN, Minghui WANG, Dajiang ZHOU, Satoshi GOTO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS},
year={2011},
volume={E94-A},
number={12},
pages={2609-2618},
abstract={This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm² in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.},
keywords={},
doi={10.1587/transfun.E94.A.2609},
ISSN={1745-1337},
month={December},}

Copy

TY - JOUR
TI - A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 2609
EP - 2618
AU - Xun HE
AU - Xin JIN
AU - Minghui WANG
AU - Dajiang ZHOU
AU - Satoshi GOTO
PY - 2011
DO - 10.1587/transfun.E94.A.2609
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E94-A
IS - 12
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - December 2011
AB - This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm² in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.
ER -