This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm2 in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Xun HE, Xin JIN, Minghui WANG, Dajiang ZHOU, Satoshi GOTO, "A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS" in IEICE TRANSACTIONS on Fundamentals,
vol. E94-A, no. 12, pp. 2609-2618, December 2011, doi: 10.1587/transfun.E94.A.2609.
Abstract: This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm2 in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.
URL: https://globals.ieice.org/en_transactions/fundamentals/10.1587/transfun.E94.A.2609/_p
Copy
@ARTICLE{e94-a_12_2609,
author={Xun HE, Xin JIN, Minghui WANG, Dajiang ZHOU, Satoshi GOTO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS},
year={2011},
volume={E94-A},
number={12},
pages={2609-2618},
abstract={This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm2 in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.},
keywords={},
doi={10.1587/transfun.E94.A.2609},
ISSN={1745-1337},
month={December},}
Copy
TY - JOUR
TI - A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 2609
EP - 2618
AU - Xun HE
AU - Xin JIN
AU - Minghui WANG
AU - Dajiang ZHOU
AU - Satoshi GOTO
PY - 2011
DO - 10.1587/transfun.E94.A.2609
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E94-A
IS - 12
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - December 2011
AB - This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm2 in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.
ER -