1-2hit |
Antoniette MONDIGO Tomohiro UENO Kentaro SANO Hiroyuki TAKIZAWA
Since the hardware resource of a single FPGA is limited, one idea to scale the performance of FPGA-based HPC applications is to expand the design space with multiple FPGAs. This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to achieve a scaled performance. For a practical exploration of this vast design space, a performance model is presented and verified with the evaluation of a tsunami simulation application implemented on Intel Arria 10 FPGAs. Finally, scalability analysis is performed, where speedup is achieved when increasing the computing pipeline over multiple FPGAs while maintaining the problem size of computation. Performance is scaled with multiple FPGAs; however, performance degradation occurs with insufficient available bandwidth and large pipeline overhead brought by inadequate data stream size. Tsunami simulation results show that the highest scaled performance for 8 cascaded Arria 10 FPGAs is achieved with a single pipeline of 5 stream processing elements (SPEs), which obtained a scaled performance of 2.5 TFlops and a parallel efficiency of 98%, indicating the strong scalability of the multi-FPGA stream computing platform.
Dongsheng YANG Tomohiro UENO Wei DENG Yuki TERASHIMA Kengo NAKATA Aravind Tharayil NARAYANAN Rui WU Kenichi OKADA Akira MATSUZAWA
A fully synthesizable all-digital phase-locked loop (AD-PLL) with a stochastic time-to-digital converter (STDC) is proposed in this paper. The whole AD-PLL circuit design is based on only standard cells from digital library, thus the layout of this AD-PLL can be automatically synthesized by a commercial place-and-route (P&R) tool with a foundry-provided standard-cell library. No manual layout and process modification is required in the whole AD-PLL design. In order to solve the delay mismatch issue in the delay-line-based time-to-digital converter (TDC), an STDC employing only standard D flip-flop (DFF) is presented to mitigate the sensitivity to layout mismatch resulted from automatic P&R. For the stochastic TDC, the key idea is to utilize the layout uncertainty due to automatic P&R which follows Gaussian distribution according to statistics theory. Moreover, the fully synthesized STDC can achieve a finer resolution compared to the conventional TDC. Implemented in a 28nm fully depleted silicon on insulator (FDSOI) technology, the fully synthesized PLL consumes only 480µW under 1.0V power supply while operating at 0.9GHz. It achieves a figure of merit (FoM) of -231.1dB with 4.0ps RMS jitter while occupying 0.0055mm2 chip area only.