Network Interface Architecture with Scalable Low-Latency Message Receiving Mechanism

Noboru TANABE; Atsushi OHTA

doi:10.1587/transinf.E96.D.2536

Network Interface Architecture with Scalable Low-Latency Message Receiving Mechanism

Noboru TANABE, Atsushi OHTA

Full Text Views

0

Share
Cite this

Summary :

Most of scientists except computer scientists do not want to make efforts for performance tuning with rewriting their MPI applications. In addition, the number of processing elements which can be used by them is increasing year by year. On large-scale parallel systems, the number of accumulated messages on a message buffer tends to increase in some of their applications. Since searching message queue in MPI is time-consuming, system side scalable acceleration is needed for those systems. In this paper, a support function named LHS (Limited-length Head Separation) is proposed. Its performance in searching message buffer and hardware cost are evaluated. LHS accelerates searching message buffer by means of switching location to store limited-length heads of messages. It uses the effects such as increasing hit rate of cache on host with partial off-loading to hardware. Searching speed of message buffer when the order of message reception is different from the receiver's expectation is accelerated 14.3 times with LHS on FPGA-based network interface card (NIC) named DIMMnet-2. This absolute performance is 38.5 times higher than that of IBM BlueGene/P although the frequency is 8.5times slower than BlueGene/P. LHS has higher scalability than ALPU in the performance per frequency. Since these results are obtained with partially on loaded linear searching on old Pentium®4, performance gap will increase using state of art CPU. Therefore, LHS is more suitable for larger parallel systems. The discussions for adopting proposed method to state of art processors and systems are also presented.

Publication: IEICE TRANSACTIONS on Information Vol.E96-D No.12 pp.2536-2544

Publication Date: 2013/12/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E96.D.2536

Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)

Category

Authors

Noboru TANABE
Toshiba Corporation
Atsushi OHTA
Hitachi Information and Communication Engineering, Ltd.

Keyword

network interface, MPI, message passing, queue management, low latency communication, scalability

Cite this

Copy

Noboru TANABE, Atsushi OHTA, "Network Interface Architecture with Scalable Low-Latency Message Receiving Mechanism" in IEICE TRANSACTIONS on Information, vol. E96-D, no. 12, pp. 2536-2544, December 2013, doi: 10.1587/transinf.E96.D.2536.
Abstract: Most of scientists except computer scientists do not want to make efforts for performance tuning with rewriting their MPI applications. In addition, the number of processing elements which can be used by them is increasing year by year. On large-scale parallel systems, the number of accumulated messages on a message buffer tends to increase in some of their applications. Since searching message queue in MPI is time-consuming, system side scalable acceleration is needed for those systems. In this paper, a support function named LHS (Limited-length Head Separation) is proposed. Its performance in searching message buffer and hardware cost are evaluated. LHS accelerates searching message buffer by means of switching location to store limited-length heads of messages. It uses the effects such as increasing hit rate of cache on host with partial off-loading to hardware. Searching speed of message buffer when the order of message reception is different from the receiver's expectation is accelerated 14.3 times with LHS on FPGA-based network interface card (NIC) named DIMMnet-2. This absolute performance is 38.5 times higher than that of IBM BlueGene/P although the frequency is 8.5times slower than BlueGene/P. LHS has higher scalability than ALPU in the performance per frequency. Since these results are obtained with partially on loaded linear searching on old Pentium®4, performance gap will increase using state of art CPU. Therefore, LHS is more suitable for larger parallel systems. The discussions for adopting proposed method to state of art processors and systems are also presented.
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.E96.D.2536/_p

Copy

@ARTICLE{e96-d_12_2536,
author={Noboru TANABE, Atsushi OHTA, },
journal={IEICE TRANSACTIONS on Information},
title={Network Interface Architecture with Scalable Low-Latency Message Receiving Mechanism},
year={2013},
volume={E96-D},
number={12},
pages={2536-2544},
abstract={Most of scientists except computer scientists do not want to make efforts for performance tuning with rewriting their MPI applications. In addition, the number of processing elements which can be used by them is increasing year by year. On large-scale parallel systems, the number of accumulated messages on a message buffer tends to increase in some of their applications. Since searching message queue in MPI is time-consuming, system side scalable acceleration is needed for those systems. In this paper, a support function named LHS (Limited-length Head Separation) is proposed. Its performance in searching message buffer and hardware cost are evaluated. LHS accelerates searching message buffer by means of switching location to store limited-length heads of messages. It uses the effects such as increasing hit rate of cache on host with partial off-loading to hardware. Searching speed of message buffer when the order of message reception is different from the receiver's expectation is accelerated 14.3 times with LHS on FPGA-based network interface card (NIC) named DIMMnet-2. This absolute performance is 38.5 times higher than that of IBM BlueGene/P although the frequency is 8.5times slower than BlueGene/P. LHS has higher scalability than ALPU in the performance per frequency. Since these results are obtained with partially on loaded linear searching on old Pentium®4, performance gap will increase using state of art CPU. Therefore, LHS is more suitable for larger parallel systems. The discussions for adopting proposed method to state of art processors and systems are also presented.},
keywords={},
doi={10.1587/transinf.E96.D.2536},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Network Interface Architecture with Scalable Low-Latency Message Receiving Mechanism
T2 - IEICE TRANSACTIONS on Information
SP - 2536
EP - 2544
AU - Noboru TANABE
AU - Atsushi OHTA
PY - 2013
DO - 10.1587/transinf.E96.D.2536
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2013
AB - Most of scientists except computer scientists do not want to make efforts for performance tuning with rewriting their MPI applications. In addition, the number of processing elements which can be used by them is increasing year by year. On large-scale parallel systems, the number of accumulated messages on a message buffer tends to increase in some of their applications. Since searching message queue in MPI is time-consuming, system side scalable acceleration is needed for those systems. In this paper, a support function named LHS (Limited-length Head Separation) is proposed. Its performance in searching message buffer and hardware cost are evaluated. LHS accelerates searching message buffer by means of switching location to store limited-length heads of messages. It uses the effects such as increasing hit rate of cache on host with partial off-loading to hardware. Searching speed of message buffer when the order of message reception is different from the receiver's expectation is accelerated 14.3 times with LHS on FPGA-based network interface card (NIC) named DIMMnet-2. This absolute performance is 38.5 times higher than that of IBM BlueGene/P although the frequency is 8.5times slower than BlueGene/P. LHS has higher scalability than ALPU in the performance per frequency. Since these results are obtained with partially on loaded linear searching on old Pentium®4, performance gap will increase using state of art CPU. Therefore, LHS is more suitable for larger parallel systems. The discussions for adopting proposed method to state of art processors and systems are also presented.
ER -