heyingzhang@nudt.edu.cn
wkf.working@gmail.com
Received 10 December 2012
Accepted 27 August 2013
Published 28 November 2013
High-radix router based on the tile structure requires large amount of buer resources. To
reduce the buer space requirement without degrading the throughput of the router, shared
buer management schemes like dynamically allocated multi-queue (DAMQ) can be used by
improving the buer utilization. Unfortunately, it is commonly regarded that DAMQ is slow in
write and read. To address this issue, we propose a fast and fair DAMQ structure called
F2DAMQ for high-radix routers in this paper. It uses a fast FIFO structure in the implementation of idle address list as well as data buer and achieves some critical performance improvement such as continuous and concurrent write and read with zero-delay. Besides,
F2DAMQ also uses a novel credit management mechanism which is ecient in avoiding one
virtual channel (VC) monopolizing the shared part of the buer and achieving fairness among
competing VCs sharing the buer. The analyses and simulations show that F2DAMQ performs
well in achieving low latency, high throughput and good fairness under dierent trac patterns.
Keywords: High-radix router; dynamic allocation; buer management.
1. Introduction
In recent years, the peak performance of supercomputer has increased to more
than 20 Peta-ops. It will increase to 100 Peta-ops in the near future.1 The interconnection network plays a more and more critical role in supercomputer by determining the latency, throughput and stability of the whole system. It is commonly
regarded that routers with many ports are more ecient in reducing hop count and
latency.With the increase of pin bandwidth and advances in signaling technology,
*This paper was recommended
Corresponding author.
1450012-1
H. Zhang et al.
1450012-2
enhance the drive strength of the output signals. SRAM-R will delay the output data
for an additional clock cycle compared to SRAM without registered output. That is,
SRAM-R receives the read address and read enable signal on the rst clock cycle and
issues data out on the third clock cycle. How to hide the read delay and realize
continuous read from the shared buer is the most challenging problem in designing
DAMQ based on SRAM-R. Even though SRAM without registered output is used,
DAMQ implemented in linked list is too slow in write and read to be used in highradix router. To speed up the access of DAMQ, we propose a novel fast and fair
DAMQ structure in this paper called F2DAMQ. It realizes many signicant performance advantages such as continuous and concurrent read or write, low delay and
high throughput. Analyses and tests show that F2DAMQ satises the performance
and area requirement of high-radix router in great extent.
To summarize, this paper makes the following contributions:
(1) Design a fast FIFO structure. The idle address list and shared data buer in
F2DAMQ are organized in such a structure.
(2) Read the idle address list with zero delay on data arrival, which reduces the write
delay to zero.
(3) Read data from the shared buer to the prefetch buer before the read request is
received, which reduces the read delay to zero.
(4) Propose a fair credit management mechanism to avoid one VC monopolizing the
shared buer.
The rest of the paper is organized as follows. Section 2 introduces the use of DAMQ
in tile structure. Section 3 designs a fast FIFO structure and uses it to construct
F2DAMQ buer. Section 4 analyzes and evaluates the performance of F2DAMQ.
Section 5 discusses the related works in great detail. Finally, the conclusion is given
in Sec. 6.
1450012-3
H. Zhang et al.
disadvantage is the large buer requirement from input buer, row buer and column buer. In the rst high-radix router YARC,5 64 ports are organized in 64 8 8
tiles, including 64 input buers, 512 row buers and 512 column buers. The input
buer size is 256 its and the row buer and column buer are 16 its respectively.
Where, it is the smallest message unit for transmission and ow control in the
network. The packets with various lengths can be segmented to several its of xed
length. In the following description, data and it are used alternately without
confusion.
In our design, there are four VCs sharing a physical link. The VC ID is assigned by
the network interface chip according to the application of the packet. It is marked in
each it of a packet and keeps unchanged during it traversing through the network.
In Figure 1, the 1:5 distributer should rst schedule among data from four VCs in the
input buer, then distribute it to other tiles. Similarly, the 5 5 crossbar also
includes two main steps. First, each row buer in the tile arbitrates among the four
VCs. Then, the data of selected VC is sent to the 5:1 arbiter. At the last stage, the 5:1
multiplexer selects among ve column buers after data from one VC is scheduled in
each column buer.
In the high-radix router, we can statically allocate separate buer space to each
VC or dynamically allocate random buer space to the input data in a shared buer.
Take the input buer as example. Figure 2 compares the static buer allocation
among N VCs and dynamic allocation. For static allocation, data from each VC is
1450012-4
Output data
Input data
VCN
N:1
multiplexer
(a)
(b)
Fig. 2. Dierent buer allocation: (a) static and (b) dynamic.
placed in a specic buer space. Even if some VC has no data to store, other VC
cannot use its idle space, thus resulting in poor buer utilization. To overcome this
problem, buer space can be allocated to VCs on demand as shown in Fig. 2(b).
Consequently, each VC can use more buer blocks as long as there are idle space in
the buer. If each VC is guaranteed to use the same amount of buer space, the total
buer space required by dynamic allocation is less than that of static one. Hence,
dynamic allocation is more appropriate for high-radix router which is usually buer
resource limited.
SAMQ and DAMQ are typical mechanisms for static allocation and dynamic
allocation respectively. Even though the control logic of DAMQ is complex compared to SAMQ, exploring its usage in high-radix router is meaningful in consideration of its eciency in improving buer utilization. Moreover, in ASIC oor-plan
of high-radix router, a lot of long wires are needed to connect tiles in the same row
or the same column. Therefore, the chip space that can be used to place so many
buers is very limited. The additional control logic caused by DAMQ is acceptable
considering the benet of decrease in buer resource. DAMQ can be used in input
buer, row buer and column buer in the tile structure. Each buer is shared by
multiple VCs. The original proposed DAMQ is ecient in buer management but
slow in write and read operations. With the increase of working frequency, more
and more memory elements register the output data for one additional clock cycle
before they are used by the logics following up. Otherwise, it is dicult to meet the
setup time requirement of the signal. Unfortunately, this additional delay makes
DAMQ even slower.
1450012-5
The components of DAMQ buer commonly include data buer, address buer,
idle address list, write and read pointer management. Among them, the data buer
and idle address list are usually implemented by SRAM. If SRAM-R is used to
implement the idle address list, upon data arrival, DAMQ need one more cycle delay
to get the idle address for accommodating the arriving data. Moreover, if SRAM-R is
used to implement the data buer, when the scheduler or arbiter send read request to
the DAMQ buer, there is also one more cycle delay to get the output data. To speed
up the access to DAMQ buer, we design a fast FIFO structure and use it to
implement the data buer and the idle address list. We also design a ow control
scheme based on credit management to fairly allocate the buer slots among multiple
VCs. The new DAMQ scheme is called F2DAMQ. Its details are described in the
following sections.
3. F2DAMQ Buer
Without loss of generality, we introduce the implementation of a F2DAMQ buer
shared by four VCs. Note that the buer shared by less or more VCs has similar
structure to this one. Figure 3 shows an overview of F2DAMQ. The main components are:
(1) Data buer: store input data from VC.
(2) Address buer: store the address of next data.
Data buffer Address buffer
Flow control
Bypass
Read address
Write address
Read
Output data
Read
H. Zhang et al.
Output data
Main part
VC0 FIFO_TOP
VC0 bypass write
VC1 FIFO_TOP
Output data
VC2 FIFO_TOP
VC2 read data buffer
VC3 FIFO_TOP
VC3 read data buffer
1450012-6
(3) First-in rst-out (FIFO) TOP: store data read in advance from the data buer.
(4) Head and tail management: maintain the read pointer and write pointer of the
data buer and address buer for each VC.
(5) Idle address list: contain the idle address of data buer, organized in FIFO order.
(6) Flow control: initialize, decrease and increase the sending credit of each VC.
Address buffer
Addri(j+1)
Datai(j+1)
Addri(j+2)
Datai(j+2)
NULL
1450012-7
H. Zhang et al.
c1
c2
c3
clk
address
Addr i
data
Data i
(a)
c1
c2
clk
address
Addr i
data
Data i
(b)
Fig. 5. Timing of read operation: (a) SRAM-R and (b) SRAM.
c1
c2
c3 c4
clk
address
Addr i
Addr j
data
c1
c2
data
Data j
c3 c4 c5
Data buffer
clk
address
Data i
clk
Addr i
Addr j
Data i+Addr j
address
Data j+Addr k
data
Addr i
Addr j
Addr j
Addr k
Address buffer
(a)
(b)
Fig. 6. Timing of read operation from the data buer. (a) Data buer and address buer are combined in
one SRAM-R memory. (b) Data buer and address buer are implemented separately.
1450012-8
c1
c2
c3 c4
clk
request
address
Addr i
Addr j
data
Data i
Data j
Fig. 7.
on c1. The data and address of next data addrj are available on c3. Therefore the
next read can only start on c3. It is evident that there are bubbles on the output data
line of the buer. To address this issue, we implement the address buer in fast
register arrays or SRAM without registered output as shown in Fig. 6(b). The rst
read request to the data buer is also issued on c1. The data is available on c3 and the
address of next data addrj is available on c2. So the second read request can be issued
out on c2. As a result, reading from the data buer can be continuous without
interval.
In the realization of input-queued switch, the read request to the data buer is
usually the grant of an arbiter. To decrease the switch latency, the data buer is
expected to output data immediately when it receives the grant signal as illustrated
in Fig. 7. When the read request is valid, the data is read out on the same clock cycle.
Unfortunately, the timing shown in Fig. 6(b) cannot meet this requirement. To
achieve this goal, we design a fast FIFO structure for each VC by reading data from
the shared data buer to its private FIFO TOP in advance before the read request is
received.
3.2. FIFO TOP
As shown in Fig. 3, each VC has its own private buer called FIFO TOP. It is used
to store the top three data of each VC as illustrated in Fig. 8. The read request from
the arbiter to the F2DAMQ buer is sent to FIFO TOP rather than the shared data
buer. And the required data is also output from the FIFO TOP.
There are two sources writing data to FIFO TOP. First, input data to the shared
data buer is written to FIFO TOP directly on condition that this VC has no data in
the shared data buer and its FIFO TOP is not full. This process is called bypass
write. Otherwise, the input data is written to the shared data buer. Second, data is
read from the shared data buer if a VC has data queuing in it and its FIFO TOP is
not full. Perhaps there are more than one VC want to read data from the shared data
buer on the same clock cycle. An arbiter among these requests should be used. Here,
the read request to the F2DAMQ buer is also used to read data from the shared
1450012-9
H. Zhang et al.
Data buffer
VC0 FIFO_TOP
VC1 data4
VC0 data4
VC3 data4
VC1 data5
VC2 data4
data1
data2
data3
VC1 FIFO_TOP
data1
data2
data3
VC2 FIFO_TOP
data1
data2
data3
VC3 FIFO_TOP
data1
data2
data3
Fig. 8.
data buer to FIFO TOP. The result of the arbiter in crossbar is used as the read
request and it will be valid at most for only one VC each round of arbitration or each
cycle. Thus, the arbiter for read requests to the data buer is needless. If one data is
read from the FIFO TOP of a specic VC, on the same clock cycle, another data of
the same VC can be read from the shared data buer and written to the FIFO TOP.
Data is always read from the rst memory block of FIFO TOP. Once the rst data in
FIFO TOP is read, other data will be moved forward as shown in Fig. 9.
The FIFO TOP should be implemented by simple and fast memories such as
register array. Accessing to these memories should be fast enough. Furthermore, the
depth of FIFO TOP should be deeper than the read delay it is used to hide. We
dene the read delay as the clock cycles elapsed from the validation of read enable
signal to the output data. If the read delay of the data buer is n clock cycles, the
FIFO TOP should include more than n 1 memory entries. The deeper the FIFO
TOP, the less probability that it becomes empty, and the more control logic and area
resources are required. Considering the read delay of the data buer implemented in
SRAM-R is two clock cycles, we set the depth of FIFO TOP to three by making a
tradeo between the performance and resource overhead.
The data buer and FIFO TOP construct a fast FIFO structure. This structure
realizes read with zero latency for the read delay from the shared data buer is
hidden by reading data to FIFO TOP before the read request from crossbar is
received. In contrast, the existing DAMQ reads the shared buer when they received
the read request. Consequently, the read delay is determined by the timing of the
shared buer.
1450012-10
Data buffer
VC0 FIFO_TOP
data1
data2
data3
VC0 data4
VC3 data4
VC1 data5
VC2 data4
VC1 FIFO_TOP
data2
data3
data4
Data1 is read.
VC2 FIFO_TOP
data1
data2
data3
VC3 FIFO_TOP
data1
data2
data3
Fig. 9.
1450012-11
H. Zhang et al.
+
+
+
+
7
7
7
7
7
7
7
7
+
+
+
+
(a)
(b)
Fig. 10. Head and tail pointer: (a) change of the tail pointer after write and (b) change of the head
pointer after read.
Similarly, assume reading from VCi . The steps of reading from the data buer are:
(1) Read data from the data buer that current read pointer of VCi points to.
(2) Write current read pointer to the idle address list. Meanwhile, if M > 1, read the
address of next data from the address buer that current read pointer points to.
(3) Update the read pointer with the address of next data if M > 1 or null if M 1.
The changes of pointers during the run of F2DAMQ are illustrated in Fig. 10.
Where, Hi , i 0; 1; 2; 3, denotes the read pointer of VCi . Ti , i 0; 1; 2; 3, denotes the
write pointer of VCi . After reset of the F2DAMQ circuit, both the write pointer and
read pointer are initialized to null. In Fig. 10(a), Ti points to the location of the last
data from VCi . In Fig. 10(b), H0 moves to the location of next data after the it of
VC0 is read from the data buer.
The description above implies that reading the data buer will run without delay
for the read pointer is ready before read operations occurring. In contrast, whether
write to the data buer with delay or not determined mainly by the read delay of the
idle address list.
3.4. Idle address list
The idle address list contains the address of unoccupied space in the data buer,
organized in FIFO and implemented by SRAM-R. To hide the read delay of SRAMR, the idle address list is also organized in fast FIFO structure with FIFO TOP as
shown in Fig. 3. The idle address list includes main part and FIFO TOP. When the
main part is empty and the FIFO TOP is not full, input data is written to FIFO
TOP directly. Otherwise, data is written to the main part. When data is read from
FIFO TOP and the main part is not empty, data will be moved from the main part
1450012-12
to FIFO TOP. As a result, if there are less than three data in the idle address list,
they will locate in FIFO TOP. Otherwise, if there are more than three data in the
idle address list, the top three will locate in FIFO TOP. It is transparent to the user
that data is moved from the main part to FIFO TOP. The user can read the idle
address list without delay. This feature is crucial to implement fast and back-to-back
write to the data buer.
The idle address list should be initialized during reset of the F2DAMQ circuit. All
the addresses of the data buer will be written to the idle address list. When a it is
written to the data buer, an idle address will be read from the idle address list. On the
contrary, when a it is read from the data buer, its address is written to the idle
address list. The structure of idle address list with FIFO TOP improves the throughput
and reduces the write delay of the data buer by its zero-delay access feature.
3.5. Flow control
Flow control is used to prevent buer overow by limiting the amount of data that a
sender can write to the buer. Flow control based on credit is widely used between
communication peers in networks. It realizes two main functions, credit increase and
decrease. The sender can only send data when its credit is larger than zero. Once data
is sent, the credit will be decreased by one. When the receiver reads data from the
buer, it returns credit release signal to the sender. Then the credit of the sender will
be increased by one. In F2DAMQ, the credit management not only prevents the
buer of receiver from overow but also determines the allocation of shared data
buer among competing VCs. We propose a fair credit management (FCM) scheme
to avoid one VC monopolizing the shared data buer.
FCM partitions the data buer (DB) into two parts, shared buer (SB) and
private buer (PB). The data buer is assumed to be shared by N VCs. We have
DBdepth SBdepth N PBdepth . Where, DBdepth is the depth of DB, SBdepth and
PBdepth are depth of SB and PB respectively. SB can store data of all VCs, while PB
can only store data of a specic VC. Note that the location of SB and PB is random
and dynamic rather than xed and static. Accordingly, the credit is divided into
shared credit (SC) and private credit (PC) corresponding to the partition of data
buer. SC and PC are initialized to the depth of SB and PB, respectively.
In F2DAMQ, VCi can send data to DB when it has credit. The rules for sending
data and changing credit are:
(1) If SC > 0, which means SB is not full, VCi can send data to SB. Then, SC
SC 1 means a shared buer block is occupied after data sending.
(2) If SC 0, PCi > 0, which means SB is full while PB is not. VCi can send data to
its private buer. Then, PCi PCi 1 means a private buer block of VCi is
occupied after data sending.
If both the two conditions are not satised, VCi cannot send data for lack of credit.
1450012-13
H. Zhang et al.
In FCM, the receiver is DB. Upon data arrival, it will allocate a free space to
accommodate the input data. When data of a VC is read by other switch logic,
DB sends credit release signal to inform the sender that the buer of receiver adds
a free space as well as the VC ID that the output data belongs to. Once the sender
receives this signal, it determines to add the released credit to SC or PC. This
decision will aect the fairness of buer sharing among VCs. FCM uses a fair
credit increase method to avoid one VC monopolizing the shared part of the data
buer.
The denition of fairness here is dierent from equal partition. We dene fairness
as VC getting the share of buer on demand as much as possible. Although the
setting of private buer makes great eort in achieving fairness, it is far from enough.
To further improve fairness, we set a variable PSC for each VC to record the amount
of SB occupied by it. We also set a threshold PSCthreshold for PSC. If PSC of VCi is
greater than the threshold, that is PSCi > PSCthreshold , the released credit of VCi will
be added to SC. Otherwise, it is added to PCi . PSCthreshold can be set to SC/N , where
N is the number of VC sharing the buer. The details of FCM are described below.
VCi has data to send:
if (SC>0)
f
VC send( );
SC SC-1;
PSCi PSCi 1;
g
else if (PCi > 0)
f
VC send( );
PCi PCi -1;
g
else
VCi cannot send;
VCi receives the credit release signal:
if (PSCi > PSCthreshold )
f
SC SC 1;
PSCi PSCi -1;
g
else if (PCi < PCmax )
PCi PCi 1;
else
f
SC SC 1;
1450012-14
4. Performance Analyses
The performance of F2DAMQ is compared with SAMQ by analyzing the buer
utilization and resource requirement. We also evaluate the performance of F2DAMQ
by implementing it in Verilog and testing through cycle-accurate simulator. The
simulation results indicate that F2DAMQ can achieve high throughput and low
latency by supporting continuous, concurrent and fast read or write.
We assume the number of VC sharing a single buer is N which is usually greater
than one in most networks. SAMQ allocates M buer blocks to each VC. F2DAMQ
allocates P buer blocks to shared part and Q blocks to each VC as private buer.
For normal DAMQ, all the buer blocks are shared part denoted as W . The total
buer blocks required by SAMQ, F2DAMQ and DAMQ are N M; P Q N and
W , respectively. The maximum number of it sent by each VC in SAMQ, F2DAMQ
and DAMQ is M , P Q and W , respectively.
4.1. Resource requirement
Suppose SAMQ, F2DAMQ and DAMQ have the same amount of buer space. That
is N M P Q N W . Rewrite to M P =N Q. If N > 1, we have
P =N Q < P Q. That is M < P Q. It is evident that M < P Q < W , which
indicates DAMQ allows each VC send more its than F2DAMQ and SAMQ. Note
that for F2DAMQ, P and Q are congurable parameters. If Q is set to zero,
F2DAMQ allows each VC send the same amount of its as DAMQ. On the other
hand, if P is set to 0, F2DAMQ performs the same as SAMQ. In the following
analyses and tests, we observed the similar results in many cases. So, F2DAMQ
makes a tradeo between high throughput and fairness among VCs in buer allocation by properly setting the parameters of P and Q.
1450012-15
H. Zhang et al.
On the other hand, suppose each VC can occupy the same amount of buer
blocks. That is M P Q W . The total buer blocks required by SAMQ are
N M N P Q. For F2DAMQ, it is P Q N. For DAMQ, it is W . If
N > 1, we have N P Q > P Q N > W , which means SAMQ requires
more buer blocks than F2DAMQ and DAMQ. If Q is set to zero, F2DAMQ requires
the same amount of buer blocks as DAMQ. We can further calculate the buer
blocks saved by F2DAMQ. Let k denote the ratio of memory blocks saved by
F2DAMQ to SAMQ. Then, we have
NP Q P N Q
NP Q
1
1
1
:
N
1 Q=P
The relationship of k with N and Q=P is shown in Fig. 11. We can nd that more
VCs share a buer and less buer space is allocated to private part, F2DAMQ can
save more space in comparison to SAMQ. In reality, k is also aected by the trac
pattern besides N and Q=P . If the trac of all VCs is even, each VC will occupy less
blocks in the shared part P. Correspondingly, the memory blocks saved by F2DAMQ
will be less than k. So, it is more reasonable to view k as the upper limit of the saved
memory blocks. If Q is set to 0 in (1), k 1 1=N. This is the ratio of memory blocks
saved by DAMQ to SAMQ.
The analyses above show that, on one hand, F2DAMQ and DAMQ can accommodate more its from a VC than SAMQ when they have equal amount of buer
space. This feature is critical for accepting bursty trac or forwarding short message
1450012-16
quickly. On the other hand, F2DAMQ and DAMQ require less buer resource than
SAMQ when they can send the same amount of its. This makes F2DAMQ and
DAMQ more suitable for high-radix routers. By appropriately setting the value of Q,
F2DAMQ can perform similar to DAMQ.
where 0 < n N. If n 0, the buer utilization is zero. Figure 12 shows the relationship of buer utilization with n and Q=P . Where, N 8. n changes from one to
eight and Q/P changes from 0.125 to 1. We can nd that the buer utilization is
increased with the increase of n and the decrease of Q=P .
We further rewrite (3) to
n P =n Q
:
N P =N Q
1450012-17
H. Zhang et al.
cycle-accurate simulator. The data buer and idle address list of F2DAMQ and
DAMQ are implemented in SRAM-R. The data buer of SAMQ is also implemented
in SRAM-R. SAMQ allocates equal amount of memory blocks located in a xed
region to each VC. For normal DAMQ, any VC can send data if there is idle buer
blocks. For any of the three mechanisms, only one VC is allowed to write to the data
buer on each clock cycle. Similarly, only one VC is allowed to read from the data
buer, too. Writing to the buer and reading from it can happen on the same clock
cycle. The read arbiter of the buer changes the scheduling priority following the
rules of round-robin. If some VC has data in the buer, it will be read out according
to the grant signal generated by the read arbiter. The data buer can accommodate
128 its and the number of VC is four. For F2DAMQ, PCmax 16, the initial value
of SC is 64, PSCthreshold 16. For SAMQ, the initial credit of each VC is 32. For
DAMQ, there is only public credit, which is initialized to the depth of the data buer.
4.3. Latency
In this experiment, we test the latency of a it travelling through the buer. The
latency is dened as the clock cycles required by a it to traverse the buer. For
F2DAMQ, DAMQ and SAMQ, the minimum latency is one cycle, ve cycles and
three cycles respectively. F2DAMQ achieves the shortest latency due to its zerodelay write and read. Moreover, we also nd that, for F2DAMQ, writing to and
reading from the buer can run continuously and concurrently, which is the fastest
access rate that a buer can achieve. For DAMQ, the write delay and read delay are
two cycles. For SAMQ, the write delay is zero and the read delay is two cycles.
The pipeline stages that a it experience in the buer are shown in Fig. 14. To
achieve the minimum latency, the it is read immediately after it is written to the
buer. Otherwise, the it will queue in the buer until the arrival of read request.
Cycle
WR
RI
GI
RR
6
GF
DAMQ
WR
RR
GF
SAMQ
WR
RI
GI
RR
GF
F2DAMQ
Fig. 14. The pipeline of a it owing through buer.
1450012-19
H. Zhang et al.
There are totally ve stages included in the pipeline for F2DAMQ and DAMQ. They
are WR, RI, GI, RR and GF. SAMQ need only three stages without access to the idle
address list. The meanings of the ve pipeline stages are:
(1)
(2)
(3)
(4)
(5)
It is evident that F2DAMQ has the shortest pipeline and DAMQ has the longest one.
F2DAMQ can nish WR, RI and GI in a single clock cycle. It can also implement RR
and GF in a single cycle. This is mainly attributed to the use of fast FIFO structure
in the idle address list and data path. DAMQ implements WR and RI in the same
cycle. Then it must wait a cycle to get the idle address. To get the it, it must wait
another cycle after receiving the read request. The read timing of SAMQ is the same
as DAMQ.
4.4. Throughput
In this experiment, we will test the maximum throughput of the buers. Here,
throughput is dened as the ratio of total input its to the passed clock cycles. The
it is written to and read from the buer on each clock cycle. Only one VC is allowed
to send it each time. The ID of the VC which is allowed to send is generated
according to Poisson distribution. The aggregate throughput of F2DAMQ, DAMQ
and SAMQ are shown in Fig. 15. The test results show that the throughput of
DAMQ is lower than F2DAMQ and SAMQ. The main reason is that F2DAMQ and
SAMQ can write and read its on each clock cycle, while DAMQ experiences multicycle latency in either write or read.
In reality, the read operation will be stopped by some reason such as contention
on an output port, or some VCs stop sending data within a period of time. These VCs
are called inactive VC. Those have data to send are called active VC. If the active
VCs have the same amount of data to send, they generate uniform trac. Otherwise,
if some active VC has more data to send than others, they generate non-uniform
trac.
In the second experiment, we randomly stop the read operation at run-time. The
it will be written to the buer as long as there are credits. The throughput of the
three buer management mechanisms with dierent number of active VC generating
uniform trac is shown in Fig. 16(a). We can nd that the throughput of F2DAMQ
is higher than DAMQ and SAMQ. Moreover, the throughput of DAMQ and SAMQ
increase dramatically with the increase of active VC. In contrast, the throughput of
F2DAMQ keeps high and changes slowly with the increase of active VC. The results
can be explained as follows. For F2DAMQ, the active VC can occupy the shared
1450012-20
buer and its private buer. Only the private buer allocated to the inactive VCs
cannot be used. So, most of the buer blocks can be used which guarantees high
throughput to F2DAMQ. For SAMQ, the buer blocks are equally allocated to each
VC. The private buer of the inactive VC cannot be used which makes the available
buer small. As a result, the throughput is limited by the small available buer
blocks. For DAMQ, even though the active VC can use the whole buer, the large
write delay and read delay make it dicult to achieve high throughput.
The analyses in Sec. 4.1 indicate that F2DAMQ requires less buer blocks to
achieve the same throughput as SAMQ. To further verify the analyses, we test the
throughput of the three buer management schemes under dierent depth of buer
when all the four VCs are active and generating uniform trac. For F2DAMQ, the
private buer allocated to each VC is xed at eight its and the rest buer blocks are
allocated to shared part. The results are shown in Fig. 16(b). F2DAMQ achieves
higher throughput than DAMQ and SAMQ. In other words, F2DAMQ with small
buer depth achieves the same throughput as SAMQ with large buer depth. The
high throughput of F2DAMQ mainly owes to the fast write and read as well as the
ecient credit management.
Under the same conguration, we also test the throughput under non-uniform
trac. Where, the four VCs are numbered as VC0 to VC3. VC3 sends four times
amount of data than the other three VCs. The results are shown in Fig. 16(c). The
throughput of F2DAMQ is much higher than SAMQ and DAMQ. The reason is that
F2DAMQ can accommodate bursty trac better than SAMQ by its exible credit
management.
1450012-21
H. Zhang et al.
(a)
(b)
(c)
Fig. 16. Throughput of the three buer management. (a) The throughput under dierent number of
active VCs. (b) The throughput under dierent depth of buer and uniform trac. (c) The throughput
under dierent depth of buer and non-uniform trac.
1450012-22
i1
where xi is the share achieved by user i and n is the number of users sharing the
resource. xi 0 but all the xi should not equal to 0 at the same time. So, 0 < J 1.
When J approaches to one, the fairness property is better.
In our test, xi is the amount of its sent by VCi from the start of test to the clock
cycle we observed. n is the number of VC sharing the buer which is four here. In the
rst test, we generate a variant k from zero to three in sequential and cyclic manner.
Correspondingly, VCk will send data. The fairness indexes of F2DAMQ, DAMQ and
SAMQ are shown in Fig. 17.
SAMQ achieves the best fairness and F2DAMQ achieves better fairness than
DAMQ. This can be explained that SAMQ allocates equal buer blocks to each VC.
So the four VCs can get the same share of buer blocks. F2DAMQ uses FCM credit
management to allocate equal private buer to each VC and avoid one VC monopolizing the shared buer blocks. By doing this, it can also perform well in fairness
guarantee. DAMQ allocates the shared data buer among VCs according to their
demand. In this test, the four VCs have the same demand on buer blocks. Therefore, DAMQ achieves good fairness, too. In fact, the four VCs can hardly generate
equal trac in real interconnection networks. It is meaningful and necessary to test
fairness under non-uniform trac.
1450012-23
H. Zhang et al.
Table 1. The number of its sent by each VC.
Clock cycle
Mechanisms
VC
100
300
500
700
900
1100
F2DAMQ
VC0
VC1
VC2
VC3
13
13
13
61
38
38
38
124
63
63
63
148
88
88
88
178
113
113
113
201
138
138
138
222
DAMQ
VC0
VC1
VC2
VC3
9
9
9
41
25
25
25
125
42
42
42
208
59
55
55
276
70
62
61
331
79
72
67
384
SAMQ
VC0
VC1
VC2
VC3
13
13
13
50
38
38
38
79
63
63
63
104
88
88
88
133
113
113
113
158
138
138
138
182
In the second test, the amount of data that VC3 want to send is ve times of other
VCs. The test results are shown in Table 1 and Fig. 18. The numbers of its sent by
each VC from the start of test to some specic clock cycle are shown in Table 1. For
F2DAMQ and SAMQ, at the start of test, the number of its sent by VC3 is about
four times of other VC. While, with the progress of the test, the number of its sent
by VC3 is only about 1.5 times of other VC. However, for DAMQ, the amount of its
sent by VC3 is always about ve times of others.
The fairness index is shown in Fig. 18. At the beginning of experiment, the
fairness indexes of F2DAMQ, SAMQ and DAMQ are low because there is great
1450012-24
(a)
(b)
Fig. 19. Performance under dierent settings of private credit: (a) throughput and (b) fairness.
dierence among the number of its sent by the four VCs. With the progress of the
experiment, the fairness index of F2DAMQ and SAMQ increase to one gradually.
While the fairness index of DAMQ keeps low. The main reason is that DAMQ does
not limit the amount of its sent by aggressive VC, while SAMQ and F2DAMQ can.
The experiments above indicate that, with specic credit conguration,
F2DAMQ can achieve high throughput and good fairness under variable trac
patterns. In the third experiment, we evaluate the throughput and fairness under
dierent settings of private credit. The depth of buer is set to 128 its. The private
credit for each VC is Q its and the shared credit is P 128 4 Q. The
PSCthreshold P=4 or PSCthreshold 32 Q. When Q 0, F2DAMQ performs like
DAMQ. And when Q 32, F2DAMQ performs like SAMQ. The fairness index and
throughput under uniform or non-uniform trac with dierent settings of private
credit are shown in Figs. 19(a) and 19(b), respectively. From Fig. 19(a), we can nd
that the throughput under non-uniform trac is slightly lower than that under
uniform trac. Moreover, the throughput decreases slightly with the increase of
private credit under non-uniform trac. The reason is that the shared credit will be
small if the private credit is large. If some VCs want to send more data and used up
their private credit, it will be prevented from sending data frequently when the
shared credit is small. In other words, F2DAMQ can accommodate more its with
small private credit especially under non-uniform trac.
Figure 19(b) shows the fairness of F2DAMQ under uniform and non-uniform
trac with dierent settings of private credit. The fairness index keeps unchanged
under uniform trac with dierent values of private credit. Under non-uniform
trac, the fairness index increases dramatically with the increase of private credit.
This can be explained that the aggressive VCs are prevented from occupying excessive shared buer blocks if the private credit is large. So the aggressive VCs and
unaggressive ones can occupy similar buer blocks.
1450012-25
H. Zhang et al.
According to the experiment results, the setting of the private credit should make
a tradeo between high throughput and good fairness. It is better to adjust the
private credit according to trac patterns. Perhaps this is the direction of future
work.
128
39732
256
75035
512
145639
DAMQ
F2DAMQ
128
66323
128
76693
1450012-26
largest leakage power consumers in a NoC router, consuming about 64% of the total
router leakage power. Therefore, F2DAMQ with complex control logic and small
buer will not incur unacceptable power consumption overhead.
The analyses and tests described above show that F2DAMQ outperforms DAMQ
and SAMQ in many aspects except the additional control logic caused by the FIFO
TOP. For high-radix router, it's more important to decrease the area and power
overhead caused by buer rather than control logic. In this point of view, F2DAMQ
can satisfy the requirement of high-radix router in performance and buer resource
consumption. Moreover, F2DAMQ can also be used in NoC routers where buers are
more expensive than wires.
5. Related Works
The primary concept of DAMQ was proposed in Ref. 6 and implemented in linkedlist. The basic idea of this approach is to maintain (k 1) linked list in each buer:
one list of packets for each one of the (k 1) output ports, one list of packets for the
end node interface and one list of free buer blocks. Where, k is the number of output
ports. Similarly, F2DAMQ proposed here is also implemented in (N 1) linked list:
one list of packets for each of the N VCs and one list of free buer blocks. It is
commonly regarded that the original DAMQ suer from high latency in write and
read operation. The prefetch structure proposed in this paper eliminates this problem
eectively.
The other performance penalty faced by traditional DAMQ is complexity. SCB
buer is an important circuitry proposed to reduce the hardware complexity of
DAMQ as well as speedup the read and write operation.7 The SCB system has the
capability of performing a read, a write, or a simultaneous read/write operation per
cycle due to its pipelined architecture. F2DAMQ performs as well as SCB in these
operations. Unfortunately, SCB requires customized CMOS circuit to facilitate data
insertion and block data moving in the buer. However, for most ASIC designs,
standard memory IPs provided by manufacturers are usually preferred by logic
designers due to cost and availability considerations. F2DAMQ aims to such kind of
designs. Its additional prefetch structure decreases the latency of write and read
operation at the cost of increasing the complexity of F2DAMQ in an acceptable
manner. To lower the hardware complexity of F2DAMQ, we consider deleting the
idle address list based on the observation that the write and read operations of the
idle address list are opposite to the corresponding operations of data buer. This
work is now underway.
For traditional DAMQ and SCB, there is no reserved space dedicated for each
output channel, the packets destined to one specic output port may occupy the
whole buer space thus the packets destined to other output ports have no chance to
get into the buer. In order to overcome the shortcoming, a new buer scheme named
DAMQall is proposed in Ref. 12, which is based on SCB and reserved space for all
1450012-27
H. Zhang et al.
6. Conclusion
High-radix router based on tile structure becomes popular in the interconnection
networks of supercomputer. The dramatically increased buer requirement puts
forward great challenge on backend oor-plan of the ASIC chip. One possible solution is to decrease the number of buers, while it is dicult to realize. Another
solution is to decrease the buer depth by dynamically allocating buer entries
among competing VCs. DAMQ is such a mechanism. It is ecient in dynamic buer
management, while not so ecient in achieving low-latency write and read. To
overcome this problem, a fast and fair multi-VC shared buer structure named
F2DAMQ is proposed in this paper. It designs a fast FIFO structure to hide the read
delay of high-speed SRAM-R memory by always moving the top three data in advance to FIFO TOP. Both the idle address list and data buer use the fast FIFO
structure to speed up write and read. Moreover, a fair credit management method is
proposed to allocate buer space among VCs fairly and avoid one VC monopolizing
the shared part of the buer. Analyses and tests indicate that F2DAMQ performs
well in latency, throughput and fairness. How to simplify the control logic of
F2DAMQ further is the main direction of future work.
Acknowledgments
This work was supported by the National High-Tech Research and Development
Plan of China under Grant No. 2012AA01A301.
References
1. D. Chen et al., The IBM blue gene/Q interconnection network and massage unit, SC'11,
Seattle, Washington, USA, 1218 November, 2011.
2. W. Oed, The cray gemini interconnect: More than just a router. ISC'10, Hamburg,
Germany, June 2010.
3. J. Kim, W. J. Dally, B. Towles and A. K. Gupta, Microarchitecture of a high-radix router,
Proc. 32nd Int. Symp. Computer Architecture, Madison, WI, USA (2005), pp. 420431.
4. S. Scott et al., The blackwidow high-radix clos network, Proc. 32rd Int. Symp. Computer
Architecture, Boston, MA, June 2006.
1450012-29
H. Zhang et al.
5. S. Li, L. S. Peh and N. K. Jha, Dynamic voltage scaling with links for power optimization
of interconnection networks, Proc. 9th Int. Symp. High-Performance Computer Architecture (HPCA) (2003), pp. 91102.
6. Y. Tamir and G. L. Frazier, Dynamically-allocated multi-queue buers for VLSI communication switches, IEEE Trans. Comput. 14 (1992) 725737.
7. J. Park, B. W. O'Krafka, S. Vassiliadis and J. G. Delgado-Frias, Design and evaluation of
a DAMQ multiprocessor network with self-compacting buers, IEEE Supercomputing'94,
Washington D. C., November 1994, pp. 713722.
8. M. Jamali and A. Khademzadeh, Improving the performance of interconnection networks
using DAMQ buer schemes, IJCSNS Int. J. Comput. Sci. Network Security 9 (2009)
713.
9. M. Jamali and A. Khademzadeh, DAMQ-based schemes for eciently using the buer
spaces of a NoC router, IJCSI Int. J. Comput. Sci. Issues 4 (2009) 3641.
10. Y. Choi and T. M. Pinkston, Evaluation of queue designs for true fully adaptive routers,
J. Parallel Distributed Comput. 9 (2003) 606616.
11. A. Kodi, A. Sarathy and A. Louri, Adaptive channel buers in on-chip interconnection
networks A power and performance analysis, IEEE Trans. Comput. 57 (2008) 1169
1181.
12. J. Liu and J. G. Delgado-Frias, DAMQ self-compacting buer schemes for systems with
network-on-chip, Proc. Int. Conf. Computer Design, Las Vegas, June 2005, pp. 97103.
13. J. Liu and J. G. Delgado-Frias, A DAMQ shared buer scheme for network-on-chip.
Proc. 5th IASTED Int. Conf. Circuits, Signals, and Systems, Alberta, Canada, July 2007.
14. K. J. Rajendra, W. C. Dah-Ming and R. H. William, A quantitative measure of fairness
and discrimination for resource allocation in shared computer system, Technical Report
301, Digital Equipment Corporation, 1984.
15. T. T. Ye, L. Benini and G. De Micheli, Analysis of power consumption on switch fabrics in
network routers, Proc. 39th Design Automation Conf. (DAC) (2002), pp. 795800.
1450012-30