Zhang 2014

Journal of Circuits, Systems, and Computers
Vol. 23, No. 1 (2014) 1450012 (30 pages)

#
.c World Scientic Publishing Company
DOI: 10.1142/S0218126614500121
J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

by MICHIGAN STATE UNIVERSITY on 02/12/15. For personal use only.
A FAST AND FAIR SHARED BUFFER
FOR HIGH-RADIX ROUTER
HEYING ZHANG, KEFEI WANG, JIANMIN ZHANG,

NAN WU and YI DAI
College of Computer,
National University of Defense Technology,
Changsha, 410073, China
heyingzhang@nudt.edu.cn
wkf.working@gmail.com
Received 10 December 2012
Accepted 27 August 2013
Published 28 November 2013
High-radix router based on the tile structure requires large amount of buer resources. To
reduce the buer space requirement without degrading the throughput of the router, shared
buer management schemes like dynamically allocated multi-queue (DAMQ) can be used by
improving the buer utilization. Unfortunately, it is commonly regarded that DAMQ is slow in
write and read. To address this issue, we propose a fast and fair DAMQ structure called
F2DAMQ for high-radix routers in this paper. It uses a fast FIFO structure in the implementation of idle address list as well as data buer and achieves some critical performance improvement such as continuous and concurrent write and read with zero-delay. Besides,
F2DAMQ also uses a novel credit management mechanism which is ecient in avoiding one
virtual channel (VC) monopolizing the shared part of the buer and achieving fairness among
competing VCs sharing the buer. The analyses and simulations show that F2DAMQ performs
well in achieving low latency, high throughput and good fairness under dierent trac patterns.
Keywords: High-radix router; dynamic allocation; buer management.
1. Introduction
In recent years, the peak performance of supercomputer has increased to more
than 20 Peta-ops. It will increase to 100 Peta-ops in the near future.1 The interconnection network plays a more and more critical role in supercomputer by determining the latency, throughput and stability of the whole system. It is commonly
regarded that routers with many ports are more ecient in reducing hop count and
latency.With the increase of pin bandwidth and advances in signaling technology,
*This paper was recommended
Corresponding author.
by Regional Editor Eby G. Friedman.
1450012-1

H. Zhang et al.
implementing such kind of router becomes possible.2,a In contrast to low-radix

routers with a few fat channels in conventional interconnection networks, this kind of
routers are called high-radix routers which typically have more than 16 ports with
thin channels. The advantage of high-radix router lies in the ability to connect more
compute nodes directly. Thus the transmission latency is decreased by reducing the
diameter of the interconnection network. Unfortunately, it is dicult to implement a
crossbar larger than 16 16 using single-level switch. The tile structure has been
proposed and successfully resolved this problem in the cost of more buer requirement.3,4 In the tile structure, additional row and column buers are required except
the input buer. Take a router with M N ports as example. If it is implemented in
traditional single-level, input-buered crossbar structure, M N buers are required.
However, if it is implemented in tile structure with M rows and N columns,
(M N 1) M N buers will be required totally. The dramatically increased
buers occupy large chip area and worse the power consumption.5 It is necessary to
decrease the buer resource in tile structure and make the oor-plan of large scale tile
structure easier.
In fact, it is dicult to reduce the number of buers in the tile structure, while the
depth of buer can really be reduced by using some ecient buer management
mechanisms such as dynamically allocated multi-queue (DAMQ) which allocates
buer space among multiple virtual channels (VCs) sharing a single physical link.
DAMQ was originally proposed in Ref. 6 to improve the buer utilization in router or
switch. It is also used to implement virtual output queuing (VoQ). DAMQ was rst
implemented in linked list with the requirement of at least three clock cycles to
receive or transmit a it. To decrease the hardware overhead incurred by dynamic
buer management, a special circuitry called self-compacting buer (SCB) is proposed to implement DAMQ in Ref. 7. SCB allocates consequent regions within the
input buer for each output channel by moving all the data locating in the address
larger than current write address or read address. Due to the complex control logic
and special CMOS circuit requirement, SCB is not widely used yet. Observing that
DAMQ based on linked list and SCB cannot avoid one VC occupying the shared
buer exclusively, DAMQ with space reservation is presented in Refs. 8 and 9. It
requires less buer space and performs as well as statically allocated multi-queue
(SAMQ). In Ref. 10, DAMQ is enhanced to better support fully adaptive routing.
Recently, DAMQ has been used in network on chip (NoC) to reduce the buer
space.1113
In conventional switch design, it is a preferred choice to implement buer in
general static random access memory (SRAM) considering the cost and availability
of the corresponding intellectual property (IP) core. At present, buer in high-radix
router is also mainly implemented by SRAM. With the improvement of working
frequency, SRAM with registered output called SRAM-R is increasingly used to
a Blue
waters sustained petascale computing (2010), http://www.ncsa.uiuc.edu/BlueWaters.
1450012-2

A Fast and Fair Shared Buer for High-Radix Router
enhance the drive strength of the output signals. SRAM-R will delay the output data
for an additional clock cycle compared to SRAM without registered output. That is,
SRAM-R receives the read address and read enable signal on the rst clock cycle and
issues data out on the third clock cycle. How to hide the read delay and realize
continuous read from the shared buer is the most challenging problem in designing
DAMQ based on SRAM-R. Even though SRAM without registered output is used,
DAMQ implemented in linked list is too slow in write and read to be used in highradix router. To speed up the access of DAMQ, we propose a novel fast and fair
DAMQ structure in this paper called F2DAMQ. It realizes many signicant performance advantages such as continuous and concurrent read or write, low delay and
high throughput. Analyses and tests show that F2DAMQ satises the performance
and area requirement of high-radix router in great extent.
To summarize, this paper makes the following contributions:
(1) Design a fast FIFO structure. The idle address list and shared data buer in
F2DAMQ are organized in such a structure.
(2) Read the idle address list with zero delay on data arrival, which reduces the write
delay to zero.
(3) Read data from the shared buer to the prefetch buer before the read request is
received, which reduces the read delay to zero.
(4) Propose a fair credit management mechanism to avoid one VC monopolizing the
shared buer.
The rest of the paper is organized as follows. Section 2 introduces the use of DAMQ
in tile structure. Section 3 designs a fast FIFO structure and uses it to construct
F2DAMQ buer. Section 4 analyzes and evaluates the performance of F2DAMQ.
Section 5 discusses the related works in great detail. Finally, the conclusion is given
in Sec. 6.
2. DAMQ in Tile Structure

Without loss of generality, we design a 25 25 crossbar implemented in 5 5 tiles as
shown in Fig. 1. Where, NI is the network interface and In buf is the input buer.
Each tile has the same structure including network interface, input buer, 1:5 distributer, row buer, 5 5 crossbar, column buer and 5:1 multiplexer. The input
data of a tile is distributed to each tile in the same row including itself and then
accommodated by the row buers. A 5 5 crossbar switches data among the ve
tiles in the same row and the ve tiles in the same column. The column buer accepts
data from the ve tiles in the same column including itself. Then, at the last switch
stage, the output data is selected among the ve column buers. The detailed introduction of the tile structure can also be found in Ref. 5. The most notable advantage of the tile structure is easy to route wire during oor-planning. While the
1450012-3

H. Zhang et al.
Fig. 1. A 25 25 crossbar organized in 5 5 tile structure.
disadvantage is the large buer requirement from input buer, row buer and column buer. In the rst high-radix router YARC,5 64 ports are organized in 64 8 8
tiles, including 64 input buers, 512 row buers and 512 column buers. The input
buer size is 256 its and the row buer and column buer are 16 its respectively.
Where, it is the smallest message unit for transmission and ow control in the
network. The packets with various lengths can be segmented to several its of xed
length. In the following description, data and it are used alternately without
confusion.
In our design, there are four VCs sharing a physical link. The VC ID is assigned by
the network interface chip according to the application of the packet. It is marked in
each it of a packet and keeps unchanged during it traversing through the network.
In Figure 1, the 1:5 distributer should rst schedule among data from four VCs in the
input buer, then distribute it to other tiles. Similarly, the 5 5 crossbar also
includes two main steps. First, each row buer in the tile arbitrates among the four
VCs. Then, the data of selected VC is sent to the 5:1 arbiter. At the last stage, the 5:1
multiplexer selects among ve column buers after data from one VC is scheduled in
each column buer.
In the high-radix router, we can statically allocate separate buer space to each
VC or dynamically allocate random buer space to the input data in a shared buer.
Take the input buer as example. Figure 2 compares the static buer allocation
among N VCs and dynamic allocation. For static allocation, data from each VC is
1450012-4

Input buffer
VC1
VC2
Output data
Input data
VCN
N:1
multiplexer

(a)
(b)
Fig. 2. Dierent buer allocation: (a) static and (b) dynamic.
placed in a specic buer space. Even if some VC has no data to store, other VC
cannot use its idle space, thus resulting in poor buer utilization. To overcome this
problem, buer space can be allocated to VCs on demand as shown in Fig. 2(b).
Consequently, each VC can use more buer blocks as long as there are idle space in
the buer. If each VC is guaranteed to use the same amount of buer space, the total
buer space required by dynamic allocation is less than that of static one. Hence,
dynamic allocation is more appropriate for high-radix router which is usually buer
resource limited.
SAMQ and DAMQ are typical mechanisms for static allocation and dynamic
allocation respectively. Even though the control logic of DAMQ is complex compared to SAMQ, exploring its usage in high-radix router is meaningful in consideration of its eciency in improving buer utilization. Moreover, in ASIC oor-plan
of high-radix router, a lot of long wires are needed to connect tiles in the same row
or the same column. Therefore, the chip space that can be used to place so many
buers is very limited. The additional control logic caused by DAMQ is acceptable
considering the benet of decrease in buer resource. DAMQ can be used in input
buer, row buer and column buer in the tile structure. Each buer is shared by
multiple VCs. The original proposed DAMQ is ecient in buer management but
slow in write and read operations. With the increase of working frequency, more
and more memory elements register the output data for one additional clock cycle
before they are used by the logics following up. Otherwise, it is dicult to meet the
setup time requirement of the signal. Unfortunately, this additional delay makes
DAMQ even slower.
1450012-5
The components of DAMQ buer commonly include data buer, address buer,
idle address list, write and read pointer management. Among them, the data buer
and idle address list are usually implemented by SRAM. If SRAM-R is used to
implement the idle address list, upon data arrival, DAMQ need one more cycle delay
to get the idle address for accommodating the arriving data. Moreover, if SRAM-R is
used to implement the data buer, when the scheduler or arbiter send read request to
the DAMQ buer, there is also one more cycle delay to get the output data. To speed
up the access to DAMQ buer, we design a fast FIFO structure and use it to
implement the data buer and the idle address list. We also design a ow control
scheme based on credit management to fairly allocate the buer slots among multiple
VCs. The new DAMQ scheme is called F2DAMQ. Its details are described in the
following sections.
3. F2DAMQ Buer
Without loss of generality, we introduce the implementation of a F2DAMQ buer
shared by four VCs. Note that the buer shared by less or more VCs has similar
structure to this one. Figure 3 shows an overview of F2DAMQ. The main components are:
(1) Data buer: store input data from VC.
(2) Address buer: store the address of next data.
Data buffer Address buffer
Flow control
Idle address list

FIFO_TOP
Input data from 4VCs

Write
VC0 head and tail

management
Bypass
Read address
Write address
Read
VC2 head and tail

management
VC3 head and tail

management
Output data
VC1 head and tail

management
Read

H. Zhang et al.
Output data
Main part
VC0 FIFO_TOP
VC0 bypass write
VC0 read data buffer
VC0 output data

Read VC0
VC1 FIFO_TOP
VC1 bypass write
VC1 output data

Read VC1
Output data
VC2 FIFO_TOP
VC2 bypass write
VC2 output data

Read VC2
VC3 FIFO_TOP
VC3 bypass write
VC3 output data

Read VC3
Read request
Fig. 3. Structure of F2DAMQ.
1450012-6
(3) First-in rst-out (FIFO) TOP: store data read in advance from the data buer.
(4) Head and tail management: maintain the read pointer and write pointer of the
data buer and address buer for each VC.
(5) Idle address list: contain the idle address of data buer, organized in FIFO order.
(6) Flow control: initialize, decrease and increase the sending credit of each VC.

3.1. Data buer and address buer

The data buer in Fig. 3 is used to hold input data from four VCs and the address
buer is used to hold address of next data from the same VC that current input data
belongs to. The relationship between these two buers is shown in Fig. 4.
Dataij denotes the jth data from VCi . Addrij1 is the address of the (j 1)th
data from VCi . Data from the same VC is linked up by the address of next data in
FIFO order. Moreover, the data buer and the address buer have the same read
pointer and write pointer. These two buers are accessed as follows. For read operation, the current required data and the address of next data if it exists will be read
out together from the data buer and the address buer, respectively. Once next read
request arrives, the corresponding data will be read out immediately for its address
has been prepared at the last read. On the other hand, for write operation, assume
current write pointer of VCi is m. There are several situations to cope with. First, if
there is more than one it from VCi in the data buer, the input data is written to
the memory blocks with address m in the data buer and m is written to the address
of the previous data from VCi in the address buer. Second, if there is one data from
VCi in the data buer and not being read out now, m is written to the address buer
just as the rst case. Third, if there is one data from VCi in the data buer and being
read out or there is no data of VCi , m is given to the read pointer rather than written
to the address buer.
The data buer and address buer can be combined into one to simplify the
management. However, due to some special implementation considerations described
below, it is better to separate them to independent buers.
The data buer is commonly implemented by SRAM-R which generally has one
write port and one read port sharing the same clock. The main signals of the memory
interface include: write address, write data, write enable, read address, read data,
Data buffer
Dataij
Address buffer
Addri(j+1)
Datai(j+1)
Addri(j+2)
Datai(j+2)
NULL
Fig. 4. The relationship between data buer and address buer.
1450012-7
H. Zhang et al.
c1
c2
c3
clk
address
Addr i
data
Data i
(a)

c1
c2
clk
address
Addr i
data
Data i
(b)
Fig. 5. Timing of read operation: (a) SRAM-R and (b) SRAM.
read enable, etc. At present, the working frequency of SRAM-R is as high as

800 MHz, even 1 GHz. The read timings of SRAM-R and SRAM are illustrated in
Fig. 5. Where, c1 is the rst cycle observed, c2 is the second and so on. For SRAM-R,
there is an interval of one clock cycle between the read address and the output data.
In contrast, for SRAM, there is no interval between them.
The read timing of SRAM-R indicates that if the address of next data is also
stored in the data buer, the same VC cannot read from the data buer on continuous clock cycles without interval, for it must get the address of next data from
the output data of current read as shown in Fig. 6(a). The rst read request is issued
c1
c2
c3 c4
clk
address
Addr i
Addr j
data
c1
c2
data
Data j
c3 c4 c5
Data buffer
clk
address
Data i
clk
Addr i
Addr j
Data i+Addr j
address
Data j+Addr k
data
Addr i
Addr j
Addr j
Addr k
Address buffer
(a)
(b)
Fig. 6. Timing of read operation from the data buer. (a) Data buer and address buer are combined in
one SRAM-R memory. (b) Data buer and address buer are implemented separately.
1450012-8
c1
c2
c3 c4
clk
request
address
Addr i
Addr j
data
Data i
Data j

Fig. 7.
The expected timing of read operation from the data buer.
on c1. The data and address of next data addrj are available on c3. Therefore the
next read can only start on c3. It is evident that there are bubbles on the output data
line of the buer. To address this issue, we implement the address buer in fast
register arrays or SRAM without registered output as shown in Fig. 6(b). The rst
read request to the data buer is also issued on c1. The data is available on c3 and the
address of next data addrj is available on c2. So the second read request can be issued
out on c2. As a result, reading from the data buer can be continuous without
interval.
In the realization of input-queued switch, the read request to the data buer is
usually the grant of an arbiter. To decrease the switch latency, the data buer is
expected to output data immediately when it receives the grant signal as illustrated
in Fig. 7. When the read request is valid, the data is read out on the same clock cycle.
Unfortunately, the timing shown in Fig. 6(b) cannot meet this requirement. To
achieve this goal, we design a fast FIFO structure for each VC by reading data from
the shared data buer to its private FIFO TOP in advance before the read request is
received.
3.2. FIFO TOP
As shown in Fig. 3, each VC has its own private buer called FIFO TOP. It is used
to store the top three data of each VC as illustrated in Fig. 8. The read request from
the arbiter to the F2DAMQ buer is sent to FIFO TOP rather than the shared data
buer. And the required data is also output from the FIFO TOP.
There are two sources writing data to FIFO TOP. First, input data to the shared
data buer is written to FIFO TOP directly on condition that this VC has no data in
the shared data buer and its FIFO TOP is not full. This process is called bypass
write. Otherwise, the input data is written to the shared data buer. Second, data is
read from the shared data buer if a VC has data queuing in it and its FIFO TOP is
not full. Perhaps there are more than one VC want to read data from the shared data
buer on the same clock cycle. An arbiter among these requests should be used. Here,
the read request to the F2DAMQ buer is also used to read data from the shared
1450012-9
H. Zhang et al.
Data buffer
VC0 FIFO_TOP
VC1 data4
VC0 data4
VC3 data4
VC1 data5
VC2 data4
data1
data2
data3
VC1 FIFO_TOP
data1
data2
data3

VC2 FIFO_TOP
data1
data2
data3
VC3 FIFO_TOP
data1
data2
data3
Fig. 8.
The data in FIFO TOP and data buer.
data buer to FIFO TOP. The result of the arbiter in crossbar is used as the read
request and it will be valid at most for only one VC each round of arbitration or each
cycle. Thus, the arbiter for read requests to the data buer is needless. If one data is
read from the FIFO TOP of a specic VC, on the same clock cycle, another data of
the same VC can be read from the shared data buer and written to the FIFO TOP.
Data is always read from the rst memory block of FIFO TOP. Once the rst data in
FIFO TOP is read, other data will be moved forward as shown in Fig. 9.
The FIFO TOP should be implemented by simple and fast memories such as
register array. Accessing to these memories should be fast enough. Furthermore, the
depth of FIFO TOP should be deeper than the read delay it is used to hide. We
dene the read delay as the clock cycles elapsed from the validation of read enable
signal to the output data. If the read delay of the data buer is n clock cycles, the
FIFO TOP should include more than n 1 memory entries. The deeper the FIFO
TOP, the less probability that it becomes empty, and the more control logic and area
resources are required. Considering the read delay of the data buer implemented in
SRAM-R is two clock cycles, we set the depth of FIFO TOP to three by making a
tradeo between the performance and resource overhead.
The data buer and FIFO TOP construct a fast FIFO structure. This structure
realizes read with zero latency for the read delay from the shared data buer is
hidden by reading data to FIFO TOP before the read request from crossbar is
received. In contrast, the existing DAMQ reads the shared buer when they received
the read request. Consequently, the read delay is determined by the timing of the
shared buer.
1450012-10
Data buffer
VC0 FIFO_TOP
data1
data2
data3
VC0 data4
VC3 data4
VC1 data5
VC2 data4
VC1 FIFO_TOP
data2
data3
data4
Data1 is read.

VC2 FIFO_TOP
data1
data2
data3
VC3 FIFO_TOP
data1
data2
data3
Fig. 9.
Data is read from FIFO TOP.
3.3. Head and tail management

In F2DAMQ, its of all VCs are stored in the data buer randomly. To identify the
location of next data to write or read, each VC maintains a write pointer and read
pointer. The write pointer points to the address of data buer that next input data
will be written to. The read pointer points to the data which will be read next. The
read pointer is also the head of data belonging to a specic VC and the write pointer
is the tail of that. These two pointers will be updated during write and read of the
data buer respectively.
Assume the input it is the jth data of VCi . The (j 1)th data is stored in ADDR
(j 1) of data buer. There are totally M its from VCi in the data buer. Writing
to the data buer includes the following operations:
(1) Read the idle address list and get an idle address. Update the write pointer with
this idle address.
(2) The input it is written to the memory block of data buer that current write
pointer points to.
(3) If M > 1, or M 1 and no reading to VCi , the write pointer is written to ADDR
(j 1) of the address buer. Otherwise, if M 1 and there is reading to VCi , or
M 0, the read pointer is updated to the write pointer.
1450012-11
H. Zhang et al.

+
+
+
+
7
7
7
7
7
7
7
7
+
+
+
+
(a)
(b)
Fig. 10. Head and tail pointer: (a) change of the tail pointer after write and (b) change of the head
pointer after read.
Similarly, assume reading from VCi . The steps of reading from the data buer are:
(1) Read data from the data buer that current read pointer of VCi points to.
(2) Write current read pointer to the idle address list. Meanwhile, if M > 1, read the
address of next data from the address buer that current read pointer points to.
(3) Update the read pointer with the address of next data if M > 1 or null if M 1.
The changes of pointers during the run of F2DAMQ are illustrated in Fig. 10.
Where, Hi , i 0; 1; 2; 3, denotes the read pointer of VCi . Ti , i 0; 1; 2; 3, denotes the
write pointer of VCi . After reset of the F2DAMQ circuit, both the write pointer and
read pointer are initialized to null. In Fig. 10(a), Ti points to the location of the last
data from VCi . In Fig. 10(b), H0 moves to the location of next data after the it of
VC0 is read from the data buer.
The description above implies that reading the data buer will run without delay
for the read pointer is ready before read operations occurring. In contrast, whether
write to the data buer with delay or not determined mainly by the read delay of the
idle address list.
3.4. Idle address list
The idle address list contains the address of unoccupied space in the data buer,
organized in FIFO and implemented by SRAM-R. To hide the read delay of SRAMR, the idle address list is also organized in fast FIFO structure with FIFO TOP as
shown in Fig. 3. The idle address list includes main part and FIFO TOP. When the
main part is empty and the FIFO TOP is not full, input data is written to FIFO
TOP directly. Otherwise, data is written to the main part. When data is read from
FIFO TOP and the main part is not empty, data will be moved from the main part
1450012-12

to FIFO TOP. As a result, if there are less than three data in the idle address list,
they will locate in FIFO TOP. Otherwise, if there are more than three data in the
idle address list, the top three will locate in FIFO TOP. It is transparent to the user
that data is moved from the main part to FIFO TOP. The user can read the idle
address list without delay. This feature is crucial to implement fast and back-to-back
write to the data buer.
The idle address list should be initialized during reset of the F2DAMQ circuit. All
the addresses of the data buer will be written to the idle address list. When a it is
written to the data buer, an idle address will be read from the idle address list. On the
contrary, when a it is read from the data buer, its address is written to the idle
address list. The structure of idle address list with FIFO TOP improves the throughput
and reduces the write delay of the data buer by its zero-delay access feature.
3.5. Flow control
Flow control is used to prevent buer overow by limiting the amount of data that a
sender can write to the buer. Flow control based on credit is widely used between
communication peers in networks. It realizes two main functions, credit increase and
decrease. The sender can only send data when its credit is larger than zero. Once data
is sent, the credit will be decreased by one. When the receiver reads data from the
buer, it returns credit release signal to the sender. Then the credit of the sender will
be increased by one. In F2DAMQ, the credit management not only prevents the
buer of receiver from overow but also determines the allocation of shared data
buer among competing VCs. We propose a fair credit management (FCM) scheme
to avoid one VC monopolizing the shared data buer.
FCM partitions the data buer (DB) into two parts, shared buer (SB) and
private buer (PB). The data buer is assumed to be shared by N VCs. We have
DBdepth SBdepth N PBdepth . Where, DBdepth is the depth of DB, SBdepth and
PBdepth are depth of SB and PB respectively. SB can store data of all VCs, while PB
can only store data of a specic VC. Note that the location of SB and PB is random
and dynamic rather than xed and static. Accordingly, the credit is divided into
shared credit (SC) and private credit (PC) corresponding to the partition of data
buer. SC and PC are initialized to the depth of SB and PB, respectively.
In F2DAMQ, VCi can send data to DB when it has credit. The rules for sending
data and changing credit are:
(1) If SC > 0, which means SB is not full, VCi can send data to SB. Then, SC
SC 1 means a shared buer block is occupied after data sending.
(2) If SC 0, PCi > 0, which means SB is full while PB is not. VCi can send data to
its private buer. Then, PCi PCi 1 means a private buer block of VCi is
occupied after data sending.
If both the two conditions are not satised, VCi cannot send data for lack of credit.
1450012-13

H. Zhang et al.
In FCM, the receiver is DB. Upon data arrival, it will allocate a free space to
accommodate the input data. When data of a VC is read by other switch logic,
DB sends credit release signal to inform the sender that the buer of receiver adds
a free space as well as the VC ID that the output data belongs to. Once the sender
receives this signal, it determines to add the released credit to SC or PC. This
decision will aect the fairness of buer sharing among VCs. FCM uses a fair
credit increase method to avoid one VC monopolizing the shared part of the data
buer.
The denition of fairness here is dierent from equal partition. We dene fairness
as VC getting the share of buer on demand as much as possible. Although the
setting of private buer makes great eort in achieving fairness, it is far from enough.
To further improve fairness, we set a variable PSC for each VC to record the amount
of SB occupied by it. We also set a threshold PSCthreshold for PSC. If PSC of VCi is
greater than the threshold, that is PSCi > PSCthreshold , the released credit of VCi will
be added to SC. Otherwise, it is added to PCi . PSCthreshold can be set to SC/N , where
N is the number of VC sharing the buer. The details of FCM are described below.
VCi has data to send:
if (SC>0)
f
VC send( );
SC SC-1;
PSCi PSCi 1;
g
else if (PCi > 0)
f
VC send( );
PCi PCi -1;
g
else
VCi cannot send;
VCi receives the credit release signal:
if (PSCi > PSCthreshold )
f
SC SC 1;
PSCi PSCi -1;
g
else if (PCi < PCmax )
PCi PCi 1;
else
f
SC SC 1;
1450012-14

PSCi PSCi -1;

g
Variables:
SC: shared credit, initialized as the depth of SB;
PCi : private credit of VCi , initialized as the depth of PB;
PSCi : SC consumed by VCi ;
Constants:
PSCthreshold : threshold of SC consumed by each VC, can be set as SC/N;
PCmax : initial value of PC;
Functions:
VC send( ): VC send data.
Now the whole structure of F2DAMQ is established. We will analyze and evaluate
its performance extensively in the following section.
4. Performance Analyses
The performance of F2DAMQ is compared with SAMQ by analyzing the buer
utilization and resource requirement. We also evaluate the performance of F2DAMQ
by implementing it in Verilog and testing through cycle-accurate simulator. The
simulation results indicate that F2DAMQ can achieve high throughput and low
latency by supporting continuous, concurrent and fast read or write.
We assume the number of VC sharing a single buer is N which is usually greater
than one in most networks. SAMQ allocates M buer blocks to each VC. F2DAMQ
allocates P buer blocks to shared part and Q blocks to each VC as private buer.
For normal DAMQ, all the buer blocks are shared part denoted as W . The total
buer blocks required by SAMQ, F2DAMQ and DAMQ are N M; P Q N and
W , respectively. The maximum number of it sent by each VC in SAMQ, F2DAMQ
and DAMQ is M , P Q and W , respectively.
4.1. Resource requirement
Suppose SAMQ, F2DAMQ and DAMQ have the same amount of buer space. That
is N M P Q N W . Rewrite to M P =N Q. If N > 1, we have
P =N Q < P Q. That is M < P Q. It is evident that M < P Q < W , which
indicates DAMQ allows each VC send more its than F2DAMQ and SAMQ. Note
that for F2DAMQ, P and Q are congurable parameters. If Q is set to zero,
F2DAMQ allows each VC send the same amount of its as DAMQ. On the other
hand, if P is set to 0, F2DAMQ performs the same as SAMQ. In the following
analyses and tests, we observed the similar results in many cases. So, F2DAMQ
makes a tradeo between high throughput and fairness among VCs in buer allocation by properly setting the parameters of P and Q.
1450012-15
H. Zhang et al.
On the other hand, suppose each VC can occupy the same amount of buer
blocks. That is M P Q W . The total buer blocks required by SAMQ are
N M N P Q. For F2DAMQ, it is P Q N. For DAMQ, it is W . If
N > 1, we have N P Q > P Q N > W , which means SAMQ requires
more buer blocks than F2DAMQ and DAMQ. If Q is set to zero, F2DAMQ requires
the same amount of buer blocks as DAMQ. We can further calculate the buer
blocks saved by F2DAMQ. Let k denote the ratio of memory blocks saved by
F2DAMQ to SAMQ. Then, we have
NP Q P N Q
NP Q

1
1
1
:
N
1 Q=P

The relationship of k with N and Q=P is shown in Fig. 11. We can nd that more
VCs share a buer and less buer space is allocated to private part, F2DAMQ can
save more space in comparison to SAMQ. In reality, k is also aected by the trac
pattern besides N and Q=P . If the trac of all VCs is even, each VC will occupy less
blocks in the shared part P. Correspondingly, the memory blocks saved by F2DAMQ
will be less than k. So, it is more reasonable to view k as the upper limit of the saved
memory blocks. If Q is set to 0 in (1), k 1 1=N. This is the ratio of memory blocks
saved by DAMQ to SAMQ.
The analyses above show that, on one hand, F2DAMQ and DAMQ can accommodate more its from a VC than SAMQ when they have equal amount of buer
space. This feature is critical for accepting bursty trac or forwarding short message
Fig. 11. The ratio of saved memory space.
1450012-16
quickly. On the other hand, F2DAMQ and DAMQ require less buer resource than
SAMQ when they can send the same amount of its. This makes F2DAMQ and
DAMQ more suitable for high-radix routers. By appropriately setting the value of Q,
F2DAMQ can perform similar to DAMQ.

4.2. Buer utilization

Suppose there are n VCs have data to send at a particular time and 0 < n < N. The
buer utilization of SAMQ is
n
X
Mi
i1
;
2
N M
where Mi is the number of buer blocks used by VCi and 0 Mi M. When
Mi M, SAMQ obtains the maximum buer utilization, that is n=N.
For F2DAMQ, the maximum buer utilization is:
P Qn
;
P QN
where 0 < n N. If n 0, the buer utilization is zero. Figure 12 shows the relationship of buer utilization with n and Q=P . Where, N 8. n changes from one to
eight and Q/P changes from 0.125 to 1. We can nd that the buer utilization is
increased with the increase of n and the decrease of Q=P .
We further rewrite (3) to
n P =n Q
:
N P =N Q
Fig. 12. The buer utilization of F2DAMQ.
1450012-17

H. Zhang et al.
Fig. 13. The buer utilization of F2DAMQ and SAMQ.
Since n < N, we have

n P =n Q
n
>
:
N P =N Q N
According to (5), we can nd that the maximum buer utilization of F2DAMQ is

larger than SAMQ when part of VCs has data to send. Figure 13 compares the buer
utilization of SAMQ and F2DAMQ when there are totally four VCs share a buer.
Theoretically, the buer utilization is never larger than one. To show the lines near
one clearly, the values larger than one on y axis are also shown. Similar process is also
used in the following sections such as the gures of throughput and fairness. Figure 13 shows that with the increase of n, the buer utilization of SAMQ and
F2DAMQ will increase. If the shared part of buer is set to zero and each VC has the
same private buer, F2DAMQ achieves the same buer utilization as SAMQ. Instead, if the private buer is set to zero, F2DAMQ can always achieve the high
utilization of one, just as DAMQ does. When the ratio of Q to P changes from zero to
one, the buer utilization of F2DAMQ decreases accordingly. The analyses indicate
that F2DAMQ can achieve dierent buer utilization by appropriately allocating
the shared buer and private buer according to the active number of VCs as well as
trac patterns. Since DAMQ does not reserve buer space for VC, it can achieve the
highest utilization of one even if there is only one VC sending data. In contrast,
F2DAMQ reserve buer space for each VC, which degrades the buer utilization
while guarantees fairness in buer allocation and avoids VC blocking caused by one
congestion VC occupying the whole buer.
To further evaluate the performance of F2DAMQ, we implement it and SAMQ,
normal DAMQ without FIFO TOP in Verilog and test their performance through
1450012-18

cycle-accurate simulator. The data buer and idle address list of F2DAMQ and
DAMQ are implemented in SRAM-R. The data buer of SAMQ is also implemented
in SRAM-R. SAMQ allocates equal amount of memory blocks located in a xed
region to each VC. For normal DAMQ, any VC can send data if there is idle buer
blocks. For any of the three mechanisms, only one VC is allowed to write to the data
buer on each clock cycle. Similarly, only one VC is allowed to read from the data
buer, too. Writing to the buer and reading from it can happen on the same clock
cycle. The read arbiter of the buer changes the scheduling priority following the
rules of round-robin. If some VC has data in the buer, it will be read out according
to the grant signal generated by the read arbiter. The data buer can accommodate
128 its and the number of VC is four. For F2DAMQ, PCmax 16, the initial value
of SC is 64, PSCthreshold 16. For SAMQ, the initial credit of each VC is 32. For
DAMQ, there is only public credit, which is initialized to the depth of the data buer.
4.3. Latency
In this experiment, we test the latency of a it travelling through the buer. The
latency is dened as the clock cycles required by a it to traverse the buer. For
F2DAMQ, DAMQ and SAMQ, the minimum latency is one cycle, ve cycles and
three cycles respectively. F2DAMQ achieves the shortest latency due to its zerodelay write and read. Moreover, we also nd that, for F2DAMQ, writing to and
reading from the buer can run continuously and concurrently, which is the fastest
access rate that a buer can achieve. For DAMQ, the write delay and read delay are
two cycles. For SAMQ, the write delay is zero and the read delay is two cycles.
The pipeline stages that a it experience in the buer are shown in Fig. 14. To
achieve the minimum latency, the it is read immediately after it is written to the
buer. Otherwise, the it will queue in the buer until the arrival of read request.
Cycle
WR
RI
GI
RR
6
GF
DAMQ
WR
RR
GF
SAMQ
WR
RI
GI
RR
GF
F2DAMQ
Fig. 14. The pipeline of a it owing through buer.
1450012-19
H. Zhang et al.
There are totally ve stages included in the pipeline for F2DAMQ and DAMQ. They
are WR, RI, GI, RR and GF. SAMQ need only three stages without access to the idle
address list. The meanings of the ve pipeline stages are:

(1)
(2)
(3)
(4)
(5)
WR: VCs write it to the buer.

RI: read the idle address list.
GI: get an idle address and write the arrival it to the buer.
RR: receive read request.
GF: get the required it.
It is evident that F2DAMQ has the shortest pipeline and DAMQ has the longest one.
F2DAMQ can nish WR, RI and GI in a single clock cycle. It can also implement RR
and GF in a single cycle. This is mainly attributed to the use of fast FIFO structure
in the idle address list and data path. DAMQ implements WR and RI in the same
cycle. Then it must wait a cycle to get the idle address. To get the it, it must wait
another cycle after receiving the read request. The read timing of SAMQ is the same
as DAMQ.
4.4. Throughput
In this experiment, we will test the maximum throughput of the buers. Here,
throughput is dened as the ratio of total input its to the passed clock cycles. The
it is written to and read from the buer on each clock cycle. Only one VC is allowed
to send it each time. The ID of the VC which is allowed to send is generated
according to Poisson distribution. The aggregate throughput of F2DAMQ, DAMQ
and SAMQ are shown in Fig. 15. The test results show that the throughput of
DAMQ is lower than F2DAMQ and SAMQ. The main reason is that F2DAMQ and
SAMQ can write and read its on each clock cycle, while DAMQ experiences multicycle latency in either write or read.
In reality, the read operation will be stopped by some reason such as contention
on an output port, or some VCs stop sending data within a period of time. These VCs
are called inactive VC. Those have data to send are called active VC. If the active
VCs have the same amount of data to send, they generate uniform trac. Otherwise,
if some active VC has more data to send than others, they generate non-uniform
trac.
In the second experiment, we randomly stop the read operation at run-time. The
it will be written to the buer as long as there are credits. The throughput of the
three buer management mechanisms with dierent number of active VC generating
uniform trac is shown in Fig. 16(a). We can nd that the throughput of F2DAMQ
is higher than DAMQ and SAMQ. Moreover, the throughput of DAMQ and SAMQ
increase dramatically with the increase of active VC. In contrast, the throughput of
F2DAMQ keeps high and changes slowly with the increase of active VC. The results
can be explained as follows. For F2DAMQ, the active VC can occupy the shared
1450012-20

Fig. 15. The maximum throughput.
buer and its private buer. Only the private buer allocated to the inactive VCs
cannot be used. So, most of the buer blocks can be used which guarantees high
throughput to F2DAMQ. For SAMQ, the buer blocks are equally allocated to each
VC. The private buer of the inactive VC cannot be used which makes the available
buer small. As a result, the throughput is limited by the small available buer
blocks. For DAMQ, even though the active VC can use the whole buer, the large
write delay and read delay make it dicult to achieve high throughput.
The analyses in Sec. 4.1 indicate that F2DAMQ requires less buer blocks to
achieve the same throughput as SAMQ. To further verify the analyses, we test the
throughput of the three buer management schemes under dierent depth of buer
when all the four VCs are active and generating uniform trac. For F2DAMQ, the
private buer allocated to each VC is xed at eight its and the rest buer blocks are
allocated to shared part. The results are shown in Fig. 16(b). F2DAMQ achieves
higher throughput than DAMQ and SAMQ. In other words, F2DAMQ with small
buer depth achieves the same throughput as SAMQ with large buer depth. The
high throughput of F2DAMQ mainly owes to the fast write and read as well as the
ecient credit management.
Under the same conguration, we also test the throughput under non-uniform
trac. Where, the four VCs are numbered as VC0 to VC3. VC3 sends four times
amount of data than the other three VCs. The results are shown in Fig. 16(c). The
throughput of F2DAMQ is much higher than SAMQ and DAMQ. The reason is that
F2DAMQ can accommodate bursty trac better than SAMQ by its exible credit
management.
1450012-21

H. Zhang et al.
(a)
(b)
(c)
Fig. 16. Throughput of the three buer management. (a) The throughput under dierent number of
active VCs. (b) The throughput under dierent depth of buer and uniform trac. (c) The throughput
under dierent depth of buer and non-uniform trac.
4.5. Fairness of FCM

In this experiment, we will test the fairness of FCM credit management scheme.
According to the experiment result shown in Fig. 15, F2DAMQ can accept write and
read request on each clock cycle. In this case, the input data will be sent directly to
FIFO TOP and read out immediately. The active VC can only use the shared credit
to send data. The management of private credit will not be test. To evaluate all the
functions of FCM, reading from the buer should be slower than writing to it. We
count the its sent by each VC from the start of test to some specic clock cycle and
compare the total its sent by dierent VCs. To make the comparison more accurate,
we use the Jain fairness index which is widely used in evaluating the fairness property
1450012-22
of a resource sharing system.14 The index is dened as:

X
2
n
xi
J i1
;
n
X
n
x 2i

i1
where xi is the share achieved by user i and n is the number of users sharing the
resource. xi 0 but all the xi should not equal to 0 at the same time. So, 0 < J 1.
When J approaches to one, the fairness property is better.
In our test, xi is the amount of its sent by VCi from the start of test to the clock
cycle we observed. n is the number of VC sharing the buer which is four here. In the
rst test, we generate a variant k from zero to three in sequential and cyclic manner.
Correspondingly, VCk will send data. The fairness indexes of F2DAMQ, DAMQ and
SAMQ are shown in Fig. 17.
SAMQ achieves the best fairness and F2DAMQ achieves better fairness than
DAMQ. This can be explained that SAMQ allocates equal buer blocks to each VC.
So the four VCs can get the same share of buer blocks. F2DAMQ uses FCM credit
management to allocate equal private buer to each VC and avoid one VC monopolizing the shared buer blocks. By doing this, it can also perform well in fairness
guarantee. DAMQ allocates the shared data buer among VCs according to their
demand. In this test, the four VCs have the same demand on buer blocks. Therefore, DAMQ achieves good fairness, too. In fact, the four VCs can hardly generate
equal trac in real interconnection networks. It is meaningful and necessary to test
fairness under non-uniform trac.
Fig. 17. Fairness index under uniform trac.
1450012-23
H. Zhang et al.
Table 1. The number of its sent by each VC.

Clock cycle
Mechanisms
VC
100
300
500
700
900
1100
F2DAMQ
VC0
VC1
VC2
VC3
13
13
13
61
38
38
38
124
63
63
63
148
88
88
88
178
113
113
113
201
138
138
138
222
DAMQ
VC0
VC1
VC2
VC3
9
9
9
41
25
25
25
125
42
42
42
208
59
55
55
276
70
62
61
331
79
72
67
384
SAMQ
VC0
VC1
VC2
VC3
13
13
13
50
38
38
38
79
63
63
63
104
88
88
88
133
113
113
113
158
138
138
138
182
In the second test, the amount of data that VC3 want to send is ve times of other
VCs. The test results are shown in Table 1 and Fig. 18. The numbers of its sent by
each VC from the start of test to some specic clock cycle are shown in Table 1. For
F2DAMQ and SAMQ, at the start of test, the number of its sent by VC3 is about
four times of other VC. While, with the progress of the test, the number of its sent
by VC3 is only about 1.5 times of other VC. However, for DAMQ, the amount of its
sent by VC3 is always about ve times of others.
The fairness index is shown in Fig. 18. At the beginning of experiment, the
fairness indexes of F2DAMQ, SAMQ and DAMQ are low because there is great
Fig. 18. Fairness index under non-uniform trac.
1450012-24

(a)
(b)
Fig. 19. Performance under dierent settings of private credit: (a) throughput and (b) fairness.
dierence among the number of its sent by the four VCs. With the progress of the
experiment, the fairness index of F2DAMQ and SAMQ increase to one gradually.
While the fairness index of DAMQ keeps low. The main reason is that DAMQ does
not limit the amount of its sent by aggressive VC, while SAMQ and F2DAMQ can.
The experiments above indicate that, with specic credit conguration,
F2DAMQ can achieve high throughput and good fairness under variable trac
patterns. In the third experiment, we evaluate the throughput and fairness under
dierent settings of private credit. The depth of buer is set to 128 its. The private
credit for each VC is Q its and the shared credit is P 128 4 Q. The
PSCthreshold P=4 or PSCthreshold 32 Q. When Q 0, F2DAMQ performs like
DAMQ. And when Q 32, F2DAMQ performs like SAMQ. The fairness index and
throughput under uniform or non-uniform trac with dierent settings of private
credit are shown in Figs. 19(a) and 19(b), respectively. From Fig. 19(a), we can nd
that the throughput under non-uniform trac is slightly lower than that under
uniform trac. Moreover, the throughput decreases slightly with the increase of
private credit under non-uniform trac. The reason is that the shared credit will be
small if the private credit is large. If some VCs want to send more data and used up
their private credit, it will be prevented from sending data frequently when the
shared credit is small. In other words, F2DAMQ can accommodate more its with
small private credit especially under non-uniform trac.
Figure 19(b) shows the fairness of F2DAMQ under uniform and non-uniform
trac with dierent settings of private credit. The fairness index keeps unchanged
under uniform trac with dierent values of private credit. Under non-uniform
trac, the fairness index increases dramatically with the increase of private credit.
This can be explained that the aggressive VCs are prevented from occupying excessive shared buer blocks if the private credit is large. So the aggressive VCs and
unaggressive ones can occupy similar buer blocks.
1450012-25
H. Zhang et al.
According to the experiment results, the setting of the private credit should make
a tradeo between high throughput and good fairness. It is better to adjust the
private credit according to trac patterns. Perhaps this is the direction of future
work.

4.6. Area and power

The SAMQ, DAMQ and F2DAMQ designs for four VCs with the it width of 88 bits
are synthesized using the Synopsys Design Compiler tool with the TSMC 40 nm
technology library, with a supply voltage of 1 V and an operating frequency of
700 MHz. Table 2 shows the area occupied by these three designs.
From Table 2, we can nd that SAMQ occupies the least area, F2DAMQ occupies
the largest and DAMQ occupies the moderate under the same buer depth. This is
mainly due to the dierence in complexity of the buer control logic. It is evident
that SAMQ buer management is simpler than DAMQ and DAMQ is simpler than
F2DAMQ. According to the analysis in Sec. 4.1, SAMQ requires more buer space to
guarantee the same buer blocks for each VC as DAMQ and F2DAMQ do. When the
buer depth increases to 256, the area occupied by SAMQ is nearly doubled and close
to F2DAMQ. According to eq. (1), F2DAMQ and DAMQ can save the maximum of
75% buer space compared to SAMQ when the number of VC is four. So the buer
depth of SAMQ is further increased to 512 to obtain similar performance as DAMQ
and F2DAMQ. The area overhead of SAMQ is enlarged accordingly and nearly two
times of F2DAMQ.
Here we only compare the area occupied by buer and its control logic. In fact,
buer management usually cooperates with other components within a router chip,
such as arbiter, crossbar, input port, output port, etc. The large buer will restrict
the oor-plan and wire routing of the entire chip by occupying the wiring resources
and nally result in a large chip. In contrast, a little more complicated control logic
with small buer provides more exibility for oor-plan and wire routing for they can
be placed in a distributed manner and thus will not denitely result in a large chip.
The proposed structure of F2DAMQ has been used in a high-radix router with 25
ports and six VCs per port. The power consumption of the router chip is 145 W
which is close to the 153 W power consumption of Mellanox InniScale MT47396A1FDC switch with 24 ports.b It is also pointed out in Ref. 15 that buers are the
Table 2. Comparison of area occupied by the three designs.
SAMQ
Buer depth
Area (m 2 )
128
39732
256
75035
512
145639
DAMQ
F2DAMQ
128
66323
128
76693
b InniScale MT47396A1-FDC 24-port 12X/4X/1X InniBand Switch Hardware Reference Manual

(HRM), http://www.mellanox.com.
1450012-26

largest leakage power consumers in a NoC router, consuming about 64% of the total
router leakage power. Therefore, F2DAMQ with complex control logic and small
buer will not incur unacceptable power consumption overhead.
The analyses and tests described above show that F2DAMQ outperforms DAMQ
and SAMQ in many aspects except the additional control logic caused by the FIFO
TOP. For high-radix router, it's more important to decrease the area and power
overhead caused by buer rather than control logic. In this point of view, F2DAMQ
can satisfy the requirement of high-radix router in performance and buer resource
consumption. Moreover, F2DAMQ can also be used in NoC routers where buers are
more expensive than wires.
5. Related Works
The primary concept of DAMQ was proposed in Ref. 6 and implemented in linkedlist. The basic idea of this approach is to maintain (k 1) linked list in each buer:
one list of packets for each one of the (k 1) output ports, one list of packets for the
end node interface and one list of free buer blocks. Where, k is the number of output
ports. Similarly, F2DAMQ proposed here is also implemented in (N 1) linked list:
one list of packets for each of the N VCs and one list of free buer blocks. It is
commonly regarded that the original DAMQ suer from high latency in write and
read operation. The prefetch structure proposed in this paper eliminates this problem
eectively.
The other performance penalty faced by traditional DAMQ is complexity. SCB
buer is an important circuitry proposed to reduce the hardware complexity of
DAMQ as well as speedup the read and write operation.7 The SCB system has the
capability of performing a read, a write, or a simultaneous read/write operation per
cycle due to its pipelined architecture. F2DAMQ performs as well as SCB in these
operations. Unfortunately, SCB requires customized CMOS circuit to facilitate data
insertion and block data moving in the buer. However, for most ASIC designs,
standard memory IPs provided by manufacturers are usually preferred by logic
designers due to cost and availability considerations. F2DAMQ aims to such kind of
designs. Its additional prefetch structure decreases the latency of write and read
operation at the cost of increasing the complexity of F2DAMQ in an acceptable
manner. To lower the hardware complexity of F2DAMQ, we consider deleting the
idle address list based on the observation that the write and read operations of the
idle address list are opposite to the corresponding operations of data buer. This
work is now underway.
For traditional DAMQ and SCB, there is no reserved space dedicated for each
output channel, the packets destined to one specic output port may occupy the
whole buer space thus the packets destined to other output ports have no chance to
get into the buer. In order to overcome the shortcoming, a new buer scheme named
DAMQall is proposed in Ref. 12, which is based on SCB and reserved space for all
1450012-27

H. Zhang et al.
VCs. In a wormhole-switched network with several VCs multiplexing a physical

channel, the trac load is not evenly distributed in the entire buer space for a
physical channel for the routing algorithms tend to choose one set of VCs over others.
To improve the utilization of buer space further, DAMQshared is proposed in Ref. 13,
which combines the buer for VCs from two dierent physical channels. Refs. 8 and 9
test the latency, throughput and buer utilization of DAMQshared in NoC based on
mesh topology. Both DAMQall and DAMQshared reserve buer space based on SCB
circuitry. F2DAMQ achieves the same capability by FCM credit management. The
advantage of F2DAMQ's space reservation scheme is that it is more exible to meet
the requirements of dierent trac loads by appropriately choosing the initial value
of private credit. In this way, F2DAMQ can easily mimic the behaviors of traditional
DAMQ, SAMQ and the moderate cases between them.
In DAMQs, when a packet is incoming, a routing decision for the packet is made
in order to assign the packet to one of the sub-queues. This limits the packet to be
routed to another output port calculated by adaptive routing algorithms. In order to
exploit the full capabilities of adaptive routing, DAMQWR (DAMQ with recruit
registers) and VCDAMQ (virtual channel DAMQ) are proposed in Ref. 10.
DAMQWR sets associated recruit registers corresponding to dierent output port
for the sub-queues. Each of these recruit registers points to a packet in the sub-queue
that can be routed to the output port corresponding to that register, as determined
by the routing function. Thus, the packet can select among more than one output
ports. VCDQMA, on the other hand, associates the sub-queues with router VCs
rather than router output ports. It can eciently adapt to unbalanced trac loads
among VCs by dynamically allocating queue space to VCs. Moreover, each subqueue of a VCDAMQ is not coupled to any specic output link. This allows packets
to be routed in any direction no matter which queue the packet is stored, better
accommodating true fully adaptive routing capability. The capability of F2DAMQ
in supporting adaptive routing is very similar as VCDAMQ. In our design, VC is
assigned by end node interface and keeps xed during the packet transmission in the
network. So the sub-queues of F2DAMQ are associated with VCs rather than output
port which means the incoming it is stored in a sub-queue of F2DAMQ according to
the VC number carried by it. Whether the output port is determined by deterministic or adaptive routing algorithms, the it is scheduled to its destination output
port by the arbiter of crossbar.
Reference 11 proposed a low-power low-area on-chip interconnection network
architecture by reducing the number of buers within the router. They insert a novel
repeater between routers which can act as channel buer in case of congestion or a
normal data channel in absence of congestion. Such kind of channel buer can be
used with dynamic or static buer allocation in router. The dynamic buer allocation
adopts the table-based approach. It uses UVST (unied VC state table) to record the
read pointer, write pointer, output port, current stage of router pipeline and location
of each data for each VC. It also uses BSA (buer slot availability) tracking system
1450012-28

to allocate/deallocate arriving/departing its with buer slots. The function of BSA

is similar to the idle address list in F2DAMQ. Since the UVST should record the
location of each data for each VC, the storage cost becomes large for deep buers.
Hence, this approach is more appropriate to wormhole switch with small buer than
virtual cut-through with deep buer. Moreover, to decrease the storage overhead
of UVST, the proposed dynamic buer allocation requires equally dividing the
buer slots among VCs. In contrast, F2DAMQ imposes no restriction on the buer
allocation.
6. Conclusion
High-radix router based on tile structure becomes popular in the interconnection
networks of supercomputer. The dramatically increased buer requirement puts
forward great challenge on backend oor-plan of the ASIC chip. One possible solution is to decrease the number of buers, while it is dicult to realize. Another
solution is to decrease the buer depth by dynamically allocating buer entries
among competing VCs. DAMQ is such a mechanism. It is ecient in dynamic buer
management, while not so ecient in achieving low-latency write and read. To
overcome this problem, a fast and fair multi-VC shared buer structure named
F2DAMQ is proposed in this paper. It designs a fast FIFO structure to hide the read
delay of high-speed SRAM-R memory by always moving the top three data in advance to FIFO TOP. Both the idle address list and data buer use the fast FIFO
structure to speed up write and read. Moreover, a fair credit management method is
proposed to allocate buer space among VCs fairly and avoid one VC monopolizing
the shared part of the buer. Analyses and tests indicate that F2DAMQ performs
well in latency, throughput and fairness. How to simplify the control logic of
F2DAMQ further is the main direction of future work.
Acknowledgments
This work was supported by the National High-Tech Research and Development
Plan of China under Grant No. 2012AA01A301.
References
1. D. Chen et al., The IBM blue gene/Q interconnection network and massage unit, SC'11,
Seattle, Washington, USA, 1218 November, 2011.
2. W. Oed, The cray gemini interconnect: More than just a router. ISC'10, Hamburg,
Germany, June 2010.
3. J. Kim, W. J. Dally, B. Towles and A. K. Gupta, Microarchitecture of a high-radix router,
Proc. 32nd Int. Symp. Computer Architecture, Madison, WI, USA (2005), pp. 420431.
4. S. Scott et al., The blackwidow high-radix clos network, Proc. 32rd Int. Symp. Computer
Architecture, Boston, MA, June 2006.
1450012-29

H. Zhang et al.
5. S. Li, L. S. Peh and N. K. Jha, Dynamic voltage scaling with links for power optimization
of interconnection networks, Proc. 9th Int. Symp. High-Performance Computer Architecture (HPCA) (2003), pp. 91102.
6. Y. Tamir and G. L. Frazier, Dynamically-allocated multi-queue buers for VLSI communication switches, IEEE Trans. Comput. 14 (1992) 725737.
7. J. Park, B. W. O'Krafka, S. Vassiliadis and J. G. Delgado-Frias, Design and evaluation of
a DAMQ multiprocessor network with self-compacting buers, IEEE Supercomputing'94,
Washington D. C., November 1994, pp. 713722.
8. M. Jamali and A. Khademzadeh, Improving the performance of interconnection networks
using DAMQ buer schemes, IJCSNS Int. J. Comput. Sci. Network Security 9 (2009)
713.
9. M. Jamali and A. Khademzadeh, DAMQ-based schemes for eciently using the buer
spaces of a NoC router, IJCSI Int. J. Comput. Sci. Issues 4 (2009) 3641.
10. Y. Choi and T. M. Pinkston, Evaluation of queue designs for true fully adaptive routers,
J. Parallel Distributed Comput. 9 (2003) 606616.
11. A. Kodi, A. Sarathy and A. Louri, Adaptive channel buers in on-chip interconnection
networks A power and performance analysis, IEEE Trans. Comput. 57 (2008) 1169
1181.
12. J. Liu and J. G. Delgado-Frias, DAMQ self-compacting buer schemes for systems with
network-on-chip, Proc. Int. Conf. Computer Design, Las Vegas, June 2005, pp. 97103.
13. J. Liu and J. G. Delgado-Frias, A DAMQ shared buer scheme for network-on-chip.
Proc. 5th IASTED Int. Conf. Circuits, Signals, and Systems, Alberta, Canada, July 2007.
14. K. J. Rajendra, W. C. Dah-Ming and R. H. William, A quantitative measure of fairness
and discrimination for resource allocation in shared computer system, Technical Report
301, Digital Equipment Corporation, 1984.
15. T. T. Ye, L. Benini and G. De Micheli, Analysis of power consumption on switch fabrics in
network routers, Proc. 39th Design Automation Conf. (DAC) (2002), pp. 795800.
1450012-30

Zhang 2014

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Zhang 2014

Diunggah oleh

Hak Cipta:

Format Tersedia

Journal of Circuits, Systems, and Computers

Vol. 23, No. 1 (2014) 1450012 (30 pages)

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

A FAST AND FAIR SHARED BUFFER

FOR HIGH-RADIX ROUTER

HEYING ZHANG, KEFEI WANG, JIANMIN ZHANG,

by Regional Editor Eby G. Friedman.

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

implementing such kind of router becomes possible.2,a In contrast to low-radix

waters sustained petascale computing (2010), http://www.ncsa.uiuc.edu/BlueWaters.

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

A Fast and Fair Shared Buer for High-Radix Router

2. DAMQ in Tile Structure

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

Fig. 1. A 25 25 crossbar organized in 5 5 tile structure.

A Fast and Fair Shared Buer for High-Radix Router

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

Idle address list

Input data from 4VCs

VC0 head and tail

VC2 head and tail

VC3 head and tail

VC1 head and tail

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

VC0 read data buffer

VC0 output data

VC1 read data buffer

VC1 bypass write

VC1 output data

VC2 bypass write

VC2 output data

VC3 bypass write

VC3 output data

Fig. 3. Structure of F2DAMQ.

A Fast and Fair Shared Buer for High-Radix Router

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

3.1. Data buer and address buer

Fig. 4. The relationship between data buer and address buer.

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

read enable, etc. At present, the working frequency of SRAM-R is as high as

A Fast and Fair Shared Buer for High-Radix Router

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

The expected timing of read operation from the data buer.

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

The data in FIFO TOP and data buer.

A Fast and Fair Shared Buer for High-Radix Router

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

Data is read from FIFO TOP.

3.3. Head and tail management

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

A Fast and Fair Shared Buer for High-Radix Router

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

A Fast and Fair Shared Buer for High-Radix Router

PSCi PSCi -1;

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

Fig. 11. The ratio of saved memory space.

A Fast and Fair Shared Buer for High-Radix Router

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

4.2. Buer utilization

Fig. 12. The buer utilization of F2DAMQ.

J CIRCUIT SYST COMP 2014.23. Downloaded from www.worldscientific.com

Fig. 13. The buer utilization of F2DAMQ and SAMQ.