Elastic Fifo

Elastic-Buffer Flow-Control
for On-Chip Networks

George Michelogiannakis,
James Balfour, William J. Dally
Computer Systems Laboratory
Stanford University
Edited by: Abhay Bhopat
Background
Buffer
Elastic Buffer
Elastic Buffer design
Introduction
Elastic-buffer (EB) flow-control uses the channels
as distributed FIFOs
Input buffers at routers are not needed
Can provide 12% more throughput per unit power

Reduces router cycle time by 18%
Compared to VC routers
Outline
Building elastic-buffered channels
By using what is already there
Router microarchitecture
Deadlock avoidance
Load-sensing for adaptive routing
Evaluation
The Idea
Use the network channels as distributed FIFOs
Use that storage instead of input buffers at
routers
To remove input buffer area and power costs
Pipelined channel
Channel as FIFO
Building an Elastic Buffer

To build an EB in a pipelined channel with masterslave flip-flops (FFs):
Use latches for storage by driving their enables
independently
Elastic buffer
Master-slave FF
Expanded view of EB control logic
How Elastic Buffer Channels Work

Ready/valid handshake between elastic buffers
Ready: At least one free storage slot
Valid: Non-empty (driving valid data)
Cycle 6
1
2
3
4
5
8
Control Logic Area Overhead

Control logic is implemented as a four-state FSM
with 10 gates and 2 FFs
Cost is amortized over channel width
Example: control logic increases

area of a 64-bit channel by 5%
Outline
Use EB flow-control through the router
Deadlock avoidance
Evaluation
10
Use EB Flow-Control Through the Router
VC input-buffered
router
Three-slot
VC & SW output
Input
buffer
EB
cover
for
allocators
removed.
LAto
routing
also
replaced by
arbitration
Per-output
arbiters
applicable done
to EB
input
EB
one
cycle in
instead.
networks.
advance.
EB router
11
Topology
2D 4x4 FBFly
12
Separate routers for networks
13
Outline
Deadlock avoidance
How to provide isolation without VCs

Evaluation
14
Deadlock Avoidance: Duplicate Channels

No input buffers
no virtual channels
Three types of possible deadlocks:

1. Protocol deadlock
2. Cyclic flit dependency in network
Solution: Duplicate physical channels
15
Deadlock Avoidance: No Interleaving

3. Interleaving deadlock
New head flits require destination registers
Occupied destination registers depend on tail flits
Tail flits cannot bypass the new head flit
Solution: Disallow packet interleaving
16
Duplicating Channels Between Routers

Duplicate channels with neckdown
Small improvement (still one switch port), large cost
Duplicate channels with duplicate switch ports

Excessive cost (switch quadratic cost)
17
Dividing Into Sub-Networks More Efficient

Divide into sub-networks
Double bandwidth, double the cost
However, when narrowing datapath down to normalize
for throughput or power
more beneficial
Again, due to switch quadratic cost
18
Outline
Deadlock avoidance
Propose a load metric for EB networks
Evaluation
19
Congestion metrics
Blocked Cycles
Blocked Ratio
Output Occupancy
Channel Occupancy
Channel Delay
20
Output Channel Occupancy Load Metric

Flit-buffered networks use credit count
EB networks measure output channel occupancy
At a certain segment of the output channel (shown in red)
Occupancy decremented when flits leave that segment
Incremented by a packets length when routing decision is
made. Packets see other decisions in same cycle
21
Outline
Deadlock avoidance
Evaluation
Compare throughput, power, area, latency, cycle time
22
Evaluation Methodology
Used a modified version
Area/power estimations from a 65nm library
Input buffers modeled as SRAM cells
Throughput/power optimal # of VCs and buffer depth
Two sub-networks: request and reply
Averaged over a set of 6 traffic patterns

Constant packet size (512 bits)
Swept channel width from 28 to 192 bits
23
Throughput-Power Gains in 2D Mesh
Throughput gain
EB network improvement:
Same power: 10%
increased throughput
Same throughput: 12%
reduced power
24
Throughput-Area Gains in 2D Mesh
2% improvement
for EB networks
25
Latency-Throughput in 2D Mesh
Zero-load latency equal
26
Power Breakdown: No Input Buffer Power
27
Area Breakdown: No Input Buffer Area
28
Router RTL Implementation

No buffers, VCs, allocators, credits
VC router had look-ahead routing
Buffers: FF arrays. 2 VCs, 8 slots each

45nm, LP-CMOS, worst-case
Mesh 5x5 routers. DOR. 64-bit datapath
Aspect
VC router
EB router
Savings
Area (m2)
63,515
14,730
77%
Clock (ns)
3.3
2.7
18%
Power (mW)
2.59
0.12
95%
29
Conclusions
EB flow-control uses channels as distributed FIFOs
Removes input buffers from routers
Uses duplicate physical channels instead of VCs
Increases throughput per unit power up to 12%

for low-swing
Depends on what fraction of the overall cost input buffers
constitute
Reduces router cycle time by 18%

Flow-control choice depends on design parameters
and priorities
30
Thanks for your

attention
Questions?

Elastic Fifo

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Elastic Fifo

Diunggah oleh

Hak Cipta:

Format Tersedia

Elastic-Buffer Flow-Control

for On-Chip Networks

Edited by: Abhay Bhopat

Can provide 12% more throughput per unit power

Building an Elastic Buffer

Expanded view of EB control logic

How Elastic Buffer Channels Work

Control Logic Area Overhead

Example: control logic increases

Use EB Flow-Control Through the Router

Separate routers for networks

Load-sensing for adaptive routing

Deadlock Avoidance: Duplicate Channels

Three types of possible deadlocks:

Deadlock Avoidance: No Interleaving

Solution: Disallow packet interleaving

Duplicating Channels Between Routers

Duplicate channels with duplicate switch ports

Dividing Into Sub-Networks More Efficient

Output Channel Occupancy Load Metric

Averaged over a set of 6 traffic patterns

Throughput-Power Gains in 2D Mesh

Throughput-Area Gains in 2D Mesh

Zero-load latency equal

Power Breakdown: No Input Buffer Power

Area Breakdown: No Input Buffer Area

Router RTL Implementation

Buffers: FF arrays. 2 VCs, 8 slots each

Increases throughput per unit power up to 12%

Reduces router cycle time by 18%

Thanks for your

Anda mungkin juga menyukai