Anda di halaman 1dari 3

NoC-Centric Partitioning and Reconfiguration Technologies

for the Efficient Sharing of Multi-Core Programmable


Accelerators
Marco Balboni
MPSoC Research Group - Engineering Department
University of Ferrara - ITALY
email: marco.balboni@unife.it
Dissertation Advisor:
Davide Bertozzi, MPSoC Research Group - Engineering Department, University of Ferrara - ITALY.

DOCTORAL DISSERTATION COLLOQUIUM


EXTENDED ABSTRACT

AbstractToday, multi- and many-core architectures are gain- concurrently. However, a static partitioning scheme cannot
ing momentum as a potential source of hardware acceleration, keep up with the increased levels of adaptivity of modern
bringing to new challenges for system designers related to both embedded systems, therefore flexible partitioning should be
system virtualization and runtime testing. My research activity
tackles these challenges exploiting and optimizing the capabilities the target. In practice, partitions should be set up or tore
of reconfiguring the routing function at runtime. down with few or no restrictions, and their size and shape
KeywordsReconfigurable Computing & FPGA Based Archi-
potentially changed at runtime [3]. Whether such a usage
tectures, Multi-Core Architectures and Support. paradigm will be feasible or not depends to a large extent on
the capability of reconfiguring at runtime the routing function
of the on-chip network (NoC), serving as the global commu-
I. I NTRODUCTION nication fabric as well as the system integration framework.
In the on-chip domain, runtime reconfiguration of the routing
Driven by flexibility, performance and cost constraints of function can be achieved either by non-reconfigurable fault-
demanding modern applications, heterogeneous Systems-on- tolerant routing strategies, which tolerate a limited number
Chip (SoCs) are the dominant design paradigm in the embed- of faults [5][8], or reconfigurable routing mechanisms that
ded computing domain [2]. SoC architecture and heterogeneity allow unlimited changes to the network. Focusing on schemes
clearly provide a wider power/performance scaling, combining of the second category, in literature, both static reconfiguration
host CPUs along with massively parallel accelerator fabrics, methods and dynamic ones are presented. The formers consist
holding potential of bridging the gap between the energy in draining the network from ongoing packets, modifying
efficiency (GOPS/W) of hardwired hardware accelerators and routing tables to configure the new routing paths, and finally
the computational power delivered by throughput computing. resuming traffic injection [9], [10], but at the cost of large
As a potential source of hardware acceleration for many performance penalties. On the contrary, dynamic reconfigu-
different algorithms [1], today, such multi- and many-core ar- ration techniques succeed in updating routing tables without
chitectures are gaining momentum. At the same time, modern stopping user traffic, but typically result into unacceptable
embedded systems need to integrate more and more complex implementation overheads for an on-chip setting [4], [11]
functionalities, requiring the concurrent execution of several [17]. Although runtime performance is more likely to be
applications onto the same hardware platform, possibly with preserved, such approaches end up materializing architectures
heterogeneous and time-varying performance/reliability/power with lower operating speeds (or higher latencies) by con-
requirements, thus coping with the rapidly growing demand struction. Furthermore, current approaches to runtime network
for a new type of interactions between the user and the configuration suffer from large hardware/software overhead
device, based on understanding of the environment sensed in and/or lack of scalability. In general, centralized approaches
multiple manner (image, motion, sound, etc.) striving to create have the disadvantage that some reconfiguration tasks (e.g.,
more friendly user interfaces (augmented reality, virtual reality, the computation of the new routing function) are performed in
haptics, etc.). This new scenario is bringing to new challenges software. In contrast, distributed reconfiguration mechanism,
for system designers. like [25], suffers from sub-optimality of emergency routing
One of the main challenges is related to the system solutions and overly high implementation cost and complexity.
virtualization, that is finally strengthening the need for an
optimized usage of parallel hardware resources to cope with An intensive research effort is currently underway in an at-
this increased level of resource contention and dynamic appli- tempt to find a suitable design point for chip implementations,
cation behavior. Partitioning of array fabrics of homogeneous including Vicis [18], Immunet [19], Ariadne [20] and other
processor cores and isolation of derived partitions are gaining reconfigurable routing frameworks [21][23]. With respect to
momentum as means of pursuing the integration of func- these works, the recent adaptation of the Overlapped Static
tionality from separate users/devices onto NoC-based many- Reconfiguration (OSR) [4] methodology to the tight resource
core processors, while meeting their potentially heteroge- budgets of embedded systems (OSRLite [24]) provided an
neous requirements. Following this trend, the traditional time appealing trade-off between reconfiguration performance and
and space partitioning concept is being extended to parallel implementation complexity. OSR relies on the principle that
hardware platforms to overcome the challenge of using shared if packets with the old routing function are prevented from
(yet modular) resources in applications that are executed following packets using the new one, deadlock cannot occur.

978-1-4673-7813-0/15/$31.00 2015 IEEE 643


Experimental Evalua/on
Recongura+on Latency for a scaled NoC: comparison with SOA
Average for an 8x8 mesh network.
Enforcing this ordering mechanism is possible even without 1400
draining the network from ongoing packets, by propagating 30
1200

8x8 Mesh: Reconfiguration


a separation token between old and new packets through- 25

Latency AVG [cycles]


out the network. Notwithstanding that, several performance 1000 20
inefficiencies still affect the OSR mechanism, which can be 15
fundamentally identified as the temporary suspension of traffic 800 10
injection during the reconfiguration transient, the packet block- 600
5
ing behind the self-propagating epoch separation boundary, 0
as well as the network-wide nature of each reconfiguration 400 Local TUNNEL BLINC
event. Furthermore, OSR mechanism is centralized (so the
manager is on the critical path of the reconfiguration process, 200
needing also a separate control network or virtual channel for 0
communications), and can be triggered from just the root node OSR Global TUNNEL Local TUNNEL BLINC ARIADNE
of the network and not from each ones. Figure 2. the sReconfiguration
Under Latency
ame opera/ng condi/ons the for a NoC:
op/mized comparison
T-OSR mechanism awith SOA
chieves a
considering a 8x8 mesh.
Overcoming all the limitations and issues of OSRLite, mate- 35% of speedup with respect to the closest compe/tor (BLINC).

rializing its potentials, thus designing an interconnection fabric
supporting features suitable for a virtualized and dynamic T-Ohand,
other SR worst Icase
relylatency
on the (region
factof that
8 switches)
during stays constant indipendently
reconfiguration theof old
the
environment is the aim and the starting point of my research network
routing size. are still available and can be used by the packets
paths
whenever a duplication Marco Balboni, MPSoC
ofResearch
theGroup, University of Ffunctions
routing errara - ITALY registers is 18
activity.
affordable (an area overhead must be taken into account in
this case). In practice, I perform an epoch conversion of the
II. P ROPOSED M ETHOD AND S OLUTIONS packets that would be using the old routing function instead of
the new one, hence not breaking the OSR assumption, while
On one side, in order to tackle the inefficiencies of the at the same time crossing the token propagation barrier. This
mechanism impacting the performance on ongoing communi- latter optimization is not restricted to switch local ports (Fig.1,
cation flows while a reconfiguration takes place,I propose a set blue plot), but can be in principle applied to all input ports,
of performance optimizations spanning from simple to more and in this case it provides a transparent reconfiguration (Fig.1,
aggressive ones trading performance speedups (approaching yellow plot), as the experimental result proves.
latency insensitive reconfiguration) with a higher implemen-
tation cost, and at the same time aiming at speeding up the Another contribution of my work consists of tackling the
reconfiguration transient itself [26]. To achieve these goals, hardware/software overhead of a centralized mechanism, but
first of all I changed the logic that controls and manages the moving from the following perspective: the synergistic ex-
tokens propagation through a careful engineering of switching ploitation of routing resources that are already there in many
hardware to avoid critical races and/or inconsistent states, NoC implementations. In a sense, there is an overhead which
making it feasible to restrict the reconfiguration procedure is increasingly accepted in NoC design, and which is justified
to only affected partitions of the network, cutting-off the by other design goals, which consists of the use of multiple
reconfiguration transient duration by limiting the involved physical networks instead of logic ones. Although this seems
area. The feasibility of local reconfiguration means also having to run contrary to much previous work [28], [29], it is
the possibility to set up boundaries between different regions actually motivated by how the relative costs of network design
of the NoC, thus enabling more flexible scenarios, i.e. partition change for implementation on a single die [27]. Given this,
scheduling and reshaping (restriction and expansion of an the solution I proposed is to exploit the existing multiple
existing partition, merging of and splitting into two partitions), physical networks to spatially separate resource allocations
avoiding faulty links or switches, powering-off unused or that may close dependency cycles: whenever a switch port
overheated regions, setting up or tearing down of reserved path processing old traffic has a routing dependency with a port
(QoS), this way creating the support for a highly dynamic envi- already migrated to the new epoch, an escape path is set
ronment. Furthermore, to overcome the performance penalties, up into another network plane and taken by the packets, in
I propose two possible solutions. On one hand I prevent the this way avoiding deadlock. So I developed a reconfiguration
blocking of traffic injection at switch local ports because of methodology, called Tunneled-OSR [31], around the above
the unsynchronized arrival of new routing functions and of basic idea, that, through an engineered protocol of tunnels
the tokens at switch input ports. This is done implementing request, opening and propagation and token management, lets
a control logic inside the routers that can synchronize those the system perform a distributed and fast reconfiguration. All
events postponing the notification of the new routing functions, the routers are able to trigger the token propagation (Global
thus letting the local ports continue injecting packets. On the T-ORS) and the reconfiguration process within an affected re-
gion, which a further optimization of the mechanism (Local T-
$600$ OSR). During the reconfiguration transient, tunnels are opened
MAXIMUM LATENCY (cycles)

276$cycles$ OSR3Lite$
OSR3Lite3opt1$
OSR-aggressive-opt
creating a boundary around the area in the network that needs
$500$ 256$cycles$ to reconfigure the routing function: the packets that want to
OSR-opt
OSR3Lite3opt2$
227$cycles$ cross it are rerouted and tunneled into the escape network
$$400$
that is the key requirement of the proposed mechanism. So,
$300$ for instance, it could be suitable for networks carrying intra-
partition or inter-core traffic in a many-core processor being
$$200$ reconfigured on top of a global network (for I/O or memory
controller communication). Alternatively, a network carrying
$$100$ one message type could be reconfigured on top of a network
$$$$$$0$
carrying a different message type, provided the two message
149800 150000 150900
types do not form a dependency chain rising the risk of
EXECUTION
EXECUTION TIME (cycles) at TIME
medium i(cycles)
njec:on rate.

message-dependent deadlock. For instance, memory requests
Figure 1. Maximum message latency in the NoC with medium injection rate messages cannot be tunneled into a response network, and
on a 8x8 mesh. At cycle 150000 a reconfiguration of the routing function is vice versa. Nonetheless, another case falls within reach of
triggered. this work, that is, multiple networks with multiple virtual

644
channels each. Finally we also demonstrated a substantial [2] D. Melpignano, L. Benini, et al., Platform 2012, a many-core
improvement over state-of-the-art in terms of reconfiguration computing accelerator for embedded SoCs: performance evaluation of
visual analytics applications, DAC, 2012.
latency (Fig.2 shows upto 35% of speedups compared to our [3] R. Hilbrich and J. Reinier van Kampenhout, Partitioning and Task
the best competitor BLINC [25]), area overhead, impact over Transfer on NoC-based Many-Core Processors in the Avionics Domain,
the performance of running traffic, and scalability to large Workshop Entwicklung zuverlassiger Software-Systeme, 2011.
[4] O. Lysne, J. Montanana, J. Flich, J. Duato, T. Pinkston, and T. Skeie,
networks. The counterpart of opening tunnels is a perturbation An efficient and deadlock-free network reconfiguration protocol, IEEE
on the traffic of the escape network but being the transient very Transactions of Computers, vol.57, no. 6, pp. 762779, 2008.
fast, this perturbation dies out quickly. [5] W. Dally, L. Dennison, D. Harris, K. Kan, and T. Xanthopoulus, The
reliable router: A reliable and high-performance communication substrate
for parallel computers, in Proceedings of the Workshop on Parallel
III. C URRENT S TATUS Computer Routing and Communication (PCRCW), May 1994.
[6] C. Glass and L. Ni, Fault-tolerant wormhole routing in meshes without
virtual channels, IEEE T. Parallel and Distributed Systems,1996.
Summarizing, with my research activity I have optimized [7] M. Gomez, J. Duato, J. Flich, P. Lopez, A. Robles, N. Nordbotten, O.
the OSRLite reconfiguration mechanism to make it suitable Lysne, and T. Skeie, An efficient fault-tolerant routing methodology for
for highly dynamic and shared execution environments, based meshes and tori, Computer Architecture Letters, vol. 3, n.1, 2004.
[8] C.T. Ho and L. Stockmeyer, A new approach to fault-tolerant wormhole
on the principle of flexible network partitioning. Reconfig- routing for mesh-connected parallel computers, IEEE Transactions on
urations do not require to drain the network from ongoing Computers, vol. 53, no. 4, pp. 427439, 2004.
traffic, and are local to affected partitions. I have proposed [9] K. M. et al., Fibre channel switch fabric-2 (fc-sw-2), NCITS 321-200x
T11/Project1305-D/Rev 4. 3 Specification, Tech. Rep., March 2000.
different optimization strategies for network injectors to match [10] M. Schroeder, et al., Autonet: a high-speed, self-configuring local area
increasing resource budgets. To the limit, I prove that fully network using point-to-point links, IEEE Journal on Selected Areas in
transparent network reconfiguration is feasible. Secondly I Communicartions, vol. 9, no. 8, pp. 13181335, October 1991.
[11] R. Casado, A. Bermudez, , J. Duato, F. Quiles, and J. Sanchez, A
showed that the synergistic exploitation of multiple physical protocol for deadlock-free dynamic reconfiguration in high-speed local
networks can lead to a fast, low-impact and scalable dynamic area networks, IEEE Transactions on Parallel and Distributed Systems,
reconfiguration of the routing function at runtime. I bound the vol. 12, no. 2, pp. 115132, February 2001.
[12] O. Lysne and J. Duato, Fast dynamic reconfiguration in irregular
area affected by a reconfiguration and devised a mechanism networks, in Proceedings of the 2000 International Conference of
for the fast yet controlled switching of the routing function Parallel Processing (ICPP).
to the new epoch in it. I rely on concurrent token and tunnel [13] T. Pinkston, R. Pang, and J. Duato, Deadlock-free dynamic reconfigu-
ration schemes for increased network dependability, IEEE Transactions
propagation and I showed minimum perturbation of the escape on Parallel and Distributed Systems, vol. 14, no. 8, pp. 780794, 2003.
NoC, and only for an overly short amount of time with [14] J. Duato, O. Lysne, R. Pang, and T. Pinkston, Part I: A theory for
respect the reconfiguration latencies of competing approaches. deadlock-free dynamic network reconfiguration, IEEE Transactions on
Parallel and Distributed Systems, vol. 16, no. 5, pp. 412427, May 2005.
The mechanism can finally scale to a large number of cores [15] O. Lysne, T. Pinkston, and J. Duato, Part II: A methodology for
thus coping with the scalability requirements of embedded developing deadlock-free dynamic network reconfiguration processes,
systems. Furthermore the optimizations implemented in my IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 5,
pp. 428443, May 2005.
research work pave the way for the frequent and fast partition [16] D. Avresky and N. Natchev, Dynamic reconfiguration in computer
reconfigurations that future applications will require to handle clusters with irregular topologies in the presence of multiple node and
workload adaptivity, fault-tolerance and quality-of-service. link failures, IEEE Transactions Computers, 2005.
[17] J. Acosta and D. Avresky, Intelligent dynamic network reconfigura-
All the experimental results were collected and the proposed tion, in Proceedings of the 21st IPDPS.
[18] D. Fick, A. DeOrio, J.H., V. Bertacco, D. Blaauw, and D. Sylvester,
optimizations were modeled and simulated with cycle accu- Vicis: A reliable network for unreliable silicon, in DAC2009.
racy in RTL-equivalent SystemC by augmenting the baseline [19] V. Puente, J. Gregorio, F. Vallejo, and R. Beivide, Immunet: A cheap
VirtualSoC [32] simulation environment and also through the and robust fault-tolerant packet routing mechanism, in Proceedings of
the 31th Annual International Symposium on Computer Architecture.
synthesis of the xpipeslite switch [30], after a conversion in [20] K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco, Ariadne: Agnostic
synthesizable Verilog code. I also realized demonstrators of the reconfiguration in a disconnected network environment, in Proceedings
optimized mechanism at work, using a Xilinx Virtex7 FPGA of the International Conference on Parallel Architectures and Compilation
Techniques (PACT), 2011.
to validate the real feasibility of the systems and architecture [21] J. Flich, A. Mejia, P. Lopez, and J. Duato, Region-based routing: An
proposed, also tested with several real applications. efficient routing mechanism to tackle unreliable hardware in network on
chips, in Proceedings of NOCS07.
[22] C. Feng, Z. Lu, A. Jantsch, J. Li, and M. Zhang, A reconfigurable fault-
IV. P LANS TO C OMPLETE THE R ESEARCH tolerant deflection routing algorithm based on reinforcement learning for
Network-on-Chip, in Proceedings of NocArc, 2010.
As a future work, I want to consistently develop the impli- [23] Z. Zhang, A. Greiner, and S. Taktak, A reconfigurable routing algorithm
for a fault-tolerant 2D-mesh Network-on-Chip, in Proceedings of the
cations of such routing reconfiguration capability to the upper 46th Design Automation Conference (DAC).
layers of the design hierarchy. At first, the first and foremost [24] A. Strano, D. Bertozzi, et al., OSR-Lite: Fast and deadlock-free NoC
implication is on the concept of space partition, that is, on the reconfiguration framework, SAMOS 2012.
[25] D.Lee, R.Parikh, V.Bertacco, Brisk and Limited-Impact NoC Routing
grouping of neighboring computation units in an homogeneous Reconfiguration, in DATE2014.
parallel computing fabric to accommodate a single (parallel) [26] Marco Balboni, Francisco Trivino, Jose Flich, Davide Bertozzi. Op-
application. Thanks to the reconfiguration property of the timizing the Overhead for Network-on-Chip Routing Reconfiguration in
Massively Parallel Multi-Core Platforms, Int. SoC Symposium, 2013.
interconnect fabric, I will be able to introduce the concept [27] D.Wentzlaff et al. On-Chip Interconnection Network Architecture of
of flexible space partition in shape and size, thus opening up the Tile Processor. IEEE Micro, Vol.27, Issue 5, pp.15-31, 2007.
unprecedented opportunities for resource utilization and power [28] F. Gilabert, M.E. Gomez, S. Medardoni, D. Bertozzi. Improved
utilization of NoC channel bandwidth by switch replication for cost-
efficiency. In turn, this poses requirements on the runtime effective multi-processor systems-on-chip, pp.165-172, NOCS 2010.
manager of the system, which should be able to support [29] Young-Jin Yoon, Nicola Concer, Michele Petracca, Luca P. Carloni
such flexibility by implementing some kind of application Virtual Channels and Multiple Physical Networks: Two Alternatives to
Improve NoC Performance., IEEE Trans. on CAD of Integrated Circuits
versioning. Thus I can contribute to evolve programmable and Systems 32(12): 1906-1919 (2013).
accelerators towards unprecedented levels of runtime reconfig- [30] S. Stergiou, F. Angiolini, S. Carta, L.Raffo, D. Bertozzi, G. De Micheli.
uration through a cross-layer approach to design, optimization xpipes Lite: A Synthesis Oriented Design Library For Networks on
Chips, DATE 2005: 1188-1193.
and programming. [31] Marco Balboni, Jose Flich, Davide Bertozzi. Synergistic Use of
Multiple On-Chip Networks for Ultra-Low Latency and Scalable Dis-
tributed Routing Reconfiguration, DATE 2015.
R EFERENCES [32] D. Bortolotti, C. Pinto, A. Marongiu, M. Ruggiero and L. Benini Vir-
tualSoC: a Full-System Simulation Environment for Massively Parallel
[1] A. Majumdar, S. Cadambi, M. Becchi, S. T. Chakradhar and P. Graf, Heterogeneous System-on-Chip, IEEE, 27th International Symposium
A Massively Parallel, Energy Efficient Programmable Accelerator for on Parallel & Distributed Processing Workshops and PhD Forum 2013.
Learning and Classification, ACM TACO, March 2012.

645

Anda mungkin juga menyukai