Anda di halaman 1dari 6

Seeking (the right) Problems for the Solutions of

Reconfigurable Computing

Bernardo Kastrup, Jef van Meerbergen, and Katarzyna Nowak

Philips Research Laboratories, Prof. Holstlaan 4 (WL11), 5656 AA Eindhoven,


The Netherlands. Tel.: +31 40 274 4421. FAX: +31 40 274 4004.
{kastrup, meerberg, knowak}@natlab.research.philips.com

Abstract. After a decade of active research, Reconfigurable Computing (RC)


has yet to prove itself competitive enough to establish a commercial presence.
Why? This paper looks for reasons from a pragmatic perspective based on cost-
effectiveness. It illustrates a think-model of four dimensions of reasoning that
can help evaluate the efficiency of RC approaches. It contributes to augmenting
an RC classification system with a choice criteria for each category. RC is
found cost-effective for some embedded applications, if certain guidelines are
observed. We try to point out practical ways to identify those cases.

1 Introduction

This paper discusses issues facing Reconfigurable Computing (RC) as a promising


paradigm for the high volume computing market. We will address the use of FPL as a
core feature of computing systems, not as a convenience for prototyping.
The motivation for the RC paradigm is the promise to combine the best of the
worlds of ASICs and programmable processors. Lower power dissipation is also a
driving force behind it. In spite of amazing progress in the last 10 years, however, RC
has not yet found its way into the computing market. The lack of a clear commercial
application is acknowledged in the community [1]. In the eve of the new millennium,
in a time for review and contemplation, we ask ourselves: why? Why such a
promising approach has not yet proven sufficiently competitive to become a viable
commercial option in the high-volume computing main stream?
While not pretending to come up with answers for those questions, this paper tries
to contribute to the discussion from a pragmatic and applications-oriented perspective.
We try to survey well-known notions regarding the utilisation of RC and organise
them into a structured framework.

2 Drawbacks of FPL

General-purpose FPL architectures are fine or (at most) medium-grained. The


drawbacks of these architectures are well known: (1) FPL implementations of digital
circuits represent a huge area overhead compared to their dedicated-logic counter-
parts. This is particularly true for heavy arithmetic circuits like multipliers; (2) In
terms of performance, FPL is ill suited for regular arithmetic computations like full-
word additions and multiplies, or variable-length shifts, w.r.t. to ASIC equivalents.

3 A Simplified RC Classification System


  
  
  11]
proposes a more detailed classification system, for the
purposes of this paper we can classify hybrid RC devices (FPL+CPU) simply in: (1)
Loosely coupled. Devices featuring FPL resources as a co-processor. Typically, co-
processor and host-processor are connected via a bus and can operate concurrently, in
an asynchronous fashion. To exploit concurrent computation, typically large segments
of the application are mapped onto the co-processor, which must have a relatively
large amount of FPL resources available; (2) Tightly coupled. The FPL is integrated
within the datapath of the host-processor (compile-time scheduling). The tight
integration eliminates the problem of synchronisation and communication latency
between host and FPL unit. There are two main lines of usage: (2.1) Using the FPL
unit for building deep custom pipelines (e.g. [10]), with relatively long execution
latency; (2.2) Using the FPL unit as a Reconfigurable Functional Unit (RFU) (e.g.
[8]), in a way analogous to the use of an ALU or a multiplier. The execution latency is
typically low. RFUs are considerably smaller than reconfigurable co-processors.

4 Big, But Still Cost-Effective?

The area overhead of FPL is largely inherent to its flexibility. The counter-argument
is that FPL is re-usable, in the context of reconfiguration. The same piece of silicon,
re-used repeatedly for different circuit implementations, can justify the area penalty it
implies for a single implementation. Re-usability through reconfiguration is the only
justification for the silicon overhead of FPL implementations, to the extent that the
FPL resources are used to implement a number of different digital circuits throughout
the device’s lifetime. For high-volume electronics, however, hardware
reconfigurability is not an issue when: (1) The algorithms a computing device must
run during its operating lifetime are known at design-time. In this case, ASICs are
more cost-effective (see the discussion in [3]); (2) Standard programmable
architectures can fulfil the performance requirements. These architectures can be
customised before fabrication [6], which is claimed to be viable even for low-volume
production [7]. The stability of the hardware platform makes application
programming tools much easier to develop for standard programmable platforms.
Reconfigurable Computing platforms are way behind programmable architectures
(and their well-developed compiler technology counter-part) in terms of programming
friendliness.
The closer an RC platform is to the standard programmable architectures, the
greater are the possibilities of adapting standard compiler technology to make
application programming less of a problem. Tightly-coupled platforms are suggested
as good candidates (Section 3) in this aspect.
The less a device is targeted at a specific application, the more unknown-at-design-
time algorithms it must run in operation, and the better it can benefit from hardware
reconfigurability. See Figure 1. However, the performance deficiency aspect of FPL
(Section 2) must also be taken into account.

::C ; ^N_ _ ` a!` b$c'a$dfeT_Kg h'biKjlk AB


;A8B GIHKJKLNM O!HKPRQHTSNHTO#UTV M W X 8>
?@ Y*Z W [NH
PHT\NM ]H
:=> 9 E8F @
9: ;<8 <?
78    ! #"$ % &'$(') *+
,'-$. /)01-$.  . #"!+( % 2)3 . -$4 % 5 %  6 D BEB

Fig. 1. Cost-effectiveness of the RC paradigm, and efficiency of the FPL, in perspective.

5 The Wrong Problem for the Right Solution

Once the usefulness of reconfigurability has been verified, there are yet other issues to
look at. As the enabling technology of the RC paradigm, FPL is a promising solution
for a wide range of problems. A solution, however, that comes in different flavours.
The choice of the right flavour for the right problem is not always obvious.
Mapping a multiply-rich application segment onto a general-purpose FPL
architecture is like using a hammer to tighten a screw. If the target applications are
known to be biased towards a certain kind of computation, a suitable FPGA
architecture can be chosen that performs best (and with the least silicon overhead) for
that particular kind of computation. For instance, “island-style” FPGAs [4], like the
Xilinx XC4000 family, have arbitrary long-distance communication lines suitable for
complex, irregular random logic. In contrast, fine-grained “cellular-style” FPGAs [4],
like the Atmel’s AT6K family, are better suited for highly local, pipelined circuits
such as systolic arrays. Hauck [5] discusses those issues thoroughly.
Architectural optimisations that improve FPL performance for regular DSP
arithmetic have been developed more extensively in the Academia, in the form of
coarse-grained FPGAs [9][10] (or “chunky functional units”, as in [1]). Hard-wired
computing cores as ALUs or multipliers are embedded into the framework of a
reconfigurable interconnect matrix. This allows for a boost in performance and a
reduction in the area overhead for the target applications. The loss in flexibility, in
turn, renders chunky units inefficient for irregular bit-wise computations. Another
limitation is that reduction of order is no longer possible.
Generally speaking, the FPL architecture can be fine-tuned towards a specific set
of applications (a domain) by varying the degree of flexibility of the interconnect and
the logic blocks, and by specific performance-enhancing features. This fine-tuning, in
turn, usually renders the FPL inefficient for other application domains.
Returning to our point regarding reconfigurable general-purpose processors
(Section 4), it is likely that any such device would be required to run as much DSP-
like computing kernels as anything else, due to its broad application nature. General-
purpose computing, however, is the extreme of a spectrum that goes all the way down
to ASIC devices in the opposite extreme. A fundamental dilemma in RC then
becomes clear in Figure 1. In our view, the essential design challenge is to find an
application domain wide enough to justify hardware reconfigurability, while specific
enough to allow for proper fine-tuning of the FPL resources. Different domains may
require different FPL flavours and different integration methods (see Section 3).

6 There Is No Free Meal

Impressive speed-ups achieved with the use of FPL in different computing


applications have been reported (for instance, [10]). However, as with any other
promising technology out there, the use of FPL for computations involves a trade-off.
FPL is not cost-effective for computing if the benefits it offers are not necessary or
desired in the context of the trade-off it implies.
Brebner [2] has discussed issues involving the use of control-flow (“computing in
time”) and data-flow (“computing in space”) approaches in the framework of RC. He
notes that RC platforms support both control and data-flow programming. The speed-
ups FPL allows for, when compared to programmable processors, are related to the
fact that the intrinsic parallelism of an application can be fully exploited in hardware,
in a data-flow computing fashion. Computing units (at whatever level, from logic
gates to full multipliers) can be replicated as necessary for parallel data manipulation
(limited only by the amount of programmable logic available in the device). In this
context, coprocessor platforms (see Section 3) are suggested as the best candidates,
due to the large amounts of FPL resources they deploy. Small RFUs typically cannot
achieve the same level of speed-up, but represent a modest investment in silicon. On
the other hand, standard programmable architectures use the control-flow computing
paradigm, and process data in a sequential way. A defined and limited number of
computing units (functional units), executing pre-defined instructions, is utilised for
all data manipulations. Hardware is not spatially replicated, but cyclically re-used
over a period of time (without reconfiguration). This typically leads to a more
compact and cheaper hardware implementation. Therefore, orders of magnitude
speed-ups are the consequence of resources replication.
There is no panacea in here. High levels of parallelism and hardware replication
(i.e., FPL utilisation) are only justifiable if the trade-off with performance is cost-
effective. In spite of all the real benefits reconfigurable computing allows for, it is
then not surprising that, many times, the trade-off renders FPL not competitive at all.

7 Wrap-Up

A cost-effective implementation of the RC paradigm will depend upon four main


dimensions of reasoning, which define a think-model:
Dimension 1. Big, but cost-effective. The question the designer should make himself
is: Is there demand for hardware reconfigurability?
Dimension 2. The right problem for the right solution (or the tale of the hammer and
the screw). The question is: Is it the right FPL flavour?
Dimension 3. There is no free meal (or the data-flow versus control-flow contest). The
question is: To which extent do I need a ‘computing in space’ approach?
Dimension 4. Specifics. Particular requirements must be taken into account, like the
need for low-power, user-friendly programming tools, device testability, etc.
Table 1 below is derived from the application of this think-model.

Table 1. Mapping application requirements to the classification categories of RC platforms.


Loosely-coupled Deep pipelines RFUs
General-purpose, High-throughput Modest investment in
fine-grained FPL concurrent FPL; multi-threaded
processing; bit-level platform; programmers
computations; real- have no hardware
time; dynamic rate. background; gradual
(e.g. networking) transition to RC;
alternative to bigger
caches and faster clock.
(e.g. embedded crypto)
Domain-specific, Order-reducibility; Order-reducibility;
fine-grained, high-throughput high-throughput and
cellular-style FPL concurrent low-power; mix of
processing; systolic word and bit-level
computations; real- computations; fixed-
time; dynamic rate. rate. (e.g. finite-field
(e.g. radar apps) computations)
Domain-specific, High-throughput High-throughput; Modest investment in
coarse-grained concurrent alternative to FPL; multi-threaded
FPL processing; word- superscalarity; fixed- platform; programmers
level computations; rate. (e.g. filter have no hardware
real-time; dynamic sections) background; gradual
rate. (e.g. video I/O transition to RC; data-
processing) parallel processing;
alternative to increased
superscalarity. (e.g.
multimedia instruction
set extensions).

8 An Example: Philips ConCISe

Philips ConCISe [8] is a tightly-coupled single-RFU approach based on a fine-grained


CPLD architecture. The CPLD is placed in parallel with the ALU in the execution
stage of a RISC pipeline, and can execute Application-Specific Instructions (ASIs).
Standard processors are known to be very inefficient for bit-level operations due to
their fixed word-size. The ConCISe RFU can be of benefit for a broad set of
applications for easing that limitation. Multiple, small application segments, different
for each application it runs, can be mapped onto the RFU (dimension 1). The CPLD
architecture is optimised for bit-level manipulations, and the compiler makes sure the
RFU is never used for any other sort of operation (dimension 2). This is possible due
to the tightly-coupled RFU integration approach used (see Section 3). The CPLD core
occupies an estimated 4 mm2 of silicon surface (yet to be confirmed). This is a modest
investment when compared to the 22mm2 of a MIPS PR3930, in the same 0,35 m m
process. We expect ConCISe to allow for up to a factor of 2 speed-up in some critical
applications (namely in the cryptography domain). The small investment ConCISe
represents makes it cost-effective (dimension 3). A special compilation chain has
been developed, which automatically translates application segments into hardware
descriptions for the RFU. The device is as easy to program as any microprocessor,
therefore representing no extra costs in the programming flow (dimension 4).

13 Conclusions

Reconfigurable Computing (RC) implies a number of trade-offs in terms of: (1) the
demand for hardware reconfigurability, (2) the choice of the correct FPL architecture
for the kind of computations at hand, (3) the silicon overhead of a “computing in
space” approach, and (4) special requirements related to the target applications. These
trade-offs spawn a 4-dimensional think-model that can help evaluate the cost-
effectiveness of RC approaches. The RC paradigm is not a cost-effective solution
where the trade-offs do not lead to a commercial edge.

References
n o W. H. Mangione et al. “Seeking Solutions in Configurable Computing”, Computer,
pq 30(12), pp. 38-43, December 1997.
G. Brebner. “Field-Programmable Logic: Catalyst for New Computing Paradigms”, Proc.
rts Of Field-Programmable Logic and Applications, pp. 49-58, Estonia, 1998.
R. Wilson. “Large PLDs face big uncertainties”, EE Times, Issue 1050, March 1st, 1999.
u v http://www.techweb.com/se/directlink.cgi?EET19990301S0002
wtx S. Trimberger. “Field-Programmable Gate Array Technology”, Kluwer, MA, 1994.
S. Hauck. “The Roles of FPGA’s in Reprogrammable Systems”, Proc. of the IEEE,
ytz Volume 86, Issue 4, April 1998.
D. Bursky. “Tool Suite Enables Designers to Craft Customized Embedded Processors”,
{t| Electronic Design, pp. 33-38, February 8, 1999.
Wolfe. “HP lays foundation for embedded’s future”, EDTN Network, March 1st, 1999.
}t~ http://www.edtn.com/story/tech/OEG19990226S0010-R
B. Kastrup et al. “ConCISe: A Compiler-Driven CPLD-Based Instruction Set
Accelerator”, Proc. IEEE Symp. on Field-Programmable Custom Computing Machines,
t€ Napa Valley, April 1999.
R. W. Hartenstein et al. “Using the KressArray for Configurable Computing”, Proc. of
‚tƒ SPIE, Vol. 3526, Boston, MA, November 2-3, 1998.
C. Ebeling et al. “RaPiD – Reconfigurable Pipelined Datapath”, Proc. Of Field-
„t„…‡†ˆl‰NŠ‹NŒNŽ 
Programmable Logic and Applications, 1996.
et al. “A Survey of Reconfigurable Computing Architectures”, Proc. Of Field-
Programmable Logic and Applications, pp. 376-385, Estonia, 1998.

Anda mungkin juga menyukai