Case Study:
Cryptography Computing Cluster
Blends FPGAs, Quad CPUs
Designed for military cryptography applications, a cluster computing design
marries an array of FPGA computing resources and power control processors tied
together over PCI Express.
Steve McReynolds, Technical Support Manager tive and easy to use. But many such pro-
Trenton Technology cessors are burdened with extremely high
Mark Hur, Director of Marketing power requirements. In such processors,
ploration
your goal Pico Computing a very large number of gates must switch
k directly when performing even the simplest deci-
C
age, the hoosing the right architecture can sions. In contrast, FPGA modules, such as
source.
ology,
make a huge difference in how the Pico Computing E-16 shown in Fig-
d products complex signal processing com- ure 1, use a minimal set of gates to make
puting applications perform. Applica- basic logic decisions.
d tions such as weather predication, bomb The high-performance computing
blast analysis and financial modeling, system described here requires approxi-
for example, require floating-point com- mately 500W of system power. Over half
putations. Meanwhile, other similarly of that power is consumed by the PICMG
complex applications such as the Smith 1.3 System Host Board (SHB) with its two
Waterman algorithm, military cryp- Quad-Core Intel Xeon Processors E5335.
nies providing solutions now
tography and image processing do not The Trenton MCXT system host board
ion into products, technologies and companies. Whether your goal is to research the latest
tion Engineer,typically
or jump to arequire
company'sfloating-point
technical page, the compu-
goal of Get Connected is to put you shown in Figure 2 takes full advantage
tations.
you require for whateverFor
type non-floating-point
of technology, applica- of the capabilities of the Quad-Core Intel
and productstions,
you arelarge
searching for. of FPGAs can dramat-
arrays Figure 1 Xeon Processors to manage the complex
www.cotsjournalonline.com/getconnected
ically speed up the computation of the data traffic flow needed by the high-per-
Unlike general-purpose processor
overall solution algorithm. formance computing system.
arrays—in which a very large number
This case study takes a look at how In cluster computing designs, equally
an array of up to 84 FPGAs managed by of gates must switch to perform even the important as the processing elements is
a PICMG 1.3 System Host Board (SHB) simplest decisions—FPGA computing the flexibility of the interconnect that
communicating to the FPGA boards modules, such as this Pico Computing links processing elements together. Rarely
over PCI Express links on a PICMG 1.3 E-16, use a minimal set of gates to make can any one military application justify
End of Article
backplane, enables 10,000:1 speed im-
provements in typical hardware cryp-
basic logic decisions. its own purpose-built hardware opti-
mized for a particular algorithm. Typi-
tography applications. cally a machine designed for one military
FPGA Acceleration with Intel application does not have the optimal
Get Connected Processors interconnect strategy for another algo-
with companies mentioned in this article. General-purpose processor arrays— rithm. With that in mind, the PCI Ex-
www.cotsjournalonline.com/getconnected including graphics processors—are effec- press (PCIe)-based interconnect strategy
used in this system allows tremendous their nearest neighbors, to other process- FPGA algorithms and minimal impact
flexibility connecting the processing ele- ing elements hosted in the same box, and on speed. Unless a PE is capable of per-
ments (PE). The PEs may be connected to even across boxes with no change to the forming significant amounts of process-
ing independently, the potential for high
levels of parallelism is limited. The PCIe
interconnect strategy used in the Pico
Computing SC3 Super Cluster overcomes
this inherent limitation.
based FPGA clusters can be expanded to on the FPGA card so an operating system to allow up to 256 segments on the PCIe
include very large arrays where cluster on the card was not a choice. That said, bus. Each Pico Card requires approxi-
computing is used to satisfy the require- the PE can be replaced with the FX part, mately 3 1/7 segments when allowances
ments of complex applications. which includes a PPC 440 processor. are made for all of the supporting devices
The SC3 runs under Linux and natu- contained in the system. With 256 seg-
rally exploits the 64-bit architecture of Engineering Challenges and ments managed by the BIOS, a maximum
that operating system. For algorithm Solutions of 84 Pico Cards can be used effectively in
development, any particular card may As shown in Figure 5, the Pico this SC3 SuperCluster.
be run under Windows (XP or Vista), as Computing SC3 SuperCluster has five Particular care was paid to cool-
would be most likely available on any lap- EC7BP PCI Express backplane cards ing the SC3 SuperCluster. The standard
top. Loading the FPGA image is managed installed with seven E-16 FPGA cards fans were replaced with higher-volume
entirely through the PCIe bus (in either plugged into each board. The PCI Ex- fans; however, it was not necessary to
architecture). The choice of host operat- press architecture is built on the same deviate from the standard air-cooled
ing systems is, of course, not binding on logical addressing model developed chassis design. Each FPGA has a tem-
the operating system that runs on the years ago with the PCI bus and as such, perature sensor built in, and the chip
FPGA. On the SC3 there is no processor PCIe is subject to similar limitations on will shutdown if it overheats. The same
the number of supported device seg- is true with the Quad-Core Intel Xeon
ments. Given the architecture of a typi- Processors used on the Trenton MCXT
cal PICMG 1.3 passive backplane and system host board. Care must taken to
the BIOS configuration of a system host guard against FPGA overheating since
board, the number of allowable PCI the internal algorithms are open to the
buses was exceeded in the initial system user design and it is possible to drive
design with the maximum number of the FPGA so hard that it will overheat.
FPGA cards installed. Design steps are taken to find the point
To correct this condition, engineers where the increased FPGA clock speed
at Trenton Technology made several will cause the FPGA to “burn up” and
modifications to the MCXT board’s BIOS then the clock speed is adjusted down-
MCXT
System Host Board
PCIe-412 Backplane
EC7BP
PCIe Backplane Card E-16 FPGA Card
Figure 4
ward to avoid this condition. While this permit a larger FPGA to be mounted on diverse processing elements. This capa-
additional step in the design process the card such as the Virtex-5 LX85 or bility enables very flexible FPGA solu-
must be taken, it does illustrate the in- possibly the LX110. tions in a wide variety of computing
herent FPGA flexibility and acceleration Both FPGAs and general-purpose applications. The SC3 SuperCluster is a
capability to fine-tune the FPGA clock processors have an impressive breadth of great example of how complete off-the-
speed to maximize processing efficiency available development tools and a long shelf FPGA and quad-core processor
of a particular algorithm. and established track record of delivering technology can be merged together by
superior system performance. General- virtue of PCI Express and the PICMG
Future Development Directions purpose processors represent 50 years 1.3 architecture, to provide the robust,
Having the FPGA card as a sepa- of maturity, and FPGAs have been used high-performance computing platform
rate unit with its own power regula- in advanced computing applications for needed for military cryptography and
tion, memory, clocks and PCIe interface, over 20 years. Tools, algorithms and ap- applications like it.
opens up the possibility of using differ- plications for both of these technologies
ent processing elements. Among these have made tremendous strides. Needless Pico Computing
possibilities under active development to say, there are high levels of engineering Seattle, WA.
are: an FPGA with a PPC processor, a and development activity at work advanc- (206) 283-2178.
larger FPGA, and special coprocessors ing the capabilities of both of these prod- [www.picocomputing.com].
such as the MathStar chip. Single in- uct technologies.
stances, or clusters, of such components An FPGA is an intrinsically paral-
Trenton Technology
can be integrated with no change to the lel device and implements well with the
Gainesville, GA.
overall system cluster architecture. The kind of algorithm that can be divided
Pico E-16 card is a 34 mm wide card. The into relatively watertight sub-processes. (770) 287-3100.
same footprint will accommodate a card FPGA architectures can be expanded, [www.trentontechnology.com].
with a 54 mm width, which will in turn more or less at will, to incorporate many