ibm.com/redbooks
Redpaper
International Technical Support Organization Roadrunner: Hardware and Software Overview January 2009
REDP-4477-00
Note: Before using this information and the product it supports, read the information in Notices on page v.
First Edition (January 2009) This edition applies to the Roadrunner computing system.
Copyright International Business Machines Corporation 2009. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1. Roadrunner hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What Roadrunner is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Roadrunner hardware components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 TriBlade: a unique concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 IBM BladeCenter QS22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 IBM BladeCenter LS21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Rack configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Compute node rack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Compute node and I/O rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Switch and service rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 The Connected Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5.1 Networks within a Connected Unit cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5.2 Networks between Connected Unit clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2. Roadrunner software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Roadrunner components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Compute node (TriBlade) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 I/O node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Master (management) node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Cluster boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Boot scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 How applications are written and executed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Application core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Offloading logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A. The Cell Broadband Engine (Cell/B.E.) processor . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The processor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Element Interconnet Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 20 20 20 21 21 21 22 23 23 23 24 27 28 30 30 31
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Copyright IBM Corp. 2009. All rights reserved.
iii
iv
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AS/400 BladeCenter Blue Gene/L Blue Gene Domino GPFS IBM PowerXCell IBM iSeries PartnerWorld Power Architecture POWER3 POWER5 PowerPC Redbooks Redbooks (logo) RS/6000 System i WebSphere
The following terms are trademarks of other companies: AMD, AMD Opteron, HyperTransport, the AMD Arrow logo, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand Trade Association. Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom. Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel Pentium, Intel, Pentium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
vi
Preface
This IBM Redpaper publication provides an overview of the hardware and software components that constitute a Roadrunner system. This includes the actual chips, cards, and so on that comprise a Roadrunner connected unit, as well as the peripheral systems required to run applications. It also includes a brief description of the software used to manage and run the system.
vii
Prashant Manikal Cornell Wright IBM Austin Debbie Landon Wade Wallace International Technical Support Organization, Rochester Center
Comments welcome
Your comments are important to us! We want our papers to be as helpful as possible. Send us your comments about this paper or other IBM Redbooks in one of the following ways: Use the online Contact us review Redbooks form found at: ibm.com/redbooks Send your comments in an e-mail to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400
viii
Chapter 1.
on the Linpack benchmark. Roadrunner in 2008 has demonstrated a thousand fold increase in sustained compute performance. Note: The name Roadrunner was chosen by Los Alamos National Laboratory and is not a product name of the IBM Corporation. This supercomputer was designed and developed for the Department of Energy and Los Alamos National Laboratory under the project name Roadrunner. The project was named after the state bird of New Mexico.
Use virtual prototyping and modeling to understand how new production processors and materials affect performance, safety, reliability, and aging. This understanding helps define the right configuration of production and testing facilities necessary for managing the stockpile throughout the next several decades. Throughout the history of this program, the IBM Corporation has been a key partner of the Department of Energy's National Nuclear Security Administration (NNSA) program. Here are several historical examples: In 1998, IBM delivered the ASCI Blue Pacific system, which consisted of 5,856 PowerPC 604e microprocessors. The theoretical peak performance of this system was 3.8 teraflops. In 2000, IBM delivered the ASCI White system. This computer system was based on the IBM RS/6000 computer, which contained IBM POWER3 nodes running at 375 MHz. This cluster consisted of 512 nodes, each of which had 16 processors for a total of 8,192 processors. The power requirements for this machine consisted of 3 MW for the computer and an additional 3 MW required for cooling. The theoretical peak processing power was 12.3 teraflops and a Linpack performance of 7.2 teraflops. In 2005, IBM delivered and installed the ASC Purple system at Lawrence Livermore Laboratories. This system was a 100 teraflop machine and was the successful realization of a goal set a decade earlier (1996) to deliver a 100 teraflop machine within the 2004 to 2005 time frame. Note: At the time these goals were set, computers were still at the gigaflop level and were still two years away from the realization of the first teraflop machine. ASC Purple is based on the symmetric shared memory IBM POWER5 architecture. The combined system contains approximately 12,500 POWER5 processors and requires 7.5 MW of electrical power for both the computer and cooling equipment. Another machine in the ASC program is the IBM System Blue Gene/L machine delivered by IBM to Lawrence Livermore Laboratories. The Blue Gene architecture is unique in that it allows for a very dense packing of computer nodes. A single Blue Gene rack contains 1024 nodes. On March 24, 2005, the US Department of Energy announced that the Blue Gene/L installation at Lawrence Livermore Laboratory had achieved a speed of 135 teraflops on a system consisting of 32 racks. On October 27, 2005, Lawrence Livermore Laboratories and IBM announced that Blue Gene/L had produced a Linpack benchmark that exceeded 280 teraflops. This system consisted of 65,536 compute nodes housed in 64 Blue Gene racks. As with each of the systems described above, the Roadrunner project is a partnership with IBM. The original contract was signed in September 2006 and projected for three phases. In phase 1, a base system was delivered consisting of Opteron nodes. A hybrid node prototype system was projected for phase 2. The delivery of a hybrid final system, one that would achieve a sustained petaflop in Linpack performance, was projected for phase 3. For more information, refer to the Advanced Simulation and Computing Web site at: http://www.sandia.gov/NNSA/ASC/about.html
The node design of the TriBlade offers a number of important characteristics. Since each node is accelerated by Cell/B.E. processors, by design there is one Cell/B.E. chip for each Opteron core. The TriBlade is populated with 16 GB of Opteron memory and an equal amount of Cell/B.E. memory. Since the new Cell/B.E. eDP processors are capable of delivering 102.4 gigaflops of peak performance, each TriBlade node is capable of approximately 400 gigaflops of double precision compute power. For additional information about the Cell/B.E. processor, see Appendix A, The Cell Broadband Engine (Cell/B.E.) processor on page 27. The design of the TriBlade presents the user with a very specific memory hierarchy. The Opteron processors establish a master-subordinate relationship with the Cell/B.E. processors. Each Opteron blade contains 4 GB of memory per core, resulting in 8 GB of shared memory per socket. The Opteron blade thus contains 16 GB of NUMA shared memory per node. Each Cell/B.E. processor contains 4 GB of shared memory, resulting in 8 GB of shared memory per blade. In total, the Cell/B.E. blades contain 16 GB of distributed memory per TriBlade node. It is important to note that not only is there a one-to-one mapping of Opteron cores to Cell/B.E. processors, but also each node consists of a distribution of equal memory among each of these components. In order to sustain this compute power, the connectivity within each node consists of four PCI Express x8 links, each capable of 2 GBs transfer rates, with a 2 micro-second latency. The expansion slot also contains the InfiniBand interconnect, which allows communications to the rest of the cluster. The capability of the InfiniBand 4x DDR interconnect is rated at 2 GBs with a 2 micro-second latency.
Important: The implementation chosen for the Roadrunner system consists of the standard blade populated with 16 GB of DDR2 memory. As with the Opteron blades, all of the Cell/B.E. based blades are diskless. For additional information about the Cell/B.E. processor, see Appendix A, The Cell Broadband Engine (Cell/B.E.) processor on page 27.
For more information about the QS22, see the IBM BladeCenter QS22 Web page at: http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html
For each installed microprocessor, a set of four DIMM sockets are enabled. The processors used in these blades are standard low-power processors. The standard AMD Opteron processors draw a maximum of 95 W. Specially manufactured low-power processors operate at 68 W or less without any performance trade-offs. This savings in power at the processor level combined with the smarter power solution that IBM BladeCenter delivers make these blades very attractive for installations that are limited by power and cooling resources. This blade is designed with power management capability to provide the maximum up time possible. In extended thermal conditions, rather than shut down completely or fail, the LS21 automatically reduces the processor frequency to maintain acceptable thermal levels. A standard LS21 blade server offers these features: Up to two high-performance, AMD Dual-Core Opteron processors. A system board containing eight DIMM connectors, supporting 512 MB, 1 GB, 2 GB, or 4 GB DIMMs. Up to 32 GB of system memory is supported with 4 GB DIMMs. A SAS controller, supporting one internal SAS drive (36 or 73 GB) and up to three additional SAS drives with optional SIO blade. Two TCP/IP Offload Engine enabled Gigabit Ethernet controllers (Broadcom 5706S) as standard, with load balancing and failover features. Support for concurrent KVM (cKVM) and concurrent USB/DVD (cMedia) through Advanced Management Module and an optional daughter card. Support for a Storage and I/O Expansion (SIO) unit. Dual Gigabit Ethernet controllers are standard, providing high-speed data transfers and offering TCP/IP Offload Engine support, load-balancing, and failover capabilities. The version used for Roadrunner uses optional InfiniBand expansion cards, allowing high speed communication between nodes. The InfiniBand fabric installed with Roadrunner provides 4x DDR connections that have a theoretical peak of 2 GB per second. Finally, the LS21 supports both the Windows and Linux operating systems. The Roadrunner implementation uses the Fedora version of Linux. Figure 1-3 on page 9 shows a schematic of the planar of an LS21.
For more information about the LS21, see the IBM BladeCenter LS21 Web page at: http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html
10
11
A switch and service rack looks similar to the picture shown in Figure 1-6.
Misc
A CU can be thought of as a base cluster unit. The racks that make up a CU are connected to each other through first-stage switches. CUs are then tied together through second-stage switches to create a larger grid. The size of a CU is largely determined by the capabilities of the first-stage switch. There are 180 TriBlades in a CU. This number of TriBlades means that a Connected Unit contains 180 AMD Opteron LS21s and 360 IBM BladeCenter QS22s. See Figure 1-7 on page 13.
12
Connected Unit
Misc
Compute rack x3
Note: As previously discussed in this chapter, the entire Roadrunner system or cluster is comprised of a total of 17 CUs.
1.5 Networks
Given the high number of racks and nodes in the Roadrunner system, it should come as no surprise that there are several different networks used to tie the system together. This section provides an overview of the different networks involved as well as their functional purpose.
13
Total
288
Figure 1-9 on page 15, on the other hand, shows a fat tree. Notice how the number of links between nodes increases as you get closer to the trees root. The number of links shown is just one example of a fat tree configuration; the actual number may be higher or lower between any two nodes depending on the given requirements.
14
Fat tree topologies are becoming quite popular in InfiniBand clusters. For more information about fat trees and their usage with InfiniBand, see the article Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM, which is available at the following Web site: http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf
15
The link between the AMD Opteron and its associated Cell/B.E. processor is through a direct, point-to-point PCI-Express connection. As discussed in 1.2.3, IBM BladeCenter LS21 on page 7, each LS21 has an expansion card installed, a Broadcom HT-2100. This expansion card allows for PCI-Express (PCIe) communications to a Cell/B.E. processor. The Cell/B.E. blades have PCIe functionality built into them directly, so no extra expansion card is needed. Low-level device drivers have been written to enable communications across the PCIe link. Higher-level APIs, such as Data Communication and Synchronization (DaCS) and the Accelerated Library Framework (ALF), will flow across this PCIe connection to enable Opteron-to-Cell/B.E. communications. For more information about MPI, DaCS, and ALF, see Chapter 2, Roadrunner software overview on page 19.
96
96
96
96
96
96
96
96
96
96
96
96
96
96
96
96
96
CUA
Figure 1-10 Role of second stage InfiniBand switches in Roadrunner
CUQ
16
17
18
Chapter 2.
19
20
GPFS or Panasas PanFS client to communicate with the external file system, depending on what file system software is running there.
21
I/O nodes
Once successfully booted, the service nodes begin transferring the required boot images down the CVLAN. The I/O nodes are standard Opteron Linux servers and are booted diskless with the required image. I/O nodes are connected to the 10 GB Global File System (GFS) to service the compute nodes file access requests. The image required to boot the I/O node is received from its local service node through the CVLAN network.
22
2.3 xCAT
Setting up the installation and management of a cluster is a complicated task and doing everything manually can become very complicated. The development of xCAT grew out of the desire to automate a lot of the repetitive steps involved in installing and configuring a Linux cluster. The development of xCAT is driven by customer requirements. Because xCAT itself is written entirely using scripting languages such as korn shell, Perl, and Expect, an administrator can easily modify the scripts should the need arise. The main functions of xCAT are grouped as follows: Automated installation Hardware management and monitoring Software administration Remote console support for text and graphics For more information about xCAT, refer to the xCAT Web site at: http://xcat.sourceforge.net
23
DaCS
The Data Communication and Synchronization (DaCS) library provides a set of services that ease the development of applications and application frameworks in a heterogeneous multi-tiered system (for example, a 64-bit x86 system (x86_64) and one or more Cell/B.E. processor systems). The DaCS services are implemented as a set of APIs providing an architecturally neutral layer for application developers on a variety of multi-core systems. One of the key abstractions that further differentiates DaCS from other programming frameworks is a hierarchical topology of processing elements, each referred to as a DaCS Element (DE). Within the hierarchy, each DE can serve one or both of the following roles: A general purpose processing element, acting as a supervisor, control, or master processor. This type of element usually runs a full operating system and manages jobs running on other DEs. This is referred to as a Host Element (HE). A general or special purpose processing element running tasks assigned by an HE. This is referred to as an Accelerator Element (AE). DaCS for Hybrid (DaCSH) is an implementation of the DaCS API specification that supports the connection of an HE on an x86_64 system to one or more AEs on Cell/B.E. processors. In SDK 3.0, DaCSH only supports the use of sockets to connect the HE with the AEs. Direct access to the Synergistic Processor Elements (SPEs) on the Cell/B.E. processor is not provided. Instead, DaCSH provides access to the PowerPC Processor Element (PPE), allowing a PPE program to be started and stopped and allowing data transfer between the x86_64 system and the PPE. The SPEs can only be used by the program running on the PPE. For more information about DaCS, see IBM Software Development Kit for Multicore Acceleration Data Communication and Synchronization Library for Hybrid-x86 Programmer's Guide and API Reference, SC33-8408.
ALF
The Accelerated Library Framework (ALF) provides a programming environment for data and task parallel applications and libraries. The ALF API provides you with a set of interfaces to simplify library development on heterogeneous multi-core systems. You can use the provided framework to offload the computationally intensive work to the accelerators. More complex applications can be developed by combining the several function offload libraries. You can also choose to implement applications directly to the ALF interface. ALF supports the multiple-program-multiple-data (MPMD) programming module where multiple programs can be scheduled to run on multiple accelerator elements at the same time.
24
The ALF functionality includes: Data transfer management Parallel task management Double buffering Dynamic load balancing for data parallel tasks With the provided API, you can also create descriptions for multiple compute tasks and define their execution orders by defining task dependency. Task parallelism is accomplished by having tasks without direct or indirect dependencies between them. The ALF run time provides an optimal parallel scheduling scheme for the tasks based on given dependencies. For more information about ALF, see IBM Software Development Kit for Multicore Acceleration Accelerated Library Framework for Hybrid-x86 Programmer's Guide and API Reference, SC33-8406.
25
26
Appendix A.
27
Background
The Cell/B.E. architecture is designed to support a very broad range of applications. The first implementation is a single-chip multiprocessor with nine processor elements operating on a shared memory model, as shown in Figure A-1. In this respect, the Cell/B.E. processor extends current trends in PC and server processors. The most distinguishing feature of the Cell/B.E. processor is that, although all processor elements can share or access all available memory, their function is specialized into two types: the Power Processor Element (PPE) and the Synergistic Processor Element (SPE). The Cell/B.E. processor has one PPE and eight SPEs. The architectural definition of the physical Cell/B.E. architecture-compliant processor is much more general than the initial implementation. A Cell/B.E. architecture-compliant processor can consist of a single chip, a multi-chip module (or modules), or multiple single-chip modules on a system board or other second-level package. The design depends on the technology used and performance characteristics of the intended design. Logically, the Cell/B.E. architecture defines four separate types of functional components: PowerPC Processor Element (PPE) Synergistic Processor Unit (SPU) Memory Flow Controller (MFC) Internal Interrupt Controller (IIC) The computational units in the Cell/B.E. architecture-compliant processor are the PPEs and the SPUs. Each SPU must have a dedicated local storage, a dedicated MFC with its associated memory management unit (MMU), and a replacement management table (RMT). The combination of these components is called a Synergistic Processor Element (SPE).
The first type of processor element, the PPE, contains a 64-bit PowerPC architecture core. It complies with the 64-bit PowerPC architecture and can run 32-bit and 64-bit applications. The second type of processor element, the SPE, is designed to run computationally intensive single-instruction multiple-data (SIMD)/vector applications. It is not intended to run a full featured operating system. The SPEs are independent processor elements, each running their own individual application programs or threads. Each SPE has full access to shared memory, including the memory-mapped I/O space implemented by multiple DMA units. There is a mutual dependence between the PPE and the SPEs. The SPEs depend on the PPE to run the operating system and, in many cases, the top-level thread control for a user code. The PPE depends on the SPEs to provide the bulk of compute power. 28
Roadrunner: Hardware and Software Overview
The SPEs are designed to be programmed in high level languages. They support a rich instruction set that includes extensive SIMD functionality. However, like conventional processors with SIMD extensions, use of SIMD data is preferred but not mandatory. For programming convenience, the PPE also supports the standard PowerPC architecture instruction set and the SIMD/vector multimedia extensions. To an application programmer, the Cell/B.E. processor looks like a single core, dual threaded processor with eight additional cores, each having their own local store. The PPE is more adept than the SPEs at control-intensive tasks and quicker at task switching. The SPEs are more adept at compute intensive tasks and slower than the PPE at task switching. Either processor element is capable of both types of functions. This specialization is a significant factor in accounting for the order-of magnitude improvement in peak computational performance and power efficiency that the Cell/B.E. processor achieves over conventional processors. The more significant difference between the SPE and PPE lies in how they access memory. The PPE accesses memory with load and store instructions that move data between main storage and a set of registers, the contents of which may be cached. PPE memory access is like that of a conventional processor. The SPEs in contrast access main storage with direct memory access (DMA) commands that move data and instructions between main storage and a private local memory, called a local store (LS). An SPE's instruction fetches and load/store instructions access a private local store rather than the shared main memory. This three-level organization of storage (registers, LS, and main memory), with asynchronous DMA transfers between LS and main memory, is a radical break from conventional architecture and programming models. It explicitly parallels computation with the transfer of data and instructions that feed computation and stores the results of computation in main memory. A primary motivation for this new memory model is the realization that over the past twenty five years, memory latency, as measured in processor cycles, has increased by almost three orders of magnitude. The result is that application performance is, in most cases, limited by memory latency rather than peak compute capability, as measured by processor clock speeds. When a sequential program performs a load instruction that encounters a cache miss, program execution comes to a halt for several hundred cycles (techniques such as hardware threading attempt to hide these stalls, but it does not help single threaded applications). Compared to this penalty, the few cycles that it takes to set up a DMA transfer for an SPE is a much better trade off, especially considering the fact that each of the eight SPE's DMA controllers can maintain up to 16 DMA transfers in flight simultaneously. Anticipating DMA needs efficiently can provide just in time delivery of data, which may reduce this stall or eliminate it entirely. Conventional processors, even with deep and costly speculation, manage to get, at best, a handful of independent memory accesses in flight. One of the SPE's DMA transfer methods supports a list (such as a scatter gather list) of DMA transfers that is constructed in an SPE's local store, so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this approach of accessing memory has improved application performance by almost two orders of magnitude when compared to the performance of conventional processors This is significantly more than one would expect from the peak performance ratio (approximately 10x) between the Cell/B.E. processor and conventional PC processors.
29
page 28 shows each of these elements and the order in which the elements are connected to the EIB. The connection order is important to programmers seeking to minimize the latency of transfers on the EIB, where latency is a function of the number of connection hops. Transfers between adjacent elements have the shortest latencies, while transfers between elements separated by multiple hops have the longest latencies. The EIB's internal maximum bandwidth is 96 bytes per processor clock cycle. Multiple transfers can be in process concurrently on each ring, including more than 100 outstanding DMA memory transfer requests between main storage and the SPEs in either direction. These requests also may include SPE memory to and from the I/O space. The EIB does not support any particular quality of service (QoS) behavior other than to guarantee forward progress. However, a resource allocation management (RAM) facility resides in the EIB. Privileged software can use it to regulate the rate at which resource requesters (the PPE, SPEs, and I/O devices) can use memory and I/O resources.
31
32
Glossary
Accelerator General or special purpose processing element in a hybrid system. An accelerator might have a multi-level architecture with both host elements and accelerator elements. An accelerator, as defined here, is a hierarchy with potentially multiple layers of hosts and accelerators. An accelerator element is always associated with one host. Aside from its direct host, an accelerator cannot communicate with other processing elements in the system. The memory subsystem of the accelerator can be viewed as distinct and independent from a host. This is referred to as the subordinate in a cluster collective. All-reduce operation Output from multiple accelerators is reduced and combined into one output. API Application Programming Interface. An application programming interface defines the syntax and semantics for invoking services from within an executing application. All APIs are targeted to be available to both FORTRAN and C programs, although implementation issues (such as whether the FORTRAN routines are simply wrappers for calling C routines) are up to the supplier. ASCI The name commonly used for the Advanced Simulation and Computing program administered by Department of Energy (DOE)/National Nuclear Security Agency (NNSA). ASIC B/U CEC cluster Application Specific Integrated Circuit. Bring up. Central electronic complex. A collection of nodes. de_id A unique number assigned by the DaCS application at run time to a physical processing element in a topology group A group construct specifies a collection of DEs and processes in a system. EDRAM Enhanced dynamic random access memory is dynamic random access memory that includes a small amount of static RAM (SRAM) inside a larger amount of DRAM. Performance is enhanced by making sure that many of the memory accesses will be to the faster SRAM. EMC ESD ETH Electromagnetic compatibility. Electrostatic discharge. Ethernet, as in adapter or interface.
FLOP Floating Point OPeration. A measure of computations speed frequently used with supercomputers. FLOP/s FPU FRU FLOPs per second.
GFLOP GigaFLOP. A gigaFLOP/s is a billion (109 = 1,000,000,000) floating point operations per second. handle A handle is an abstraction of a data object, usually a pointer to a structure. HBCT Hardware-based cycle time.
compute kernel Part of the accelerator code that does stateless computation tasks on one piece of input data and generates the corresponding output results. compute task An accelerator execution image that consists of a compute kernel linked with the accelerated library framework accelerator runtime library. DaCS element A general or special purpose processing element in a topology. This refers specifically to the physical unit in the topology. A DaCS element can serve as a host or an accelerator. DDR Double Data Rate. DDR is a technique for doubling the switching rate of a circuit by triggering both the rising edge and falling edge of a clock signal. DE See DaCS element.
host A general purpose processing element in a hybrid system. A host can have multiple accelerators attached to it. This is often referred to as the master node in a cluster collective. hybrid A 64-bit x86 system using a Cell Broadband Engine (Cell/B.E.) architecture as an accelerator. I/O I/O (input/output) describes any operation, program, or device that transfers data to or from a computer. I/O node The I/O nodes (ION) are responsible, in part, for providing I/O services to compute nodes. Job A job is a cluster-wide abstraction similar to a POSIX session, with certain characteristics and attributes. Commands are targeted to be available to manipulate a job as a single entity (including kill, modify, query characteristics, and query state).
33
LANL
LINPACK LINPACK is a collection of FORTRAN subroutines that analyze and solve linear equations and linear leastsquares problems. main thread The main thread of the application. In many cases, Cell/B.E. architecture programs are multi-threaded using multiple SPEs running concurrently. A typical scenario is that the application consists of a main thread that creates as many SPE threads as needed and the application organizes them. MFLOP MegaFLOP/s. A megaFLOP/s is a million (106 = 1,000,000) floating point operations per second. MPI Message passing interface.
SPE Synergistic Processor Element. Extends the PowerPC 64 architecture by acting as cooperative offload processors (synergistic processors), with the direct memory access (DMA) and synchronization mechanisms to communicate with them (memory flow control), and with enhancements for real-time management. There are eight SPEs on each Cell/B.E. processor. SPMD Single Program Multiple Data. A common style of parallel computing. All processes use the same program, but each has its own data. SPU Synergistic Processor Unit. The part of an SPE that executes instructions from its local store (LS). SWL Synthetic workload.
MPICH2 MPICH is an implementation of the MPI standard available from Argonne National Laboratory. node A node is a functional unit in the system topology, consisting of one host together with all the accelerators connected as children in the topology (this includes any children of accelerators). parent The parent of a DE is the DE that resides immediately above it in the topology tree. PPE Power Processor Element: 64-bit Power Architecture unit within the CBE that is optimized for running operating systems and applications. The PPE depends on the SPEs to provide the bulk of the application performance. PPE PowerPC Processor Element. The general-purpose processor in the Cell/B.E. processor. process A process is a standard UNIX-type process with a separate address space. RAS Reliability, availability, and serviceability.
thread A sequence of instructions executed within the global context (shared memory space and other global resources) of a process that has created (spawned) the thread. Multiple threads (including multiple instances of the same sequence of instructions) can run simultaneously if each thread has its own architectural state (registers, program counter, flags, and other program-visible state). Each SPE can support only a single thread at any one time. Multiple SPEs can simultaneously support multiple threads. The PPE supports two threads at any one time, without the need for software to create the threads. It does this by duplicating the architectural state. A thread is typically created by the pthreads library. topology A topology is a configuration of DaCS elements in a system. The topology specifies how the different processing elements in a system are related to each other. DaCS assumes a tree topology: Each DE has at most one parent. Tri-Lab The Tri-Lab includes Los Alamos National Laboratory, Lawrence Livermore National Laboratory, and Sandia National Laboratories. VPD Vital product data.
service node The service node is responsible, in part, for management and control of RoadRunner. SIMD Single Instruction Multiple Data. Processing in which a single instruction operates on multiple data elements that make up a vector data type. Also known as vector processing. This style of programming implements data-level parallelism. SN See service node.
work block A basic unit of data to be managed by the framework. It consists of one piece of the partitioned data, the corresponding output buffer, and related parameters. A work block is associated with a task. A task can have as many work blocks as necessary. work queue An internal data structure of the accelerated library framework that holds the lists of work blocks to be processed by the active instances of the compute task.
SPE Synergistic Processor Element. Eight of these exist within the Cell/B.E. processor, optimized for running compute-intensive applications, and they are not optimized for running an operating system. The SPEs are independent processors, each running its own individual application programs.
34
35
36
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this paper.
IBM Redbooks
For information about ordering these publications, see How to get Redbooks on page 38. Note that some of the documents referenced here may be available in softcopy only. Building a Linux HPC Cluster with xCAT, SG24-6623 IBM BladeCenter Products and Technology, SG24-7523 Programming the Cell Broadband Engine Architecture: Examples and Best Practices, SG24-7575
Other publications
These publications are also relevant as further information sources: IBM Software Development Kit for Multicore Acceleration Accelerated Library Framework for Cell/B.E. Programmer's Guide and API Reference, SC33-8333-02 IBM Software Development Kit for Multicore Acceleration Accelerated Library Framework for Hybrid-x86 Programmer's Guide and API Reference, SC33-8406-00 IBM Software Development Kit for Multicore Acceleration Data Communication and Synchronization Library for Cell/B.E. Programmer's Guide and API Reference, SC33-8407-00 IBM Software Development Kit for Multicore Acceleration Data Communication and Synchronization Library for Hybrid-x86 Programmer's Guide and API Reference, SC33-8408-00 Software Development Kit for Multicore Acceleration Version 3.0 Programmer's Guide Version 3.0, SC33-8325 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial, SC33-8410 Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM, found at: http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05. pdf
Online resources
These Web sites are also relevant as further information sources: Advanced Simulation and Computing http://www.sandia.gov/NNSA/ASC/about.html
37
The Cell Broadband Engine (Cell/B.E.) project at IBM Research http://www.research.ibm.com/cell/ Cell/B.E. resource center http://www.ibm.com/developerworks/power/cell/ IBM BladeCenter QS22 http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html IBM BladeCenter LS21 http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html Open MPI: Open Source High Performance Computing http://www.open-mpi.org/ xCAT http://xcat.sourceforge.net
38
Back cover