ISCA
2000
Rajkumar Buyya,
Objectives
Architecture,
System Software
Programming Environments and Tools
Applications
http://www.buyya.com/cluster/
2
Agenda
Overview of Computing
Motivations & Enabling Technologies
Cluster Architecture & its Components
Clusters Classifications
Cluster Middleware
Single System Image
Representative Cluster Systems
Resources and Conclusions
3
Computing Elements
Applications
Programming Paradigms
Threads Interface
Operating System
Microkernel
Multi-Processor Computing System
P
P Processor
Thread
Hardware
Process
4
Sequential
Era
Parallel
Era
1940
50
60
70
80
90
2000
2030
Commercialization
R&D
Commodity
5
Life Sciences
CAD/CAM
Aerospace
Digital Biology
E-commerce/anything
Military Applications7
1. Work Harder
2. Work Smarter
3. Get Help
Computer Analogy
10
11
12
13
Millions of Customers
(Millions) of Partners
14
2100
2100
2100
2100
2100
2100
2100
2100
15
Pentiums
Myrinet
NetBSD/Linuux
PM
Score-D
MPC++
PAPIA PC Cluster
16
Sequential Architecture
Limitations
C.P.I.
Multiprocessor
Uniprocessor
2. . . .
No. of Processors
19
Vertical
Growth
Horizontal
10
15 20 25
30
35
40
45 . . . .
Age
20
Significant
development
in
Networking technology is paving a
way for heterogeneous computing.
21
Reliability/ Speed
22
Motivating Factors
23
Taxonomy of Architectures
SISD - conventional
SIMD - data parallel, vector computing
MISD - systolic arrays
MIMD - very general, multiple approaches.
Current
25
26
27
28
Technology Trend
30
Scalable Parallel
Computers
31
32
Towards Inexpensive
Supercomputing
It is:
Cluster Computing..
The Commodity
Supercomputing!
33
35
Cycle Stealing
Usually
38
Cycle Stealing
Typically,
39
Cycle Stealing
Cluster
Mainframes
Minis
1970
Minis
PCs
1980
PCs
Network
Computing
1995
41
42
Mainframe
Mini Computer
Workstation
PC
Vector Supercomputer
43
Mini Computer
Workstation
(future is bleak)
PC
Mainframe
Vector Supercomputer
MPP
44
45
What is a cluster?
A
46
47
Architectural Drivers(cont)
...Architectural Drivers
Clustering of Computers
for Collective Computing: Trends
?
1960
1990
1995+ 2000
Example Clusters:
Berkeley NOW
100 Sun
UltraSparcs
200 disks
Myrinet SAN
160 MB/s
Fast comm.
AM, MPI, ...
Ether/ATM
switched
external net
Global OS
Self Config
51
Basic Components
MyriNet
160 MB/s
Myricom
NIC
I/O bus
$
P
52
Basic unit:
2 PCs double-ending
four SCSI chains of 8
disks each
8 processors
4 Myricom NICs each
54
Millennium PC Clumps
Inexpensive, easy
to manage Cluster
Replicated in many
departments
Prototype for very
large PC cluster
55
56
So Whats So Different?
Commodity parts?
Communications Packaging?
Incremental Scalability?
Independent Failure?
Intelligent Network Interfaces?
Complete System on every node
virtual memory
scheduler
files
...
57
OPPORTUNITIES
&
CHALLENGES
58
Opportunity of Large-scale
Computing on NOW
Shared Pool of
Computing Resources:
Processors, Memory, Disks
Interconnect
Windows of Opportunities
MPP/DSM:
Network RAM:
Software RAID:
Multi-path Communication:
60
Parallel Processing
61
Network RAM
I/O Bottleneck:
63
64
65
Clustering Today
Clustering
1. Very HP Microprocessors
workstation performance = yesterday supercomputers
Cluster Computer
Architecture
67
Cluster Components...1a
Nodes
Multiple
PCs
Workstations
SMPs (CLUMPS)
Distributed HPC Systems leading to
Metacomputing
They can be based on different
architectures and running difference OS
68
Cluster Components...1b
Processors
There are many (CISC/RISC/VLIW/Vector..)
Intel: Pentiums, Xeon, Merceed.
Sun: SPARC, ULTRASPARC
HP PA
IBM RS6000/PowerPC
SGI MPIS
Digital Alphas
Integrate Memory, processing and
networking into a single chip
69
Cluster Components2
OS
State of the art OS:
Linux
(Beowulf)
Microsoft NT (Illinois HPVM)
SUN Solaris
(Berkeley NOW)
IBM AIX
(IBM SP2)
HP UX
(Illinois - PANDA)
Mach (Microkernel based OS) (CMU)
Cluster Operating Systems (Solaris MC, SCO Unixware,
MOSIX (academic project)
OS gluing layers:
(Berkeley Glunix)
70
Cluster Components3
High Performance Networks
Ethernet
(10Mbps),
Fast Ethernet (100Mbps),
Gigabit Ethernet (1Gbps)
SCI (Dolphin - MPI- 12micro-sec
latency)
ATM
Myrinet (1.2Gbps)
Digital Memory Channel
FDDI
71
Cluster Components4
Network Interfaces
Network
Interface Card
72
Cluster Components5
Communication Software
Cluster Components6a
Cluster Middleware
Resides
Cluster Components6b
Middleware Components
Hardware
DEC Memory Channel, DSM (Alewife, DASH) SMP
Techniques
OS
/ Gluing Layers
and Subsystems
75
Cluster Components7a
Programming environments
76
Cluster Components7b
Development Tools ?
Compilers
C/C++/Java/ ;
Parallel programming with C++ (MIT Press book)
RAD
Cluster Components8
Applications
Sequential
Parallel
app.)
/ Distributed (Cluster-aware
78
79
Classification
of Cluster Computer
80
Clusters Classification..1
Based
81
82
Clusters Classification..2
Based
on Workstation/PC Ownership
Dedicated Clusters
Non-dedicated clusters
Adaptive parallel computing
Also called Communal multiprocessing
83
Clusters Classification..3
Based
on Node Architecture..
84
85
Clusters Classification..4
Based
on Node OS Type..
86
Clusters Classification..5
Based
on node components
architecture & configuration
(Processor Arch, Node Type:
PC/Workstation.. & OS: Linux/NT..):
Homogeneous Clusters
All nodes will have similar configuration
Heterogeneous Clusters
Nodes based on different processors and
running different OSes.
87
Clusters Classification..6a
(3)
Network
Public
Metacomputing (GRID)
Technology
Enterprise
(1)
Campus
Department
Workgroup
Uniprocessor
SMP
Cluster
MPP
Platform
(2)
88
Clusters Classification..6b
Levels of Clustering
Group
89
Cluster Middleware
and
Single System Image
91
Application
PVM / MPI/ RSH
???
Hardware/OS
92
CC should support
93
Application
PVM / MPI/ RSH
Middleware or
Underware
Hardware/OS
94
telnet cluster.my_institute.edu
telnet node1.cluster. institute.edu
Availability Support
Functions
101
UP
102
SSI Levels/How do we
implement SSI ?
Hardware Level
103
Examples
Boundary
Importance
application
an application
what a user
wants
subsystem
distributed DB,
OSF DME, Lotus
Notes, MPI, PVM
a subsystem
file system
toolkit
Examples
Solaris MC, Unixware
MOSIX, Sprite,Amoeba
/ GLunix
UNIX (Sun) vnode,
Locus (IBM) vproc
Boundary
Importance
none supporting
each distributed
operating system kernel virtual memory
space
Mach, PARAS, Chorus, each service
OSF/1AD, Amoeba
outside the
microkernel
may simplify
implementation
of kernel objects
implicit SSI for
all system services
(c) In search of clusters
105
Examples
Boundary
Importance
SCI, DASH
memory space
memory
and I/O
SSI Characteristics
1.
107
SSI Boundaries -- an
applications SSI boundary
Batch System
SSI
Boundary
(c) In search
of clusters
108
Relationship Among
Middleware Modules
109
110
level SSI
SCO NSC UnixWare
Solaris-MC
MOSIX, .
Middleware level SSI
PVM, TreadMarks (DSM), Glunix,
Condor, Codine, Nimrod, .
Application level SSI
PARMON, Parallel Oracle, ...
111
UP or SMP node
Extensions
Modular
kernel
extensions
Extensions
Standard OS
kernel calls
Standard SCO
UnixWare
with clustering
hooks
Modular
kernel
extensions
Devices
Devices
ServerNet
Other nodes
112
Applications
Network
File system
C++
Processes
Solaris MC
Other
nodes
global file
system
globalized
process
management
globalized
networking
and I/O
Object framework
Object invocations
http://www.sun.com/research/solaris-mc/
114
Solaris MC components
Applications
System call interface
Network
File system
C++
Processes
Object and
communication
support
High availability
support
PXFS global
distributed file
system
Process
mangement
Networking
Solaris MC
Object framework
Object invocations
Other
nodes
115
Application
PVM / MPI / RSH
Hardware/OS
116
Main tool
Preemptive process migration that can
migrate--->any process, anywhere, anytime
117
http://www.mosix.cs.huji.ac.il/
118
NOW @ Berkeley
Design & Implementation of higher-level system
Global OS (Glunix)
Parallel File Systems (xFS)
Fast Communication (HW for Active Messages)
Application Support
Overcoming technology shortcomings
Fault tolerance
System Management
NOW Goal: Faster for Parallel AND Sequential
http://now.cs.berkeley.edu/
119
Unix
Workstation
Unix
Workstation
Unix
Workstation
VN segment
Driver
AM L.C.P.
VN segment
Driver
AM L.C.P.
VN segment
Driver
AM L.C.P.
Active Messages
Unix (Solaris)
Workstation
VN segment
Driver
AM L.C.P.
120
122
Cluster Programming
Environments
DSM
Threads/OpenMP (enabled for clusters)
Java threads (HKU JESSICA, IBM cJVM)
PVM (PVM)
MPI (MPI)
Parametric Computations
Nimrod/Clustor
Levels of Parallelism
PVM/MPI
Threads
Compilers
CPU
Task i-l
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
Task i
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
Task i+1
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Load
Code-Granularity
Code Item
Large grain
(task level)
Program
Medium grain
(control level)
Function (thread)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware
124
http://www.mpi-forum.org/
(master)
Hello,...
(workers)
126
Execution
% cc -o hello hello.c -lmpi
% mpirun -p2 hello
Hello, I am process 1!
% mpirun -p4 hello
Hello, I am process 1!
Hello, I am process 2!
Hello, I am process 3!
% mpirun hello
(no output, there are no workers.., no greetings)
128
PARMON: A Cluster
Monitoring Tool
PARMON Client on JVM
PARMON Server
on each node
parmon
parmond
PARMON
High-Speed
Switch
http://www.buyya.com/parmon/
129
Resource Utilization at a
Glance
130
Users
Users
132
133
134
...
...
D11 D12
D21 D22
D1t
D2t
...
Sequential
addresses
LDn
...
Dn1 Dn2
B11
B12
SD1
...
...
B21
B22
SD2
...
Dnt
Bm1
Bm2
SDm
...
B1k
B2k
P1
. . .
Local
Disks,
(RADD
Space)
Shared
RAIDs,
(NASD Space)
Bmk
Peripherals
(NAP Space)
Ph
135
User Applications
Name Agent
I/O Agent
Disk/RAID/
NAP Mapper
I/O Agent
RADD
Block Mover
I/O Agent
I/O Agent
NASD
NAP
User-level
Middleware
plus some
Modified OS
System Calls
136
Node 1
Block
Mover
Request
Data
Block A
Node 2
I/O Agent
LD2 or SDi
LD1
User
Application
I/O Agent
of the NASD
Node 1
Node 2
Block
Mover
I/O Agent
LD2 or SDi
LD1
of the NASD
A
137
What Next ??
Clusters of Clusters (HyperClusters)
Global Grid
Interplanetary Grid
Universal Grid??
138
Master
Daemon
LAN/WAN
Submit
Graphical
Control
Cluster 3
Execution
Daemon
Scheduler
Clients
Master
Daemon
Cluster 2
Submit
Graphical
Control
Scheduler
Master
Daemon
Execution
Daemon
Clients
Submit
Graphical
Control
Execution
Daemon
Clients
139
140
What is Grid ?
http://www.sun.com/hpc/
142
Grid Application-Drivers
(distributed) Supercomputing
Collaborative engineering
high-throughput computing
large scale simulation & parameter studies
Grid Components
Applications and Portals
Scientific
Engineering
Collaboration
Languages
Libraries
Debuggers
Monitoring
Resource Brokers
Comm.
Information
Process
Data Access
Web tools
QoS
Grid
Apps.
Grid
Tools
Grid
Middleware
Operating Systems
Computers
Queuing Systems
Clusters
Storage Systems
Data Sources
Grid
Fabric
Scientific Instruments
144
PUBLIC FORUMS
Computing Portals
Grid Forum
European Grid Forum
IEEE TFCC!
GRID2000 and more.
Australia
Nimrod/G
EcoGrid and GRACE
DISCWorld
Europe
Distributed.net
SETI@Home
Compute Power Grid
USA
Globus
Legion
JAVELIN
AppLes
NASA IPG
Condor
Harness
NetSolve
NCSA Workbench
WebFlow
EveryWhere
and many more...
UNICORE
MOL
METODIS
Globe
Poznan Metacomputing
Japan
CERN Data Grid
Ninf
Bricks
MetaMPI
and many more...
DAS
JaWS
and many more... http://www.gridcomputing.com/
145
NetSolve
Client/Server/Agent -- Based Computing
Easy-to-use tool to provide efficient and uniform
access to a variety of scientific packages on UNIX platforms
Client-Server design
Network-enabled solvers
Network Resources
Seamless access to resources
Non-hierarchical system
Load Balancing
Fault Tolerance
reply
Interfaces to Fortran, C, Java, Matlab, more
Software Repository
choice
request
Software is available
www.cs.utk.edu/netsolve/
NetSolve Client
NetSolve Agent
146
Host B
Another
VM
Virtual
Machine
Component
based daemon
Host C
process control
user features
HARNESS daemon
Customization
and extension
by dynamically
adding plug-ins
http://www.epm.ornl.gov/harness/
147
148
http://www.dgs.monash.edu.au/~davida/nimrod.html
149
150
Nimrod/G Architecture
Nimrod/G Client
Nimrod/G Client
Nimrod/G Client
Nimrod Engine
Schedule Advisor
Trading Manager
Persistent
Store
Dispatcher
Grid Explorer
TM
Middleware Services
TS
GE
GIS
RM & TS
RM & TS
RM & TS
GUSTO Test Bed
151
Grid Explorer
Application
Job
Control
Agent
Schedule Advisor
Trade Server
Charging Alg.
Trading
Trade Manager
Deployment Agent
User
Resource Broker
Accounting
Resource
Reservation
Other
services
Resource Allocation
R1
R2
Rn
A Resource Domain
152
Pointers to Literature on
Cluster Computing
153
Reading Resources..1a
Internet & WWW
Computer Architecture:
http://www.cs.wisc.edu/~arch/www/
DSMs
http://www.cs.umd.edu/~keleher/dsm.html
154
Reading Resources..1b
Internet & WWW
Solaris-MC
http://www.sunlabs.com/research/solaris-mc
Beowulf:
http://www.beowulf.org
Metacomputing
http://www.sis.port.ac.uk/~mab/Metacomputing/
155
Reading Resources..2
Books
In Search of Cluster
by G.Pfister, Prentice Hall (2ed), 98
Reading Resources..3
Journals
A Case of NOW, IEEE Micro, Feb95
by Anderson, Culler, Paterson
157
http://www.csse.monash.edu.au/~rajkumar/cluster/
158
http://www.ieeetfcc.org
159
TFCC Activities...
Network Technologies
OS Technologies
Parallel I/O
Programming Environments
Java Technologies
Algorithms and Applications
>Analysis and Profiling
Storage Technologies
High Throughput Computing
160
TFCC Activities...
High Availability
Single System Image
Performance Evaluation
Software Engineering
Education
Newsletter
Industrial Wing
TFCC Regional Activities
TFCC Activities...
Clusters Revisited
163
Summary
Conclusions
166
167
Backup Slides...
168
Data Input
Speed
Processor
Data Output
Data
Output
Stream
A
Data
Input
Stream
Processor
B
Processor
C
More
SIMD Architecture
Instruction
Stream
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Processor
A
Data Output
stream B
Processor
B
Processor
Data Output
stream C
Ci<= Ai * Bi
Ex: CRAY machine vector processing, Thinking machine cm*
171
MIMD Architecture
Instruction Instruction Instruction
Stream A Stream B Stream C
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Processor
A
Data Output
stream B
Processor
B
Processor
Data Output
stream C
172