Anda di halaman 1dari 99

Network Based Computing

Dhabaleswar K. (DK) Panda


Department of Computer Science and Engineering
The Ohio State University

E-mail: panda@cse.ohio-state.edu
http://nowlab.cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Presentation Outline

• Network-Based Computing (NBC) Trend


• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
Current and Next Generation
Applications and Computing Systems
• Big demand for
– High Performance Computing (HPC)
– Servers for file systems, web, multimedia,
database, visualization, ...
– Internet data centers (Multi-tier Datacenters)
• Processor speed continues to grow
– Doubling every 18 months
– Multicore chips are emerging
• Commodity networking also continues to grow
– Increase in speed
– Many features
– Affordable pricing
• Clusters are increasingly becoming popular to
design next generation computing systems
– scalability
– modularity
– can be easily upgraded as computing and
networking technologies improve
Growth in Commodity Networking
Technologies
• Representative commodity networking technologies and their
entries into the market
– Ethernet (1979- ) 10 Mbit/sec
– Fast Ethernet (1993- ) 100 Mbit/sec
– Gigabit Ethernet (1995- ) 1000 Mbit/sec
– ATM (1995- ) 155/622/1024 Mbit/sec
– Myrinet (1993- ) 1 Gbit/sec
– Fibre Channel (1994- ) 1 Gbit/sec
– InfiniBand (2001-) 2.5 Gbit/sec (1X) -> 2 Gbit/sec
– InfiniBand (2003-) 10 Gbit/sec (4X) -> 8 Gbit/sec
– 10 Gigabit Ethernet (2004-) 10 Gbit/sec
– InfiniBand (2005-) 20 Gbit/sec (4X DDR)
-> 16 Gbit/sec
30 Gbit/sec (12X SDR)
-> 24 Gbit/sec
12 times in the last 5 years
Network-Based Computing

Computing Systems Web Technology

•powerful •powerful
•cost- •ability to
effective integrate
•commodity computation and
communication

Network-
Networking
Based Emerging Applications
Computing
• collaborative/
interactive
•high bandwidth
•cost-effective •wide range of
computation and
•commodity communication
characteristics
Presentation Outline

• Network-Based Computing (NBC) Trend


• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
Different Types of NBC Systems

• Five major categories


– Dedicated Cluster over a LAN
– Clusters with Interactive clients
– Globally-Interconnected Systems
– Interconnected Clusters over WAN
– Multi-tier Data Centers
– Integrated environment of multiple clusters
Dedicated Cluster over a SAN/LAN

• Packaged inside a rack/room interconnected with


SAN/LAN

comp comp

comp Network(s) comp

comp comp
Trends for Computing Clusters
in the Top 500 List
• Top 500 list of Supercomputers (www.top500.org)
– June 2001: 33/500 (6.6%)
– Nov 2001: 43/500 (8.6%)
– June 2002: 80/500 (16%)
– Nov 2002: 93/500 (18.6%)
– June 2003: 149/500 (29.8%)
– Nov 2003: 208/500 (41.6%)
– June 2004: 291/500 (58.2%)
– Nov 2004: 294/500 (58.8%)
– June 2005: 304/500 (60.8%)
– Nov 2005: 360/500 (72.0%)
– June 2006: 364/500 (72.8%)
Cluster with Interactive Clients
LAN/WAN client

cluster
Globally-Interconnected Systems
SMP system WAN

SMP system
Interconnected Clusters over WAN

WAN

Workstation
cluster
Generic Three-Tier Model for
Data Centers
Tier 1 Tier 2 Tier 3
Storage

Routers/Servers Application Server Database Server

Routers/Servers Application Server Database Server

Routers/Servers Switch Application Server Switch Database Server Switch

. . .
. . .
Routers/Servers Application Server Database Server

• All major search engines and e-commerce companies


are using clusters for multi-tier datacenter
• Google, Amazon, Financial institutions, …..
Integrated Environment with
Multiple Clusters
Compute cluster
Storage Cluster
Front end

LAN
LAN

LAN/WAN

LAN Datacenter for


Visualization and
Data Mining
Presentation Outline

• Network-Based Computing (NBC) Trend


• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
Our Vision
• Network-Based Computing Group to take a lead in
– Proposing new designs for high performance NBC
systems by taking advantages of modern networking
technologies and computing systems
– Developing better middleware/API/programming
environments so that modern NBC applications can be
developed and implemented in a scalable fashion
• Carry out research in an integrated manner
(systems, networking, and applications)
Group Members
• Currently 12 PhD students and one MS student
– Lei Chai, Qi Gao, Wei Huang, Matthew Koop, Rahul Kumar,
Ping Lai, Amith Mamidala, Sundeep Narravul, Ranjit
Noronha, Sayantan Sur, Gopal Santhanaraman, Karthik
Vaidyanathan, and Abhinav Vishnu
• All current students are supported as RAs
• Programmers
– Shaun Rowland
– Jonathan Perkins
• System Administrator
– Keith Stewart
External Collaboration
• Industry
– IBM TJ Watson and IBM
– Intel
– Dell
– SUN
– Apple
– Linux Networx
– NetApp
– Mellanox and other InfiniBand companies
• Research Labs
– Sandia National Lab
– Los Alamos National Lab
– Pacific Northwest National Lab
– Argonne National Lab
• Other units in campus
– Bio-Medical Informatics (BMI)
– Ohio Supercomputer Center (OSC)
– Internet 2 Technology Evaluation Centers (ITEC-Ohio)
Research Funding
• National Science Foundation
– Multiple grants
• Department of Energy
– Multiple grants
• Sandia National Lab
• Los Alamos National Lab
• Pacific Northwest National Lab
• Ohio Board of Regents
• IBM Research
• Intel
• SUN
• Mellanox
• Cisco
• Linux Networx
• Network Appliance
Research Funding (Cont’d)
• Three Major Large-Scale Grants
– Started in Sept ‘06
– DOE Grant for Next Generation Programming Models
• Collaboration with Argonne National Laboratory, Rice
University, Univ. of Berkeley, and Pacific Northwest
National Laboratory
– DOE Grant for Scalable Fault Tolerance in Large-Scale
Clusters
• Collaboration with Argonne National Laboratory, Indiana
University, Lawrence Berkeley National Laboratory, and
Oakridge National Laboratory
– AVETEC Grant for Center for Performance Evaluation of
Cluster Interconnects
• Collaboration with major DOE Labs and NASA
• Several continuation grants from other funding
agencies
Presentation Outline

• Systems Area in the Department


• Network-Based Computing (NBC) Trend
• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
Requirements for Modern Clusters,
Cluster-based Servers, and
Datacenters
• Requires
– good Systems Area Network with excellent performance
(low latency and high bandwidth) for interprocessor
communication (IPC) and I/O
– good Storage Area Networks high performance I/O
– good WAN connectivity in addition to intra-cluster
SAN/LAN connectivity
– Quality of Service (QoS) for supporting interactive
applications
– RAS (Reliability, Availability, and Serviceability)
Research Challenges

Applications
(Scientific, Commercial, Servers, Datacenters)

Programming Models
(Message Passing, Sockets, Shared
Memory w/o Coherency)

Low Overhead Substrates


Point-to-point Collective Synchronization I/O & File Fault
Communication Communication & Locks Systems
QoS Tolerance

Networking Technologies Commodity


(Myrinet, Gigabit Ethernet, Computing System Architectures
InfiniBand, Quadrics, 10GigE, (single, dual, quad, ..)
RNICs) & Intelligent NICs Multi-core architecture
Research Challenges and Their
Dependencies

Applications

System Software/Middleware
Support

Networking and
Communication Support

Trends in Networking/Computing
Technologies
Major Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Why InfiniBand?

• Traditionally HPC clusters have used


– Myrinet or Quadrics
• Proprietary interconnects
– Fast Ethernet or Gigabit Ethernet for low-end clusters
• Datacenters have used
– Ethernet (Gigabit Ethernet is common)
– 10.0 Gigabit Ethernet is emerging and not yet available
with low cost
– QoS and RAS capabilities are not there in Ethernet
• Storage and File Systems have used
– Ethernet with IP
– Fibre Channel
– Does not support high bandwidth and QoS, etc.
Need for A New Generation
Computing, Communication & I/O
Architecture
• Started around late 90s
• Several initiatives were working to address high
performance I/O issues
– Next Generation I/O
– Future I/O
• Virtual Interface Architecture (VIA) consortium
was working towards high performance
interprocessor communication
• An attempt was made to consolidate all these
efforts as an open standard
• The idea behind InfiniBand Architecture (IBA)
was conceived ...
• The original target was for Data Centers
IBA Trade Organization

• IBA Trade Organization was formed with seven


industry leaders (Compaq, Dell, HP, IBM, Intel,
Microsoft, and Sun)
• Goal: To design a scalable and high performance
communication and I/O architecture by taking an
integrated view of computing, networking, and
storage technologies
• Many other industry participated in the effort to
define the IBA architecture specification
• InfiniBand Architecture (Volume 1, Version 1.0)
was released to public on Oct 24, 2000
• www.infinibandta.org
Rich Set of Features of IBA
• High Performance Data Transfer
– Interprocessor communication and I/O
– Low latency (~1.0-3.0 microsec), High bandwidth (~1-2
GigaBytes/sec), and low CPU utilization (5-10%)
• Flexibility for WAN communication
• Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable
Datagram (RD), Unreliable Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers
• Multiple Operations
– Send/Recv
– RDMA Read/Write
– Atomic Operations (very unique)
• high performance and scalable implementations of distributed
locks, semaphores, collective communication operations
Principles behind RDMA
Mechanism
Node Node

Memory Memory

P0 PCI/PCI-EX
P0 PCI/PCI-EX

IBA IBA
P1 P1

• No involvement by the processor at the receiver side (RDMA Write/Put)


• No involvement by the processor at the sender side (RDMA Read/get)
• 1-2 microsec latency (for short data)
• 1.5 Gbytes/sec bandwidth (for large data)
• 3-5 microsec for atomic operation
Rich Set of Features of IBA
(Cont’d)
• Range of Network Features and QoS Mechanisms
– Service Levels (priorities)
– Virtual lanes
– Partitioning
– Multicast
• allows to design a new generation of scalable communication and I/O
subsystem with QoS
• Protected Operations
– Keys
– Protection Domains
• Flexibility for supporting Reliability, Availability, and Serviceability
(RAS) in next Generation Systems with IBA features
– Multiple CRC fields
• error detection (per-hop, end-to-end)
– Fail-over
• unmanaged and managed
– Path Migration
– Built-in Management Services
Subnet Manager

Inactive
Link

Multicast Active
Inactive
Setup Links

Switch Multicast Join


Compute Node

Multicast
Setup
Multicast Join

Subnet
Manager
Designing MPI Using InfiniBand
Features
MPI Design Components
Protocol Flow Communication Multirail
Mapping Control Progress Support

Buffer Connection Collective One-sided


Management Management Communication Active/Passive

Substrate

Communication Atomic Completion & End-to-End


Semantics Operations Event Flow Control

Transport Communication Multicast Quality of


Services Management Service

InfiniBand Features
High Performance MPI over IBA –
Research Agenda at OSU
• Point-to-point communication
– RDMA-based design for both small and large messages
• Collective communication
– Taking advantage of IBA hardware multicast for Broadcast
– RDMA-based designs for barrier, all-to-all, all-gather
• Flow control
– Static vs. dynamic
• Connection Management
– Static vs. dynamic
– On-demand
• Multi-rail designs
– Multiple ports/HCAs
– Different schemes (striping, binding, adaptive)
• MPI Datatype Communication
– Taking advantage of scatter/gather semantics of IBA
• Fault-tolerance Support
– End-to-end reliability to tolerate I/O errors
– Network fault-tolerance with Automatic Path Migration (APM)
– Applications transparent checkpoint and restart
• Currently extending our designs for iWARP adapters
Overview of MVAPICH and MVAPICH2
Projects (OSU MPI for InfiniBand)
• Focusing on
– MPI-1 (MVAPICH)
– MPI-2 (MVAPICH2)
• Open Source (BSD licensing) with anonymous SVN
access
• Directly downloaded and being used by more than
440 organizations worldwide (in 30 countries)
• Available in the software stacks of
– Many IBA and server vendors
– OFED (OpenFabrics Enterprise Distribution)
– Linux Distributors
• Empowers multiple InfiniBand clusters in the TOP
500 list
• URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
Larger IBA Clusters using MVAPICH
and Top500 Rankings (June ’06)
• 6th : 4000-node dual Intel Xeon 3.6 GHz cluster at Sandia
• 28th: 1100-node dual Apple Xserve 2.3 GHz cluster at Virginia Tech
• 66th: 576-node dual Intel Xeon EM64T 3.6 GHz cluster at Univ. of
Sherbrooke (Canada)
• 367th : 356-node dual Opteron 2.4 GHz cluster at Trinity Center
for High Performance Computing (TCHPC), Trinity College Dublin
(Ireland)
• 436th: 272-node dual Intel Xeon EM64T 3.4 GHz cluster at SARA
(The Netherlands)
• 460th: 200-node dual Intel Xeon EM64T 3.2 GHz cluster at Texas
Advanced Computing Center/Univ. of Texas
• 465th: 315-node dual Opteron 2.2 GHz cluster at NERSC/LBNL
• More are getting installed ….
MPI-level Latency (One-way):
IBA (Mellanox and PathScale) vs. Myrinet vs.
Quadrics
Small message latency
Large message latency
14
700
Latency (us)

12 MVAPICH-X
600
MVAPICH-Ex-1p
10 500
MVAPICH-Ex-2p
8 400
MPICH/MX
4.9 6 300
MPICH/QsNet-X
4.0 4 200
MVAPICH-Gen2-Ex-
3.3 DDR-1p
2.8 2 100
2.0
0 0
0 4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K

Msg size (Bytes) Msg size (Bytes)

• SC ’03
• Hot Interconnect ’04 09/26/05
• IEEE Micro (Jan-Feb) ’05, one of the best papers from HotI ‘04
MPI-level Bandwidth (Uni-directional):
IBA (Mellanox and PathScale) vs. Myrinet vs.
Quadrics

1500 1492
1350
MVAPICH-X
1472
1200 MVAPICH-Ex-1p
Bandwidth (MillionBytes/Sec)

1050 MVAPICH-Ex-2p 970


900 MPICH/MX
910
750
MPICH/QsNet-X 891
MVAPICH-Gen2-DDR-Ex-1p
600

450 494
300

150

0
4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
09/26/05
Mesg size (Bytes)
MPI-level Bandwidth (Bi-directional):
IBA (Mellanox and PathScale)) vs. Myrinet vs.
Quadrics
3000

2700
2724
MVAPICH-X
2628
2400 MVAPICH-Ex-1p
Bandwidth (MillionBytes/Sec)

2100
MVAPICH-Ex-2p
MPICH/MX-X
1800
MPICH/QsNet-X
1841
1500
MVAPICH-Gen2-DDR-Ex-1p

1200
943
900 908
600
901
300

0
4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Mesg size (Bytes) 08/15/05


MPI over InfiniBand Performance
(Dual-core Intel Bensley Systems with Dual-Rail
DDR InfiniBand)
HCA 0 HCA 1 Uni-Directional Bandwidth

HCA 0 HCA 1
3000
P2 P0 P0 P2
2500
4 processes
2000

P3 P1 P1 P3 1500

1000
2808 MB/sec
500
4-processes on each node concurrently
communicating over Dual-rail InfiniBand DDR (Mellanox) 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Message Size (bytes)

Bi-Directional Bandwidth Messaging Rate


3.5
5000
4500 3
4000 4 processes
2.5
4 processes
3500

3.16 Million
3000 2
2500

4553 MB/sec Msgs/sec


1.5
2000
1500 1

1000
0.5
(16 Bytes)
500
0 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K
Message Size (bytes)
Message Size (bytes)

M. J. Koop, W. Huang, A. Vishnu and D. K. Panda, Memory Scalability Evaluation of Next Generation Intel
Bensley Platform with InfiniBand, to be presented at Hot Interconnect Symposium (Aug. 2006).
High Performance and Scalable
Collectives
• Reliable MPI Broadcast using IB hardware
multicast
– Capability to support broadcast of 1K bytes
message to 1024 nodes in less than 40 microsec
• RDMA-based designs for
– MPI_Barrier
– MPI_All_to_All
J. Liu, A. Mamidala and D. K. Panda, Fast and Scalable MPI-Level Broadcast
using InfiniBand’s Hardware Multicast Support, Int’l Parallel and Distributed
Processing Symposium (IPDPS ’04), April 2004

S. Sur and D. K. Panda, Efficient and Scalable All-to-all Exchange for


InfiniBand-based Clusters, Int’l Conference on Parallel Processing (ICPP ’04),
Aug. 2004
Analytical Model –
Extrapolation for 1024 nodes
80 180
Actual, no h/w mcst
70 160
60
Estimated, no h/w mcst
140 no h/w mcst, 1024
Actual, h/w mcst h/w mcst, 1024
50 120
latency

Estimated, h/w mcst


100
40

la t e n c y
80
30
60
20 40
10 20
0 0
16
32
64

8
6
2
1
2
4
8

24
48
96

1
2
4
8
12
25
51

16
32
64
8
6
10 2
20 4
40 8
96
10
20
40

12
25
51
2
4
size
size

• Can broadcast a 1Kbyte message to 1024 nodes in 40.0


microsec
Fault Tolerance
• Component failures are the norm in large-
scale clusters
• Imposes need on reliability and fault
tolerance
• Working along the following three angles
– End-to-end Reliability with memory-to-memory
CRC
• Already available with the latest release
– Reliable Networking with Automatic Path
Migration (APM) utilizing Redundant
Communication Paths
• Will be available in future release
– Process Fault Tolerance with Efficient
Checkpoint and Restart
• Already available with the latest release
Checkpoint/Restart Support for
MVAPICH2
• Process-level Fault Tolerance
– User-transparent, system-level checkpointing
– Based on BLCR from LBNL to take coordinated
checkpoints of entire program, including front
end and individual processes
– Designed novel schemes to
• Coordinate all MPI processes to drain all in flight
messages in IB connections
• Store communication state and buffers, etc. while
taking checkpoint
• Restarting from the checkpoint
A Running Example
LU Restarted
is
Now,
not
Now,
Start
LU Restart
Get
affected,
is
Take
running
still
LU
its PID
checkpoint
finish
from
running
Stop
LU
the
running.
it using
checkpoint
CTRL-C
• Show how to checkpoint/restart LU
from NAS benchmark
• There are two terminals:
– Left one for normal run
– Right one for checkpoint/restart
NAS Execution Time and
HPL Performance
lu.C.8 bt.C.9 sp.C. 30

P e rfo rm a n c e (G F L O P S )
500 25

400 20
15
300
10
200
5
100 0
1 min 2 min 4 min None 2 min (6) 4 min (2) 8 min (1) None
Checkpointing In Checkpointing Interval & No. of checkpoints
• The execution time of NAS benchmarks increased slightly with frequent
checkpointing
• Frequent checkpointing also has some impact on HPL performance metrics

Q. Gao, W. Yu, W. Huang and D.K. Panda, Application-Transparent Checkpoint/Restart for


MPI over InfiniBand, International Conference on Parallel Processing, Columbus, Ohio, 2006
Continued Research on several
components of MPI
• Collective communication
– All-to-all
– All-gather
– Barrier
– All-reduce
• Intra-node shared memory support
– User-level, kernel-level
– NUMA support
– Multi-core support
• MPI-2
– One-sided (get, put, accumulate)
– Synchronization (active, passive)
– Message ordering
• Datatype
Continued Research on several
components of MPI (Cont’d)
• Multi-rail support
– Multiple NICs
– Multiple ports/NIC
• uDAPL support
– Portability across different networks (Myrinet,
Quadrics, GigE, 10GigE)
• Fault-tolerance
– Check-point restart
– Automatic Path Migration
• MPI-IO
– High performance I/O support
Major Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Can we enhance NFS Performance
with RDMA and InfiniBand?
• Many enterprise environments use file
servers and NFS
• Most of the current systems use Ethernet
• Can such environments use InfiniBand?
• NFS over RDMA standard has been
proposed
• Designed and implemented this on
InfiniBand in Open Solaris
– Taking advantage of RDMA mechanisms
• Joint work with funding from Sun and
Network Appliance
NFS over RDMA –
New Design
• NFS over RDMA in Open Solaris
– New designs with server-based RDMA read and write
operations
– Allows different registration mode, including IB FMR
(Fast Memory Registration)
– Enables multiple chunk support
– Supports multiple concurrent outstanding RDMA
operations
– Interoperable with Linux NFSv3 and v4
– Currently tested on InfiniBand only
• Advantages
– Improved performance
– Reduced CPU utilization and efficiency
– Eliminates risk to NFS servers
• The code will be available with Solaris and Open
Solaris in 2007
IOZone Read Performance
Read-Write Client Utlization

800 25

700

20

600
Bandwidth (MB/s)

%/(MB/s)
500
15

400

10
300

200
5

100

0 0
1 2 3 4 5 6 7 8 9 10 11 12

Number of IOzone Threads


• Sun x2100’s (dual 2.2 GHz Opteron CPU’s with x8 PCI-Express Adaptors)
• IOzone Read Bandwidth up to 713 MB/s
IOZone Write Performance
800
700
600
WRITE 500
Throughput 400
(MB/s) 300 tmpfs
200
sinkfs
100
0
1 2 3 4 5 6 7 8

Threads
System: Sun x2100’s (Dual 2.2 GHz Opteron CPU’s with x8
PCI-Express Adaptors)
File size: 128MB for tmpfs and 1GB for sinkfs
IO size: 1MB
Major Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Why Targeting Virtualization?
• Ease of management
– Virtualized clusters
– VM migration – deal with system upgrade/failures
• Customized OS
– Light-weight OS: No widely adoption due to
management difficulties
– VM makes those techniques possible
• System security & productivity
– Users can do ‘anything’ in VM, in the worst case
crash a VM, not the whole system
Challenges
• Performance overhead of I/O virtualization
• Management framework to take advantages of VM
technology for HPC
• Migration of modern OS-bypass network devices
• File system support for VM based cluster
Our Initial Studies
• J. Liu*, W. Huang, B. Abali*, D. K. Panda. High Performance VMM-
Bypass I/O in Virtual Machines, USENIX’06

• W. Huang, J. Liu*, B. Abali*, D. K. Panda. A Case for High


Performance Computing with Virtual Machines, ICS ’06

External collaborators from IBM T.J. Watson Research Center


From OS-bypass to VMM-bypass
VM VM • Guest modules in guest VMs handle
Application Application
setup and management operations
(privileged access)
– Guest modules communicate with
OS Guest Module OS
backend modules in VMM to get
jobs done
Backend Module – The original privileged module can
VMM be reused
Privileged Module • Once things are setup properly,
devices can be accessed directly from
guest VMs (VMM-bypass access)
– Either from the OS kernel or
Device

Privileged Access
applications
VMM-bypass Access
• Backend and privileged modules can
also reside in a special VM
Xen-IB:Xen-IB: InfiniBand
an InfiniBand virtualization driver
Virtualization for Driver
Xen for Xen
• Follows Xen split driver model
• Presents virtual HCAs to guest domains
– Para-virtualization
• Two modes of access:
– Privileged access
• OS involved
• Setup, resource management and memory
management
– OS/VMM-bypass access
• Directly done in user space/guest VM
• Maintains high performance of InfiniBand
hardware
MPI Latency and Bandwidth
(MVAPICH)
Latency Bandwidth
30 1000
25 xen xen
800
20 native native

MillionBytes/s
600
Latency (us)

15
400
10
200
5
0 0

16

64

1k

4k

1M

4M
1

k
k
6k
32

2k

8k
0

25

16

64
12

51

25
Msg size (Bytes) Msg size (Bytes)

• Only VMM Bypass operations are used


• Xen-IB performs similar to native InfiniBand
• Numbers taken with MVAPICH, high performance MPI-1 and
MPI-2 (MVAPICH2) implementations over InfiniBand
– Currently used by more than 405 organizations across the world
HPC Benchmarks (NAS)

Dom0 VMM DomU


1.2
Normalized Execution Time

VM Native
BT 0.4% 0.2% 99.4%
1
CG 0.6% 0.3% 99.0%
0.8
EP 0.6% 0.3% 99.3%
0.6
FT 1.6% 0.5% 97.9%
0.4
IS 3.6% 1.9% 94.5%
0.2
LU 0.6% 0.3% 99.0%
0
BT CG EP FT IS LU MG SP MG 1.8% 1.0% 97.3%
SP 0.3% 0.1% 99.6%

• NAS Parallel Benchmarks achieve similar performance in


VM and native environment
Future work
• System-level support for better
virtualization
– Fully compatible implementation with
OpenIB/Gen2 interface (including SDP, MAD
service, etc. besides user verbs)
– Migration of OS-bypass interconnects
– Enhancement to file systems to support
effective image management in VM-based
environment
– Scalability studies
Future Work (Cont’d)
• Comparing with Native Check-point Restart in MPI
over InfiniBand
– MVAPICH2 – BLCR-based approach
• Service virtualization
– Migration of web services (e.g. virtualized http servers,
application servers, etc.)
– Designing service management protocols to enable
efficient coordination of services in a virtualized system
• Designing higher level services with virtualization
– Providing QoS and prioritization at all levels: network,
application and system
– Enabling fine-grained application scheduling for better
system utilization
Major Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Data Centers – Issues and
Challenges
Tier 1 Tier 2 Tier 3
Storage
Routers/Servers Application Server Database Server

Routers/Servers Application Server Database Server

Routers/Servers Switch Application Server Switch Database Server Switch

. . .
. . .
Routers/Servers Application Server Database Server

• Client requests come over TCP (WAN)


• Traditionally TCP requests have been forwarded through multi-tiers
• Higher response time and lower throughput
• Can performance be improved with IBA?
• High Performance TCP-like communication over IBA
• TCP Termination
Multi-Tier Commercial Data-Centers:
Can they be Re-architected?
Web-server
Proxy (Apache)
Clients
Server
More Complexity and Overhead
Storage

WAN
WAN

More Number of Requests !


Database
Application Server
Server (PHP) (MySQL)

• Proxy: Caching, load balancing, resource monitoring, etc.


• Application server performs the business logic
– Retrieves data from the database to process the requests
Data-Center Communication
• Data Center Services require efficient
Communications
– Resource Management
– Cache and Shared State Management
– Support for Data-Centric Application like Apache
– Load Monitoring
• High performance and feature requirements
– Utilizing Capabilities of networks like InfiniBand
• Providing effective mechanisms - A challenge!!
SDP Architectural Model
Traditional Model Possible SDP Model
SDP
Sockets App Sockets Application OS
Modules
Sockets API Sockets API Infiniband
Hardware
Sockets

User User
Kernel
TCP/IP Sockets Kernel
TCP/IP Sockets Sockets Direct
Provider Provider Protocol

TCP/IP Transport TCP/IP Transport


Kernel
Driver Driver
Bypass
RDMA
Driver Driver
Semantics

InfiniBand CA InfiniBand CA
Source: InfinibandSM Trade Association 2002
Our Objectives
• To study the importance of the communication
layer in the context of a multi-tier data center
– Sockets Direct Protocol (SDP) vs. IPoIB
• Explore whether InfiniBand mechanisms can help
designing various components of datacenter
efficiently
– Web Caching/Coherency
– Re-configurability and QoS
– Load Balancing, I/O and File systems, etc.
• Studying workload characteristics
• In memory databases
3-Tier Datacenter Testbed at OSU
Caching Web Apache
Servers

Tier 3
Tier 1
Database
Clients Proxy Nodes Servers
MySQL/

Tier 2 DB2

Application
Servers

File System
Generate requests for TCP Termination evaluation
both web servers and Load Balancing Caching Schemes
database servers. Caching

PHP Dynamic Content


Caching
Persistent Connections
SDP vs. IPoIB: Latency and
Bandwidth (3.4 GHz PCI-Express)
Latency Bandwidth

100 7000
90
6000
80
5000

Bandwidth (Mbps)
70
Latency (us)

60
4000
50
40 3000

30 2000
20
1000
10
0 0
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Message Size (bytes) Message Size (bytes)

IPoIB SDP IPoIB SDP

SDP enables high bandwidth (up to 750 MBytes/sec or 6000


Mbps), low latency (21 µs) message passing
Cache Coherency and Consistency
with Dynamic Data

User Request
Proxy Node Back-End
Cache Data
Update

Proxy Nodes
Back-End Nodes

Update
User Requests
Active Caching Performance
Throughput: ZipF Distribution Throughput: Worldcup Trace
800 2500

700
Transactions per Second (TPS)

Transactions per Second (TPS)


2000
600
No Cache
500 1500 TCP/IP
No Cache
TCP/IP RDMA
400
RDMA SDP
300 SDP 1000

200
500
100

0
0

0
10
20
30
40
50
60
70
80
90
100
200
0
0
10
20
30
40
50
60
70
80
90
0

10
20

Number of Threads Number of Threads

“Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-tier Data-centers over InfiniBand”, S. Narravula,
P. Balaji, K. Vaidyanathan and D. K. Panda. IEEE International Conference on Cluster Computing and the Grid (CCGrid) ’05.

“Supporting Strong Coherency for Active Caches in Multi-tier Data-Centers over InfiniBand”, S. Narravula, P. Balaji, K.
Vaidyanathan, K. Savitha, J. Wu and D. K. Panda. Workshop on System Area Networks (SAN); with HPCA ’03.
Dynamic Re-configurability
and QoS
• More datacenters are using dynamic data
• How to decide the number of proxy nodes vs. application
servers
• Current approach
– Use a fixed distribution
– Incorporate over-provisioning to handle dynamic data
• Can we design a dynamic re-configurability scheme with shared
state using RDMA operations?
– To allow a proxy node work like app server and vice versa as needed
– Allocates resources as needed
– Over-provisioning need not be used
• Scheme can be used for multi web-site hosting servers
Dynamic Reconfigurability in
Shared Multi-tier Data-Centers
Load Balancing
Cluster (Site A) Website A

Servers
Clients

Load Balancing
Cluster (Site B) Website B

WAN Servers

Clients
Load Balancing
Cluster (Site C) Website C

Servers

Nodes reconfigure themselves to highly loaded websites at run-time


Dynamic Re-configurability
with Shared State using RDMA Operations
60000 5

N um ber of busy nodes


50000 4
40000
3
TPS

30000
2
20000

10000 1

0 0
512 1024 2048 4096 8192 16384 0 7491 14962 22423 29901 37345
Burst Length Iterations
Rigid Reconf Over-Provisioning Reconf Node Utilization Rigid Over-provisioning
Performance of dynamic reconfiguration For large burst of requests, dynamic
scheme largely depends on the burst length reconfiguration scheme utilizes all idle
of requests nodes in the system
P. Balaji, S. Narravula, K. Vaidyanathan, S. Narravula, H. -W. Jin, K. Savitha and D. K.
Panda, Exploiting Remote Memory Operations to Design Efficient Reconfigurations for Shared
Data-Centers over InfiniBand, RAIT ’04
QoS meeting capabilities
Hard QoS Meeting Capability (High Priority Requests) Hard QoS Meeting Capability (Low Priority Requests)

100% 100%
90% 90%
80% 80%
% o f t im e s Q o S m e t

% o f t im e s Q o S m e t
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
Case 1 Case 2 Case 3 Case 1 Case 2 Case 3

Reconf Reconf-P Reconf-PQ Reconf Reconf-P Reconf-PQ

Re-configurability with QoS and Prioritization handles the QoS meeting


capabilities of both high and low priority requests for all cases
P. Balaji, S. Narravula, K. Vaidyanathan, H. -W. Jin and D. K. Panda, On the Provision of
Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data Centers over InfiniBand,
Int’l Symposium on Performance Analysis of Systems and Software (ISPASS 05)
Other Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Presentation Outline

• Systems Area in the Department


• Network-Based Computing (NBC) Trend
• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
NBCLab Infrastructure
• Multiple Clusters (around 500 processors in total)
– 128-processor
• 32 dual Intel EM64T processor
• 32 dual AMD Opteron
• InfiniBand DDR on all nodes (32 nodes with dual rail)
– 128-processor (Donation by Intel for 99 years)
• 64 dual Intel Pentium 2.6 GHz processor
• InfiniBand SDR on all nodes
– 32-processor (Donation by Appro, Mellanox and Intel)
• 16 dual Intel EM64T blades
• InfiniBand SDR and DDR on all nodes
– 28-processor (Donation by Sun)
• 12 dual AMD Opteron
• 1 quad AMD Opteron
– 32-processor
• 16 dual-Pentium 300 MHz nodes
• Myrinet, Gigabit Ethernet, and GigaNet cLAN
– 96-processor
• 16 quad-Pentium 700 MHz nodes
• 16 dual-Pentium 1.0 GHz nodes
• Myrinet and GigaNet cLAN
– 32-processor
• 8 dual-Xeon 2.4 GHz nodes
• 8 dual-Xeon 3.0 GHz nodes
• InfiniBand, Quadrics, Ammasso (1GigE), Level 5(1GigE), Chelsio (10GigE), Neteffect (10GigE)
– Many other donated systems from Intel, AMD, Apple, IBM, SUN, Microway and QLogic
• Other test systems and desktops
– 18 dual GHz nodes
NBCLab Clusters
High-End Computing and Networking
Research Testbed for Next Generation
Data Driven, Interactive Applications

PI: D. K. Panda
Co-PIs: G. Agrawal, P. Sadayappan, J. Saltz
and H.-W. Shen

Other Investigators: S. Ahalt, U. Catalyurek, H. Ferhatosmanoglu,


H.-W. Jin, T. Kurc, M. Lauria, D. Lee, R. Machiraju, S. Parthasarathy,
P. Sinha, D. Stredney, A. E. Stutz, and P. Wyckoff

Dept. of Computer Science and Engineering, Dept. of Biomedical


Informatics, and Ohio Supercomputer Center

The Ohio State University

Total funding: $3.01M ($1.53M from NSF + $1.48 from Ohio State
Board of Regents and Various units in OSU)
Experimental Testbed Installed
BMI OSC
10.0 GigE 2x10=20 GigE Mass Storage System
switch 40 GigE (Yr4) 500 TBytes
(Existing)
70-node Memory Cluster with
512 GBytes memory, 24 TB disk 10.0 GigE
GigE, and InfiniBand SDR switch
Upgrade (Yr4)

2x10=20 GigE
40 GigE (Yr4)

CSE 10.0 GigE


switch
64-node Compute Cluster with
InfiniBand DDR and 4TB disk
10 GigE on some (to be added) 20 Wireless Clients &
Upgrade (Yr4) 3 Access Points
Upgrade (Yr4)
Graphics Adapters Video wall
and Haptic Devices
Collaboration among the
Components and Investigators
Saltz, Stredney, Sadayappan
Data Intensive Machiraju, Parthasarathy,
Applications Catalyurek, and Other OSU
collaborators

Data Intensive Shen, Agrawal, Machiraju,


Algorithms and Parthasarathy,

Programming Systems Saltz, Agrawal, Sadayappan,


and Scheduling Kurc, Catalyurek, Ahalt, and
Hakan

Networking, Panda, Jin, Lee, Lauria,


Communication, Sinha, Wyckoff, and
QoS, and I/O Kurc
Wright Center for Innovation
(WCI)
• A new funding to install a larger cluster
with 64 nodes with dual dual-core
processors (up to 256 processors)
• Storage nodes with 40 TBytes of space
• Connected with InfiniBand DDR
• Focuses on Advanced Data Management
Presentation Outline

• Systems Area in the Department


• Network-Based Computing (NBC) Trend
• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
788.08P course
• Network-Based Computing
• Covers all the issues in detail and develops the
foundation in this research area
• Last offered in Wi ‘06
– http://www.cis.ohio-state.edu/~panda/788/
– Contains all papers and presentations
• Will be offered in Wi ’08 again
888.08P course
• Research Seminar on Network-Based Computing
• Offered every quarter
• This quarter (Au ’06) offering
– Mondays: 4:30-6:00pm
– Thursdays: 3:30-4:30pm
• Project-based course
• 3-credits
• Two-quarter rule
Presentation Outline

• Systems Area in the Department


• Network-Based Computing (NBC) Trend
• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
Group Members
• Currently 12 PhD students and One MS student
– Lei Chai, Qi Gao, Wei Huang, Matthew Koop, Rahul Kumar,
Ping Lai, Amith Mamidala, Sundeep Narravul, Ranjit
Noronha, Sayantan Sur, Gopal Santhanaraman, Karthik
Vaidyanathan and Abhinav Vishnu
• All current students are supported as RAs
• Programmers
– Shaun Rowland
– Jonathan Perkins
• System Administrator
– Keith Stewart
Publications from the Group

• Since 1995
– 11 PhD Dissertations
– 15 MS Theses
– 180+ papers in prestigious international conferences and
Journals
– 5 journals and 24 conference/workshop papers in 2005
– 2 journals and 21 conference in 2006 (so far)
– Electronically available from NBC web page
• http://nowlab.cse.ohio-state.edu -> Publications
Student Quality and Accomplishments

• First employment of graduated students


– IBM TJ Watson, IBM Research, Argonne National Laboratory,
OakRidge National Laboratory, SGI, Compaq/Tandem, Pacific
Northwest National Lab, Argonne National Lab, ASK Jeeves, Fore
Systems, Microsoft, Lucent, Citrix, Dell
• Many employments are by invitation
• Several of the past and current students
– OSU Graduate Fellowship
– OSU Presidential Fellowship (’95, ’98, ’99, ’03 and ’04)
– IBM Co-operative Fellowship (’96, ’97, ’02 and ’06)
– CIS annual research award (’99, ’02, ’04 (2), ’05 (2), ’06)
– Best Papers at conferences (HPCA ’95, IPDPS ’03, Cluster ’03, Euro
PVM ’05, Hot Interconnect ’04, Hot Interconnect ’05, ..)
• Big demand for summer internships every year
– IBM TJ Watson, IBM Almaden, Argonne, Sandia, Los Alamos, Pacific
Northwest National Lab, Intel, Dell, ..
– 7 internships this past summer (Argonne (2), IBM TJ Watson, IBM,
Sun, LLNL, HP)
In News

• Research in our group gets quoted in press


releases and articles in a steady manner
• HPCwire, Grid Today, Supercomputing Online, AVM
News
http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
-> In News
Visit to Capitol Hill (June 2005)

• Invited to Capitol Hill (CNSF)


to demonstrate the benefits
of InfiniBand networking
technology to
– Congressmen
– Senators
– White House Director
(Office of Science and
Technology)

More details at http://www.cra.org/govaffairs/blog/archives/000101.htm


Looking For ….
• Motivated students with strong interest in different aspects of
network-based computing
– architecture
– networks
– operating systems
– algorithms
– applications
• Should have good knowledge in Architecture (775, 875), High
Performance/Parallel Computing (621, 721), networking (677, 678),
operating systems (662)
• Should have or should be willing to develop skills in
– architecture design
– networking
– parallel algorithms
– parallel programming
– experimentation and evaluation
Presentation Outline

• Systems Area in the Department


• Network-Based Computing (NBC) Trend
• Different Kinds of NBC Systems
• Our Vision, Collaboration, Funding
• Research Challenges and Projects
• Available Experimental Testbed
• Related 788 and 888 Courses
• Students in the Group and their accomplishments
• Conclusions
Conclusions
• Network-Based computing is an emerging trend
• Small to medium scale network-based computing systems are
getting ready to be used
• Many open research issues need to be solved before building
large-scale network-based computing systems
• Will be able to provide affordable, portable, scalable, and
..able computing paradigm for the next 10-15 years
• Exciting area of research
Web Pointers

NBC home page

http://www.cse.ohio-state.edu/~panda/
http://nowlab.cse.ohio-state.edu/
E-mail: panda@cse.ohio-state.edu
Acknowledgements

Our research is supported by the following organizations


• Current Funding support by

• Current Equipment support by

99