Anda di halaman 1dari 42

Development of the Grid

Computing
Infrastructure
Daniel J. Harvey

Acknowledgements
NASA Ames Research Center
Sunnyvale, California
Presentation Overview
 What is grid computing?
 Historical preliminaries
 Grid potential
 Technological challenges
 Current developments
 Future outlook
 Personal research
What is Grid Computing?
The grid is a hardware/software infrastructure
that enables heterogeneous geographically
separated clusters of processors to be connected
in a virtual environment.
 Heterogeneous: The computer configurations are
highly varied.
 Clusters: Generally high performance computer
configurations consisting of tens or hundreds of
processors that interact through interconnections.
 Virtual: The users operate as if the entire grid was
a single integrated system.
Motivation for Grids
 The need
 Solutions to many large-scale computational problems are not
feasible on existing supercomputer systems.
 Expensive equipment are not always available locally.
 There is a growing demand for applications such as virtual
reality, data mining, and remote collaboration
 Possible solutions
 Many very powerful systems are not fully utilized.
 Pooling resources is cost effective.
 Communication technology is progressing rapidly.
 The internet has shown global interconnections to be effective
Historical Preliminaries
 Existing infrastructures in society
 Power grid
 Transportation system
 Railroad system
 Requirements
 Universal access
 Standardized use
 Dependable
 Cost Effective
Early Testbed Project
Corporation for National Research Initiatives

 When: Early 1990s


 Where: Cross Continent
 By Who: National Science
Foundation (NSF) and
Defense Advanced Research
Projects (DARPA)
 Funding: 15.8 million
 Purpose: Improved
Supercomputer utilization and
investigation of Gigabit
network technologies
Early TestBed Examples
 AURORA – Link MIT and University of Pennsylvania at 622
Mbps using fiber optics
 BLANCA – Link AT&T Bell Labs, Berkeley, University of
Wisconsin, LBL, Cray Research, University of Illinois with the first
nationwide testbed.
 CASA – Link Los Alamos, Cal Tech, Jet Propulsion Laboratory, San
Diego Supercomputing center. High speed point to point network .
 NECTAR – Link Pittsburgh Supercomputer Center and Carnegie
Mellon University. Interface LAN’s to high speed backbones.
 VISTANET – Link University of North Carolina, North Carolina
State University, MCNC. Prototype for public switched gigabyte
network.
Early Results with CASA
 Chemical reaction modeling
 Achieved super linear speedup
 3.3 faster on two supercomputers computers that
speed achieved on either supercomputer.
 Matched application to supercomputer
characteristics
 Modeling of ocean currents
 Calculations processed at UCLA, real time
visualization at Intel, displayed at the SDSC
Lessons Learned
 Synchronization
 difficult requiring tedious manual intervention
 Heterogeneity
 varied hardware, software, and operational
procedures
 Fault Tolerance
 Critical for dependable service
 Poor performance
 Latency tolerance, redundant communication, load-
balance, communication protocol, contention, routing
 Security issues
The I-WAY Project
(IEEE Supercomputing ’95)

 Purpose
 Connect the individual testbeds
 Exploit available resources
 NSF, DOE, NASA networking infrastructure
 Supercomputers at dozens of laboratories
 Broadband national networks
 High-end graphics workstations
 Virtual Reality (VR) and collaborative environments
 Demonstrate innovative approaches to scientific
applications
Examples of I-Way
Demonstrations
 N-body galaxy simulation
 Coupled Cray, SGI, Thinking Machines, IBM
 Virtual reality results displayed in the CAVE at
Supercomputing 95.
 Cave is a room-sized virtual environment.
 Industrial emission control system
 Coupled Chicago supercomputer with Cave in San
Diego and Washington, D.C.
 Teleimmersion, colaborative environment
Grid Potential
Applications identified by I-Way

 Large Scale Scientific Problems


 simulations, data analysis
 Data Intensive Applications
 Data mining, distributed data storage
 Teleimmersion applications
 Virtual Reality, collaboration
 Remote instrument control
 Particle accelerators, medical imaging, feedback
control
Current Grid Networks
National Technology Grid NASA Information Power Grid

Globus Ubiquitous Testbed Organization (GUSTO)


Atmospheric Ozone Observation
Large scale data collection

DataGrid Testbed
Fundamental Particle Research
Remote instrumentation

DataGrid Testbed
Genome Research
Data mining, code management, remote GUI interfaces

DataGrid Testbed
Teleimmersive Applications

Electronic Visualization Laboratory


University of Illinois at Chicago
Supercomputer Simulations
Parallel execution and visualization

Cactus Computational Toolkit


University of Illinois at Urbana-Champaign
Couple Photon Source to Supercomputers
Advanced visualization and computation

Argonne’s National Laboratory


Technological Challenges
Operation Application Software
 Authentication  Data storage
 Authorization  Load balancing
 Resource discovery  Latency tolerance
 Resource allocation  Redundant
 Fault recovery communication
 Existing software
 Staging of executables
 Heterogeneity Connectivity
 Instrumentation • Latency and bandwidth
 Auditing and accounting • Communication protocol
Operation
 Authentication and Authorization
 Verify who is attempting to use available resources
 Verify that access to a particular resource is acceptable
 Resource discovery and allocation
 Published list of resources
 Updated resource status
 Fault recovery, synchronization and staging
execution
 Dynamic reallocation
 Run time dependencies
 Heterogeneity and portability
 Different policies and resources on different systems
Application Software
 Data storage
 Access to distributed data bases
 Load balancing
 Workload needs to be balanced across all systems
 Latency tolerance
 Streaming, overlap processing and communication
 Redundant communication
 Implementation of staged communication
 Existing software
 Preserving existing software as much as possible
Connectivity
 Latency
 Internet based connections vary by 1,000 %
 Bandwidth
 Mixture of high speed gigabyte connections and slow
speed local connections
 Protocol
 Minimize communication hops
 “guarantee of service” instead of “best effort”
 Adapted to application requirements
 Maintenance and cost
The Grid environment
 Transparent
 Present the look and feel of a single coupled system
 Incremental
 Minimal initial programming modifications required
 Additional grid benefits achieved over time
 Portable
 Available and easy to install on all popular platforms
 Grid based tools
 Minimize effort to implement grid aware applications
Current Developments
 Models for a grid computing
infrastructure
 Corba (extending “off the shelf” technology)
 Legion (object oriented)
 Globus (toolkit approach)
 Commercial interest
 IBM has invested four billion dollars to utilize Globus
for implementation of grid-based computer farms
around the world.
 NASA is developing its Information Power Grid using
Globus for the framework.
 The European Union is developing the Data Grid
Corba
A set of facilities linked through “off-the-shelf” packages

 Client Server Model


 Web based technology
 Wrap existing code in Java objects
 Exploit Object level and thread level
parallelism
 Utilize Current public-key security
techniques
Legion
Vision:A single unified virtual machine that ties
together millions of processors and objects

 Object Oriented
 Each object defines the rules for access
 Well defined object interfaces
 A core set of objects provide basic services
 Users can define and create their own objects
 Users and executing objects manipulate remote objects
 High Performance
 Users select hosts based on load and job affinity
 Object wrapping for support for parallel programming.
 User Autonomy
 Users choose scheduling policies and security arrangements.
Globus
An open-architecture integrated “bag” of basic grid services

 Middleware layered architecture


 builds global services on top of core local services.
 Translucent interfaces to core Globus
services
 Well-defined interfaces that can be accessed directly
by applications.
 Interfaces that are indirectly accessed through
augmented software tools provided by developers
 Coexistence with existing applications
 System Evolution
 Incremental implementation
 Existing tools can be enhanced or replaced as needed
DataGrid Project
Based on Globus
Funded by the European Union
Globus Core Services
 Meta-computing directory service (MDS)
 Status of global resources
 Global security infrastructure (GSI)
 Authentication and authorization
 Global resource allocation manager (GRAM)
 Allocation of processors, instrumentation, memory, etc.
 Global execution management (GEM) - Staging executables
 Heartbeat monitor (HBM) - Fault detection and recovery
 Global access to secondary storage (GASS)
 Access to distributed data storage facilities
 Portable heterogeneous communication library (Nexus)
 Communication between processors, Parallel programming
Future Outlook
 Successes
 The basic Globus approach has proven successful
 Acceptable bandwidths are within reach
 Challenges
 “speed of light” latency limitations
 Improved protocol schemes
 Less overhead, fewer hops, less susceptible to load
 Application specific
 Additional grid-based tools are needed
 Open Areas
 Which applications are applicable for grid
implementations?
Personal Research
 What applications are suitable for grid solutions
 MinEX: A latency-tolerant dynamic partitioner for grid computing
applications, Special Issue Journal of the FGCS, 2002
 A Latency-tolerant partitioner for distributed computing on the
Information Power Grid, IPDPS proceedings, 2001
 Latency hiding in partitioning and load balancing of grid
computing applications, IEEE International Symposium CCGrid,
2001, Finalist for best paper
 Partitioning scheme to maximize grid-based
performance
 Performance of a heterogeneous grid partitioner for N-body
Applications, ICPP proceedings, 2002, under review
Adaptive Mesh Experiments

Helicopter rotor test Time dependent shock wave


Three adaptations
Nine adaptations
13,967 - 137,474 vertices
60,968 - 137,414 tetrahedra 50,000 - 1,833,730 elements
74,343 - 913412 edges
Experimental Results
Simulated Grids of 32 Processors
Varying clusters and interconnect speeds

1) Latency tolerance is more critical as clusters increase


2) New grid-based approaches are needed

Expected runtimes Expected runtimes


(no latency tolerance) (maximum latency tolerance)
INTERCONNECT SLOWDOWNS INTERCONNECT SLOWDOWNS
Clusters 3 10 100 1000 Clusters 3 10 100 1000
1 473 473 473 473 1 287 287 287 287
2 728 863 1228 4102 2 298 469 763 3941
3 755 1168 2783 18512 3 322 548 2386 12705
4 791 1361 3667 25040 4 328 680 3297 21888
5 854 1649 5677 53912 5 336 768 4369 33092
6 915 1717 8512 76169 6 345 856 5044 52668
7 956 1915 10958 80568 7 352 893 5480 61079
8 968 2178 11492 93566 8 357 1048 5721 61321
Runtimes in thousands of units
Partitioning
 Motivation
 Avoid the possibilities that some processors are
overloaded while others are idle.
 Implementation
 Suspend the application
 Model the problem using a dual graph approach
 Utilize an available partitioner to balance load
 Move data sets between processors
 Resume the application
Latency Tolerance
Minimize the effect of high grid bandwidths and
latencies

 Overlap processing, communication,


and data movement

 Minimize communication costs


 Eliminate redundancies
 Regionalize
Dual Graph Modeling
Example: Adaptive Meshes
Partitioner Deficiencies
 Partitioning and data set redistribution are
executed in separate steps.
 It is possible to achieve load balance and still
require incur communication cost
 Data set redistribution is not minimized during
partitioning
 Application latency tolerance algorithms are
not considered
 The grid configuration is not considered
Research Goals
 Create a partitioning tool for the Grid
 Objective goal to minimize runtime instead
of achieving processing a balanced workload
 Anticipate latency tolerance techniques
employed by solvers
 Map the partitioning graph onto the grid
configuration
 Combine partitioning and data-set
movement into one partitioning step
The N-body problem
 Simulating movement of a N-bodies
 Based on gravitational or electrostatic forces
 Applicable to many scientific problems
 Implementation of Solver
 Modified Barnes & Hut (Treat far clusters as single bodies)
 Present dual-graph to MinEX and MeTiS partitioners at each time
step
 Simulated Grid environment
 More practical than actually building grids
 Various configurations simulated by changing parameters
Experimental Results
Comparisions: MinEX to MeTiS
Conclusions
- Small advantage at fast interconnects
- Effect of slow bandwidth completely hidden
- Huge advantage at the slower speeds
- Improved load balance achieved (Not Shown)
- Increased MinEX advantage on heterogeneous configurations
MeTis MinEX
INTERCONNECT SLOWDOWNS INTERCONNECT SLOWDOWNS
Clusters 3 10 100 1000 Clusters 3 10 100 1000
1 16 16 16 16 1 15 15 15 15
2 16 23 91 825 2 15 15 40 304
3 16 23 109 1017 3 15 15 38 331
4 16 23 119 1115 4 16 15 51 372
5 16 23 123 1161 5 15 15 52 396
6 16 23 128 1205 6 16 16 74 391
7 16 23 131 1244 7 15 16 46 393
8 16 23 132 1253 8 16 15 70 405
Runtimes in thousands of units
Additional Information
 The Grid Blueprint for a new Computing Infrastructure
 Ian Foster and Carl Kesselman, ISBN 1-55860-475-8
 Morgan Kaufmann Publishers, 1999
 Semi-technical overview of grid computing
 589 references to publications

 http://www. globus.org
 Downloadable software
 Detailed description of Globus
 Personal Web Site (http://www. sou.edu/cs/
harvey)

Anda mungkin juga menyukai