Anda di halaman 1dari 24

Eco-Computing

Ralf Gruber, Vincent Keller EPFL, Lausanne

RG, March 7, 2006

Eco-Computing

Types of parallel applications

High communication needs: Multicast dominant ones (FFT,..) Medium communication needs: Point-to-point dominant ones (EF,..) Low communication needs: Master-slave ones (Data server,..)

RG, March 7, 2006

Eco-Computing

Characteristic parameters of an application components*

O: W: Z: S:

Number of operations per node [Flops] Number of main memory accesses per node [Words] Number of messages to be sent per node Number of words sent by one node [Words]

Va=O/W: a = O/S:

Number of operations per memory access [Flops/Word] Number of operations per word sent [Flops/Word]

*suppose

the parallel components are well equilibrated

RG, March 7, 2006

Eco-Computing

Characteristic parameters of a computational node

R: M:

Peak performance of a node [Flops/s] Peak main memory bandwidth of a node [Words/s]

VM=R/M: Number of operations per memory access [Flops/Word] ra=min(R, M*Va): Peak performance of an application component [Flops/s] Total computation time [s] tc=O/ra:

RG, March 7, 2006

Eco-Computing

Efficiency of sparse matrix*vector operation (Va=O/W=1)

Dominant cpu time operation for explicit time evolution code cpu time dominated by memory access 70% of all EPFL applications are of this type

Machine P4 Xeon/64 Xeon dual/64 Laptop Laptop

Frequency f [GHz] 2.8 2.8 3.6 0.8 2.0

R [GF/s] 5.6 5.6 15.2 1.6 4.0

M [GW/s] 0.8 0.8 0.8 0.533 0.533

VM [R /M] 7 7 19 3 7.5

ra= Va*M [GF/s] 0.8 0.8 0.8 0.533 0.533

r (real) [GF/s] 0.410 0.380 0.368 0.230 0.256

em [r/ra] 0.51 0.47 0.46 0.43 0.48

ep [r/ R] 0.073 0.068 0.024 0.144 0.064


5

RG, March 7, 2006

Eco-Computing

Total costs for Matmult operation


Machine Xeon/64
1Xeon/64

node server single server single server dual server


2Laptop

3P

[Watt] 2*200 2*140 2*400 2*200 2*455

Em [kWh/y] 3500 2440 7000 3500 790

energy costs [CHF] 1400 986 2800 1400 316

44y

List price [CHF] 2139 2709 3809 1800 14402


40.10

r (real) [GF/s] 0.380 0.380 0.368 0.410 0.256

total CHF/ GF/s 9200 9700 18000 7800 6850

Xeon /64 P4 PM
1low

voltage 2incl. screen, Microsoft, keyboard, battery, etc. 3electric consumption +cooling

CHF/kWh 5estimated using Pf2

RG, March 7, 2006

Eco-Computing

First conclusions

Green is cheap

fm-computing can reduce energy consumption


frequency modulation

RG, March 7, 2006

Eco-Computing

Today: Worldwide 700000000 PC installed

Suppose: 200000000 PCs on = 40 GW = 40 nuclear power plants (!!)

RG, March 7, 2006

Eco-Computing

Power consumption for Linpack


Peak Gflops/Watt 0.8
Livermore 65536

Gflops/Watt (Jack Dongarra): Top10, June 2005

BlueGene
IBM 40960

BlueBrain@EPFL

0.6

Groningen 12288

0.4
Tokyo 8192 Lausanne 8192

0.2
NASA 10160

Barcelona 4800

JS20
Tokyo

Livermore 4096

Sandia 5000

Laptop based supercomputer 1 GHz 3.2GB/s VM=5

0
RG, March 7, 2006

Itanium 5120 ES 1 2 3 4

Itanium 5 6 7 8 9

Opteron

Eco-Computing

10

Power consumption for Matmult

Gflops/Watt

0.06 BlueGene Laptop based supercomputer: 1 GHz, 3.2GB/s

0.04

0.02 Xeon low voltage Xeon 0


RG, March 7, 2006 Eco-Computing 10

Characteristic parameters of the internode network

P: C: L: <d>:

Number of nodes in a machine Total network bandwidth of a machine [Words/s] Latency of the network [s] Average distance

VC=P R/ C: Number of operations per sent word [Flops/Word] b=C/(P*<d>): Inter-node communication bandwidth per node [Words/s] Time needed to send S words through the network [s] tb=S/b: tL=LZ: Latency time [s] Total turn around time of an application component* T=tc+ tb+ tL: M=ra/b (1+LZ/S): Number of operations per word sent [Flops/Word] B=4 L C /P: Message size taking L
*for

simplicity: I/O is not considered and communication cannot be hidden behind computation
RG, March 7, 2006 Eco-Computing 11

Cluster NoW Pleiades1 Pleiades2 Mizar BlueGene Horizon SX-5 Regatta Cluster NoW Pleiades1 Pleiades2 Mizar BlueGene Horizon SX-5 Regatta P

Site LIN-EPFL STI-EPFL STI-EPFL DIT-EPFL DIT-EPFL CSCS CSCS CSCS R [GF/s] 10 132 120 160 4096 1100 16 256 6 5.6 5.6 9.6 5.6 5.2 8 5

Vendor Logics Logics DELL Dalco IBM Cray NEC IBM P R [GF/s] 60 739 672 1536 22937 5720 128 1300

node Pentium 4 Pentium 4 Xeon 64 Opteron Power 4 Opteron vector Power 4 M [GW/s] 0.8 0.8 0.8 1.6 0.7 0.8 8 0.4

procs/node 1 1 1 2 2 1 1 1 VM [F/W] 7.5 7 7 6 8 6.5 1 12 C [GW/s] 0.0032 0.4 3.75 10 1065 1760 128 16

network 1 FE Bus FE switch GbE switch Myrinet Torus Torus SMP Colony VC [F/W] 19200 1792 179 154 22 3.3 80

network 2 Fat Tree L [s] 60 60 60 10 2.5 6.8 10* B [] 750 750 7500 2500 4800 52000 640* 12

RG, March 7, 2006

Eco-Computing

Tailoring clusters to applications through parameterisation

>1
RG, March 7, 2006 Eco-Computing 13

model

>1
= a / M Task/application: Machine (if LZ/S<<1): a = O / S [flops/64bit word] M = ra / b [flops/64bit word]

Speedup Efficiency

P 1+1 / A 1 E= = P 1+1 / P >1 A > 2 > 1 E > 50% A=


Eco-Computing 14

RG, March 7, 2006

LAUTREC: Car-Parrinello molecular dynamic (M. Stengel) Swiss-T1 + Tnet: -model: m = 100, a = 330, =3.3 Measured: =3.6 -> 25% of overall time is due to communication 75% is due to computation
Latency does not play a role

Swiss-T1 + Fast Ethernet: -model: m = 1333, a = 330, =0.25 Measured: =0.25 -> 20% of overall time is due to computation 80% is due to communication
RG, March 7, 2006 Eco-Computing 15

DS: Turbulence simulation (2/3 parallelised, E. Leriche) Speedup


1.8 Pleiades 2 XT3 Mizar 1.7

Speedup
Cray XT3 Cray XT3

1.6

1.5

1.4 Speedup

1.3

1.2

Myrinet Opteron cluster

Myrinet Opteron cluster

1.1

GbE Xeon cluster GbE Xeon cluster


2 3 4 5 Number of nodes [ - ] 6 7

0.9

0.8 1

P measured

model

RG, March 7, 2006

Eco-Computing

16

Itanium2/8 SX-8 SX-5 Cray XT3

DS: Scalability/strategy model (totally parallelised)


Machine Pentium4 Xeon Itanium t [sec] 4.37 3.42 3.26 4.13 1.50 0.75

Altix Myrinet Opteron cluster GbE Xeon cluster FE P4 cluster


Get result in 4 months 1 month 1 week 1 day
RG, March 7, 2006

Opteron SX-5 SX8

Capability: expensive? all machines SX-8 (1), Itanium2 (4), Horizon (8) SX-8 (4 procs) no machine

Capacity: cheap? one node Itanium2 (4) SX-8 (4 procs)

Computer investment costs CHF 200.CHF 1000.CHF 5000.-

Eco-Computing

17

Automatic parameterisation through monitoring


Pleiades: SAR user profile statistics 1.1.-31.3. 2005 (Ganglia)

<1 E<50%

>1 E>50% <E>=0.64 <E>=0.82

I/O or MPI ?

To machine with faster communication

Pleiades: Monitoring of single jobs


Average message size=300 B Dominated by Latency GbE does not help Need remote store MPI

CFD

=1

Plasma physics =7

RG, March 7, 2006

Eco-Computing

19

Intelligent (Grid) Scheduling System (ISS)

Goal
Submit an application to the most suited computer architecture

Database: All data on machines and applications delivered by Ganglia and and metascheduler

Cost function: Includes all strategic components on machine status and application behaviour to decide where to submit Job submission: Through Unicore/metascheduler

Intelligent (Grid) Scheduling System (ISS)


Team: EPFL, CSCS, EIF, Jlich, Fraunhofer

ISS

Intelligent Grid Scheduling System (ISS): The planned testbeds

EPFL: SMP/NUMA High M cluster

ETHZ: SMP/NUMA High M cluster

I
Switch

S S

CERN: EGEE EIF: NoW

CSCS: SMP/vector Low M cluster

SwissGrid initiative
RG, March 7, 2006 Eco-Computing 22

Data intensive computing? Model has to be extended

RG, March 7, 2006

Eco-Computing

23

Conclusions Green is cheap fm-computing can reduce energy consumption Grid should include different types of parallel machines Optimal Grid scheduling possible thanks to ISS ISS enables application adapted Grids that have . economical advantages . ecological advantages Capability/capacity->parameterization Top500 list: R , VM , VC

Need to be done: Laptop based supercomputer

RG, March 7, 2006

Eco-Computing

24

Anda mungkin juga menyukai