Eco-Computing: Ralf Gruber, Vincent Keller EPFL, Lausanne

Eco-Computing
Ralf Gruber, Vincent Keller EPFL, Lausanne
RG, March 7, 2006
Eco-Computing
Types of parallel applications
High communication needs: Multicast dominant ones (FFT,..) Medium communication needs: Point-to-point dominant ones (EF,..) Low communication needs: Master-slave ones (Data server,..)
RG, March 7, 2006
Eco-Computing
Characteristic parameters of an application components*
O: W: Z: S:
Number of operations per node [Flops] Number of main memory accesses per node [Words] Number of messages to be sent per node Number of words sent by one node [Words]
Va=O/W: a = O/S:
Number of operations per memory access [Flops/Word] Number of operations per word sent [Flops/Word]
*suppose
the parallel components are well equilibrated
RG, March 7, 2006
Eco-Computing
Characteristic parameters of a computational node
R: M:
Peak performance of a node [Flops/s] Peak main memory bandwidth of a node [Words/s]
VM=R/M: Number of operations per memory access [Flops/Word] ra=min(R, M*Va): Peak performance of an application component [Flops/s] Total computation time [s] tc=O/ra:
RG, March 7, 2006
Eco-Computing
Efficiency of sparse matrix*vector operation (Va=O/W=1)
Dominant cpu time operation for explicit time evolution code cpu time dominated by memory access 70% of all EPFL applications are of this type
Machine P4 Xeon/64 Xeon dual/64 Laptop Laptop
Frequency f [GHz] 2.8 2.8 3.6 0.8 2.0
R [GF/s] 5.6 5.6 15.2 1.6 4.0
M [GW/s] 0.8 0.8 0.8 0.533 0.533
VM [R /M] 7 7 19 3 7.5
ra= Va*M [GF/s] 0.8 0.8 0.8 0.533 0.533
r (real) [GF/s] 0.410 0.380 0.368 0.230 0.256
em [r/ra] 0.51 0.47 0.46 0.43 0.48
ep [r/ R] 0.073 0.068 0.024 0.144 0.064

5
RG, March 7, 2006
Eco-Computing
Total costs for Matmult operation

Machine Xeon/64
1Xeon/64
node server single server single server dual server

2Laptop
3P
[Watt] 2*200 2*140 2*400 2*200 2*455
Em [kWh/y] 3500 2440 7000 3500 790
energy costs [CHF] 1400 986 2800 1400 316
44y
List price [CHF] 2139 2709 3809 1800 14402

40.10
r (real) [GF/s] 0.380 0.380 0.368 0.410 0.256
total CHF/ GF/s 9200 9700 18000 7800 6850
Xeon /64 P4 PM
1low
voltage 2incl. screen, Microsoft, keyboard, battery, etc. 3electric consumption +cooling
CHF/kWh 5estimated using Pf2
RG, March 7, 2006
Eco-Computing
First conclusions
Green is cheap
fm-computing can reduce energy consumption

frequency modulation
RG, March 7, 2006
Eco-Computing
Today: Worldwide 700000000 PC installed
Suppose: 200000000 PCs on = 40 GW = 40 nuclear power plants (!!)
RG, March 7, 2006
Eco-Computing
Power consumption for Linpack

Peak Gflops/Watt 0.8
Livermore 65536
Gflops/Watt (Jack Dongarra): Top10, June 2005
BlueGene
IBM 40960
BlueBrain@EPFL
0.6
Groningen 12288
0.4
Tokyo 8192 Lausanne 8192
0.2
NASA 10160
Barcelona 4800
JS20
Tokyo
Livermore 4096
Sandia 5000
Laptop based supercomputer 1 GHz 3.2GB/s VM=5
0
RG, March 7, 2006
Itanium 5120 ES 1 2 3 4
Itanium 5 6 7 8 9
Opteron
Eco-Computing
10
Power consumption for Matmult
Gflops/Watt
0.06 BlueGene Laptop based supercomputer: 1 GHz, 3.2GB/s
0.04
0.02 Xeon low voltage Xeon 0

RG, March 7, 2006 Eco-Computing 10
Characteristic parameters of the internode network
P: C: L: <d>:
Number of nodes in a machine Total network bandwidth of a machine [Words/s] Latency of the network [s] Average distance
VC=P R/ C: Number of operations per sent word [Flops/Word] b=C/(P*<d>): Inter-node communication bandwidth per node [Words/s] Time needed to send S words through the network [s] tb=S/b: tL=LZ: Latency time [s] Total turn around time of an application component* T=tc+ tb+ tL: M=ra/b (1+LZ/S): Number of operations per word sent [Flops/Word] B=4 L C /P: Message size taking L
*for
simplicity: I/O is not considered and communication cannot be hidden behind computation
Cluster NoW Pleiades1 Pleiades2 Mizar BlueGene Horizon SX-5 Regatta Cluster NoW Pleiades1 Pleiades2 Mizar BlueGene Horizon SX-5 Regatta P
Site LIN-EPFL STI-EPFL STI-EPFL DIT-EPFL DIT-EPFL CSCS CSCS CSCS R [GF/s] 10 132 120 160 4096 1100 16 256 6 5.6 5.6 9.6 5.6 5.2 8 5
Vendor Logics Logics DELL Dalco IBM Cray NEC IBM P R [GF/s] 60 739 672 1536 22937 5720 128 1300
node Pentium 4 Pentium 4 Xeon 64 Opteron Power 4 Opteron vector Power 4 M [GW/s] 0.8 0.8 0.8 1.6 0.7 0.8 8 0.4
procs/node 1 1 1 2 2 1 1 1 VM [F/W] 7.5 7 7 6 8 6.5 1 12 C [GW/s] 0.0032 0.4 3.75 10 1065 1760 128 16
network 1 FE Bus FE switch GbE switch Myrinet Torus Torus SMP Colony VC [F/W] 19200 1792 179 154 22 3.3 80
network 2 Fat Tree L [s] 60 60 60 10 2.5 6.8 10* B [] 750 750 7500 2500 4800 52000 640* 12
RG, March 7, 2006
Eco-Computing
Tailoring clusters to applications through parameterisation
>1
model
>1
= a / M Task/application: Machine (if LZ/S<<1): a = O / S [flops/64bit word] M = ra / b [flops/64bit word]
Speedup Efficiency
P 1+1 / A 1 E= = P 1+1 / P >1 A > 2 > 1 E > 50% A=

Eco-Computing 14
RG, March 7, 2006
LAUTREC: Car-Parrinello molecular dynamic (M. Stengel) Swiss-T1 + Tnet: -model: m = 100, a = 330, =3.3 Measured: =3.6 -> 25% of overall time is due to communication 75% is due to computation
Latency does not play a role
Swiss-T1 + Fast Ethernet: -model: m = 1333, a = 330, =0.25 Measured: =0.25 -> 20% of overall time is due to computation 80% is due to communication
DS: Turbulence simulation (2/3 parallelised, E. Leriche) Speedup

1.8 Pleiades 2 XT3 Mizar 1.7
Speedup
Cray XT3 Cray XT3
1.6
1.5
1.4 Speedup
1.3
1.2
Myrinet Opteron cluster
Myrinet Opteron cluster
1.1
GbE Xeon cluster GbE Xeon cluster

2 3 4 5 Number of nodes [ - ] 6 7
0.9
0.8 1
P measured
model
RG, March 7, 2006
Eco-Computing
16
Itanium2/8 SX-8 SX-5 Cray XT3
DS: Scalability/strategy model (totally parallelised)

Machine Pentium4 Xeon Itanium t [sec] 4.37 3.42 3.26 4.13 1.50 0.75
Altix Myrinet Opteron cluster GbE Xeon cluster FE P4 cluster

Get result in 4 months 1 month 1 week 1 day
RG, March 7, 2006
Opteron SX-5 SX8
Capability: expensive? all machines SX-8 (1), Itanium2 (4), Horizon (8) SX-8 (4 procs) no machine
Capacity: cheap? one node Itanium2 (4) SX-8 (4 procs)
Computer investment costs CHF 200.CHF 1000.CHF 5000.-
Eco-Computing
17
Automatic parameterisation through monitoring

Pleiades: SAR user profile statistics 1.1.-31.3. 2005 (Ganglia)
<1 E<50%
>1 E>50% <E>=0.64 <E>=0.82
I/O or MPI ?
To machine with faster communication
Pleiades: Monitoring of single jobs

Average message size=300 B Dominated by Latency GbE does not help Need remote store MPI
CFD
=1
Plasma physics =7
RG, March 7, 2006
Eco-Computing
19
Intelligent (Grid) Scheduling System (ISS)
Goal
Submit an application to the most suited computer architecture
Database: All data on machines and applications delivered by Ganglia and and metascheduler
Cost function: Includes all strategic components on machine status and application behaviour to decide where to submit Job submission: Through Unicore/metascheduler
Intelligent (Grid) Scheduling System (ISS)

Team: EPFL, CSCS, EIF, Jlich, Fraunhofer
ISS
Intelligent Grid Scheduling System (ISS): The planned testbeds
EPFL: SMP/NUMA High M cluster
ETHZ: SMP/NUMA High M cluster
I
Switch
S S
CERN: EGEE EIF: NoW
CSCS: SMP/vector Low M cluster
SwissGrid initiative
Data intensive computing? Model has to be extended
RG, March 7, 2006
Eco-Computing
23
Conclusions Green is cheap fm-computing can reduce energy consumption Grid should include different types of parallel machines Optimal Grid scheduling possible thanks to ISS ISS enables application adapted Grids that have . economical advantages . ecological advantages Capability/capacity->parameterization Top500 list: R , VM , VC
Need to be done: Laptop based supercomputer
RG, March 7, 2006
Eco-Computing
24

Eco-Computing: Ralf Gruber, Vincent Keller EPFL, Lausanne

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Eco-Computing: Ralf Gruber, Vincent Keller EPFL, Lausanne

Diunggah oleh

Hak Cipta:

Format Tersedia

Eco-Computing

Ralf Gruber, Vincent Keller EPFL, Lausanne

RG, March 7, 2006

Types of parallel applications

RG, March 7, 2006

Characteristic parameters of an application components*

the parallel components are well equilibrated

RG, March 7, 2006

Characteristic parameters of a computational node

RG, March 7, 2006

Efficiency of sparse matrix*vector operation (Va=O/W=1)

Machine P4 Xeon/64 Xeon dual/64 Laptop Laptop

Frequency f [GHz] 2.8 2.8 3.6 0.8 2.0

R [GF/s] 5.6 5.6 15.2 1.6 4.0

M [GW/s] 0.8 0.8 0.8 0.533 0.533

ra= Va*M [GF/s] 0.8 0.8 0.8 0.533 0.533

r (real) [GF/s] 0.410 0.380 0.368 0.230 0.256

em [r/ra] 0.51 0.47 0.46 0.43 0.48

ep [r/ R] 0.073 0.068 0.024 0.144 0.064

RG, March 7, 2006

Total costs for Matmult operation

node server single server single server dual server

[Watt] 2*200 2*140 2*400 2*200 2*455

Em [kWh/y] 3500 2440 7000 3500 790

energy costs [CHF] 1400 986 2800 1400 316

List price [CHF] 2139 2709 3809 1800 14402

r (real) [GF/s] 0.380 0.380 0.368 0.410 0.256

total CHF/ GF/s 9200 9700 18000 7800 6850

CHF/kWh 5estimated using Pf2

RG, March 7, 2006

fm-computing can reduce energy consumption

RG, March 7, 2006

Today: Worldwide 700000000 PC installed

Suppose: 200000000 PCs on = 40 GW = 40 nuclear power plants (!!)

RG, March 7, 2006

Power consumption for Linpack

Gflops/Watt (Jack Dongarra): Top10, June 2005

Laptop based supercomputer 1 GHz 3.2GB/s VM=5

Power consumption for Matmult

0.06 BlueGene Laptop based supercomputer: 1 GHz, 3.2GB/s

0.02 Xeon low voltage Xeon 0

Characteristic parameters of the internode network

RG, March 7, 2006

Tailoring clusters to applications through parameterisation

P 1+1 / A 1 E= = P 1+1 / P >1 A > 2 > 1 E > 50% A=

RG, March 7, 2006

DS: Turbulence simulation (2/3 parallelised, E. Leriche) Speedup

Myrinet Opteron cluster

Myrinet Opteron cluster

GbE Xeon cluster GbE Xeon cluster

RG, March 7, 2006

Itanium2/8 SX-8 SX-5 Cray XT3

DS: Scalability/strategy model (totally parallelised)

Altix Myrinet Opteron cluster GbE Xeon cluster FE P4 cluster

Opteron SX-5 SX8

Capacity: cheap? one node Itanium2 (4) SX-8 (4 procs)

Computer investment costs CHF 200.CHF 1000.CHF 5000.-

Automatic parameterisation through monitoring

>1 E>50% <E>=0.64 <E>=0.82

To machine with faster communication

Pleiades: Monitoring of single jobs

RG, March 7, 2006

Intelligent (Grid) Scheduling System (ISS)

[Watt] 2200 2140 2400 2200 2*455