a users perspective
Eduardo M. Bringa & Emmanuel Milln
CONICET / Instituto de Ciencias Bsicas, Universidad Nacional de Cuyo, Mendoza
(ebringa@yahoo.com)
Outline
Introduction
HOOMD-BLUE
LAMMPS
GPULAMMPS
Performance:
a) single/mixed/double precision
b) CPU/GPU neighbor lists
c) Load balancing: static & dynamic
Examples.
Summary
Many more!!!!
http://codeblue.umich.edu/hoomd-blue/index.html
(64K particles)
HOOMD-Blue 0.9.2
http://www.nvidia.com/object/hoomd_on_tesla.html
LJ
liquid
TPS
Host
GPU
Polymer
TPS
Amd
Athlon II
X4
2.8GHz
GTX
480
602.28
706.35
AMD
Opteron
2356 2.3
GHz
S2050
(1/4)
496.07
560.74
Intel
Core2
Q9300
2.50GHz
GTX
460
383.65
432.13
AMD
Opteron
2356 2.3
GHz
Tesla
S1070
(1/4)
264.06
301.89
http://codeblue.umich.edu/hoomd-blue/index.html
CUDA/CPU.
OMP for multiple cores O multiple GPUs/core
Single /double precision.
Integrators: NVE, NPT, NVT, Brownian dynamics NVT
Energy minimization : FIRE
Other features
Supports Linux, Windows, and Mac OS X.
Simple and powerful Python script interface for defining
simulations.
Performs 2D and 3D simulations.
Advanced built-in initial configuration generators.
Human readable XML input files.
Space-filling curve particle reordering to increase
performance.
Extensible object-oriented design. Additional features may
be added in new classes contained in plugins.
Simulations can be visualized in real-time using VMD's IMD
interface.
Real time analysis can be run at a non-linear rate if desired
Quantities such as temperature, pressure, and box size can be
varied smoothly over a run
Flexible selection of particles for integration allows freezing
some particles in place and many other use-cases.
Only reduced units.
Output:
HOOMD-blue 0.9.0
Compiled: Wed Oct 28 06:58:46 EDT 2009
Copyright 2008, 2009 Ames Laboratory Iowa State University and
the Regents of the University of Michigan ----http://codeblue.umich.edu/hoomd-blue/
This code is the implementation of the algorithms discussed in:
Joshua A. Anderson, Chris D. Lorenz, and Alex Travesset - 'General
Purpose Molecular Dynamics Fully Implemented on Graphics
Processing Units', Journal of Computational Physics 227 (2008)
5342-5359 ----test.hoomd:004 | init.create_random(N=100, phi_p=0.01, name='A')
test.hoomd:007 | lj = pair.lj(r_cut=3.0)
test.hoomd:008 | lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)
test.hoomd:011 | all = group.all(); Group "all" created containing 100
particles test.hoomd:012 | integrate.mode_standard(dt=0.005)
test.hoomd:013 | integrate.nvt(group=all, T=1.2, tau=0.5)
test.hoomd:016 | run(10e3)
starting run ** Time 00:00:00 | Step 10000 / 10000 | TPS 35417.9 |
ETA 00:00:00 Average TPS: 35405 --------- -Neighborlist stats:
370 normal updates / 100 forced updates / 0 dangerous updates
n_neigh_min: 0 / n_neigh_max: 10 / n_neigh_avg: 2.41
bins_min: 0 / bins_max: 6 / bins_avg: 1.5625
run complete **
Outline
Introduction
HOOMD-BLUE
LAMMPS
GPULAMMPS
Performance:
a) single/mixed/double precision
b) CPU/GPU neighbor lists
c) Load balancing: static & dynamic
Examples.
Summary
GPULAMMPS
http://code.google.com/p/gpulammps/
http://code.google.com/p/gpulammps/wiki/Lammps_cuda
http://code.google.com/p/gpulammps/wiki/Lammps_cuda_UI_Features
LAMMPS/GPULAMMPS
/src/GPU
cudpp_mini Lib
/src/USER-CUDA
in GPULAMMPS
GPULAMMPS-CUDA Compilation
http://code.google.com/p/gpulammps/wiki/Installation
USER-CUDA
Provides (more details about the features of LAMMPS CUDA):
26 pair forces
long range coulomb with pppm/cuda
nve/cuda, nvt/cuda, npt/cuda, nve/sphere/cuda
several more important fixes and computes
Installation (more details about the installaton of LAMMPS CUDA)
Make sure you can compile LAMMPS without packages with
'make YOUR-Machinefile'
Insert your path to CUDA in 'src/USER-CUDA/Makefile.common'
Install the standard packages with 'make yes-standard'
Install USER-CUDA with 'make yes-USER-CUDA'
Go to src/USER-CUDA and compile the USER-CUDA library with 'make Options'
Go to src and compile LAMMPS with 'make YOUR-Machinefile Options
IMPORTANT: use the same options for the library and LAMMPS
Features in GPULAMMPS-CUDA
http://code.google.com/p/gpulammps/wiki/Lammps_cuda_DI_Features
Run styles:
verlet/cuda
Forces:
born/coul/long/cuda, buck/coul/cut/cuda,buck/coul/long/cuda, buck/cuda, cg/cmm/coul/cut/cuda,
cg/cmm/coul/debye/cuda,, cg/cmm/coul/long/cuda, cg/cmm/cuda, eam/cuda, eam/alloy/cuda, eam/fs/cuda,
gran/hooke/cuda, lj/charmm/coul/charmm/implicit/cuda, lj/charmm/coul/charmm/cuda, lj/charmm/coul/long/cuda,
lj/cut/coul/cut/cuda, lj/cut/coul/debye/cuda, lj/cut/coul/long/cuda, lj/cut/cuda, lj/expand/cuda,
lj/gromacs/coul/gromacs/cuda, lj/gromacs/cuda, lj/smooth/cuda, lj96/cut/cuda, morse/cuda, morse/coul/long/cuda,
pppm/cuda
Fixes:
npt/cuda, nve/cuda, nvt/cuda, nve/sphere/cuda, enforce2d/cuda, temp/berendsen/cuda, temp/rescale/cuda,
addforce/cuda , setforce/cuda, aveforce/cuda, shake/cuda, gravity/cuda, freeze/cuda.
Computes:
temp/cuda, temp/partial/cuda, pressure/cuda, pe/cuda
Atom styles:
atomic/cuda, charge/cuda, full/cuda, granular/cuda
Outline
Introduction
HOOMD-BLUE
LAMMPS
GPULAMMPS
Performance:
a) single/mixed/double precision
b) CPU/GPU neighbor lists
c) Load balancing: static & dynamic
Examples.
Summary
GB: Gay-Berne.
LJ: Lennard-Jones .
Benchmarks: http://users.nccs.gov/~wb8/gpu/yona.htm
Also: http://sites.google.com/site/akohlmey/software/lammps-benchmarks
Gay-Berne ellipsoids
125 K atoms, NVE, rcut= 7, 1000 steps.
Speedup ~ 9-11.
Yona cluster:
15 Nodes : 2x6-core AMD Opteron 2435 (2.6GHz) & 2 Tesla C2050 GPUs
3GB GDDR5 memory, 448 cores (1.15GHz) , memory bandwidth: 144GB/s.
GPUs are connected on PCIx16 gen 2.0 slots. ECC support enabled.
Mellanox MT26428 QDR InfiniBand interconnect.
Implementing molecular dynamics on hybrid high performance computers short range forces
W. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington.Comp. Phys. Comm. 182 (2011) 898911
Single precision OK for many runs, but use at your peril! Colberg & Hfling, Comp. Phys. Comm. 182 (2011) 11201129.
Mixed precision (single for positions and double for forces) nearly as fast as single precision!
Double precision still cheaper than CPU.
http://users.nccs.gov/~wb8/gpu/kid_precision.htm
Fixed load balancing: setting the CPU core to accelerator ratio and by
setting the fraction of particles that will have forces calculated by the
accelerator.
Consider a job run with 4 MPI processes on a node with 2 accelerator
devices and the fraction set to 0.7. At each timestep, each MPI process will
place data transfer of positions, kernel execution of forces, and data
transfer of forces into the device (GPU) queue for 70% of the particles.
At the same time data is being transferred and forces are being calculated
on the GPU, the MPI process will perform force calculations on the CPU.
Ideal fraction: CPU time = GPU time for data transfer and kernel
execution dynamic balancing with calculation of optimal fraction based
on CPU/GPU timing at some timestep interval.
LJ, N=864000
More benchmarks
Implementing molecular dynamics on hybrid high performance computers short range forces
W. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011) 898911
Single node. Code compiled with CUDA and OpenCL. N=256K with neighboring
performed on the GPU. Time normalized by the time required to complete
simulation loop with CUDA.
Outline
Introduction
HOOMD-BLUE
LAMMPS
GPULAMMPS
Performance:
a) single/mixed/double precision
b) CPU/GPU neighbor lists
c) Load balancing: static & dynamic
Examples.
Summary
GPU: NVIDIA GeForce 9500 GT 1GB DDR-2 with 32 processing units (cost ~US$ 100 dollars), running
with CUDA 3.0.
Single CPU: AMD Athlon 64 X2 Dual Core Processor 4600+ 2.4 GHz each core, Slackware Linux 13.0.
Storm: master node + 12 slave stations. Intel P4 processor at 3.0 Ghz, 1 GiB of DDR-1 RAM, Gigabit
Ethernet Card and Rocks Linux 5.3.
Twister: 16 nodes with a Intel Core 2 Duo at 3.0 Ghz processor, 4GiB of DDR-2 RAM, Gigabit Ethernet
card and Rocks Linux 5.0.