Supercomputing 1969-2018
1969: MFlops 1985: GFlops 1997: PFlops 2008: TFlops 2018: EFlops?
1.E+18 1.E+15
1.E+12
1.E+09 1.E+06 1.E+03 CDC 7600 Cray-2 CDC STAR
Hitachi SR2201
Cray X-MP
Intel ASCI
1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011
MFLOPS(y) = 1.72(y-1969)
IBM Roadrunner
Fujitsu NWT
Tianhe I
Trendlines
Supercomputing FLOPS > Moores law Memory speed increase << Moores law
18
MFLops 16 (log10) 14
12 10 8 6 4 2 0 1960 1970
MFlops Trendline
R = 0.97
Moore's law
1980
1990
2000
2010
2020
Programming language: C
GPU: CUDA, OpenCL
C PTX (Parallel Thread Execution)
optimizations
low level: arithmetic balancing high level: loop unrolling, fusion, wavefront, mul/div elimination, subexpression elimination data optimizations: stream with smart buffer
output
vhdl design + testbench PCore (Xilinx)
optimizations
code: loop unroll, fusion, pipeline, inline data: remap, partition, arrays, reshape, resource, stream interface selection: handshake, fifo, bus, register,
output
vhdl design performance report: timing, design and loops latency, utilization, area, power, interface design viewer with timeline, regs and interfaces, with feed back to source code
AutoESL ROCCC
Compiler optimizations
Optimization Software pipelining Arithmetic balancing Loop unrolling Loop flatten hierarchy Loop fusion (merge) Function inlining Array map (combine arrays H or V) Array partition (into smaller, // arrays) Array reshape (cyclic, block) Array resource (e.g. single or DP RAM) Array streaming (FIFOs instead of RAMs) Smart Buffer Interface (handshake, none, stream, ) AutoESL x x x x x x x x x x x x ROCCC x x x
2.E+08 2 PORTS ONLY # cycles 1.E+08 Partition=2 , 4 // streams (DP) Partition=4 , 8 // streams (DP)
I/O bound
Resource bound
2.E+08
# cycles 1.E+08
Spartan3e Virtex 6