P
Sparse Matrices
B O
Evaluate y=Ax
A is a sparse matrix x & y are dense vectors
Challenges
Difficult to exploit ILP and DLP Irregular memory access to the source vector Often difficult to load balance Very low computational intensity (often >6bytes/flop) Likely memory bound
CSR footprint
Spyplot
What is Autotuning?
Idea
Hand optimizing each architecture/dataset combination is not feasible An optimization for one combination may slow another. Need automatic/general solution for each combination
Design
Library Install-Time (offline) Application Run-Time Matrix
Code Generators
Kernels specific Perl script generates 1000s of code variations autotuner searches them (either exhaustively or heuristically) to find the optimal configuration
Autotuning
For x86 machines Nave parallelization didnt even double performance, but delivered 40x on Niagara2. Prefetching(exhaustive search)/DMA was helpful on both x86 machines (more so after compression) NUMA is essential on Opteron/Cell Matrix Compression(heuristic) often delivered performance better than the reduction in memory traffic There is no nave Cell SPE Clovertown Opteron Niagara2 Cell Blade implementation Although local store blocking is essential on Cell for correctness, it is rarely beneficial on cache-based machines Machines running a large number of threads need large numbers of DIMMs Nave Serial Nave Parallel +NUMA +Prefetching (not just bandwidth)
+Matrix Compression +Cache/TLB Blocking
Load Balance Build for Target Arch. Parallel Benchmark Submatrix Parallel Benchmark data Evaluate Parallel Models Evaluate Parallel Models Evaluate Parallel Models Accumulate Handles To User: P_OSKI_Matrix_Handle For kernel Calls Submatrix Submatrix
P-OSKI
Architectures Evaluated
4GB/s (each direction)
OSKI*
History
Heuristic models
Benchmark
HT
HT
667MHz FBDIMMs