Anda di halaman 1dari 1

EECS

Electrical Engineering and Computer Sciences

Autotuning Sparse Matrix-Vector Multiplication


Sam Williams, Ankit Jain
{ samw, ankit } @eecs.berkeley.edu
BERKELEY PAR LAB

P
Sparse Matrices

B O

Sparse Matrix-Vector Multiplication


Most entries are 0.0 Significant performance advantage in only storing/operating on the nonzeros

Dataset: the matrices


Pruned original SPARSITY suite down to 14 matrices of interest None should fit in cache12-135MB (memory intensive benchmark) 4 categories Rank ranging from 2K to 1M
Name Dimension Nonzeros (nonzeros/row) Dense 2K 4.0M (2K) 48MB Protein 36K 4.3M (119) 52MB FEM / FEM / Spheres Cantilever 83K 6.0M (72) 72MB 62K 4.0M (65) 48MB Wind Tunnel 218K 11.6M (53) 140MB FEM / Harbor 47K 2.37M (50) 28MB QCD 49K 1.90M (39) 23MB FEM / Ship 141K 3.98M (28) 48MB Economics 207K 1.27M (6) 16MB Epidem 526K 2.1M (4) 27MB FEM / Accel 121K 2.62M (22) 32MB Circuit 171K 959K (6) 12MB webbase 1M 3.1M (3) 41MB LP 4K x 1M 11.3M (2825) 135MB

P-OSKI : A Library to Parallelize OSKI*


Goals
Provide both a serial and parallel interface to exploit the parallelism in sparse kernels (focus on SpMV for now) Hide the complex process of parallel tuning Expose the cost of tuning Allow for user inspection and control of the tuning process Design it to be extensible so it can be used in conjunction with other parallel libraries (e.g. ParMETIS)

Evaluate y=Ax
A is a sparse matrix x & y are dense vectors

Challenges

Where the Optimizations Occur


Optimization Load Balancing/NUMA Register Blocking OSKI P-OSKI

Difficult to exploit ILP and DLP Irregular memory access to the source vector Often difficult to load balance Very low computational intensity (often >6bytes/flop) Likely memory bound

CSR footprint

Spyplot

Initial Multicore SpMV Experimentation


Paper reference
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Cache Blocking TLB Blocking

What is Autotuning?
Idea
Hand optimizing each architecture/dataset combination is not feasible An optimization for one combination may slow another. Need automatic/general solution for each combination

Design
Library Install-Time (offline) Application Run-Time Matrix

Two threaded implementations:


Cache-based Pthreads, and Cell local-store based. 1D parallelization by rows Progressively expanded the autotuner (constant levels of productivity)

Code Generators
Kernels specific Perl script generates 1000s of code variations autotuner searches them (either exhaustively or heuristically) to find the optimal configuration

Autotuning
For x86 machines Nave parallelization didnt even double performance, but delivered 40x on Niagara2. Prefetching(exhaustive search)/DMA was helpful on both x86 machines (more so after compression) NUMA is essential on Opteron/Cell Matrix Compression(heuristic) often delivered performance better than the reduction in memory traffic There is no nave Cell SPE Clovertown Opteron Niagara2 Cell Blade implementation Although local store blocking is essential on Cell for correctness, it is rarely beneficial on cache-based machines Machines running a large number of threads need large numbers of DIMMs Nave Serial Nave Parallel +NUMA +Prefetching (not just bandwidth)
+Matrix Compression +Cache/TLB Blocking

Load Balance Build for Target Arch. Parallel Benchmark Submatrix Parallel Benchmark data Evaluate Parallel Models Evaluate Parallel Models Evaluate Parallel Models Accumulate Handles To User: P_OSKI_Matrix_Handle For kernel Calls Submatrix Submatrix

Optimizations included in our autotuner:


NUMA aware allocation Exploits NUMA memory systems SW Prefetching Attempts to hide L2 and DRAM latency Matrix Compression Maximize performance by minimizing memory traffic. This includes power of 2 register blocking, CSR/COO selection, index size reduction, etc Cache Blocking Reorganize the matrix to maximize source vector locality in the L2 TLB Blocking Reorganize the matrix to maximize source vector TLB locality SIMDization Ensures optimal in-cache performance Array Padding Avoids inter-thread conflicts in the L1/L2

+DIMMs / Firmware / Array Padding

Comparison with MPI


Compared autotuned pthread implementation with autotuned shared memory MPI (MPICH) (PETSc+OSKI autotuned implementation) For many matrices, MPI was only Clovertown Opteron Niagara2 slightly faster than serial on the x86 machines MPI rarely scaled beyond 16 threads on Niagara2 (still under investigation) Autotuned pthread implementation was often 2-3x faster than autotuned MPI

Nave Serial Autotuned shared-memory MPI (PETSc+OSKI) Autotuned pthreads

P-OSKI

Parallel Heuristic models

Architectures Evaluated
4GB/s (each direction)

OSKI*

Generated code variants

Benchmark data Evaluate Models Select Data Struct. & Code

History

Heuristic models

2.33GHz Intel Xeon (Clovertown)


Core2 Core2 Core2 Core2 4MB 4MB Shared L2 Shared L2 FSB 10.6 GB/s 21.3 GB/s(read) Core2 Core2 Core2 Core2 4MB 4MB Shared L2 Shared L2 FSB 10.6 GB/s 10.6 GB/s(write)

2.2GHz AMD Opteron


Opteron Opteron 1MB 1MB victim victim SRI / crossbar Opteron Opteron 1MB 1MB victim victim SRI / crossbar

Performance and Scalability


All machines showed good multisocket scaling (bandwidth per socket was the limiting factor) Clovertown showed very poor multicore scaling (low data utilization on FSB) Machines with simpler cores, and more bandwidth delivered better performance

Build for Target Arch.

Benchmark

OSKI_Matrix_Handle For kernel Calls

HT

Chipset (4x64b controllers) 667MHz FBDIMMs

128b memory controller 10.66 GB/s 667MHz DDR2 DIMMs

128b memory controller 10.66 GB/s 667MHz DDR2 DIMMs

HT

Status and Future Work


Version 1 of parallel SpMV with load-balancing and tree-based reduction implemented Currently working on Parallel Benchmark to better pick register block sizes during autotuning process Implement Cache Blocking and TLB blocking heuristics within OSKI Extend work to other sparse linear algebra kernels

1.4GHz Sun Niagara2 (Huron)


MT MT MT MT MT MT MT MT Sparc Sparc Sparc SparcSparc Sparc Sparc Sparc 8K L1 L1 L1 L18K L1 L1 L1 L1 8K 8K 8K 8K 8K 8K Crossbar Switch 90 GB/s (writethru) 179 GB/s (fill) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 4x128b memory controllers (2 banks each) 21.33 GB/s (write) 42.66 GB/s (read)

3.2GHz IBM Cell Blade (QS20)


SPE SPE SPE SPE 256K 256K 256K 256K PPE PPE SPE SPE SPE SPE 512KB MFC MFC MFC MFC L2 EIB (Ring Network) MFC MFC MFC MFC 256K256K256K256K XDR BIF SPE SPE SPE SPE 25.6GB/s 512MB XDR DRAM <20GB/s each direction 256K 256K 256K 256K 512KB MFC MFC MFC MFC L2 EIB (Ring Network) MFC MFC MFC MFC BIF XDR 256K256K256K256K SPE SPE SPE SPE 25.6GB/s 512MB XDR DRAM

System Power Efficiency


Used a digital power meter to measure sustained power under load 16 FBDIMMs on Niagara2 drove sustained power to 450W, while others required around 300W Clovertown delivered lower performance and required more power than Opteron (thus substantially lower power efficiency) Cell delivered good performance at moderate power

667MHz FBDIMMs

*OSKI: Optimized Sparse Kernel Interface: http://bebop.cs.berkeley.edu/oski

Anda mungkin juga menyukai