Parlab08 Poster SPMV

EECS
Electrical Engineering and Computer Sciences
Autotuning Sparse Matrix-Vector Multiplication

Sam Williams, Ankit Jain
{ samw, ankit } @eecs.berkeley.edu
BERKELEY PAR LAB
P
Sparse Matrices
B O
Sparse Matrix-Vector Multiplication

Most entries are 0.0 Significant performance advantage in only storing/operating on the nonzeros
Dataset: the matrices

Pruned original SPARSITY suite down to 14 matrices of interest None should fit in cache12-135MB (memory intensive benchmark) 4 categories Rank ranging from 2K to 1M
Name Dimension Nonzeros (nonzeros/row) Dense 2K 4.0M (2K) 48MB Protein 36K 4.3M (119) 52MB FEM / FEM / Spheres Cantilever 83K 6.0M (72) 72MB 62K 4.0M (65) 48MB Wind Tunnel 218K 11.6M (53) 140MB FEM / Harbor 47K 2.37M (50) 28MB QCD 49K 1.90M (39) 23MB FEM / Ship 141K 3.98M (28) 48MB Economics 207K 1.27M (6) 16MB Epidem 526K 2.1M (4) 27MB FEM / Accel 121K 2.62M (22) 32MB Circuit 171K 959K (6) 12MB webbase 1M 3.1M (3) 41MB LP 4K x 1M 11.3M (2825) 135MB
P-OSKI : A Library to Parallelize OSKI*

Goals
Provide both a serial and parallel interface to exploit the parallelism in sparse kernels (focus on SpMV for now) Hide the complex process of parallel tuning Expose the cost of tuning Allow for user inspection and control of the tuning process Design it to be extensible so it can be used in conjunction with other parallel libraries (e.g. ParMETIS)
Evaluate y=Ax
A is a sparse matrix x & y are dense vectors
Challenges
Where the Optimizations Occur

Optimization Load Balancing/NUMA Register Blocking OSKI P-OSKI
Difficult to exploit ILP and DLP Irregular memory access to the source vector Often difficult to load balance Very low computational intensity (often >6bytes/flop) Likely memory bound
CSR footprint
Spyplot
Initial Multicore SpMV Experimentation

Paper reference
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
Cache Blocking TLB Blocking
What is Autotuning?
Idea
Hand optimizing each architecture/dataset combination is not feasible An optimization for one combination may slow another. Need automatic/general solution for each combination
Design
Library Install-Time (offline) Application Run-Time Matrix
Two threaded implementations:

Cache-based Pthreads, and Cell local-store based. 1D parallelization by rows Progressively expanded the autotuner (constant levels of productivity)
Code Generators
Kernels specific Perl script generates 1000s of code variations autotuner searches them (either exhaustively or heuristically) to find the optimal configuration
Autotuning
For x86 machines Nave parallelization didnt even double performance, but delivered 40x on Niagara2. Prefetching(exhaustive search)/DMA was helpful on both x86 machines (more so after compression) NUMA is essential on Opteron/Cell Matrix Compression(heuristic) often delivered performance better than the reduction in memory traffic There is no nave Cell SPE Clovertown Opteron Niagara2 Cell Blade implementation Although local store blocking is essential on Cell for correctness, it is rarely beneficial on cache-based machines Machines running a large number of threads need large numbers of DIMMs Nave Serial Nave Parallel +NUMA +Prefetching (not just bandwidth)
+Matrix Compression +Cache/TLB Blocking
Load Balance Build for Target Arch. Parallel Benchmark Submatrix Parallel Benchmark data Evaluate Parallel Models Evaluate Parallel Models Evaluate Parallel Models Accumulate Handles To User: P_OSKI_Matrix_Handle For kernel Calls Submatrix Submatrix
Optimizations included in our autotuner:

NUMA aware allocation Exploits NUMA memory systems SW Prefetching Attempts to hide L2 and DRAM latency Matrix Compression Maximize performance by minimizing memory traffic. This includes power of 2 register blocking, CSR/COO selection, index size reduction, etc Cache Blocking Reorganize the matrix to maximize source vector locality in the L2 TLB Blocking Reorganize the matrix to maximize source vector TLB locality SIMDization Ensures optimal in-cache performance Array Padding Avoids inter-thread conflicts in the L1/L2
+DIMMs / Firmware / Array Padding
Comparison with MPI

Compared autotuned pthread implementation with autotuned shared memory MPI (MPICH) (PETSc+OSKI autotuned implementation) For many matrices, MPI was only Clovertown Opteron Niagara2 slightly faster than serial on the x86 machines MPI rarely scaled beyond 16 threads on Niagara2 (still under investigation) Autotuned pthread implementation was often 2-3x faster than autotuned MPI
Nave Serial Autotuned shared-memory MPI (PETSc+OSKI) Autotuned pthreads
P-OSKI
Parallel Heuristic models
Architectures Evaluated
4GB/s (each direction)
OSKI*
Generated code variants
Benchmark data Evaluate Models Select Data Struct. & Code
History
Heuristic models
2.33GHz Intel Xeon (Clovertown)

Core2 Core2 Core2 Core2 4MB 4MB Shared L2 Shared L2 FSB 10.6 GB/s 21.3 GB/s(read) Core2 Core2 Core2 Core2 4MB 4MB Shared L2 Shared L2 FSB 10.6 GB/s 10.6 GB/s(write)
2.2GHz AMD Opteron

Opteron Opteron 1MB 1MB victim victim SRI / crossbar Opteron Opteron 1MB 1MB victim victim SRI / crossbar
Performance and Scalability

All machines showed good multisocket scaling (bandwidth per socket was the limiting factor) Clovertown showed very poor multicore scaling (low data utilization on FSB) Machines with simpler cores, and more bandwidth delivered better performance
Build for Target Arch.
Benchmark
OSKI_Matrix_Handle For kernel Calls
HT
Chipset (4x64b controllers) 667MHz FBDIMMs
128b memory controller 10.66 GB/s 667MHz DDR2 DIMMs
128b memory controller 10.66 GB/s 667MHz DDR2 DIMMs
HT
Status and Future Work

Version 1 of parallel SpMV with load-balancing and tree-based reduction implemented Currently working on Parallel Benchmark to better pick register block sizes during autotuning process Implement Cache Blocking and TLB blocking heuristics within OSKI Extend work to other sparse linear algebra kernels
1.4GHz Sun Niagara2 (Huron)

MT MT MT MT MT MT MT MT Sparc Sparc Sparc SparcSparc Sparc Sparc Sparc 8K L1 L1 L1 L18K L1 L1 L1 L1 8K 8K 8K 8K 8K 8K Crossbar Switch 90 GB/s (writethru) 179 GB/s (fill) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 4x128b memory controllers (2 banks each) 21.33 GB/s (write) 42.66 GB/s (read)
3.2GHz IBM Cell Blade (QS20)

SPE SPE SPE SPE 256K 256K 256K 256K PPE PPE SPE SPE SPE SPE 512KB MFC MFC MFC MFC L2 EIB (Ring Network) MFC MFC MFC MFC 256K256K256K256K XDR BIF SPE SPE SPE SPE 25.6GB/s 512MB XDR DRAM <20GB/s each direction 256K 256K 256K 256K 512KB MFC MFC MFC MFC L2 EIB (Ring Network) MFC MFC MFC MFC BIF XDR 256K256K256K256K SPE SPE SPE SPE 25.6GB/s 512MB XDR DRAM
System Power Efficiency

Used a digital power meter to measure sustained power under load 16 FBDIMMs on Niagara2 drove sustained power to 450W, while others required around 300W Clovertown delivered lower performance and required more power than Opteron (thus substantially lower power efficiency) Cell delivered good performance at moderate power
667MHz FBDIMMs
*OSKI: Optimized Sparse Kernel Interface: http://bebop.cs.berkeley.edu/oski

Parlab08 Poster SPMV

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Parlab08 Poster SPMV

Diunggah oleh

Hak Cipta:

Format Tersedia

EECS

Electrical Engineering and Computer Sciences

Autotuning Sparse Matrix-Vector Multiplication

Sparse Matrix-Vector Multiplication

Dataset: the matrices

P-OSKI : A Library to Parallelize OSKI*

Where the Optimizations Occur

Initial Multicore SpMV Experimentation

Cache Blocking TLB Blocking

Two threaded implementations:

Optimizations included in our autotuner:

+DIMMs / Firmware / Array Padding

Comparison with MPI

Nave Serial Autotuned shared-memory MPI (PETSc+OSKI) Autotuned pthreads

Parallel Heuristic models

Generated code variants

Benchmark data Evaluate Models Select Data Struct. & Code

2.33GHz Intel Xeon (Clovertown)

2.2GHz AMD Opteron

Performance and Scalability

Build for Target Arch.

OSKI_Matrix_Handle For kernel Calls

Chipset (4x64b controllers) 667MHz FBDIMMs

128b memory controller 10.66 GB/s 667MHz DDR2 DIMMs

128b memory controller 10.66 GB/s 667MHz DDR2 DIMMs

Status and Future Work

1.4GHz Sun Niagara2 (Huron)

3.2GHz IBM Cell Blade (QS20)

System Power Efficiency

*OSKI: Optimized Sparse Kernel Interface: http://bebop.cs.berkeley.edu/oski

Anda mungkin juga menyukai