net/publication/323770964
CITATION READS
1 72
7 authors, including:
Some of the authors of this publication are also working on these related projects:
GEMSCLAIM: GreenEr Mobile Systems by Cross LAyer Integrated energy Management View project
Establishing secure reliable communication between drones and mobile devices View project
All content following this page was uploaded by Farhad Merchant on 20 March 2018.
Abstract—We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better
performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units
arXiv:1803.05320v1 [cs.DC] 14 Mar 2018
(GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and
columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of
GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a
Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the
literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable
Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also
outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive.
Index Terms—Parallel computing, orthogonal transforms, dense linear algebra, multiprocessor system-on-chip, instruction level
parallelism
A = QR (1)
where Qm×m is an orthogonal and Rm×n is upper triangular
matrix. There are mainly three methods to compute QR fac-
torization, 1) Householder Transform (HT), 2) Givens Rotation
(GR), and 3) Modified Gram-Schmidt (MGS). MGS is used in
the embedded systems where numerical accuracy of the final
solution is not critical, while HT is employed in High Perfor-
mance Computing (HPC) applications since it it is numerically
stable operation. Furthermore, GR is applied in the applications
pertaining to embedded systems where numerical stability of
the end solution is critical [3]. We sense here an opportunity
in GR for applicability in HPC domain. For implementations Fig. 1: dgeqr2 and dgeqrf Operations
on multicore and General Purpose Graphics Processing Units
(GPGPUs), a library based approach is followed for Dense Linear
(dgemv) operation is dominant while in dgeqrf, Double Precision
Algebra (DLA) computations where highly tuned packages based
Matrix Multiplication (dgemm) is dominant as depicted in the
on specifications given in Basic Linear Algebra Subprograms
figure 1. dgemv can attain up to 10% of the theoretical peak
in multicore platforms and 15-20% in GPGPUs while dgemm
• Farhad Merchant and Rainer Leupers are with Institute for Communi-
cation Technologies and Embedded Systems, RWTH Aachen University,
can attain up to 40-45% in multicore platforms and 55-60% in
Germany Email:{farhad.merchant, leupers}@ice.rwth-aachen.de GPGPUs at ≈ 65W and ≈ 260W respectively [5]. Considering
• Tarun Vatwani and Anupam Chattopadhyay are with School of Computer low performance of multicores and GPGPUs for critical DLA
Science and Engineering, Nanyang Technological University, Singapore computations, it could be a prudent approach to move away
E-mail: {tvatwani,anupam}@ntu.edu.sg
• Soumyendu Raha and S K nandy are with Indian Institute of Science, from traditional BLAS/LAPACK based strategy in software and
Bangalore accelerate these computations on a customizable platform that
• Ranjani Narayan is with Morphing Machines Pvt. LTd. can attain order of magnitude higher performance than state-of-
Manuscript received October 20, 2016; the-art realizations of these software packages. A special care
2
has to be taken in designing of the accelerator that is capable discussed in 4. Parallel realization of GGR in REDEFINE CGRA
of attaining desired performance while maintaining generality is discussed in 5. We conclude our work in section 6.
of the accelerator in supporting other operations in the domain. Abbreviations/Nomenclature:
Coarse-grained Reconfigurable Architectures (CGRAs) are a good
candidate for the domain of DLA computations since they are Abbreviation/Name Expansion/Meaning
capable of attaining performance of Application Specific Inte- AVX Advanced Vector Extension
BLAS Basic Linear Algebra Subprograms
grated Circuits (ASICs) while flexibility of Field Programmable CE Compute Element
Gate Arrays (FPGAs) [6][7][8]. Recently, there have been several CFU Custom Function Unit
proposals in the literature in developing BLAS and LAPACK on CGRA Coarse-grained Reconfigurable Architecture
CPI Cycles-per Instruction
custamizable CGRA platforms through algorithm-architecture co- CUDA Compute Unified Device Architecture
design where macro operations in the operations pertaining to DAG Directed Acyclic Graph
DLA are identified and realized on a Reconfigurable Data-path FMA Fused Multiply-Add
FPGA Field Programmable Gate Array
(RDP) that is tightly coupled to the processor pipeline [9][10]. FPS Floating Point Sequencer
In this paper, we focus on acceleration of GR based QR factor- FPU Floating Point Unit
ization, where classical GR is generalized to achieve Generalized GPGPU General Purpose Graphics Processing Unit
GR Givens Rotation
Givens Rotation (GGR) where GGR has 33% less multiplications GGR Generalized Givens Rotation
compared to GR. Several macro operations in GGR are identified HT Householder Transform
and realized on RDP to achieve superior performance compared to ICC Intel C Compiler
IFORT Intel Fortran Compiler
Modified Householder Transform (MHT) presented in [5]. Major ILP Instruction Level Parallelism
contributions in this paper to achieve efficient realization of GR KF Kalman Filter
LAPACK Linear Algebra Package
based QR factorization are as follows:
MAGMA Matrix Algebra on GPU and Multicore Architectures
MGS Modified Gram-Schmidt
• We improvise over Column-wise Givens Rotation (CGR) MHT Modified Householder Transform
NoC Network-on-Chip
presented in [11] and present GGR. While CGR is capable PE Processing Element
of simultaneous annihilation of multiple elements of a Parallel Linear Algebra Software for Multicore Archi-
PLASMA
column in the input matrix, GGR can annihilate multiple tectures
QUARK Queuing and Runtime for Kernels
elements of rows and columns simultaneously RDP Reconfigurable Data-path
• Several macro operations in GGR are identified and im- XGEMV/xgemv
Single/Double Precision General Matrix-vector Multi-
plemented on a RDP tightly coupled to pipeline of a Pro- plication
XGEMM/xgemm Single/Double Precision General Matrix Multiplication
cessing Element (PE) resulting in 81% of the theoretical Single/Double Precision QR Factorization based on
XGEQR2/dgeqr2
peak of PE. This implementation outperforms Modified Householder Transform (with XGEMV)
Blocked Single/Double Precision QR Factorization
Householder Transform (MHT) based QR factorization XGEQRF/xgeqrf
based on Householder Transform (with XGEMM)
(dgeqr2ht) implementation presented in [5] by 10% in PE. XGEQR2HT/xgeqr2ht
Single/Double Precision QR Factorization based on
GGR based QR factorization also outperforms dgemm in Modified Householder Transform
XGEQR2GGR/ Single/Double Precision QR Factorization based on
PE by 10% which is counter-intuitive xgeqr2ggr Generalized Givens Rotation
• Arguably, moving away from BLAS for realization of Blocked Single/Double Precision QR Factorization
XGEQRFHT/xgeqrfht based on Modified Householder Transform (with
GGR based QR factorization attains 10% higher perfor- XGEMM)
mance than the classical way of implementation where XGEQRFGGR/ xge- Blocked Single/Double Precision QR Factorization
qrfggr based on Generalized Givens Rotation (with XGEMM)
Level-3 BLAS is used as a dominant operation in the state-
Naming convention followed for different routines per-
of-the-art software packages for multicore and GPGPUs. PACKAGE ROUTINE/ taining to different packages. E.g., BLAS DGEMM/
This claim is validated by several case studies on multicore package routine blas dgemm is a Double Precision General Matrix
Multiplication Routine in BLAS
and GPGPUs
• For parallel realization in REDEFINE, we attach PE in
REDEFINE framework where 10% higher performance is
attained over dgeqr2ht implementation presented in [5].
We show that sequential realization in PE and parallel
realization of GGR based QR factorization in REDEFINE 2 BACKGROUND AND R ELATED W ORK
are scalable. Furthermore, it is shown that the speed-up in
parallel realization in REDEFINE over sequential realiza-
tion in PE is commensurate with the hardware resources CGR presented in [11] is discussed in section 2.1 and REDEFINE
employed in REDEFINE and the speed-up asymptotically CGRA is discussed in section 2.2. A detailed review of yesteryear
approaches theoretical peak of REDEFINE CGRA realizations of QR factorization is presented in section 2.3.
100 80
90
Performance Improvement In-terms of Gflops/watt Performance Improvement In-terms of
20x20
80 70 40x40
Gflops/watt in DGEMM
60x60
70 Performance Improvement In-terms of
Gflops/watt in DGEQR2HT 60 80x80
60 100x100
50
Percentage
50
40
40
30
20
30
10
0 20
e 0 l 0
or SM IV 0 el 5
C 0 ti x X7 sw 20
te
l 48 ra S as C 10
In TX St d
C H s la
G ra ee i7 Te
di
a l te Sp re a
vi A ar o di 0
N le lC vi
C te N AE0 AE1 AE2 AE3 AE4 AE5
In
Platforms Architectural Enhancements
(a) Performance Comparison of PE with Other Platforms for dgemm and (b) Performance of PE In-terms of Theoretical Peak Performance in PE
dgeqr2ht for dgemm
Fig. 4: Performance and Performance Comparison of PE where dgeqr2ht is Modified Householder Transform based QR Factorization
for implementation purpose since it was first proposed in [18]. arranges memory bound operations like Level-2 BLAS in classical
An alternate ordering of Givens sequences was presented in [19]. GR to compute bound operations like Level-3 BLAS, the scheme
According to the scheme presented in [19], the alternate ordering does not reduce computations unlike GGR. In general, works re-
is amenable to parallel implementation of GR while it does not lated to GR in the literature focus on different scheduling schemes
focus on fusing several Givens sequences to annihilate multiple for parallelism and exploitation of architectural features for the
elements. For an input matrix of n × n, the alternate ordering targeted platform. In case of GGR, total work is reduced compared
presented in [19] can annihilate maximum n2 elements in parallel to the classical GR while also being architecture platform friendly.
by executing disjoint Givens sequences simultaneously. Pipeline For implementation of GR, there has been no work in the literature
Given sequences for computing QR decomposition is presented in where macro operations in the routine are identified and realized
[20] where its is proposed to execute Givens sequences in pipeline carefully for performance.
fashion to update the pair of rows partially updated by the previ-
ous Givens sequences. Analysis of this strategy is presented for
Exclusive-read Exclusive-write Parallel Random Access Machine 3 C ASE S TUDIES
(EREW-PRAM) that shows that the pipeline strategy is twice
as fast compared to the classical GR. Greedy Givens algorithm For our experiments on multicore and GPGPUs we use high-
is presented in [21] that executes compound disjoing Givens tly tuned software packages PLASMA and MAGMA. Software
sequences in parallel assuming unlimited parallelism case. A high stacks of these packages are shown in figure 5. Performance
speed tournament GR and VLSI implementation of tournament of PLASMA and MAGMA depends on efficiency of underlying
GR is presented in [3] where a significant improvement is reported BLAS realization as well as realization of scheduling schemes for
in ASIC over triangular systolic array. The scheduling scheme multicore and GPGPUs. For implementation of GGR in PLASMA
presented in [3] is similar to the one presented in [19] where and MAGMA for multicore and GPGPUs, we add routines to
disjoint Givens sequences are applied to compute QR decompo- BLAS and LAPACK. Performance in-terms of theoretical peak
sition. FPGA implementation of GR is presented in [22] while performance in dgemm, dgeqr2, and dgeqrf is depicted in figure
ASIC implementation of square-root free GR is presented in [23]. 6(a) and performance in-terms of Gflops/watt for these routines is
A novel technique to avoid underflow/overflow in computation of depicted in figure 6(b).
QR decomposition using classical GR is presented in [24] that
results in numerical stable realization of GR in LAPACK. A two-
dimensional systolic array implementation of GR is presented 3.1 dgemm
in [25] where classical GR is implemented on two-dimensional dgemm is a Level-3 BLAS routine that has three loops and time
systolic array with diagonal elements of the array performs com- complexity of O(n3 ) for a matrix of size n × n. Performance
plex operations like square root and division while the rest of of dgemm in LAPACK in-terms of theoretical peak of underlying
the array performs matrix update. Restructuring of tridiagonal and platform is shown in figure 6(a) and in-terms of Gflops/watt in
bidiagonal algorithms for QR decomposition is presented in [26]. MAGMA is shown in figure 6(b). It can be observed in the figure
The restructuring strategy presented in [26] has several advantages 6(a) that the performance attained by dgemm in Intel Core i7
like it is capable of exploiting vector instructions in the modern and Nvidia Tesla C2050 is hardly 25% and 57% respectively. In-
architectures, reduces memory operations, and the matrix updates terms of Gflops/watt it is 1.22 in Nvidia Tesla C2050. Due to
are in the form of Level-3 BLAS thus capable of exploiting cache trivial nature of dgemm algorithm, we do not reproduce dgemm
architecture through reuse of data compared to classical GR where algorithm here while standard dgemm algorithm can be found in
it is Level-2 BLAS. Although, the scheme presented in [26] re- [27].
5
1.4
1.2
1
MAGMA_DGEMM
Gflops/watt
0.8 MAGMA_DGEQRF
MAGMA_DGEQR2
PLASMA_DGEQRF
0.6 LAPACK_DGEQR2
LAPACK_DGEQRF
0.4
0.2
X 103 0 X 103
1x1 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10
Matrix Size
(a) Performance In-terms of Theoretical Peak Performance of Underlying (b) Performance In-terms of Gflops/watt in dgemm, dgeqr2, and dgeqrf in
Platform in dgemm, dgeqr2, and dgeqrf in LAPACK, PLASMA, and LAPACK, PLASMA, and MAGMA
MAGMA
Fig. 6: Performance in dgemm, dgeqr2, and dgeqrf in LAPACK, PLASMA, and MAGMA
Algorithm 1 dgeqr2 (Pseudo code) Per-Instruction (CPI) attained is increased. This behavior is due to
Allocate memory for input/output matrices and vectors AVX instructions that use Fused Multiply Add (FMA). Due to this
for i = 1 to n do fact, the CPI reported by VTune™can not be considered as a mea-
Compute Householder vector v sure of performance for the algorithms and hence we accordingly
Compute P where P = I − 2vv T double the instruction count reported by Intel VTune™.
Update trailing matrix using dgemv In case of GPGPUs, dgeqr2 in MAGMA is able to achieve
end for up to 16 Gflops in Tesla C2050 which is 3.1% of the theoretical
peak performance of Tesla C2050 while performance in terms of
Gflops/watt is as low as 0.04 Gflops/watt.
3.2 dgeqr2
3.3 dgeqrf
Pseudo code of dgeqr2 is described in algorithm 1. It can be
observed in the pseudo code in the algorithm 1 that, it contains Algorithm 2 dgeqrf (Pseudo Code)
three steps, 1) computation of a householder vector for each 1: Allocate memories for input/output matrices
column 2) computation of householder matrix P , and 3) update 2: for i = 1 to n do
of trailing matrix using P = I − 2vv T where I is an identity 3: Compute Householder vectors for block column m × k
matrix. For our experiments, we use Intel C Compiler (ICC) and 4: Compute P matrix where P is Computed using House-
Intel Fortran Compiler (IFORT). We also use different compiler holder vectors
switches to improve the performance of dgeqr2 in LAPACK on 5: Update trailing matrix using dgemm
Intel micro-architectures. In Intel Core i7 4th Gen machine which 6: end for
is a Haswell micro-architecture, CPI attained saturates at 1.1 [5].
In case when compiler switch −mavx is used that enables use Pseudo code for dgeqrf routine is shown in algorithm 2. In
of Advanced Vector Extensions (AVX) instructions, the Cycles- terms of computations, there is no difference between algorithms
6
1, and 2. In a single core implementation, dgeqrf is observed to • dgeqr2ht attains similar performance as dgeqrf in multi-
be 2-3x faster than dgeqr2. The major source of efficiency in core and GPGPUs while it clearly outperforms dgeqrf in
dgeqrf is efficient exploitation of processor memory hierarchy and PE
dgemm routine which is a compute bound operation [28][29]. In
Nvidia Tesla C2050, dgeqrf in MAGMA is able to achieve up to We further propose generalization in CGR presented in [11]
265 Gflops which is 51.4 % of theoretical peak performance of and derive GGR that can outperform dgeqr2ht in PE and in
Nvidia Tesla C2050 as shown in the figure 6(a) which is 90.5% REDEFINE.
of the performance attained by dgemm. In dgeqr2 in MAGMA,
performance attained in terms of Gflops/watt is as low as 0.05 4 G ENERALIZED G IVENS R OTATION AND I MPLE -
Gflops/watt while for dgemm and dgeqrf it is 1.23 Gflops/watt and
MENTATION
1.09 Gflops/watt respectively in Nvidia Tesla C2050 as shown in
figure 6(b). In case of dgqrf in PLASMA, the performance attained From equation 1, matrix An×n , annihilation of element in the last
is 0.39 Gflops/watt while running dgeqrf in four cores. row and first column which is (n,1) would require application of
one Givens sequence
(1)
3.4 dgeqr2ht R
Gn,1 A = (6)
0
Algorithm 3 dgeqr2ht (Pseudo Code) where matrix R(k) is a matrix with k zero elements in the lower
1: Allocate memories for input/output matrices triangle of the matrix and has undergone k-updates. In general
2: for i = 1 to n do Givens matrix is given by equation 7.
3: Compute Householder vectors for block column m×k
4: Compute P A where P A = A − 2vv T A c s
Gi,j = diag(Ii−2 , G̃i,j , Im−i ) with G̃ = (7)
5: end for −s c
q
A Ai,j
where c = i−1,jt , s = t and t = A2i−1,j + A2i,j . It takes
Pseudo code for dgeqr2ht is shown in algorithm 3. A clear
difference between dgeqr2 in the algorithm 1, dgeqrf in the algo-
n − 1 Givens sequences to annihilate n − 1 elements in the matrix
rithm 2, and dgeqr2ht in the algorithm 3 is in updating of trailing
A. There is a possibility to apply multiple Givens sequences
to annihilate multiple elements in a column of a matrix. Thus,
matrix. dgeqr2 uses dgemv operation which is a memory bound
extending equation 6 to annihilate 2-elements in the first column
operation, and dgeqrf uses dgemm operation which is compute
of the matrix A
bound operation while dgeqr2ht uses an operation that is more (2)
dense in-terms of computations resulting in lower θ where θ is a R
Gn−1,1 Gn,1 A = (8)
rough quantification of parallelism through DAG based analysis 0
of routines dgeqr2, dgeqrf, and dgeqr2ht. In case of no-change
in the computations in the improved routine after re-arrangement where (Gn−1,1 Gn,1 )T (Gn−1,1 Gn,1 ) =
of computations (in this case fusing of the loops), θ translates (Gn−1,1 Gn,1 )(Gn−1,1 Gn,1 )T = I . Extending further equation
into ratio of number of levels in the DAG of improved routine to 8 to annihilate n − 1 elements of the first column of the input
number of levels in the DAG of classical routine. Implementation matrix A
of dgeqr2ht clearly outperforms implementation of dgeqrf in PE as
(n−1)
R
shown in figure 7(a). In the figure 7(a), the performance is shown G2,1 G3,1 ...Gn−1,1 Gn,1 A = (9)
0
in-terms of percentage of theoretical peak performance of PE
and also in-terms of percentage of theoretical peak performance where (G2,1 G3,1 ...Gn−1,1 Gn,1 )T (G2,1 G3,1 ...Gn−1,1 Gn,1 )
normalized to the performance attained by dgemm in the PE. It = (G2,1 G3,1 ...Gn−1,1 Gn,1 )(G2,1 G3,1 ...Gn−1,1 Gn,1 )T = I .
can be observed in the figure 7(a) that the performance attained Formulation in equation 9 can annihilate n − 1 elements in the
by dgeqr2ht is 99.3% of the performance attained by dgemm in first column of the input matrix A. Furthering annihilation of
the PE. Furthermore, it can be observed in figure 7(b) that the n − 1 elements to (n − 1) + (n − 2) elements that results in
performance achieved in-terms of Gflops/watt in the PE is 35 matrix R(n−1)+(n−2) where n − 1 elements in the first column
Gflops/watt compared to performance attained by dgeqrf which and n − 2 elements in the second column of matrix R are zero as
is 25 Gflops/watt. In multicore and GPGPUs, such a fusing does given by equation 10.
not translate into improvement due to limitations of the platform.
Further details of dgeqr2ht implementation can be found in [5]. (G3,2 G4,2 ...Gn−1,2 Gn,2 )(G2,1 G3,1 ...Gn−1,1 Gn,1 )A =
(n−1)+(n−2)
Based on case studies on dgemm, dgeqr2, dgeqrf, and dgeqr2ht, R
(10)
we make following observations: 0
• dgeqrf in highly tuned software packages like PLASMA where ((G3,2 G4,2 ...Gn−1,2 Gn,2 )(G2,1 G3,1 ...Gn−1,1 Gn,1 ))
and MAGMA attains 16% and 51% respectively. This ((G3,2 G4,2 ...Gn−1,2 Gn,2 )(G2,1 G3,1 ...Gn−1,1 Gn,1 ))T =
leaves a further scope for improvement in these routines ((G3,2 G4,2 ...Gn−1,2 Gn,2 )(G2,1 G3,1 ...Gn−1,1 Gn,1 ))T
through careful analysis ((G3,2 G4,2 ...Gn−1,2 Gn,2 )(G2,1 G3,1 ...Gn−1,1 Gn,1 )) = I . To
n(n−1)
• Typically, dgeqrf routine that has compute bound op- annihilate 2 elements in the lower triangle of the input
eration like dgemm achieves 80-85% of the theoretical matrix A, it takes n − 1 sequences and the Givens matrix shrinks
peak performance attained by dgemm in multicore and by one row and one column with each column annihilation.
GPGPUs Further generalizing equation 10,
7
120 40
DGEQR2 DGEQRF
DGEQRFHT DGEQR2HT
35
100
30
80
25
Percentage
Gflops/watt
60 20
15
40
10
20 Peak of the PE in DGEQR2 Peak of the PE in DGEQRF
Peak of the PE in DGEQRFHT Peak of the PE in DGEQR2HT 5
Peak of the DGEMM in DGEQR2 Peak of the DGEMM in DGEQRF
Peak of the DGEMM in DGEQRFHT Peak of the DGEMM in DGEQR2HT
0 X 10 0 X 10
2x2 4x4 6x6 8x8 10x10 12x12 2x2 4x4 6x6 8x8 10x10 12x12
Matrix Size Matrix Size
(a) Performance Comparison of PE with Other Platforms for dgemm and (b) Performance of PE In-terms of Theoretical Peak Performance in PE
dgeqr2ht for dgemm
Algorithm 5 lapack dgeqr2ggr (GGR in LAPACK) (Pseudo code) that computing Givens Generation (GG) is an square root of inner
Allocate memory for input/output matrices and vectors product while computing update of the first row is similar to inner
for i = 1 to n do product. Furthermore, computing rows 2,3, and 4 is determinant
update(L, A(i,i), A, LDA, I, N, M, Tau(i), Beta) of 2 × 2 matrix. We map these row updates on RDP as shown in
end for figure 12. In our implementation we ensure that the reconfiguration
update() of RDP is minimal to improve energy efficiency of the PE. We
Initialize k ,l, and s vectors to 0 introduce custom instructions in PE to implement these macro
Calculate 2-norm of the column vector operations in RDP along with instruction that can reconfigure
Calculate k , l, and s vectors RDP. Different routines are realized using these instructions as
Update row i of the matrix shown in figure 11. In the figure 11, it can be observed that
Update row i + 1 to n using k , l, and s vectors irrespective of routine for QR factorization implemented in PE,
the communication pattern remains consistent across the routines.
in HT, MHT, and GGR are different. Furthermore, in our im- Speed-up in dgeqr2ggr over different routines is shown in
plementation of GGR in PLASMA, we use dgemm for updating figure 13(a). It can be observed in the figure 13(a) that the
trailing matrix. speed-up attained in dgeqr2ggr over other routines range between
1.1 to 2.25x. Performance in-terms of the percentage of peak
4.1.2 GGR in MAGMA performance normalized to the performance attained by dgemm
for different routines for QR factorization is shown in figure
For implementation of magma dgeqr2ggr, we insert routines in
13(b). A counter-intuitive observation that can be made here
MAGMA BLAS. Pseudo code for the inserted routine is shown in
is that dgeqr2ggr can achieve performance that is higher than
algorithm 6. The implementation consists of 3 functions, drnm2
the performance attained by dgemm in the PE while dgeqr2ht
that computes 2-norm of the column vector, computation of k ,
performance reported in [5] is same as performance attained by
and l vectors by function klvec and update of trailing matrix
dgemm. dgeqr2ggr can achieve performance that is up to 82%
by dtmup function. It can be observed in the algorithm 6 that
of the theoretical peak of PE. dgeqr2ggr attains 10% higher
the most computationally intensive part in the routine is dtmup
Gflops/watt over dgeqr2ht which is the best performing routine as
function. There are two routines implemented, 1) dgeqr2ggr where
reported in [5] and [17]. Furthermore, improvement in dgeqr2ggr
trailing matrix is updated using method shown in equation 2, 2)
over other platforms is shown in figure 13(d). It can be observed
dgeqrfggr where trailing matrix is updated using dgemm. Perfor-
in the figure 13(d) that the performance improvement in PE for
mance of dgeqr2ggr and dgeqrfggr is shown in figure 9. It can
dgeqr2ggr over dgeqr2ggr in off-the-shelf platforms is ranging
be observed in the figure 9 that the performance of dgeqr2ggr is
from 3-100x.
similar to performance of dgeqr2 in MAGMA while performance
of dgeqrfggr is similar to the performance of dgeqrf in magma.
Despite abundant parallelism available in dgeqr2ggr, GPGPUs are 5 PARALLEL I MPLEMENTATION OF GGR IN RE-
not capable of exploiting this parallelism due to serialization of DEFINE
the routine while in dgeqrfggr, GPGPUs perform similar to dgeqrf
An experimental setup for implementation of dgeqr2ggr is shown
due to dominance of dgemm in the routine.
in figure 14 where PE that outperforms other off-the-shelf plat-
forms for DLA computations is attached as a CFU in REDEFINE
4.2 GGR in PE CE.
For implementation of dgeqr2ggr, and dgeqrfggr in PE, we use For implementation of GGR on REDEFINE, it requires an
similar method as presented in [5]. We perform DAG based efficient partitioning and mapping scheme that can sustain com-
analysis of GGR and identify macro operations in the DAGs. putation to communication ratio that is commensurate with the
These macro operations are then realized on RDP that is tightly hardware resources of the platform, and also ensure scalability. We
coupled to PE as shown in figure 10. PE consists of two modules, follow similar strategy presented in [5] for realization of dgeqr2ggr
1) Floating Point Sequencer (FPS), and 2) Load-Store CFU. in REDEFINE and propose a general scheme that is applicable for
FPS performs double precision floating point computations the input matrix of any size. For our experiments, we consider 3
while Load-Store CFU is responsible for loading and storing of different configurations in REDEFINE consisting of 2 × 2, 3 × 3,
data in registers, and Local Memory (LM) to/from the Global and 4 × 4 Tile arrays. Assuming that input matrix is of the size
Memory (GM). Operation in the PE can be defined by following N × N and REDEFINE Tile array is of size K × K , the input
N N
steps: matrix can be partitioned into the blocks of K ×K sub-matrices.
Since, objective of our experiment is to show scalability of our
• Send a load request to GM for input matrix and store input
solution, we choose N and K such that N %K = 0. Matrix
matrix elements in LM
partitioning and REDEFINE mapping is depicted in figure 15. As
• Move input matrix elements to registers in the FPS
shown in the figure 15, for Tile array of size 2 × 2, we follow
• Perform computations and store the results back to LM
scheme 1 where input matrix is partitioned in to sub-matrices
• Write results back to GM if the computed elements are the
of size 2 × 2. For Tile array of size 3 × 3 the input matrix is
final output or if they are not to be used immediately
partitioned in to sub-matrices of size 3×3 and as the input matrix,
Up to 90% of overlap in computation and communication is the scheme 1 is used to sustain computation to communication
attained in the PE for dgemm as presented in [15]. To identify ratio. Attained speed-up over sequential realization in different
macro operations, considering example of 4 × 4 matrix shown in Tile array sizes and matrix sizes is shown in figure 16. Speed-up in
the figure 2 and equation 2. It can be observed in the figure 2 dgeqr2ggr, dgeqr2, dgeqrf, dgeqrfht, and dgeqr2ht over sequential
9
0.9
Performance (Normalized to dgemm performance)
0.8
0.7
0.6
LAPACK_DGEQR2 LAPACK_DGEQRF LAPACK_DGQRR2HT
0.5 LAPACK_DGEQRFHT LAPACK_DGQRR2GGR LAPACK_DGEQRFGGR
PLASMA_DGEQR2 PLASMA_DGEQRF PLASMA_DGEQR2GGR
PLASMA_DGEQRFGGR MAGMA_DGEQR2 MAGMA_DGEQRF
0.4 MAGMA_DGEQR2GGR MAGMA_DGEQRFGGR
0.3
0.2
0.1
0 x103
1x1 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10
Matrix Size
6 C ONCLUSION
R EFERENCES
2.5
120
2 100
Percentage
80
Speed-up
1.5
60
1
40
Speed-up over DGEQRFHT
0.5 Speed-up over DGEQRF
DGEQR2 (PE) DGEQRF (PE) DGEQRFHT (PE)
Speed-up over DGEQR2 20
DGEQR2HT (PE) DGEQR2 (DGEMM) DGEQRF (DGEMM)
Speed-up over DGEQR2HT DGEQRFHT (DGEMM) DGEQR2HT (DGEMM) DGEQR2GGR (PE)
0 x10 DGEQR2GGR (DGEMM) DGEQRFGGR (PE) DGEQRFGGR (DGEMM) x10
0
2x2 4x4 6x6 8x8 10x10 12x12 2x2 4x4 6x6 8x8 10x10 12x12
Matrix Size Matrix Size
(a) Speed-up in dgeqr2ggr over dgeqr2, dgeqrf, dgeqrfht, and (b) Performance of dgeqr2ggr, dgeqrfggr, dgeqr2, dgeqrf, dgeqrfht,
dgeqr2ht in PE and dgeqr2ht in PE In-terms of Theoretical Peak Performance
Normalized to Performance of dgemm in the PE
Performance Improvement
40 100 Performance Improvement In-terms
35 80 of Gflops/watt in DGEQR2HT
Performance Improvement In-terms
Gflops/watt
30 60 of Gflops/watt in DGEQR2GGR
25 40
20 20
15 0
10
[4] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, Embedded Systems Design, vol. 60, no. 7, pp. 592–614, 2014. [Online].
J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen, Available: http://dx.doi.org/10.1016/j.sysarc.2014.06.002
“Lapack: A portable linear algebra library for high-performance [10] G. Ansaloni, P. Bonzini, and L. Pozzi, “Egra: A coarse grained recon-
computers,” in Proceedings of the 1990 ACM/IEEE Conference on figurable architectural template,” Very Large Scale Integration (VLSI)
Supercomputing, ser. Supercomputing ’90. Los Alamitos, CA, USA: Systems, IEEE Transactions on, vol. 19, no. 6, pp. 1062–1074, June
IEEE Computer Society Press, 1990, pp. 2–11. [Online]. Available: 2011.
http://dl.acm.org/citation.cfm?id=110382.110385 [11] F. Merchant, A. Chattopadhyay, G. Garga, S. K. Nandy, R. Narayan,
[5] F. A. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy, and N. Gopalan, “Efficient QR decomposition using low complexity
and R. Narayan, “Efficient realization of householder transform through column-wise givens rotation (CGR),” in 2014 27th International
algorithm-architecture co-design for acceleration of qr factorization,” Conference on VLSI Design and 2014 13th International Conference
IEEE Transactions on Parallel and Distributed Systems, vol. PP, no. 99, on Embedded Systems, Mumbai, India, January 5-9, 2014, 2014, pp.
pp. 1–1, 2018. 258–263. [Online]. Available: https://doi.org/10.1109/VLSID.2014.51
[6] J. York and D. Chiou, “On the asymptotic costs of multiplexer-based [12] F. Merchant, N. Choudhary, S. K. Nandy, and R. Narayan, “Efficient
reconfigurability,” in The 49th Annual Design Automation Conference realization of table look-up based double precision floating point
2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012, 2012, pp. 790– arithmetic,” in 29th International Conference on VLSI Design and 15th
795. [Online]. Available: http://doi.acm.org/10.1145/2228360.2228503 International Conference on Embedded Systems, VLSID 2016, Kolkata,
[7] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chat- India, January 4-8, 2016, 2016, pp. 415–420. [Online]. Available:
topadhyay, “Efficient and scalable cgra-based implementation of column- http://dx.doi.org/10.1109/VLSID.2016.113
wise givens rotation,” in ASAP, 2014, pp. 188–189. [13] F. Merchant, A. Chattopadhyay, S. Raha, S. K. Nandy, and
[8] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, R. Narayan, “Accelerating BLAS and LAPACK via efficient
and A. Chattopadhyay, “Scalable and energy-efficient reconfigurable floating point architecture design,” Parallel Processing Letters,
accelerator for column-wise givens rotation,” in 22nd International vol. 27, no. 3-4, pp. 1–17, 2017. [Online]. Available: https:
Conference on Very Large Scale Integration, VLSI-SoC, Playa del //doi.org/10.1142/S0129626417500062
Carmen, Mexico, October 6-8, 2014, 2014, pp. 1–6. [Online]. Available: [14] M. Alle, K. Varadarajan, A. Fell, R. R. C., N. Joseph, S. Das,
http://dx.doi.org/10.1109/VLSI-SoC.2014.7004166 P. Biswas, J. Chetia, A. Rao, S. K. Nandy, and R. Narayan, “REDEFINE:
[9] S. Das, K. T. Madhu, M. Krishna, N. Sivanandan, F. Merchant, Runtime reconfigurable polymorphic asic,” ACM Trans. Embed. Comput.
S. Natarajan, I. Biswas, A. Pulli, S. K. Nandy, and R. Narayan, “A Syst., vol. 9, no. 2, pp. 11:1–11:48, Oct. 2009. [Online]. Available:
framework for post-silicon realization of arbitrary instruction extensions http://doi.acm.org/10.1145/1596543.1596545
on reconfigurable data-paths,” Journal of Systems Architecture - [15] F. Merchant, A. Maity, M. Mahadurkar, K. Vatwani, I. Munje, M. Kr-
12
//dx.doi.org/10.1007/BF01389639
[20] M. Hofmann and E. J. Kontoghiorghes, “Pipeline givens sequences
for computing the qr decomposition on a erew pram,” Parallel
Computing, vol. 32, no. 3, pp. 222 – 230, 2006. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167819105001638
[21] E. J. Kontoghiorghes, “Greedy givens algorithms for computing
the rank-k updating of the qr decomposition,” Parallel Computing,
vol. 28, no. 9, pp. 1257 – 1273, 2002. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167819102001321
[22] S. Aslan, S. Niu, and J. Saniie, “Fpga implementation of fast qr decom-
position based on givens rotation,” in Circuits and Systems (MWSCAS),
2012 IEEE 55th International Midwest Symposium on, aug. 2012, pp.
470 –473.
[23] L. Ma, K. Dickson, J. McAllister, and J. Mccanny, “Modified givens rota-
tions and their application to matrix inversion,” in Acoustics, Speech and
Fig. 15: Matrix Partitioning and Scheduling on REDEFINE Tile Signal Processing, 2008. ICASSP 2008. IEEE International Conference
on, 31 2008-april 4 2008, pp. 1437 –1440.
Array [24] D. Bindel, J. Demmel, W. Kahan, and O. Marques, “On computing
givens rotations reliably and efficiently,” ACM Trans. Math. Softw.,
vol. 28, no. 2, pp. 206–238, Jun. 2002. [Online]. Available:
http://doi.acm.org/10.1145/567806.567809
ishna, S. Nalesh, N. Gopalan, S. Raha, S. Nandy, and R. Narayan,
[25] X. Wang and M. Leeser, “A truly two-dimensional systolic array fpga
“Micro-architectural enhancements in distributed memory cgras for lu
implementation of qr decomposition,” ACM Trans. Embed. Comput.
and qr factorizations,” in VLSI Design (VLSID), 2015 28th International
Syst., vol. 9, no. 1, pp. 3:1–3:17, Oct. 2009. [Online]. Available:
Conference on, Jan 2015, pp. 153–158.
http://doi.acm.org/10.1145/1596532.1596535
[16] M. Mahadurkar, F. Merchant, A. Maity, K. Vatwani, I. Munje, [26] F. G. V. Zee, R. A. van de Geijn, and G. Quintana-Ortı́, “Restructuring
N. Gopalan, S. K. Nandy, and R. Narayan, “Co-exploration of NLA the tridiagonal and bidiagonal QR algorithms for performance,” ACM
kernels and specification of compute elements in distributed memory Trans. Math. Softw., vol. 40, no. 3, p. 18, 2014. [Online]. Available:
cgras,” in XIVth International Conference on Embedded Computer http://doi.acm.org/10.1145/2535371
Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios [27] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.).
Konstantinos, Samos, Greece, July 14-17, 2014, 2014, pp. 225–232. Baltimore, MD, USA: Johns Hopkins University Press, 1996.
[Online]. Available: http://dx.doi.org/10.1109/SAMOS.2014.6893215 [28] J. A. Gunnels, C. Lin, G. Morrow, and R. A. van de Geijn,
[17] F. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. “A flexible class of parallel matrix multiplication algorithms,”
Nandy, and R. Narayan, “Achieving efficient QR factorization in IPPS/SPDP, 1998, pp. 110–116. [Online]. Available: http:
by algorithm-architecture co-design of householder transformation,” //dx.doi.org/10.1109/IPPS.1998.669898
in 29th International Conference on VLSI Design and 15th [29] J. Wasniewski and J. Dongarra, “High performance linear algebra
International Conference on Embedded Systems, VLSID 2016, Kolkata, package for FORTRAN 90,” in Applied Parallel Computing, Large Scale
India, January 4-8, 2016, 2016, pp. 98–103. [Online]. Available: Scientific and Industrial Problems, 4th International Workshop, PARA
http://dx.doi.org/10.1109/VLSID.2016.109 ’98, Umeå, Sweden, June 14-17, 1998, Proceedings, 1998, pp. 579–583.
[18] W. Givens, “Computation of plane unitary rotations transforming a [Online]. Available: http://dx.doi.org/10.1007/BFb0095385
general matrix to triangular form,” Journal of the Society for Industrial
and Applied Mathematics, vol. 6, no. 1, pp. pp. 26–50, 1958. [Online].
Available: http://www.jstor.org/stable/2098861
[19] J. Modi and M. Clarke, “An alternative givens ordering,” Numerische
Mathematik, vol. 43, pp. 83–90, 1984. [Online]. Available: http:
13
16 80
2x2 (DGEQR2HT) 2x2 (DGEQR2HT)
3x3 (DGEQR2HT) 3x3 (DGEQR2HT)
14 4x4 (DGEQR2HT) 70 4x4 (DGEQR2HT)
2x2 (DGEQR2) 2x2 (DGEQR2)
3x3 (DGEQR2)
12 4x4 (DGEQR2) 60 3x3 (DGEQR2)
2x2 (DGEQRF) 4x4 (DGEQR2)
3x3 (DGEQRF) 2x2 (DGEQRF)
10 4x4 (DGEQRF) 50 3x3 (DGEQRF)
Percentage
Speed-up
2x2 (DGEQRFHT) 4x4 (DGEQRF)
3x3 (DGEQRFHT)
8 4x4 (DGEQRFHT) 40 2x2 (DGEQRFHT)
3x3 (DGEQRFHT)
2x2 (DGEQR2GGR)
4x4 (DGEQRFHT)
6 3x3 (DGEQR2GGR)
4x4 (DGEQR2GGR)
30 2x2 (DGEQR2GGR)
3x3 (DGEQR2GGR)
4 20 4x4 (DGEQR2GGR)
2 10
0 0
Farhad Merchant is a Postdoctoral Research Soumyendu Raha obtained his PhD in Scientific
Fellow at Institute for Communication Technolo- Computation from the University of Minnesota in
gies and Embedded Systems, RWTH Aachen 2000. Currently he is a Professor of the Com-
University, Germany. Previously he has worked putational and Data Sciences Department at the
as a Researcher at Research and Technology Indian Institute of Science in Bangalore, which
Center, Robert Bosch and before that he was he joined in 2003, after having worked for IBM
Research Fellow at Hardware and Embedded for a couple of years. His research interests are
Systems Lab, School of Computer Science and in computational mathematics of dynamical sys-
Engineering, Nanyang Technological University, tems, both continuous and combinatorial, and
Singapore. He received his PhD from Computer in co-development and application of comput-
Aided Design Laboratory, Indian Institute of Sci- ing systems for implementation of computational
ence, Bangalore, India. His research interests are algorithm-architecture mathematics algorithms.
co-design, computer architecture, reconfigurable computing, develop-
ment and tuning of high performance software packages