Beruflich Dokumente
Kultur Dokumente
Table 1. Kalman filter (one iteration at time step k). Matrices are Classification of linear algebra computations. Useful for
upper case, vectors lower case. During prediction (steps (1) and (2)) this paper, and for specifying our contribution, is an organi-
a new estimate of the system state x is computed (1) along with an zation of the types of computations that our generator needs
associated covariance matrix P. During update (steps (3) and (4)),
to process to produce single-source code. We consider the
the predictions are combined with the current observation z.
following three categories (Fig. 1):
• Basic linear algebra computations, possibly with struc-
x k |k −1 = F x k −1|k −1 + Bu (1) tured matrices (sBLACs, following [41]): Computations
Pk |k −1 = F Pk −1|k −1 F T + Q (2) on matrices, vectors, and scalars using basic opera-
x k |k = x k |k −1 + Pk |k −1 H T tors: multiplication, addition, and transposition. Mat-
× (H Pk |k −1 H T + R)−1 (z k − H x k |k −1 ) (3) hematically, sBLACs include (as a subset) most of the
T computations supported by the BLAS interface.
Pk |k = Pk |k −1 − Pk |k −1 H
• Higher-level linear algebra computations (HLACs): Cho-
× (H Pk |k −1 H T + R)−1 H Pk |k −1 (4) lesky and LU factorization, triangular solve, and other
direct solvers. Mathematically, HLAC algorithms (in
Linear algebra particular when blocked for performance) are expres-
• Kalman filter
• L1 analysis convex solver applications sed as loops over sBLACs.
• Gaussian process regression • Linear algebra applications: Finally, to express and sup-
• Cholesky/LU factorization HLACs port entire applications like the Kalman filter, this class
• Triangular solver includes loops and sequences of sBLACs, HLACs, and
• Solver for Sylvester/Lyapunov equation
auxiliary scalar computations.
• S = HPHT + αxxT sBLACs
• V = Lx + Uy
SLinGen supports the latter class (and thus also HLACs),
• A = BC and thus can generate code for a significant class of real-
world linear algebra applications. In this paper we consider
Figure 1. Classes of linear algebra computations used in this paper three case studies: the Kalman filter, Gaussian process re-
with examples. gression, and L1-analysis convex optimization.
Contributions. In this paper, we make the following main
Program generation for small-scale linear algebra. In contributions.
an ideal world, a programmer would express a linear algebra • The design and implementation of a domain-specific
computation as presented in a book from the application system that generates performant, single-source C
domain, and the computer would produce code that is both code directly from a high-level linear algebra descrip-
specialized to this computation and highly optimized for the tion of an application. This includes the definition of
target processor. With this paper, we make progress towards the input language LA, the use of synthesis and DSL
this goal and introduce SLinGen, a generator for small-scale (domain-specific language) techniques to optimize at a
linear algebra applications on fixed-size operands. The input high, mathematical level of abstraction, and the ability
to SLinGen is the application expressed in a linear algebra to support vector ISAs. The generated code is single-
language (LA) that we introduce. LA provides basic scalar/- threaded since the focus is on small-scale computati-
vector/matrix operations, higher level operations including ons.
triangular solve, Cholesky and LU factorization, and loops. • As a crucial component, we present the first generator
The output is a single-threaded, single-source C code, optio- for a class of HLACs that produces single-source C
nally vectorized with intrinsics. (with intrinsics), i.e., does not rely on a BLAS imple-
As illustrating example, we consider the Kalman filter, mentation.
which is ubiquitously used to control dynamic systems such • Benchmarks of our generated code for both single
as vehicles and robots [39], and a possible input to SLinGen. HLACs and application-level linear algebra programs
Table 1 shows a basic Kalman filter, which performs compu- comparing against hand-written code: straightforward
tations on matrices (upper case) and vectors (lower case) to C optimized with Intel icc and clang with the polyhe-
iteratively update a (symmetric) covariance matrix P. The dral Polly, the template-based Eigen library, and code
filter performs basic multiplications and additions on matri- using libraries (Intel’s MKL, ReLAPACK, RECSY).
ces and vectors, but also requires a Cholesky factorization As we will explain, part of our work builds on, but conside-
and a triangular solve to perform the matrix inversions. Note rably expands, two prior tools: the sBLAC compiler LGen
that operand sizes are typically fixed (number of states and and Cl1ck. LGen [41, 42] generates vectorized C code for
measurements in the system) and in the 10s or even smaller. single sBLACs with fixed operand sizes; in this paper we
We show benchmarks with SLinGen-generated code for the move considerably beyond in functionality to generate code
Kalman filter later. for an entire linear algebra language that we define and that
Program Generation for Small-Scale Linear Algebra Applications CGO’18, February 24–28, 2018, Vienna, Austria
Basic linear algebra computation L is 3x3, lower triangular Explicit vectorization. LGen allows for multiple levels of
with structures (sBLAC) x is 3x1
tiling and arbitrary tile sizes. If vectorization is enabled, the
Performance evaluation
C-IR
operations on ν × ν matrices and vectors of length ν ), which
... are pre-implemented once for a given vector ISA together
Code-level optimizations t[0,0] = mm256MulPd(L[0,0], x[0,0])
... with vectorized data access building blocks, called Loaders
for ( int i = … ) {
and Storers, that handle leftovers and structured matrices.
…
Optimized C function t = _mm256_mul_pd(L, x);
… } 2.2 Cl1ck
Cl1ck [9, 10] is an algorithm generator that implements the
Figure 2. The architecture of LGen including a sketched small FLAME methodology [4]. The input to the generator is an
example. HLAC expressed in terms of the standard matrix operators—
addition, product, transposition and inversion—and matrix
includes HLAC algorithms (which themselves involve loops properties such as orthogonal, symmetric positive definite,
over sequences of sBLACs, some with changing operand size) and triangular. The output is a family of loop-based algo-
and entire applications. Cl1ck [9, 10] synthesizes blocked rithms which make use of existing BLAS kernels; while all
algorithms (but not C code) for HLACs, represented in a DSL the algorithms in a family are mathematically equivalent,
assuming an available BLAS library as API. Our goal is to they provide a range of alternatives in terms of numerical
generate single-source, vectorized C code for HLACs and accuracy and performance.
entire linear algebra applications containing them. Overview. As illustrated in Fig. 3, the generation takes place
in three stages: PME Generation, Loop Invariant Identifica-
2 Background tion, and Algorithm Construction. In the first stage, the input
We provide background on the two prior tools that we use equation (e.g., U T ∗U = S, where U is the output), is blocked
in our work. LGen [25, 41, 42] compiles single sBLACs (see symbolically to obtain one or more recursive formulations,
Fig. 1) on fixed-size operands into (optionally vectorized) C called “Partitioned Matrix Expression(s)” (PMEs); this stage
code. Cl1ck [9, 10] generates blocked algorithms for (a class makes use of a linear algebra knowledge-base, as well as pat-
of) HLACs, expressed at a high level using the BLAS API. tern matching and term rewriting. In our example, blocking
In the following, we use uppercase, lowercase, and Greek yields1
letters to denote matrices, vectors, and scalars, respectively. UT L 0
T
UT L UT R
ST L ST R
U T U T 0 U BR = S B L S BR ,
TR BR
2.1 LGen from which the three dependent equations shown are gene-
An sBLAC is a computation on scalars, vectors, and (possibly rated. In the second stage, Cl1ck identifies possible loop in-
structured) matrices that involves addition, multiplication, variants for the yet-to-be-constructed loop-based algorithms.
scalar multiplication, and transposition. Examples include One example here is the predicate UTTLUT L = ST L that has
A = BC, y = Lx + U y, S = HPH T + αxx T , where S is to be satisfied at the beginning and the end of each loop
symmetric and L/U are lower/upper triangular. The output iteration. Finally, each loop invariant is translated into an al-
is on the left-hand side and may also appear as input. gorithm that expresses the computation in terms of recursive
Overview. LGen translates an input sBLAC on fixed-size calls, and BLAS- and LAPACK-compatible operations. As the
operands into C code using two intermediate compilation process completes, the input equation becomes “known”: it
phases as shown in Fig. 2. During the first phase, the input is assigned a name, is added to the knowledge base, and can
sBLAC is optimized at the mathematical level, using a DSL be used as a building block for more complex operations.
that takes possible matrix structures into account. These Formal correctness. The main idea behind Cl1ck and the
optimizations include multi-level tiling, loop merging and underlying FLAME methodology is the simultaneous con-
exchange, and matrix structure propagation. During the se- struction of a loop-based blocked algorithm and its proof
cond phase, the mathematical expression obtained from the of correctness. This is accomplished by identifying the loop
first phase is translated into a C-intermediate representa- invariant that the algorithm will have to satisfy, before the
tion. At this level, LGen performs additional optimizations algorithm exists. From the invariant, a template of a proof of
such as loop unrolling and scalar replacement. Finally, since correctness is derived, and the algorithm is then constructed
different tiling decisions lead to different code versions of to satisfy the proof.
the same computation, LGen uses autotuning to select the
1T , B, L, and R stand for Top, Bottom, Left, and Right, respectively.
fastest version for the target computer.
CGO’18, February 24–28, 2018, Vienna, Austria D. G. Spampinato, D. Fabregat-Traver, P. Bientinesi, and M. Püschel
Algorithm construction
Figure 5. Example LA program for given constants n and k. Note
the explicit specification of input and output matrices.
Family of algorithms
1for ( j = i +ν ; j < m ; j +=ν ) { Algorithm reuse. We have discussed how the iterative pro-
2 U BTL ∗ U BR (:, j : j + ν ) = S BR (:, j : j + ν ); cess of building basic linear algebra programs out of an initial
3}
LA program can require multiple algorithmic synthesis steps
per HLAC in input. If two HLACs only differ in the sizes
Figure 8. LA code synthesized for the HLAC in line 5 of Fig. 7. of their inputs and outputs but share the same functiona-
lity (e.g., both solve a system LT X = B for X , where L is
lower triangular) their LA formulations would be based on
the same algorithms. Such reuse can occur even within the
Same algorithm as in Fig. 7a but with ν = 1 same HLAC, as we have shown for the case of (5) where the
building block in Fig. 9 was created based on the same algo-
rithm initially derived for the whole computation shown in
Fig. 7. For this reason SLinGen stores information about the
algorithms required for building basic linear algebra forms
of HLACs in a database that is queried before starting a new
algorithmic synthesis step (Stage 1a in Fig. 6).
3.2 Stage 2: sBLAC tiling and vectorization
Unrolling and removing zero-sized statements
The aim is to lower the basic linear algebra programs produ-
ced in the previous stage to C-IR form. To do this, SLinGen
decomposes sBLACs into vectorizable codelets (ν -BLACs),
and performs optimizations to increase vectorization effi-
ciency when possible.
In a basic linear algebra program, every statement is either
an auxiliary scalar computation or an sBLAC. SLinGen pro-
ceeds first by tiling all sBLACs and decomposing them into
vector-size sBLACs that can be directly mapped onto ν -
BLACs, using the LGen approach described in Sec. 2.1. Next
it improves the vectorization efficiency of the resulting im-
plementation. For instance, the codelet in Fig. 9 is composed
of sBLACs that can be mapped to vectorized ν-BLACs, but
also of several scalar computations that could result in a low
vectorization efficiency. Thus, SLinGen searches for oppor-
tunities to combine multiple scalar operations of the same
type into one single sBLAC with a technique similar to the
Figure 9. Code synthesis for the vector-sized HLAC in line 3 of one used to identify superword-level parallelism [26].
Fig. 7. We denote with i : j the interval [i, j). For instance consider the pair of rules R 0 and R 1 in Ta-
. ble 2. R 0 combines two scalar divisions into an element-wise
The HLAC corresponding to the linear system needs to division of a vector by a scalar, while R 1 transforms such
be further lowered. Since one dimension is already of size an element-wise division into a scalar division followed by
ν , the HLAC is partitioned along the other dimension (the the scaling of a vector. The application of rules R 0 and R 1 to
columns of submatrices UBR and S BR ). SLinGen synthesizes lines 13–15 and 20–21 in Fig. 9 yields two additional sBLACs
the algorithm in Fig. 8, which is inlined in the algorithm in for the multiplication of a scalar times a vector as shown in
Fig. 7 replacing line 5. Now, the computation of the HLAC Fig. 10. Similar rules for other basic operators create new
in (5) is reduced to computations on sBLACs and HLACs of sBLACs to improve code vectorization.
size ν ×ν ; SLinGen proceeds by generating the corresponding The basic linear algebra programs are now mapped onto
vector-size codelet for the latter. ν -BLACs and translated into C-IR code.
2) Automatic synthesis of HLAC codelets. The remaining 3.3 Stage 3: Code-level optimization and autotuning
task is to generate code for the vector-sized HLACs; we use In the final stage, SLinGen performs optimizations on the C-
line 3 of Fig. 7 as example. Using the same process as before IR code generated by the previous stage. These are similar to,
but with block size 1, SLinGen obtains the algorithm in Fig. 9, or extended version of those done in LGen [42]. We focus on
lines 2–11. This algorithm is then unrolled, as shown in lines one extension: an improved scalarization of vector accesses
12–29, and then inlined. enabled by a domain-specific load/store analysis. The goal
No further HLACs remain and SLinGen is ready to pro- of the analysis is to replace explicit memory loads and stores
ceed with optimizations and code generation. by shuffles in registers in the final vectorized C code. The
Program Generation for Small-Scale Linear Algebra Applications CGO’18, February 24–28, 2018, Vienna, Austria
S 0 : χ0 = β 0 /λ, S 1 : χ1 = β 1 /λ
R0 : (6) b 0 0 0 d 0 0 0 0 b 0 0 d 0 0 0
x = [χ0 | χ1 ], b = [β 0 | β 1 ], x = b/λ
op(x) = op(b)/λ, op( · ) = ( · ) or ·T b d 0 0 b d 0 0
R1 : (7)
τ = 1/λ, op(x) = τ * op(b) Resulting AVX code Resulting AVX code
_mm256_maskstore_pd(S+1, mask3, smul9a);
_mm256_maskstore_pd(S+6, mask2, smul9b);
(a) (b)
Figure 12. Resulting AVX code for the C-IR snippet in Fig. 11
without (a) and with (b) load/store analysis. In (a) vS02_vert is
Figure 10. Application of rules in Table 2 to (a) lines 13–15 and (b) obtained by explicitly storing to and loading from memory while
lines 20–21 in the code in Fig. 9, which yields additional ν -BLACs in (b) by shuffling vector variables.
(second line of both (a) and (b)).
Table 3. Selected HLAC benchmarks. All matrices ∈ Rn×n . A is
// Store sca . mul . in Fig . 9 a . U overwrites S . symmetric positive definite, S is symmetric, and L and U are lower
Vecstore ( S +1 , smul9a , [0 , 1 , 2] , hor ) ; and upper triangular, respectively. X is the output in all cases.
...
// Store sca . mul . in Fig . 9 b . U overwrites S .
Name Label Computation
Vecstore ( S +6 , smul9b , [0 , 1] , hor ) ;
... Cholesky dec. potrf X T X = A, X upper triangular
// Load from Fig . 12 , l .23
__m256d vS02_vert = Vecload ( S +2 , [0 , 1] , vert ) ; Sylvester eq. trsyl LX + XU = C, X general matrix
Figure 11. C-IR code snippet for the access S(i : i + 2, i + 2) with Lyapunov eq. trlya LX + X LT = S, X symmetric
i = 0 on line 23 of Fig. 9. Vecstore(addr, var, [p0 ,p1 ,...], Triangular inv. trtri X = L−1 , X lower triangular
hor/vert) (and analogous Vecload) is a C-IR vector instruction
with the following meaning: Store vector variable var placing the
element at position pi at the ith memory location starting from
4 Experimental Results
address addr in horizontal (vertical) direction.
We evaluate SLinGen for two classes of computations: HLACs
and linear algebra applications (see Fig. 1).
technique is domain-specific as memory pointers are associ- HLACs. We selected four HLACs common in applications:
ated with the mathematical layout. We explain the approach the Cholesky decomposition (potrf), the solution of trian-
with an example. Consider the access S(i : i + 2, i + 2) in gular, continuous-time Sylvester and Lyapunov equations
Fig. 9, line 23. The elements gathered from S were computed (trsyl and trlya, respectively), and the inverse of a triangular
in lines 13–15 and 20-21, which were rewritten to the code matrix (trtri). We provide their definitions in Table 3.
in Fig. 10 as explained before. Note that in this computa-
tion U overwrites S (see specification in Fig. 5, line 4). The Applications. We selected three applications from different
associated C-IR store/load sequence for i = 0 is shown in domains: (a) The Kalman filter (kf) for control systems which
Fig. 11, and would yield the AVX code in Fig. 12a. However, was introduced in Sec. 1, (b) a version of the Gaussian process
by analyzing their overlap, we can deduce that element 1 of regression [38] (gpr) used in machine learning to compute
the first vector (smul19a) goes into element 0 of the result one, the predictive mean and variance for noise free test data, and
while element 0 of the second vector (smul19b) into element (c) an L1-analysis convex solver [2] (l1a) used, for example,
1. This means the stores/loads can be replaced by a blend in image denoising and text analysis. We list the associated
instruction in registers as shown in Fig. 12b. LA programs in Fig. 13.
y = F ∗ x + B ∗ u; L ∗ LT = K ; y1 = α ∗ v 1 + τ ∗ z 1 ;
Y = F ∗ P ∗ FT + Q ; L ∗ t0 = y ; y2 = α ∗ v 2 + τ ∗ z 2 ;
v0 = z − H ∗ y ; LT ∗ t 1 = t 0 ; x 1 = W T ∗ y1 − AT ∗ y2 ;
M1 = H ∗ Y ; k = X ∗ x; x = x0 + β ∗ x1 ;
M2 = Y ∗ H T ; ϕ = k T ∗ t1 ; z 1 = y1 − W ∗ x ;
M3 = M 1 ∗ H T + R ; L ∗v = k; z 2 = y2 − (y − A ∗ x );
U T ∗ U = M3 ; ψ = xT ∗ x − vT ∗ v ; v1 = α ∗ v1 + τ ∗ z1 ;
U T ∗ v1 = v0 ; λ = yT ∗ t1 ; v2 = α ∗ v2 + τ ∗ z2 ;
U ∗ v2 = v1 ;
U T ∗ M4 = M1 ;
U ∗ M5 = M4 ;
x = y + M2 ∗ v2 ;
P = Y − M2 ∗ M5 ;
(a) Program kf (b) Program gpr (c) Program l1a
Figure 13. Selected application benchmarks. The declaration of input and output elements is omitted but we underline output matrices and
vectors in all HLACs. All matrices and vectors are of size n × n and n, respectively. Both kf and l1a are iterative algorithms and we limit our
LA implementation to a single iteration. The original algorithms of both gpr and l1a contain additionally a small number of min, max, and
log operations, which have very minor impact on the overall cost.
Ubuntu 14.04 with Linux kernel v3.13. Turbo Boost is disa- x-axis. All matrices and vectors are of size n × n and n, re-
bled. In the case of potrf, trsyl, trlya, and trtri we compare spectively. The peak performance of the CPU is 8 f /c. A bold
with: (a) the Intel MKL library v11.3.2, (b) ReLAPACK [33], black line represents the fastest SLinGen-generated code.
(c) Eigen v3.3.4 [18], straightforward code (d) compiled with For HLACs, generated alternative based on different Cl1ck-
Intel icc v16, and (e) clang v4 with the polyhedral Polly generated algorithms are shown using colored, dashed lines
optimizer [16], and (f) the implementation of algorithms ge- without markers. The selection of the fastest is thus a form
nerated by Cl1ck implemented with MKL. For trsyl we also of algorithmic autotuning.
compare with RECSY [23], a library specifically designed Every measurement was repeated 30 times on different
for these solvers. In the case of kf, gpr, and l1a we compare random inputs. The median is shown and quartile informa-
against library-based implementations using MKL and Eigen. tion is reported with whiskers (often too small to be visible).
Note that starting with v11.2, Intel MKL added specific sup- All tests were run with warm cache.
port for small-scale, double precision matrix multiplication 4.2 Results and analysis
(dgemm). HLACs. Figure 14 shows the performance for the HLAC
The straightforward code is scalar, handwritten, loop- benchmarks. In the left column of Fig. 14, we compare per-
based code with hardcoded sizes of the matrices. It is included formance results for all competitors except Cl1ck, for which
to show optimizations performed by the compiler. For icc we provide a more detailed comparison reported on the plots
we use the flags -O3 -xHost -fargument-noalias -fno-alias on the right column. Cl1ck generates blocked algorithms
-no-ipo -no-ip. Tests with clang/Polly were compiled with with a variable block size nb and uses a library for the occu-
flags -O3 -mllvm -polly -mllvm -polly-vectorizer=stripmine. ring BLAS functions (here: MKL BLAS). For each HLAC
Finally, tests with MKL are linked to the sequential version benchmark we measure performance for nb ∈ {ν, n/2, n},
of the library using flags from the Intel MKL Link Line Ad- where ν = 4 is the ISA vector length.
visor.2 In Eigen we used fixed-size Map interfaces to existing For potrf (Fig. 14a) SLinGen generates code that is on
arrays, no-alias assignments, in-place computations of sol- average 2×, 1.8×, and 3.8× faster than MKL, ReLAPACK,
vers, and enabled AVX code generation. All code is in double and Eigen, respectively. Compared to icc and clang/Polly
precision and the working set fits in the first two levels of we obtained a larger speedup of 4.2× and 5.6× showing the
cache. limitations of compilers. SLinGen is also about 40% faster
than Cl1ck code. As expected, the performance of Cl1ck’s
Plot navigation. The plots present performance in flops
implementation is very close to that of the MKL library. For
per cycles (f /c) on the y-axis and the problem size on the
trsyl (Fig. 14b) our system reaches up to 2 f /c and is ty-
pically 2.8×, 2.6×, 12×, 4×, 1.5×, and 2× faster than MKL,
2 software.intel.com/en-us/articles/intel-mkl-link-line-advisor
ReLAPACK, RECSY, Eigen, icc, and clang/Polly, respectively.
Program Generation for Small-Scale Linear Algebra Applications CGO’18, February 24–28, 2018, Vienna, Austria
than 2× slower than icc. Fig. 14d shows trtri where SLinGen 2 2
MKL
achieves up to 3.5 f /c, with an average speedup of about Eigen
1 1
2.5× with respect to MKL and Cl1ck’s most performant im- Cl1ck + MKL (nb 4)
plementation. When comparing with the other competitors, clang/Polly
0 0
4 28 52 76 100 124 4 28 52 76 100 124
SLinGen is up to 2.3×, 21×, 4.2×, and 4.6× faster than MKL,
ReLAPACK, Eigen, icc and clang/Polly, respectively. (a) potrf - Cost ≈ n 3 /3 flops
than MKL, Eigen, and icc. Note that typical sizes for kf tend icc
1 1
to lie on the left half, where we observe even larger speedups. Cl1ck + MKL (nb n/2)
Fig. 15b fixes the state size to 28 and varies the size of the 0 0
observation size between 4 and 28. 4 28 52 76 100 124 4 28 52 76 100 124
On gpr, our generated code performs similarly to MKL, (b) trsyl - Cost ≈ 2n 3 flops
while being 1.7× faster than icc and Eigen. Finally, for the
Performance [f/c] vs n [double] Performance [f/c] vs n [double]
l1a test, SLinGen generated code that is on average 1.6×, 4 4
1.3×, and 1.5× faster than MKL, Eigen, and icc, respectively.
To judge the quality of the generated code, we study more 3 3
in depth the code synthesized for the HLAC benchmarks.
SLinGen SLinGen
2 2
Bottleneck analysis. We examined SLinGen’s generated
1 icc 1
code with ERM [7], a tool that performs a generalized roof- Cl1ck + MKL (nb n/2)
line analysis to determine hardware bottlenecks. ERM creates 0 MKL 0
the computation DAG using the LLVM Interpreter [17, 27] 4 28 52 76 100 124 4 28 52 76 100 124
HLAC benchmarks. 3 3
ReLAPACK
Table 4 summarizes for each routine and three sizes the
Cl1ck + MKL (nb n)
hardware bottleneck found with ERM. In general we notice 2
MKL
2
that the generated code is limited by two main factors. For
1 1
small sizes it is the cost of divisions/square roots, which on
Eigen Cl1ck + MKL (nb 4)
Sandy Bridge can only be issued every 44 cycles. The fraction 0 0
of these is asymptotically small (approximately 1/n 2 for potrf, 4 28 52 76 100 124 4 28 52 76 100 124
and 1/n for trsyl, trlya, and trtri) but matters for small sizes (d) trtri - Cost ≈ n 3 /3 flops
as they are also sequentially dependent. We believe that SLinGen ReLAPACK Intel icc 16
this effect plagues our gpr, which suffers from divisions by Intel MKL 11.3.2 RECSY ‘09 clang 4 + Polly 3.9
Cholesky decomposition and triangular solve. Eigen 3.3.4
For larger sizes, the number of loads and stores between
registers and L1 grows mostly due to spills. SLinGen tends Cl1ck + MKL (nb 4) Cl1ck + MKL (nb n/2) Cl1ck + MKL (nb n)
to fuse several innermost loops of several neighbouring
sBLACs, which can result in increased register pressure due Figure 14. Performance plots for the HLAC benchmarks: (a) potrf,
to several intermediate computations. This effect could be (b) trsyl, (c) trlya, and (d) trtri. On the left column: colored, das-
improved by better scheduling strategies based on analytical hed lines without markers indicate different SLinGen-generated
models [28]. algorithms.
CGO’18, February 24–28, 2018, Vienna, Austria D. G. Spampinato, D. Fabregat-Traver, P. Bientinesi, and M. Püschel
Library focus on small computations. Recently, the pro- Finally, our approach is in concept related to Spiral, a
blem of efficiently processing small to medium size compu- generator for the different domain of linear transforms [35,
tations has attracted the attention of hardware and library 36] and also uses ideas from the code generation approach
developers. The LIBXSMM library [21] provides an assembly of the LGen compiler [25, 41, 42]. Spiral is a generator for
code generator for small dense and sparse matrix multiplica- linear transforms (like FFT, a very different domain) and not
tion, specifically for Intel platforms. Intel MKL has introdu- for the computations considered in this paper. There was an
ced fast general matrix multiplication (GEMM) kernels on effort to extend it to linear algebra [14] but was shown only
small matrices, as well as a specific functionality for batches for matrix-matrix multiplication. The connection between
of matrices with same parameters, such as size and leading SLinGen and Spiral is in concept: The translation from math
dimensions (GEMM_BATCH). The very recent [15] provides to code and the use of DSLs to apply optimizations at a high
carefully optimized assembly for small-scale BLAS-like ope- level of abstraction.
rations. The techniques used in writing these kernels could Optimizing compilers. In Sec. 4 we compare against
be incorporated into SLinGen. Polly [16], an optimizer for the clang compiler based on
the polyhedral model [13]. This and other techniques res-
DSL-based approaches. PHiPAC [5] and ATLAS [46] are
chedule computation and data accesses to enhance loca-
two of the earliest generators, both aiming at high-
lity and expose parallelization and vectorization opportuni-
performance GEMM by parameter tuning. Higher-level gene-
ties [6, 24]. Multi-platform vectorization techniques such as
rators for linear algebra include the CLAK compiler [11, 12],
those in [31, 32] use abstract SIMD representations making
the DxTer system [29], and BTO [40]. CLAK finds efficient
optimizations such as alignment detection portable across
mappings of matrix equations onto building blocks from
different architectures. The work in [37] leverages whole-
high-performance libraries such as BLAS and LAPACK. SLin-
function vectorization at the C level, differently from other
Gen could benefit from a CLAK-like high-level front-end
frameworks that deviate from the core C language, such
to perform mathematical reasoning on more involved ma-
as Intel ispc [34]. The scope of these and other optimizing
trix computations that require manipulation. DxTer trans-
compilers is more general than that of our generator. On
forms blocked algorithms such as those generated by Cl1ck,
the other hand, we take advantage of the specific domain to
and applies transformations and refinements to output high-
synthesize vectorized C code of higher performance.
performance distributed-memory implementations. BTO fo-
cuses on memory bound computations (BLAS 1-2 operations) 7 Conclusions
and relies on a compiler for vectorization. LINVIEW [30] is This paper pursues the vision that performance-optimized
a framework for incremental maintenance of analytical que- code for well-defined mathematical computations should
ries expressed in terms of linear algebra programs. The goal be generated directly from a mathematical representation.
of the system is to propagate within a (large) computation This way a) complete automation is achieved; b) domain
only the changes caused by (small) variations in the input knowledge is available to perform optimizations that are
matrices. difficult or impossible for compilers at a high level of ab-
The MATLAB Coder [43] supports the generation of C straction; c) porting to new processor architectures may be
and C++ functions from most of the MATLAB language simplified. In this paper we proposed a prototype of such a
but produces only scalar code without explicit vectorization. system, called SLinGen, for small-scale linear algebra as nee-
Julia [3] is is a high-level dynamic language that targets ded in control, signal processing, and other domains. While
scientific computing. Linear algebra operations in Julia are limited in scope, it goes considerably beyond prior work
mapped to BLAS and LAPACK calls. on automation for linear algebra, which mainly focused on
Also related are expression template-based DSLs like the library functions for BLAS or LAPACK; SLinGen compiles
Eigen [18] and the uBLAS [45] libraries. In particular, Eigen entire linear algebra programs and still obtains competitive
provides vectorized code generation, supporting a variety or superior performance to handwritten code. The methods
of functionalities including Cholesky and LU factorizations. we used are a combination of DSLs, program synthesis and
However, libraries based on C++ metaprogramming cannot generation, symbolic computation, and compiler techniques.
take advantage of algorithmic or implementation variants. There is active research on these topics, which should also
Another approach based on metaprogramming is taken by provide ever better tools and language support to facilitate
the Hierarchically Tiled Arrays (HTAs) [20], which offer data the development of more, and more powerful generators like
types with the ability to dynamically partition matrices and SLinGen.
vectors, automatically handling situations of overlapping
areas. HTAs priority, however, is to improve programmabi- Acknowledgments
lity reducing the amount of code required to handle tiling Financial support from the Deutsche Forschungsgemein-
and data distribution in parallel programs, leaving any opti- schaft (DFG, German Research Foundation) through grants
mization to the programmer (or program generator). GSC 111 and BI 1533/2–1 is gratefully acknowledged.
CGO’18, February 24–28, 2018, Vienna, Austria D. G. Spampinato, D. Fabregat-Traver, P. Bientinesi, and M. Püschel
(IPDPS). 1–8. [44] F. G. Van Zee and R. A. van de Geijn. 2015. BLIS: A Framework
[41] D. G. Spampinato and M. Püschel. 2014. A Basic Linear Algebra Com- for Rapidly Instantiating BLAS Functionality. ACM Transactions on
piler. In International Symposium on Code Generation and Optimization Mathematical Software (TOMS) 41, 3, Article 14 (2015), 33 pages.
(CGO). 23–32. [45] J. Walter, M. Koch, et al. 2012. uBLAS. www.boost.org/libs/numeric/
[42] D. G. Spampinato and M. Püschel. 2016. A Basic Linear Algebra ublas.
Compiler for Structured Matrices. In International Symposium on Code [46] R. C. Whaley and J. J. Dongarra. 1998. Automatically tuned linear
Generation and Optimization (CGO). 117–127. algebra software. In Supercomputing (SC). 1–27.
[43] The MathWorks, Inc. 2017. MATLAB Coder. https://www.mathworks. [47] X. Zhang. 2017. The OpenBLAS library. http://www.openblas.net/.
com/products/matlab-coder.html.