Anda di halaman 1dari 7

ExaShark: A Scalable Hybrid Array Kit for Exascale

Simulation
Imen Chakroun
ExaScience Life Lab, Belgium
IMEC, Leuven , Belgium

Tom Vander Aa
ExaScience Life Lab, Belgium
IMEC, Leuven , Belgium

Bruno De Fraine
Vrije Universiteit Brussels,
Brussels , Belgium

Tom Haber
ExaScience Life Lab, Belgium
UHasselt, Belgium

Roel Wuyts
ExaScience Life Lab, Belgium
IMEC, Leuven , Belgium
DistriNet, KU Leuven,
Belgium

Wolfgang Demeuter
Vrije Universiteit Brussels,
Brussels , Belgium

gap between problem specification and efficient code at the


cluster, node and chip levels.

ABSTRACT

Many problems in high-performance computing, such as


stencil applications in iterative solvers or in particle-based
simulations, have a need for regular distributed grids. Libraries offering such n-dimensional regular grids need to take
advantage of the latest high-performance computing technologies and scale to exascale systems, but also need to be
usable by non HPC experts. This paper presents ExaShark:
a library for handling n-dimensional distributed data structures that strikes a balance between performance and usability.ExaShark is an open source middleware, offered as a library, targeted at reducing the increasing programming burden on heterogeneous current and future exascale architectures. It offers its users a global-array-like usability while its
runtime can be configured to use shared memory threading
techniques (Pthreads, OpenMP, TBB), inter-node distribution
techniques (MPI, GPI), or combinations of both. ExaShark
has been used to develop applications such as a Jacobian 2D
heat simulation and advanced pipelined conjugate gradient
solvers. These applications are used to demonstrate the performance and usability of ExaShark.

Exascale machines will be mainly necessary for scientific and


industrial simulations where scientists try to understand more
complex phenomena by using simulations that run at ever
higher resolutions and on ever longer time scales. One of
the primary data structures in many of these scientific simulations is a regular multidimensional array. Indeed, a number
of simulations can be modeled as time-discrete evolution on
structured multidimensional array. Regular arrays are also at
the core of many numerical algorithms.
Unfortunately, support for large-scale distributed multidimensional arrays is very limited. Such arrays generally must
be manually mapped to one-dimensional arrays, with the
user performing all the indexing logic. The problem is even
worse in the parallel setting, since efficiently communicating subsets of an array that are not physically contiguous
requires significant effort from the programmer. Some existing libraries (Global Array Toolkit [4], LibGeoDecomp
[17], etc.) already offer support for n-dimensional regular
grids. We show that ExaShark improves on the state of the
art by providing a user-friendly, generic API without sacrificing performance. ExaShark takes advantage of the latest
high-performance computing technologies, helping to scale
to exascale systems.

Author Keywords

N-dimensional grids, Partitioned Global Address Space,


Exascale simulation, Middleware library, numerical solvers.
INTRODUCTION

This paper introduces ExaShark: a partitioned global address space (PGAS) library that provides a high-level, abstract interface for handling shared and distributed multidimensional arrays. It offers its users a global-arrays-like usability while its runtime can be configured to use any combination of shared memory threading techniques (Pthreads,
OpenMP, TBB) and inter-node distribution techniques (MPI,
GPI). Compared to similar general-purpose structured gridbased application programming interfaces, ExaShark is enhanced with advanced features such as expression templates
and operator overloading for global arrays, coupling of multiple grids and dynamic redistribution of the grid.

High Performance Computing (HPC) architectures are expected to change dramatically in the next decade with the
arrival of exascale computing, high performance computers
that offer 1 exaFlop (1015 ) of performance. Because of power
and cooling constraints, large increases in individual core performance are not possible and as a result on-chip parallelism
is increasing rapidly. The expected hardware for an exascale
machine node will therefore need to rely on massive parallelism both on-chip and off-chip, with a complex distributed
hierarchy of resources. Programming a machine of such scale
and complexity will be very hard unless some appropriate,
workable layers of abstraction are introduced to bridge the

To validate our claims of performance and usability, the performance of ExaShark is evaluated on advanced state-of-the-

HPC 2015, April 12-15, 2015, Alexandria, VA.


c 2015 Society for Modeling & Simulation International (SCS).
Copyright

art pipelined conjugate gradient solvers. We also compare


ExaSharks scalability with other widely used n-dimensional
libraries on stencils-based applications such as the standard
Jacobian 2D heat distribution benchmark.

The following section describes the design of ExaShark. The


focus is put on the communication layer, the data sharing patterns and the advanced features such as boundary conditions
and expression templates over global arrays.

The remainder of this paper is organized as follows. In Section 2, existing work dealing with multidimensional arrays is
presented. Section 3 describes the ExaShark design focusing
mainly on the communication layer, the data sharing patterns
and the advanced features such as ghost regions. In Section
4, the experimental results are presented on two benchmarks:
advanced numerical solvers and stencil codes. Some conclusions and perspectives of this work are drawn in Section 5.

FRAMEWORK DESIGN

ExaShark is an open source middleware [6], offered as a library, targeted at reducing the increasing programming burden on heterogeneous current and future exascale architectures. ExaShark handles matrices that are physically distributed blockwise, either regularly or as the Cartesian product of irregular distributions on each axis. The access to the
global array is performed through a logical indexing.
To define a global array, the programmer should define its coordinates which are the n dimensions of the structure and its
data. The data can also be redistributed at run-time. Besides,
the programmer can define halos/ghost regions in the global
array (see Figure 1). Indeed, many applications based on regular grids need support for ghost cells. These regions are
boundary data that are needed by one process to compute its
inner part of the global array but which are remotely-held by
other processes and need to be exchanged (updated) at each
iteration. Because they are error prone and because they require knowledge expertise, ExaShark transparently manages
these features for the programmer. Along with ghost cells,
periodic boundary conditions are also supported.

RELATED WORK

This section presents existing work dealing with multidimensional arrays. We particularly focus on general-purpose libraries, not on specific libraries such as [8] and [16] that are
specific to stencil computations which is only one optimization domain of ExaShark.
Despite the importance of multidimensional arrays to scientific applications, common programming languages such as
C++ provide only limited support for such arrays (e.g. boost
library).
However, a number of efforts aiming at creating reusable libraries to support scientists in implementing grids exist. The
most commonly used PGAS array library is probably Global
Arrays (GA) [4]. GA is implemented as a library with C
and Fortran bindings, and more recently added Python and
C++ interfaces (starting with the release 3.2). GA is built on
top of the aggregate remote memory copy interface (ARMCI)
[1], a low-level, one-sided communication runtime system.
ARMCI forms GAs portability layer, and when porting GA
to a new platform, ARMCI must be implemented using that
platforms native communication and data management primitives. GA works with either MPI or TCGMSG message passing libraries for communication but it does not seem to support multi-threading. The fact that variables are always local
to an MPI process and sharing them requires explicit communication between processes renders the pure MPIapproach,
without adding support for one-sided communication, unsustainable on future large-scale systems with growing numbers
of cores and decreasing amount of memory per core.
In [17], LibGeoDecomp (Library for Geometric Decomposition codes) is presented. It is a generic C++ library with a
high abstraction level that uses grids to express the computation in presence of high latency networks. It is limited to
2D and 3D simulations. LibGeoDecomp supports accelerators such as the Intel Xeon Phi coprocessor and GPU. Even
though LibGeoDecomp provides a high-level C++ interface,
it does not use specific mechanisms such as expression templates to speed up mathematical operations on the grids.

Figure 1. Different sized ghost regions configurations using ExaShark

Our major architectural drivers for a scalable structured gridbased library are efficiency, portability and ease of coding.
ExaShark is portable since it is built upon widely used technologies such as MPI and C++ as a programming language.
It provides simple coding via a global-arrays-like interface
which offers template-based functions (dot products, matrix
multiplications, unary expressions). The functionalities offered by ExaShark are efficient since they use asynchrounous
and specific communication patterns. For example, let us
consider the update operation which fills in ghost cells with

Our library, ExaShark, presented in the following section, is


also enhanced with other features such as multiple grids coupling and dynamic redistribution of the grid. A comparison
of different features is summarized in Table 1.
2

Desing Flaws

ExaShark

Global Array

LibGeoDecomp

Programming language

C++

Pyhton/Fortran/C++

C++

Ghost regions

Yes

Yes

Yes

Communication layer

Hyrbid (MPI + OpenMP/Pthreads)

MPI/TCGMSG

MPI/OpenMP

Expression templates for ndim arrays

Yes

Yes

No

Accelerator support (XeonPhi/GPU)

XeonPhi

No

Yes

Domain redistribution

Yes

No

Yes

Stencil operations

Yes

No

Yes

Vector operations (SPMV, dot products..)

Yes

Yes

No

Coupling multiple grids

Yes

No

No

Table 1. Comparing features of ExaShark with similar libraries

the visible data residing on neighboring processors. This operation usually induces latency. In order to hide this latency,
ExaShark provides an asynchronous version where the data
exchanges are overlapped with the update of the inner region
of each process.

communications are used according to the computation performed. For instance, for the update of the ghost regions geometric communication is needed, while one sided communication is used for the get and put routines and collective
operations are performed for the reductions.
Expression templates over global arrays

Data distribution

ExaShark offers many high-level functions traditionally associated with arrays, eliminating the need for programmers to
write these functions themselves. Examples are basic mathematical operators, unary functions, standard global arrays
operations such as dot products and matrix vector multiplication,and so on. These functionalities are implemented using
expression templates [18].

ExaShark allows the programmer to control regular and irregular data distributions of the global array. Indeed, ExaShark
is based on the PGAS parallel programming model which is
convenient for expressing algorithms with large and random
data access. Each process is assumed to have fast access to
a portion of each distributed matrix, and slower access to the
remainder. These access time differences define the data as
being either local or remote, respectively. This locality information can be exploited at each iteration of the computation.
For example, the user can inquire about what data portion is
held by a given process or which process owns a particular
array element, identify inner and outer regions of the grid,
etc.

In Figure 2, we show how a basic code (left) of the conjugate gradient solver which uses mathematical operations over
vectors is rewritten using ExaShark (right). The goal of this
algorithm is to give an approximate solution of the equation
Ax = b where the A is the coefficient matrix, x is the solution vector that is returned and b is the input vector. The first
instruction of the Exashark code (left) defines the global domain size (in this case n*n). The second instruction defines
two global arrays which corresponds to the x and b vectors in
the pseudo code of CG (left). Here the matrix A is not allocated since it is implicitly assembled using stencils. Indeed,
the information about the underlying matrix is available only
through matrix vector computations which is probed approximately without forming and storing the elements and therefore do not have access to a fully assembled matrix [15].

ExaShark enables user-configurable wide ghost zones: a


ghost zone with width k. It also allows ghost cell widths
to be arbitrarily set in each dimension, as sketched in Figure 1, thereby allowing programmers to improve performance
by combining multiple fields into one global array and using
multiple time steps between ghost cell updates.
Communication layer

As quoted in Section 2, ExaShark supports a plethora of lower


level programming models and libraries with the aim of being adequate to exascale systems which are highly heterogeneous. Application developers may use pure MPI and/or
hybrid MPI + OpenMP threads to target coarse and mediumgrained parallelism. Work on integrating GASPI/GPI (Global
Address Space Programming Interface) [11] is ongoing and
should provide an even faster, more efficient, and more scalable communication between processes.

By using expression templates, ExaShark incorporates delayed expressions to reduce temporary memory allocation.
For example, consider line 7 from the code sample in Figure 2. A naive implementation of the operation x = x + p
where alpha is a scalar and x and p are global arrays would
have operator+ and operator overloaded and return arrays. Then, the considered expression would mean creating
a temporary for p then another temporary for x+ that first
temporary, then assigning that to x. Even with the return
value optimization this will allocate memory at least twice.
Expression templates delay evaluation so the expression essentially generates at compile time a new array constructor. It

One of the ExaSharks efficiency keys is that it uses asynchronous communications and ensures overlapping communication and computation whenever possible. Several types of
3

is as if this constructor takes a scalar and two arrays by reference; it allocates the necessary memory and then performs the
computation. Thus only one memory allocation is performed.

carefully taken into account. Our plan is to use the stencilembedded definition provided by Patus which allows to exchange only pointers rather than performing data copies.
Generally speaking, it is not hard to integrate with external libraries because internally ExaShark stores its data in the common row-wise pattern. It is therefore sufficient in many cases
to exchange pointers. If other data formats are used then integration is still possible but may incur the penalty of copying
data from and/or to the external representation.
VALIDATION

In this section we present the experimental results obtained


with ExaShark on two benchmarks and compare them with
the results of the libraries presented in Section We also evaluate the usability of each of the considered libraries using
software design metrics.
Hardware platform

The experimental validation was carried out on the Anselm


supercomputer cluster [2] which consists of 209 compute
nodes, totaling 3344 compute cores with 15TB RAM and giving over 94 Tflop/s theoretical peak performance. Each node
is a x86-64 computer, equipped with 16 cores, at least 64GB
RAM, and 500GB hard drive. Nodes are interconnected by a
fully non-blocking fat-tree Infiniband network and equipped
with Intel Sandy Bridge processors.

Figure 2. Pseudo code (left) vs running code using ExaShark (right) of


the conjugate gradient solver

Inter-operability and interfacing with external software

ExaShark can interface with external libraries when needed


by the users application. For example, developers can use
the functionality of the Intels Math Kernel Library (MKL)
for optimized math routines [13] or PETSc (Portable, Extensible Toolkit for Scientific Computation) [7] for advanced numerical methods for partial differential equations and sparse
matrix computations. Work on integrating stencils compilers
such as Patus [10] into ExaShark for optimizing stencil code
implementations is also ongoing.

Benchmarks

Two applications are considered as benchmarks for measuring the performance of ExaShark. The first one is the conjugate gradient method (CG) [12] which is an algorithm for
the numerical solution of particular systems of linear equations. The second one is a heat distribution simulation example which is a five point stencil application. The aim of
the first experiment is to evaluate the added value of asynchronous and hybrid communication techniques supported by
ExaShark. The second is to measure the performance of ExaShark in comparison to other similar libraries namely GA
and LibGeoDecomp.

The PETSc solvers, for example, can be called within an Exashark application to solve PDEs that require solving largescale, sparse nonlinear systems of equations. One important issue related to inter-operability is how to convert the
data structures before calling the PETSc solvers, and how
to convert the data structures of PETSc back after calling
the PETSc solvers. We believe that one of the most efficient ways to exchange data structures between Exashark and
PETSc without inducing data copying overheads, is to use
the V ecGetArray() and V ecRestoreArray() routines of
PETSc. V ecGetArray(), indeed, returns a pointer to a contiguous array that contains a processors portion of the vector
data. After calling the PETSc solvers, using V ecGetArray()
will zeros out the pointer unless vectors data are not stored
in a contiguous array. In this case, this routine will copy the
data back into the underlying vector data structure from the
array obtained with V ecGetArray().

Conjugate Gradient method

Iterative methods are an efficient way to obtain a good numerical approximation to the solution of Ax = b when the
matrix A is large and sparse. The CG method is a widely
used iterative method for solving such systems when the
matrix A is symmetric and positive definite. Generalizations
of CG exist for non-symmetric problems. A pseudo code of
the conjugate gradient method is given in Figure 2.
We run three experiments for the CG solver using ExaShark
but with different communication protocols: the first scenario
is a CG implementation that relies on asynchronous communication and for which hybrid asynchronous MPI/OpenMP is
used. In the second scenario, the same implementation of
CG is run with only MPI. The last scenario corresponds to an
implementation of CG where no asynchronous communication is considered. For this later scenario hybrid synchronous
MPI/OpenMP is used.

For integration with Patus, the idea is to improve the performance of stencil operations at the node-level, while ExaShark orchestrates the distribution of data and computation
across different nodes and manages the communication between them. Here as well, the data conversion and memory
alignment for the structures used by both software should be
4

lem for exascale numerical simulations. Indeed, our simulations with LibGeoDecomp crashed for the 2D heat distribution simulation of a 655362 grid because it consumes more
than 64GB of data, which is the maximum allowed memory
size on our testing cluster.

Figure 3. Results for the conjugate gradient method with ExaShark

The results, reported in Figure 3 show that using both intra and inner nodes asynchronous communication techniques
are on average 32.17% and 19.4% better than synchronous
communication and message passing interface mechanisms
respectively. For example, resolving the same problem instance using 1024 cores would take 576 seconds with the hybrid asynchronous version of CG while it would last 971 seconds using the hybrid synchronous communication and 835
seconds with only asynchronous MPI. Results show also that
the speedup with asynchronous hybrid communication protocols scales better than the other communication techniques.
The results are machine and preconditioner-dependent.
Heat distribution using the Jacobi iteration

Many scientific applications use iterative finite-difference


techniques that sweep over a spatial grid, performing nearest neighbor computations called stencils [14]. In a stencil
operation, each point in a multidimensional grid is updated
with weighted contributions from a subset of its neighbors,
locally solving a discretized version of a partial differential
equation for that data element.

Figure 4. Comparing the performance of ExaShark, GA and LibGeoDecomp on the 2D heat distribution problem

Jacobi [9] is a popular algorithm for solving Laplaces differential equation on a square domain, regularly discretized. The
idea is the following: let us consider a 2D array of particles,
each with an initial value of temperature. Each particle is in
contact with a fixed value of neighboring particles which impact its temperature. The Laplaces equation is solved for all
particles to determine their temperature as the average of the
four neighboring particles. To do so, a number of iterations
are performed over the data to recompute average temperatures repeatedly, and the values gradually converge to a finer
solution until the desired accuracy is reached.

As quoted in Section 2, GA seem to support multi-threading.


The fact that variables are always local to an MPI process
and sharing them requires explicit communication between
processes renders the pure MPIapproach unsustainable on
future large-scale systems with growing numbers of cores and
decreasing amount of memory per core.
Figure 5 show the results of weak scaling for all considered
libraries. The results show that LibGeoDecomp scales very
well compared to ExaShark and GA as long as the grid size
was smaller than 655362 on our 64GB compute node. The
scalability obtained with GA and ExaShark can be explained
by the nature of the benchmark itself. Indeed, as most stencil
codes, the heat dissipation problem performs repeated sweeps
through data structures that are typically much larger than the
data caches of modern microprocessors. As a result, these
computations generally produce high memory traffic for relatively little computation, causing performance to be bound
by memory throughput rather than floating-point operations.

The results described in Figure 4 show that for the 2D heat


distribution simulation, ExaShark is on average 59% faster
than the Global Array Toolkit for the 327682 grid and 52%
for the 655362 grid. We can also see that ExaShark scales
better than GA. On the other hand, LibGeoDecomp is on average 39% faster than ExaShark for the 327682 grid but we
could not get it to work for bigger simulations namely for
sizes larger than 655362 . This constraint is a significant prob5

Design Flaws

ExaShark Global Array LibGeoDecomp

Internal Duplication

280

External Duplication

729

Feature Envy

122

Data Clumps

1153

67

Data Class

249

22

Tradition Breaker

Schizophrenic Class

Table 3. Design flaws for Exashark, GA and LibGeoDecomp

From the design point of view, InCode detects design antipatterns that are most commonly encountered in software
projects. These anti-patterns cover all the essential design aspects: complexity, encapsulation, coupling, cohesion and inheritance. We ran InCode for GA, ExaShark and LibeGeoDecomp and report the corresponding design flaws in Table 3.
InCode shows that ExaShark has no design flaws whereas
both other libraries have. Some of these design flaws, such as
data clumpsand feature envy, affect the understanding of
the operations provided by a library and consequently its usability. An explanation of the different design flaws reported
in Table 3 is given below:
- Data Class: refers to a class with an interface that exposes data members, instead of providing any substantial
functionality. This means that related data and behavior
are not in the same scope, which indicates a poor datafunctionality proximity.
Figure 5. Analyzing weak scaling of ExaShark, GA and LibGeoDecomp
on the 2D heat distribution problem.

- Code Duplication (internal/external): refers to groups of


operations which contain identical or slightly adapted code
fragments. Duplicated code multiplies the maintenance effort, including the management of changes and bug-fixes.
Moreover, the code base gets bloated.

Usability

Because simple code is always better than complex code, we


tried to evaluate the usability of each of the three libraries
considered in this work: ExaShark, GA and LibGeoDecomp.
To do so, we used the inCode Helium[5] software which is
a quality assessment tool for Java, C++ and C programs. InCode detects design flaws and helps to understand the causes
of quality problems on the level of code and design.

- Data Clumps: are large groups of parameters that appear


together in the signature of many operations. This hampers
the understanding of operations with a data clump.
- Feature Envy: refers to an operation that manipulates a lot
of data external to its definition scope. This is a strong indication that the affected operation was probably misplaced
and that it could be moved to the scope in which the envied data resides.

From the code point of view, InCode computes software metrics for a given program ranging from basic size and complexity metrics to the more advanced metrics. In Table 2, we
report two metrics: the number of classes and the number of
lines. We can see, according to these metrics, that ExaShark
has the least number of lines of codes and classes which is an
indication of an easier use of the library.
Metrics

- Schizophrenic Class: describes a class with a large and


non-cohesive interface. Such classes represent more than
a single key abstraction and this affects the ability to understand and change in isolation the various abstractions
embedded in the class.

ExaShark Global Array LibGeoDecomp

Number of Lines of code

8881

501553

65925

Number of classes

12

464

535

- Tradition Breaker: refers to a class that breaks the interface inherited from a base class or an interface. A class can
do this by reducing the visibility of the services published
by the base class by means of private/protected inheritance
(in C++).

Table 2. Number of lines and classes of Exashark, GA and Libgeodecomp

4. The global array toolkit.

CONCLUSION AND FUTURE WORK

http://hpc.pnl.gov/globalarrays.

In this paper we introduced ExaShark: a PGAS-based library


that provides a high-level interface for handling shared and
distributed multidimensional arrays. ExaShark supports hybrid shared-memory threading technologies such as Pthreads
and OpenMP as well as distributed-memory technologies
such as MPI. Compared to similar general-purpose structured
grid-based libraries, ExaShark is enhanced with advanced
features such as templates and operator overloading for global
arrays, multiple grids coupling and dynamic redistribution of
the grid. Two benchmarks have been used for the experimental validation of ExaShark: advanced pipelined conjugate
gradient solvers and stencils-based applications such as a 2D
heat simulation. The first application was used to highlight
the benefits of using asynchronous and hybrid communication techniques within the library. The second to demonstrate
the performance of ExaShark comparing to other similar libraries. The results shows that using both intra- and innernodes asynchronous communication techniques are 32.17 %
and 19.4% better than synchronous communication and message passing interface mechanisms respectively. Compared to
GA, ExaShark is on average 55% faster and scales better for
the 2D heat distribution simulation. LibGeoDecomp is faster
than ExaShark but we could not get it to work for bigger simulations namely for sizes greater than 655362 . This constraint
is a significant problem for exascale numerical simulations.
We also tried to evaluate the usability of ExaShark using the
inCode heliumtool which identified design flaws in GA and
LibGeoDecomp but not in ExaShark. Some of these design
flaws, such as data clumpsand feature envy, affect the understanding of the operations provided by a library and consequently its usability.

5. Incode helium.
https://www.intooitus.com/products/incode.

6. Scalable hybrid array kit.


https://github.com/ExaScience/shark.

7. Balay, S., Abhyankar, S., Adams, M. F., Brown, J.,


Brune, P., Buschelman, K., Eijkhout, V., Gropp, W. D.,
Kaushik, D., Knepley, M. G., McInnes, L. C., Rupp, K.,
Smith, B. F., and Zhang, H. PETSc users manual. Tech.
Rep. ANL-95/11 - Revision 3.5, Argonne National
Laboratory, 2014.
8. Bianco, M., and Varetto, U. A generic library for stencil
computations. CoRR abs/1207.1746 (2012).
9. Cecilia, J. M., Garca, J. M., and Ujaldon, M. CUDA 2D
Stencil Computations for the Jacobi Method. In PARA
(1), K. Jonasson, Ed., vol. 7133 of Lecture Notes in
Computer Science, Springer (2010), 173183.
10. Christen, M., Schenk, O., and Burkhart, H. Patus: A
code generation and autotuning framework for parallel
iterative stencil computations on modern
microarchitectures. In Proceedings of the 2011 IEEE
International Parallel & Distributed Processing
Symposium, IPDPS 11, IEEE Computer Society
(Washington, DC, USA, 2011), 676687.
11. Grunewald, D., and Simmendinger, C. The GASPI API
specification and its implementation GPI 2.0.
12. Hestenes, M. R., and Stiefel, E. Methods of conjugate
gradients for solving linear systems.

As future work, we are working on a GASPI/GPI (Global Address Space Programming Interface) integration within ExaShark [11]. For optimizing stencil code implementations,
stencils compilers like PATUS [10] will be included into ExaShark.

13. Intel. Using intel math kernel library for matrix


multiplication. https:
//software.intel.com/en-us/mkl_11.2_tut_c_pdf.

14. Kamil, S., Chan, C., Oliker, L., Shalf, J., and Williams,
S. An auto-tuning framework for parallel multicore
stencil computations. In Parallel & Distributed
Processing (IPDPS), 2010 IEEE International
Symposium on, IEEE (2010), 112.

ACKNOWLEDGMENT

This work is part of the European project EXA2CT [3].


EXA2CT aims at discovering solver algorithms that can scale
to the huge numbers of nodes at exascale, developing an exascale programming model that is usable by application developers and offering these developments to the wider community in open-source proto-applications as a basis to develop
real exascale applications.

15. Knoll, D. A., and Keyes, D. E. Jacobian-free


NewtonKrylov methods: a survey of approaches and
applications. Journal of Computational Physics 193, 2
(2004), 357397.
16. Lengauer, C., Apel, S., Bolten, M., Grolinger, A.,
Hannig, F., Kostler, H., Rude, U., Teich, J., Grebhahn,
A., Kronawitter, S., Kuckuk, S., Rittich, H., and Schmitt,
C. ExaStencils: Advanced Stencil-Code Engineering.

We thank Dr. Pascal Costanza from the ExaScience Lab Belgium and Intel Health & Life Sciences for his help.
REFERENCES

1. Aggregate remote memory copy interface.

17. Schafer, A., and Fey, D. Libgeodecomp: A


Grid-Enabled Library for Geometric Decomposition
Codes. In PVM/MPI, A. L. Lastovetsky, M. T. Kechadi,
and J. Dongarra, Eds., vol. 5205 of Lecture Notes in
Computer Science, Springer (2008), 285294.

http://hpc.pnl.gov/armci/documentation.htm.

2. Anselm supercomputer cluster. https://docs.it4i.


cz/anselm\discretionary{-}{}{}cluster\
discretionary{-}{}{}documentation.

18. Veldhuizen, T. Expression templates. In C++ Report,


vol. 7 (June 1995), 2631.

3. Exascale algorithms and advanced computational


techniques. http://www.exa2ct.eu/.
7