Blog entry: Improving the performance of Nektar++ with SIMD

In modern computer architectures, the gap between processor clock speed and memory bandwidth is constantly increasing, meaning that to attain optimal performance, algorithms with a high degree of arithmetic intensity – i.e the ratio between computations performed and the amount of data transferred from the memory (DRAM) – are required. In this context, high-order finite element methods are particularly attractive due to their high (and tunable) arithmetic intensity. Nektar++ is a finite element package designed to allow one to construct efficient classical low polynomial order h-type solvers (where h is the size of the finite element) as well as higher p-order piecewise polynomial solvers. The Nektar++ library comes with a number of solvers and also allows one to construct a variety of new ones. The main solvers provided are a continuous Galerkin incompressible Navier-Stokes solver and a discontinuous Galerkin compressible Navier-Stokes solver. These solvers are routinely applied to industrially relevant simulations; typical applications encompass external aerodynamics (for instance the flow around cars) as well as internal aerodynamics (for instance the flow in aircraft engines). The use single-instruction multiple-data (SIMD) vectorization, that is prevalent on modern hardware, is a well-studied solution for the efficient implementation of high-order operators. We are in the process of integrating within the Nektar++ library some operators (which are based on the work of Moxey et al., 2020) that take advantage SIMD hardware. The intent is to improve the efficiency of the Nektar++ library with the specific end goal of accelerating the compressible flow solver. Figure Caption: Roofline analysis for the vectorized Helmholtz operator (Moxey et al., 2020) on Broadwell CPU using deformed hexadral (triangle) and undeformed hexahedral (squares) elements with varying polynomial order (1-20).

Figure Caption: Roofline analysis for the vectorized Helmholtz operator (Moxey et al., 2020) on Broadwell CPU using deformed hexadral (triangle) and undeformed hexahedral (squares) elements with varying polynomial order (1-20). The roofline analysis is a visual performance model that offers insight to improve algorithms on modern architectures, for instance it can indicate if the computation is memory or compute bound.

This image has an empty alt attribute; its file name is image.png

Partners