The National Energy Research Scientific Computing Center (NERSC) is the primary scientific computing facility for the U.S. Department of Energy’s Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 6,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines.
To meet its goals, NERSC needs to optimize its diverse applications for peak performance on Intel® Xeon Phi™ processors. To do that, it uses a roofline analysis model based on Intel® Advisor, a tool in the Intel® Parallel Studio software suite. The roofline model was originally developed by Sam Williams, a computer scientist in the Computational Research Division at Lawrence Berkeley National Laboratory. Using the model increased application performance up to 35%.
“Optimizing complex applications demands a sense of absolute performance,” explained Dr. Tuomas Koskela, postdoctoral fellow at NERSC. “There are many potential optimization directions. It’s essential to know which direction to take, what factors are limiting performance, and when to stop.”
Roofline analysis helps to determine the gap between applications and peak performance of a computing platform. This visually intuitive performance model bounds the performance of various numerical methods and operations running on multi-core, many-core, or accelerator processor architectures.
Instead of simply using percent-of-peak estimates, the model can be used to assess the quality of performance by combining locality, bandwidth, and different parallelization paradigms into a single performance figure. The roofline figure helps determine both the implementation and inherent performance limitations (Figure 1).
A classic roofline model includes three measurements:
- The number of floating point operations per second (FLOP/s)
- The number of bytes from DRAM
- Computation time
The Intel Advisor roofline implementation provides even more insights than a standard roofline analysis by plotting more rooflines:
- Cache rooflines illustrate performance if all the data fits into the respective cache.
- Vector usage rooflines show the maximum achievable performance levels if vectorization is used effectively.
In a classic roofline model, bytes are measured out of a given level in memory hierarchy. Arithmetic intensity (AI) depends on the problem size and intensity. Also, memory optimizations will change AI.
Intel Advisor is based on a cache-aware roofline model, where bytes are measured into the CPU from all levels of memory hierarchy. AI is independent of problem size and platform and consistent for a given algorithm.
Figure 2 shows how a point moves in a classic versus a cache-aware roofline model.
Figure 3 shows Intel Advisor’s cache-aware roofline model report. The red loop is the most time-consuming, while the green loops are insignificant in terms of computing time. The larger loops will have more impact if optimized. The loops furthest away from a roof have the most potential for improvement.
Cache-Aware Roofline Model in Action
NERSC used Intel Advisor’s cache-aware roofline model to optimize two of its key applications:
PICSAR*, a high-performance particle-in-cell (PIC) library for many integrated core (MIC) architectures
XGC1*, a PIC code for fusion plasmas
The PICSAR application was designed to be interfaced with the existing PIC code, WARP*. Providing high-performance PIC routines to the community, it is planned to be released as an open source project.
The application is used for projects in plasma physics, laser-matter interaction, and conventional particle accelerators. Its optimizations include:
- L2 field cache blocking, where the MPI domain decomposes into tiles
- Hybrid parallelization, with OpenMP* handling tiles (inner-node parallelism)
- New data structures to enable efficient vectorization (current/charge disposition)
- An efficient parallel particle exchange algorithm between tiles
- Parallel optimized pseudo spectral Maxwell solver
- Particle sorting algorithm (memory locality)
NERSC applied the roofline model to three configurations:
- No tiling and no vectorization
- Tiling (L2 cache blocking) and no vectorization
- Tiling (L2 cache blocking) and vectorization
- The XGC1 application is a PIC code for simulating plasma turbulence in Tokamak (edge) fusion plasmas. Its complicated geometry includes:
- Unstructured mesh in 2D (poloidal) planes
- Nontrivial, field-following (toroidal) mapping between meshes
Figure 1. Roofline visual performance model
Figure 2. How a point moves in classic versus cache-aware roofline model
Figure 3. Intel Advisor report
- Typical simulations with 10,000 particles per cell, 1,000,000 cells per domain, and 64 toroidal domains
- Most of the computation time in SGC1 is spent in electron subcycling. Bottlenecks included:
- Field interpolation to particle position in field gather
- Element search on the unstructured mesh after push]
- Computation of high-order terms in gyrokinetic equations of motion in push
With a single Intel Advisor survey, NERSC was able to discover most of the bottlenecks (Figure 4).
The optimizations included:
- Enabling vectorization by inserting loops over blocks of particles inside short-trip-count loops
- Data structure reordering to store field and particle data in SoA format, which is best for accessing multiple components with a gather instruction
- Algorithmic improvements including reducing the number of unnecessary calls to the search routine and sorting particles by the element index instead of local coordinates.
Figure 4. Discovering XGC1 bottlenecks
Intel Advisor’s roofline analysis helped NERSC find peak performance for its computing platform, providing an all-in-one tool for cache-aware roofline analysis.