Parallel STL: Boosting Performance of C++ STL Code

0
32

Click here to register and download your free 30-day trial of Intel® Parallel Studio XE

Vladimir Polin, Application Engineer, and Mikhail Dvorskiy, Senior Software Development Engineer, Intel Corporation

C++ and the Evolution Toward Parallelism

Computing systems have evolved rapidly from single-threaded SISD architectures to modern multi- and many-core SIMD architectures, which are used in various fields and form factors. C++ is a general-purpose, performance-oriented language widely used on these modern systems. However, until recently, it didn’t provide any standardized instruments to fully utilize these modern systems. Even the latest version of C++ has limited features to extract parallelism. Over time, vendors invented a variety of specifications, techniques, and software to support parallelism1 (Figure 1). The upcoming version of the C++ standard (C++17) introduces Parallel STL, which makes it possible to transform existing, sequential C++ code to parallel in order to take advantage of hardware capabilities like threading and vectorization.

Figure 1. Landscape of parallelism in C++

Enter Parallel STL

Parallel STL extends the C++ Standard Template Library with the execution policy argument. An execution policy is a C++ class used as a unique type to disambiguate function overloading for STL algorithms. For convenience, the C++ library also defines an object of each class that can be used as the policy argument. Policies can be used with well-known algorithms (e.g., transform, for_each, copy_if), as well as new algorithms (e.g., reduce, transform_reduce, variations of scan [prefix sum]). Support for parallel execution policies was developed over several years as the Technical Specification for C++ Extensions for Parallelism (Parallelism TS). Now it’s been adopted as the standard and included in the current C++17 standard draft (document n46402). Support for vectorization policies has been proposed for the second version of the Parallelism TS (documents p00753 and p00764). Overall, these documents describe five different execution policies (Figure 2):

  • The class sequenced_policy (seq) requires that an algorithm’s execution may not be parallelized.2
  • The class parallel_policy (par) indicates that an algorithm’s execution may be parallelized.2 Any user-specified functions invoked during the execution should not contain data races.
  • The class parallel_unsequenced_policy (par_unseq) suggests that execution may be parallelized and vectorized.2
  • The class unsequenced_policy (unseq) is a proposal in Parallelism TS v24 of an execution policy to indicate that an algorithm’s execution may be vectorized but not parallelized. This policy requires that all functions provided are SIMD safe.
  • The class vector_policy (vec) is a proposal4 of an execution policy type to indicate that an execution may be vectorized in a way that preserves forward dependency between elements.

Figure 2. Execution policies for the C++ Standard Template Library

Figure 2 shows relations between these execution policies. The higher a policy is in the lattice, the more execution freedom it allows―but also, the more requirements it puts on the user code. An implementation of Parallel STL is allowed to substitute an execution policy with a more restrictive one that is lower in the lattice.

Simplified equivalents of the STL and Parallel STL algorithms can be written as follows:

#include <execution> #include <algorithm>  void increment_seq( float *in, float *out, int N ) { using namespace std; transform( in, in + N, out, []( float f ) { return f+1; }); } void increment_unseq( float *in, float *out, int N ) { using namespace std; using namespace std::execution; transform( unseq, in, in + N, out, []( float f ) { return f+1; }); } void increment_par( float *in, float *out, int N ) { using namespace std; using namespace std::execution; transform( par, in, in + N, out, []( float f ) { return f+1; }); }

Where

std::transform( in, in + N, out, foo );

would be as simple as the following loop

for (x = in; x < in+N; ++x) *(out+(x-in)) = foo(x);

and

std::transform( unseq, in, in + N, out, foo );

would be as simple as the following loop (our implementation uses #pragma omp simd on the innermost level; other Parallel STL implementations might use different approaches to implement unseq policy)

#pragma omp simd for (x = in; x < in+N; ++x) *(out+(x-in)) = foo(x);

and

std::transform( par, in, in + N, out);

would be as simple as the following parallel loop

tbb::parallel_for (in, in+N, [=] (x) { *(out+(x-in)) = foo(x); });

Overview of Parallel STL Implementation in Intel® Parallel Studio XE 2018 Beta

The Parallel STL implementation is a part of Intel® Parallel Studio XE 2018 Beta. It offers efficient support for both parallel and vectorized execution of algorithms on Intel® processors. Under the hood, it uses an available implementation of the C++ standard library for sequential execution, Intel® Threading Building Blocks (Intel® TBB) for parallelism with par and par_unseq execution policies, and OpenMP* vectorization for unseq and par_unseq policies.

The Parallel STL implementation in Intel Parallel Studio XE 2018 beta is prerelease code, which may not be fully functional and which Intel may substantially modify in future versions.

After installing Parallel STL, you need to set up the environment following the instructions in the “Getting Started with Parallel STL” document provided in the package. The document also contains the up-to-date list of algorithms that have parallel and vector implementations. For all other algorithms, execution policies are accepted but fall back to sequential implementation. We plan to enable parallelism in more algorithms in future releases based on feedback and demand.

To achieve best results with Parallel STL, we recommend using Intel® C++ Compiler 2018. Other compilers can also be used, provided they support C++11. For vectorization policies to be effective, the compiler should also support OpenMP 4.0 SIMD constructs. Use of parallel execution policies requires Intel TBB.

Follow these steps to add Parallel STL to your application:

  1. Add #includepstl/execution line to your code. Then add one or more of the following lines, depending on the algorithms you intend to use:
    1. #include "pstl/algorithm"
    2. #include "pstl/numeric"
    3. #include "pstl/memory"

    Note that “pstl” should be used as the part of the header name. This is done intentionally to avoid conflicts with the C++ standard library header files.

  2. When using algorithms and execution policies, specify the namespaces std and std::execution, respectively.
  3. Compile the code as C++11 or later. Use a proper compiler option to enable OpenMP vectorization; e.g., for the Intel® C++ Compiler, use -qopenmp-simd (/Qopenmp-simd for Windows*).
  4. To get good performance, specify the target platform. For the Intel C++ Compiler, some of the relevant options are -xHOST, -xCORE-AVX2, -xMIC-AVX512 for Linux* or /QxHOST, /QxCORE‑AVX2, /QxMIC-AVX512 for Windows.
  5. Link with Intel TBB, if required. On Windows, it is done automatically; on other platforms, add ‑ltbb to the linker options.

Intel Parallel Studio XE 2018 beta contains the gamma correction example that can be used to try Parallel STL.

Efficient Vectorization, Parallelization, and Composability using Parallel STL

In theory, Parallel STL was invented as a highly intuitive way for C++ developers to program a parallel random-access machine (PRAM). Let’s consider several ways the theory correlates with the best practices of parallelizing loop hierarchies, such as the “vectorize innermost, parallelize outermost” (VIPO) approach5. (Take into account that we are using prerelease software, so results can be different in further versions of the software.) Consider the classic example of image gamma correction (or simply gamma)―a nonlinear operation used to encode and decode the luminance of each image pixel. Note that we have to disable compiler auto-vectorization for sequential cases to show the difference between sequential and unsequenced execution. (Otherwise, this difference can be seen only with compilers that do not support automatic vectorization but do support OpenMP-based vectorization.)

Here is a simple serial implementation:

#include >algorithm< void ApplyGamma(Image& rows, float g) { using namespace std; for_each(rows.begin(), rows.end(), [g](Row &r) { transform(r.cbegin(), r.cend(), r.begin(), [g](float v) { return pow(v, g); }); }); }

The function ApplyGamma gets an image passed by reference as a set of rows and uses std::for_each to iterate over them. The lambda function called for each row iterates over the pixels in the row with std::transform to modify the luminance of each pixel.

As described previously, Parallel STL provides parallelized and vectorized versions of the for_each and transform algorithms enabled with the execution policies. In other words, an execution policy passed as the first argument into a standard algorithm leads to execution of the parallelized and/or vectorized version of the algorithm.

Returning to the example above, you may notice that all calculations are in the lambda function called from the transform algorithm. So, let’s try killing two birds with one stone and rewrite the example using the par_unseq policy as follows:

void ApplyGamma(Image& rows, float g) { using namespace std::execution; std::for_each(rows.begin(),rows.end(), [g](Row &r) { std::transform(par_unseq, r.cbegin(), r.cend(), r.begin(), [g](float v) { return pow(v, g); }); }); }

Figure 3. Inner parallelization and vectorization

Surprisingly, the miracle has not happened (Figure 3). Performance with par_unseq is worse than serial execution. So this is a great example of how not to use Parallel STL. If you profile the code with tools like Intel® VTune™ Amplifier XE, you might see a lot of cache misses caused by threads on different cores writing to the same cache lines, as well as high thread synchronization and scheduling overhead. [Editor’s note: This is known as false sharing. For more information, see “Avoiding and Identifying False Sharing Among Threads.”]

As we noticed before, Parallel STL helps us to express the middle and innermost level parallel patterns. The middle level means parallelization with system threads; the innermost level means vectorization with SIMD. In general, to get the best speedup, estimate execution time for an algorithm and compare it with parallelization and vectorization overheads. We recommend that serial execution time be at least two times more than the overheads on each parallelization level. Additionally:

  • Parallelize at the outermost level; seek the maximum amount of work to execute in parallel.
  • If that provides sufficient parallelism, stop. Don’t oversubscribe. Otherwise, parallelize an additional inner level.
  • Make sure the algorithm is cache efficient.
  • Try to vectorize the innermost level. Ensure minimal control flow divergence and memory access uniformity.
  • Find more recommendations.5

These recommendations suggest that specifying the parallel and vector policies on different levels may provide better performance, i.e.,

void ApplyGamma(Image& rows, float g) { using namespace std::execution; std::for_each(par, rows.begin(), rows.end(), [g](Row &r) { std::transform(unseq, r.cbegin(), r.cend(), r.begin(), [g](float v) { return pow(v, g); }); }); }

Figure 4. Outer parallelization and inner vectorization

We now have efficient parallel processing for one image (Figure 4), but real applications typically process multiple images or apply several corrections at once (Figure 5). The parallelism at that higher level might not work well using the standard algorithms. In this case, we can use Intel TBB with Parallel STL in a composable way.

Figure 5. Composability cases

Composability means you can express the topmost parallelism with Intel TBB parallel constructs (e.g., flow graph, pipeline) or tasks, and call Parallel STL algorithms at inner levels―all without worrying about oversubscribing your system, i.e.,

void Function() { Image img1, img2; tbb::parallel_invoke( [&img1] { img1.ApplyGamma(gamma1); }, [&img2] { img2.ApplyGamma(gamma2); } ); }

Figure 6. Composability between Intel® TBB and Parallel STL

As shown in Figure 6, processing two images simultaneously with Intel TBB doesn’t reduce the performance―and even increases it a bit. This indicates that the expressing of inner and innermost parallelism fully utilizes the CPU cores.

Now let’s consider the situation where we have a larger of set of images to process and more CPU cores available.

tbb::parallel_for(images.begin(), images.end(), []]image* img) {applyGamma(img->rows(), 1.1);} );

Figure 7. Composability between Intel® TBB and Parallel STL

As shown in Figure 7, processing a set of images simultaneously with Intel TBB (parallel_for) drastically increases speedup. Indeed, have a look at the first bar where we iterate sequentially through the images, and each image is processed by the inner and innermost parallelism patterns. Adding just the topmost level of parallelism (parallel_for) without the inner parallelism (par) significantly improves performance, but this is not enough to fully utilize all of the CPU core. The third bar shows that expressing all levels of parallelism increases the performance drastically. This illustrates the great composability between Intel TBB and our Parallel STL implementation.

Summary

Parallel STL is a significant step in the evolution of C++ toward parallel execution, making it easily applied to STL algorithms during code modernization and the development of new applications. It adds to C++ vectorization and parallelization capabilities without resorting to nonstandard or proprietary extensions, and its execution policies provide control over the use of these capabilities while abstracting hardware details. Parallel STL lets developers focus on expressing the parallelism in their applications without worrying about the low-level details of managing that parallelism. In addition to efficient, high-performing implementations of the most commonly used high-level parallel algorithms, the Parallel STL implementation in Intel Parallel Studio XE 2018 showcases the great composability with Intel Threading Building Blocks parallel patterns. However, Parallel STL is not a silver bullet. It must be used wisely. To achieve great performance, follow best-known practices when modernizing your code.5

Learn More

Here’s where to find recent Parallel STL and Intel TBB versions and additional information:

Your feedback is important and there are several ways to provide it:

References

  1. Alex Katranov, Mikhail Dvorskiy, Parallelizing C++ Standard Algorithms for Intel® Processors, Intel Software Development Conference, London 2016.
  2. ISO/IEC N4640, working draft, Standard for Programming Language C++.
  3. P0075r1, Template Library for Parallel For Loops, Programming Language C++ (WG21).
  4. P0076r3, Vector and Wavefront Policies, Programming Language C++ (WG21).
  5. Robert Geva, Code Modernization Best Practices: Multi-level Parallelism for Intel® Xeon® and Intel® Xeon Phi™ Processors, IDF15 – Webcast.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

LEAVE A REPLY