Rob Farber, Global Technology Consultant, TechEnablement
In October 1997, the OpenMP* Architecture Review Board (ARB) published the v1.0 version of the OpenMP Fortran specification, with the C/C++ specification following nearly a year later. At that time, the fastest supercomputer in the world, ASCI Red, was based on computational nodes containing two 200 MHz Intel® Pentium® Pro processors. (Yes, leadership-class supercomputing at that time considered a single-core 200 MHz Intel Pentium processor to be fast.) Built at a cost of USD 46 million (roughly USD 68 million in today’s dollars), ASCI Red was the first supercomputer to deliver a trillion floating-point operations per second on the TOP500 LINPACK benchmark. It was also the first supercomputer to consume a megawatt of power―foretelling a trend to come. In contrast, a modern dual-socket Intel® Xeon® processor v4 family is positively a steal, in terms of both teraflop computing capability and power consumption.
Dr. Thomas Sterling, director of the Center for Research in Extreme Scale Technologies, observes, “OpenMP has provided a vision of single-system programming and execution that emphasizes simplicity and uniformity. It challenges producers of system software to address asynchrony, latency, and overhead of control while encouraging future hardware system designers to achieve user productivity and performance portability in the era of exascale. The last two decades have seen remarkable accomplishment that will lead the next 20 years of scalable computing.”
OpenMP: A Forward-Thinking, Developer-Motivated Effort
The OpenMP initiative was motivated by the developer community. There was increasing interest during that time for a standard that programmers could reliably use to move code between different parallel, shared-memory platforms.
Before OpenMP, programmers had to explicitly use a threading model such as pthreads, or a distributed framework such as MPI, to create parallel codes. (The first MPI standard was completed in 1994.) The convenience of simply adding an OpenMP pragma to exploit parallelism in a shared-memory model was revolutionary in its convenience. But, at that time, thread-based computing models were of limited interest, since clusters of single-threaded processors dominated the high-performance computing world. It was possible on some hardware platforms to purchase extra plug-in CPUs that could provide hardware-based multithreaded performance. But, generally, threads were considered more of a software trick to emulate asynchronous behavior using OS time slices rather than a route to scalable parallel performance. At that time, the thread debate centered more on the use of heavyweight threads (e.g., processes created with fork/join) rather than lightweight threads that shared memory. Hardware parallelism inside a node was limited to dual- or quad-core processor systems, so OpenMP scaling was a nonissue.
Thus, the 1997 OpenMP specification was very forward thinking, since distributed-memory MPI computing was “the” route to parallelism. Basically, it was cheaper and easier to connect lots of machines via a network. In a world where Dennard scaling laws applied, faster application performance could be achieved by either adding MPI nodes or purchasing machines containing a higher–clock rate processor that could run serial software faster. Thus, the big advances around that time came from using commodity off-the-shelf (COTS) hardware to build clusters, which dominated the parallel computing world (Figure 1). For example, the original 1998 Beowulf how-to explains that, “Beowulf is a technology of clustering computers to form a parallel, virtual supercomputer,” which “behaves more like a single machine rather than many workstations.” There really was no mass scientific or commercial demand for multicore processors―hence, multithreaded parallel computing was more a very interesting HPC project than a mainstream programming model. The brief, massively parallel single instruction, multiple data (SIMD) interlude shown in Figure 1 was short-lived and basically disappeared with the demise of Thinking Machines Corporation, the company that manufactured the SIMD architecture CM-2 supercomputer and later the CM-5 MIMD (multiple instruction, multiple data) massively parallel processor (MPP) supercomputer. The SGI Challenge is an example of an SMP (shared-memory multiprocessor in this context) from that era.
OpenMP Moves into the Spotlight: Dennard Scaling Breaks and the Rise of Multicore
Between 2005 and 2007, it became clear that Dennard scaling had broken down and we started to see the first modern multicore processors. Since it was no longer possible to achieve significant performance increases by boosting the clock rate, manufacturers had to start adding processor cores to generate significant performance increases (and a reason to upgrade). This broke the comfortable status quo where codes would automatically run faster on the next generation of hardware due to clock rate increases. As a result, people started to seriously investigate using thread-based computing as a means to increase application performance. Even so, most applications exploited parallelism by simply running one serial MPI rank per core on the multiprocessor.
In the 2007 to 2008 timeframe, multicore processors began to dominate the performance landscape as illustrated by Figure 2, a performance share graph from the TOP500 organization. You can clearly see that the trend since then has been toward increasing core counts.
Code Modernization with OpenMP
Increasing core counts benefited both OpenMP and MPI programs through greater parallelism. But the phoenix-like rise of vector parallelism, coupled with higher-core-count processors, has really turned OpenMP into a first-class citizen.
Many legacy applications utilized the one MPI rank per processor core because parallelism was the path to performance on COTS hardware when they were written. This is not to say that vectorization was not utilized―especially in HPC codes―but rather to highlight that small vector widths in the processors used for COTS clusters bounded the performance benefits. Also, programming the vector units was difficult. As a result, many programmers continued to rely on increased MPI parallelism to achieve higher application performance. Any benefits of vectorized loops in the code that ran inside each MPI rank were a nice additional benefit.
A resurgence in SIMD and data parallel programming, starting around 2006, showed that rewriting legacy codes to exploit hardware thread parallelism could deliver significant performance increases across a wide variety of applications and computational domains.
This trend accelerated as it was realized that power efficiency was a key stumbling block on the road to petascale―and eventually exascale―computers. The Green500 list debuted in 2007, marking the end of the “performance at any cost” era in large-scale computing.
OpenMP was suddenly well positioned to exploit the focus on energy-efficient computing and data parallelism. Succinctly, CPUs are general-purpose MIMD devices that can run SIMD codes efficiently. Even better, SIMD codes map very nicely onto hardware vector units. Meanwhile, MIMD-based task parallelism was simply a loop construct away.
To increase both performance and power efficiency, ever-wider vector instructions have been added to the x86 ISA (instruction set architecture). Similar efforts are underway for other ISAs. Succinctly, hardware vector units consume relatively small amounts of space on the silicon of the chip, yet they can deliver very power-efficient floating-point performance. As a result, the floating-point capability of general-purpose processors increased dramatically and the era of high-core-count (or many-core) vector parallel processors was born. Examples include the many-core Intel Xeon and Intel® Xeon Phi™ processors.
Code modernization became a buzzword as people realized that programming one MPI rank per core was an inefficient model because it didn’t fully exploit the performance benefits of SIMD, data parallel, and vector programming. Making efficient use of the AVX-512 vector instructions on the latest generation of hardware, for example, can increase application performance by 8x for double-precision codes and by 16x for single-precision codes. Many programming projects have switched, or are in the process of switching, to a combined OpenMP/MPI hybrid model to fully exploit the benefits of both MPI and OpenMP. The resulting performance increase can be the product of the number of cores and vector performance as shown in Figure 3. In fact, the latest Intel Xeon Phi processors have two AVX-512 vector units per core.
Figure 3. The highest performance is in the top right quadrant where programmers exploit both vector and parallel hardware (Image courtesy of Morgan Kaufmann, imprint of Elsevier, all rights reserved)
OpenMP: State of the Art
The OpenMP standard recognized the importance of SIMD programming and the SIMD clause was added to the OpenMP 4.0 standard in October of 2013. Additional clauses were added to the OpenMP 4.0 specification so that the offload mode programming of coprocessors and accelerators like GPUs is also now supported. OpenMP continues to grow and adapt to the changing hardware landscape.
OpenMP for the Exascale Era
As we look to an exascale future, power consumption is king. The trend for exascale computing architectures is to link power-efficient serial cores with parallel hardware―essentially, a hardware instantiation of Amdahl’s Law. NERSC notes that the latest Cori supercomputer represents the first time users will run on a leadership-class supercomputer where their programs will run slower if they don’t do anything to the code. Such is the inescapable consequence of increased power efficiency, since power-efficient serial cores for exascale supercomputers simply require more time to run sequential code. This trend will likely spill over to the data center, where power consumption is crucial to the bottom line and profits, yet it is expected that 5G will increase data volumes by up to 1,000x (source: Forbes).
Happily, OpenMP is now a tried-and-true veteran that gives us performance while still meeting the original design goal of a standard that programmers can reliably use to move code between different parallel, shared-memory platforms. Performance plus portability: what a lovely combination. Rob Farber is a global technology consultant and author with an extensive background in HPC. He is an active advocate for portable parallel performant programming. Reach him at firstname.lastname@example.org.