Big Changes are Finally On the Horizon for Supercomputers

Looking back at this week’s ISC 17 supercomputing conference, it looks like the supercomputing world will see some big upgrades in the next couple of years, but the update to the twice-yearly Top 500 list of the world’s fastest supercomputers wasn’t very different from the previous version.

The fastest computers in the world continue to be the two massive Chinese machines that have topped the list for a few years: Sunway TaihuLight computer from China’s National Supercomputing Center in Wuxi, with sustained Linpack performance of more than 93 petaflops (93 thousand trillion floating point operations per second); and the Tianhe-2 computer from China’s National Super Computer Center in Guangzhou, with sustained performance of more than 33.8 petaflops. These remain the fastest machines by a huge margin.

The new number three is the Piz Daint system from the Swiss National Supercomputing Centre, a Cray system that uses Intel Xeons and Nvidia Tesla P100s, which was recently upgraded to give it a Linpack sustained performance of 19.6 petaflops, twice its previous total. That moved it up from number eight on the list.

This drops the top US system—the Titan system at the Oak Ridge National Laboratory—down to fourth place, making this the first time in twenty years that there is no US system in the top three. The rest of the list remains unchanged, with the US still accounting for five of the top 10 overall, and Japan for two.

Even if the fastest computer list hasn’t changed much, there are big changes elsewhere. On the Green 500 list of the most power-efficient systems, nine of the top ten changed. On top is the Tsubame 3.0 system, a modified HPE ICE XA system at the Tokyo Institute of Technology based on a Xeon E5-2680v4 14 core, Omni-Path interconnect, and Nvidia’s Tesla P100, which allows for 14.1 gigaflops per watt. This is a huge jump from Nvidia’s DGX Saturn V, based on the firm’s DGX-1 platform and P100 chips, which was number one on the November list but number ten this time, at 9.5 gigaflops/Watt. The P100 is in nine of the top ten Green500 systems.

Breaking 10 gigaflops/watt is a big deal because it means that a hypothetical exaflop system built using today’s technology would consume under 100 megawatts (MW). That’s still too much—the target is 20-30 MW for an exaflop system, which researchers hope to see in the next five years or so—but it’s a huge step forward.

Like the Top 500 list, there were only minor changes on similar lists with different benchmarks, such as the High Performance Conjugate Gradients (HPCG) benchmark, where machines tend to see only 1-10 percent of their theoretical peak performance, and where the top system—in this case, the Riken K machine—still delivers less than 1 petaflop. Both the TaihuLight and the Piz Daint systems moved up on this list. When researchers talk about an exaflop machine, they tend to mean the Linpack benchmark, but HPCG may be more realistic in terms of real-world performance.

The emergence of GPU computing as an accelerator—almost always using Nvidia GPU processors such as the P100—has been the most visible change on these lists in recent years, followed by the introduction of Intel’s own accelerator, the many-core Xeon Phi (including the most recent Knights Landing version). The current Top 500 list includes 91 systems that are using accelerators or coprocessors, including 74 with Nvidia GPUs and 17 with Xeon Phi (with another three using both); one with an AMD Radeon GPU as an accelerator, and two that use a many-core processor from PEZY Computing, a Japanese supplier. An additional 13 systems now use the Xeon Phi (Knights Landing) as the main processing unit.

But many of the bigger changes to supercomputers are still on the horizon, as we start to see larger systems designed with these concepts in mind. One example is the new MareNostrum 4 at the Barcelona Supercomputing Center, which entered the Top 500 list at number 13. As installed so far, this is a Lenovo system based on the upcoming Skylake-SP version of Xeon (officially the Xeon Platinum 8160 24-core processor). What’s interesting here are the three new clusters of “emerging technology” planned for the next couple of years, including one cluster with IBM Power 9 processors and Nvidia GPUs, designed to have a peak processing capability of over 1.5 Petaflops; a second based on the Knights Hill version of Xeon Phi; and a third based on 64-bit ARMv8 processors designed by Fujitsu.

These concepts are being used in a number of other major supercomputing projects, notably several sponsored by the US Department of Energy as part of its CORAL Collaboration at the Oak Ridge, Argonne, and Lawrence Livermore National Labs. First up should be Summit at Oak Ridge, which will use IBM Power 9 processors and Nvidia Volta GPUs, and slated to deliver over 150 to 300 peak petaflops; followed by Sierra at Lawrence Livermore, slated to deliver over 100 peak petaflops.

We should then see the Aurora supercomputer at the Argonne National Laboratory, based on the Knights Hill version of Xeon Phi and built by Cray, which is slated to deliver 180 peak petaflops. The CORAL systems should be up and running next year.

Meanwhile, the Chinese and Japanese groups have planned upgrades as well, mostly using unique architectures. It should be interesting to watch.

An even bigger shift seems to be just a little farther off: the shift toward machine learning, typically on massively parallel processing units within the processor itself. While the Linpack number refers to 64-bit or double-precision performance, there are classes of applications—including many deep neural network-based applications—that work better with single- or even half-precision calculations. New processors are taking advantage of this, such as Nvidia’s recent Volta V100 announcement and the upcoming Knights Mill version of Xeon Phi. At the show, Intel said that version, which is due to be in production in the fourth quarter, would have new instruction sets for “low-precision computing” called Quad Fused Multiply Add (QFMA) and Quad Virtual Neural Network Instruction (QVNNI).

I assume that these concepts could be applied to other architectures as well, such as Google’s TPUs or Intel’s FPGAs and Nervana chips.

Even if we aren’t seeing big changes this year, next year we should expect to see more. The concept of an exascale (1000 teraflops) machine is still in sight, though it will likely involve a number of even larger changes.

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *