At this month’s The second is a big emphasis on how chip and system makers are taking concepts from modern machine learning systems and applying these to supercomputers., two trends stood out. The first is the appearance of Intel’s latest Xeon Phi (Knights Landing) and Nvidia’s latest Tesla (the Pascal-based P100) on the Top500 list of the fastest computers in the world; both systems landed in the top 20.
On the current revision of the, which gets updated twice yearly, the top of the chart is still firmly in the hands of the from China’s National Supercomputing Center in Wuxi, and the from China’s National Super Computer Center in Guangzhou, as it has been since June’s ISC16 show. No other computers are close in total performance, with the third- and fourth- ranked systems—still the Titan supercomputer at Oak Ridge and the Sequoia system at Lawrence Livermore—both delivering about half the performance of Tianhe-2.
The first of these is based on a unique Chinese processor, the 1.45GHz SW26010, which uses a 64-bit RISC core. This has an unmatched 10,649,600 cores delivering 125.4 petaflops of theoretical peak throughput and 93 petaflops of maximum measured performance on the Linpack benchmark, using 15.4 Megawatts of power. It should be noted that while this machine tops the charts in Linpack performance by a huge margin, it doesn’t fare quite as well in other tests. There are other benchmarks such as the, where machines tend to only see 1 to 10 percent of their theoretical peak performance, and where the top system—in this case, the Riken K machine—still delivers less than 1 petaflop.
But the Linpack tests are the standard for talking about high-performance computing (HPC) and what is used to create the Top500 list. Using the Linpack tests, the No. 2 machine, Tianhe-2, was No. 1 on the chart for the past few years, and uses Xeon E5 and older Xeon Phi (Knights Corner) accelerators. This offers 54.9 petaflops of theoretical peak performance, and benchmarks at 33.8 petaflops in Linpack. Many observers believe that a ban on the export of the newer versions of Xeon Phi (Knights Landing) led the Chinese to create their own supercomputer processor.
Knights Landing, formally Xeon Phi 7250, played a big role in the new systems on the list, starting with the Cori supercomputer at Lawrence Berkeley National Laboratory coming in at fifth place, with a peak performance of 27.8 petaflops and a measured performance of 14 petaflops. This is a Cray XC40 system, using the Aries interconnect. Note that Knights Landing can act as a main processor, with 68 cores per processor delivering 3 peak teraflops. (Intel lists another version of the chip with 72 cores at 3.46 teraflops of peak theoretical double precision performance on its price list, but none of the machines on the list use this version, perhaps because it is pricier and uses more energy.)
Earlier Xeon Phis could only run as accelerators in systems that were controlled by traditional Xeon processors. In sixth place was the Oakforest-PACS system of Japan’s Joint Center for Advanced High Performance Computer, scoring 24.9 peak petaflops. This is built by Fujitsu, using Knights Landing and Intel’s Omni-Path interconnect. Knights Landing is also used in the No. 12 system (The Marconi computer at Italy’s CINECA, built by Lenovo and using Omni-Path) and the No. 33 system (the Camphor 2 at Japan’s Kyoto University, built by Cray and using the Aries interconnect).
Nvidia was well represented on the new list as well. The No. 8 system, Piz Daint at The Swiss National Supercomputing Center, was upgraded to a Cray XC50 with Xeons and the Nvidia Tesla P100, and now offers just under 16 petaflops of theoretical peak performance, and 9.8 petaflops of Linpack performance—a big upgrade from the 7.8 petaflops of peak performance and 6.3 petaflops of Linpack performance in its earlier iteration based on the Cray XC30 with Nvidia K20x accelerators.
The other P100-based system on the list was Nvidia’s own DGX Saturn V, based on the company’s own DGX-1 systems and an Infiniband interconnect, which came in at No. 28 on the list. Note that Nvidia is now selling both the processors and the DGX-1 appliance, which includes software and eight Tesla P100s. The DGX Saturn V system, which Nvidia uses for internal AI research, scores nearly 4.9 peak petaflops and 3.3 Linpack petaflops. But what Nvidia points out is that it only uses 350 kilowatts of power, making it much more energy efficient. As a result, this system tops theof the most energy-efficient systems. Nvidia points out that this is considerably less energy than the Xeon Phi-based Camphor 2 system, which has similar performance (nearly 5.5 petaflops peak and 3.1 Linpack petaflops).
It’s an interesting comparison, with Nvidia touting better energy efficiency on GPUs and Intel touting a more familiar programming model. I’m sure we’ll see more competition in the years to come, as the different architectures compete to see which of them will be the first to reach “exascale computing” or whether the Chinese home-grown approach will get there instead. Currently, the US Department of Energy’s Exascale Computing Project expects the first exascale machines to be installed in 2022 and go live the following year.
I find it interesting to note that despite the emphasis on many-core accelerators like the Nvidia Tesla and Intel Xeon Phi solutions, only 96 systems use such accelerators (including those that use Xeon Phi alone); as opposed to 104 systems a year ago. Intel continues to be the largest chip provider, with its chips in 462 of the top 500 systems, followed by IBM Power processors in 22. Hewlett-Packard Enterprise created 140 systems (including those built by Silicon Graphics, which HPE acquired), Lenovo built 92, and Cray 56.
Machine Learning Competition
There were a number of announcements at or around the show, most of which dealt with some form of artificial intelligence or machine learning. Nvidia announced a partnership with IBM on a new deep-learning software toolkit called IBM PowerAI that runs IBM Power servers using Nvidia’s NVLink interconnect.
AMD, which has been an afterthought in both HPC and machine-learning environments, is working to change that. In this area, the company focused on its own Radeon GPUs, pushed its FirePro S9300 x2 server GPUs, and announced a partnership with Google Cloud Platform to enable it to be used over the cloud. But AMD hasn’t invested as much in software for programming GPUs, as it has been emphasizing OpenCL over Nvidia’s more proprietary approach. At the show, AMD introduced a new version of its Radeon Open Compute Platform (ROCm), and touted plans to support its GPUs in heterogeneous computing scenarios with multiple CPUs, including its forthcoming “Zen” x86 CPUs, ARM architectures starting with Cavium’s ThunderX and IBM Power 8 CPUs.
At the show, Intel talked about a new version of its current Xeon E5v4 (Broadwell) chip tuned for floating point workloads, and how the next version based on the Skylake platform is due out next year. But in a later event that week, Intel made a series of announcements designed to position its chips in the artificial intelligence or machine-learning space. () Much of this has implications for high-performance computing, but is mostly separate. To begin with, in addition to the standard Xeon processors, the company also is promoting FPGAs for doing much of the inferencing in neural networks. That’s one big reason the company recently purchased Altera, and such FPGAs are now used by companies such as Microsoft.
But the focus on AI last week dealt with some newer chips. First, there is Xeon Phi, where Intel has indicated that the current Knights Landing version will be supplemented next year with a new version called Knights Mill, aimed at the “deep learning” market., this is another 14nm version but with support for half-precision calculations, which are frequently used in training neural networks. Indeed, one of the big advantages of the current Nvidia chips in deep learning is their support for half-precision calculations and 8-bit integer operations, which Nvidia often refers to as deep learning “ .” Intel has said Knights Mill will deliver up to four times the performance of Knights Landing for deep learning. (This chip is still slated to be followed later by a 10nm version called Knights Hill, probably aimed more at the traditional high-performance computing market.)
Most interesting for next year is a design from Nervana, which Intel recently acquired, which uses an array of processing clusters designed to do simple math operations connected to high-bandwidth memory (HBM). First up in this family will be Lake Crest, which was designed before Intel bought the company and manufactured on a 28nm TSMC process. Due out in test versions in the first half of next year, Intel says it will deliver more raw compute performance than a GPU. This will eventually be followed by Knights Crest, which somehow implements Nervana’s technology alongside Xeon, with details still unannounced.
“We expect Nervana’s technologies to produce a breakthrough 100-fold increase in performance in the next three years to train complex neural networks, enabling data scientists to solve their biggest AI challenges faster,” wrote Intel CEO Brian Krzanich.
Intel also recently announced plans to acquire Movidius, which makes DSP-based chips particularly suited for computer vision inferencing—again, making decisions based on previously trained models.
It’s a complicated and evolving story—certainly not as straightforward as Nvidia’s push for its GPUs everywhere. But what it makes clear is just how quickly machine learning is taking off, and the many different ways that companies are planning to address the problem, from GPUs like those from Nvidia and AMD, to many core x86 processors such as Xeon Phi, to FPGAs, to specialized products for training such as Nervana and IBM’s TrueNorth, to custom DSP-like inferencing engines such as. It will be very interesting to see whether the market has room for all of these approaches.