Intel Habana overtakes Nvidia in latest MLPerf results

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 — 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!

Intel’s Habana has overtaken Nvidia in the latest MLPerf benchmark results, which has become the industry-standard set of benchmarks to compare AI accelerators. Although Nvidia has already announced its next-gen GPU, the results indicate that competition in deep learning training hardware is intensifying.

Intel acquired startup Habana in late 2019 for $2 billion, and late last year its first-generation 16nm Gaudi NPU (neural processing unit) went live in Amazon’s AWS cloud, claiming a 40% higher performance per dollar than Nvidia-based instances. However, since it was competing against the 7nm A100 from Nvidia, Habana mostly achieved its value by charging a lower price, not by beating Nvidia on performance.

This changed in May when Habana announced Gaudi2 on 7nm, which increases the number of tensor processing cores by 3x and offers up to 96GB of HBM2e. Habana claimed that it outperformed the A100, Nvidia’s leading two-year old data center GPU, by a comfortable margin. The launch came just in time to be included in the latest MLPerf results, which is the industry’s attempt at standardizing deep learning benchmarking.

Performance results 

Habana said it had just 10 days since the launch to submit its results, so it wasn’t able to perform all eight tests, and only focused on the two most widely known benchmarks: ResNet-50 (image recognition) and BERT (natural language processing). MLPerf submissions go through a month-long peer review process.

Habana also said the short time meant it hadn’t yet had the time for thorough software optimizations. For example, Gaudi2 added support for a new lower-precision FP8 format, which wasn’t used in the submission. Instead, Habana opted to submit results based on the same software that is available to all Habana customers, whereas Nvidia purportedly uses optimizations not available in its customer-available software. 

This means that the performance difference in non-optimized cases is larger. In Habana’s own tests using public repositories on Azure instances, Habana measured that Gaudi2 was at least 2x faster on both ResNet-50 and BERT than the A100. Habana argues that these results are more representative for out-of-the-box performance that customers will see using publicly available software. 

In the MLPerf results, compared to Nvidia’s submission, Gaudi2 was able to train ResNet-50 in 36% less time, which translates to a 56% higher performance. Nevertheless, it may be noted that deep learning startup MosaicML’s MLPerf results, which used PyTorch, delivered a training time of 23.8 minutes that beat Nvidia’s own submission, although still slower than Gaudi2. On the other hand, further software optimizations may also reduce Gaudi2’s time in future submission. 

In BERT, the victory was smaller with Gaudi2 taking 7% less time than the A100. Compared to Gaudi, Gaudi2 was respectively 3x and 4.7x faster in ResNet-50 and BERT. The results for all accelerators are based on 8-card servers. Habana further showed results for a system with 256 cores, which delivers nearly 25x higher performance, as compared to the 32x theoretical scaling limit, showing that performance is maintained in the scale-out configurations that these chips are often deployed in. 

What’s next

The thesis of most AI startups was that they could beat Nvidia by throwing out all of the GPU stuff and only focus on the AI hardware. Even despite having had just a handful of days to submit its results since the official launch, Habana’s Gaudi2 has beaten Nvidia’s A100, both manufactured on 7nm process technology, using out-of-the-box hardware and commercially available software. Habana further claims that the performance difference on non-optimized code, outside of MLPerf, can be over 2x. Since Habana is likely to price its Gaudi2 lower than Nviida’s A100, and each Gaudi chip also has 24 integrated 100G Ethernet ports, the difference in total cost of ownership may be even larger, as Habana and AWS already claim is the case for the first-generation Gaudi.

While Habana may have taken the performance crown this round, Nvidia has already announced its next-generation H100 with availability later this year. Habana also has not yet announced any cloud instances for Gaudi2 yet.