When you buy through links on our articles, Future and its syndication partners may earn a commission.
Nvidia. Nvidia.
Credit: Nvidia
Meta recently published a study detailing its training of the Llama 3 405B model on a cluster containing 16,384 Nvidia H100 80GB GPUs. The training took place over 54 days, and the cluster experienced 419 unexpected component failures during that time, or one failure every three hours on average. Half of the failures were caused by the GPUs or their onboard HBM3 memory.
As the old supercomputer adage goes, the only certainty in large-scale systems is failure. Supercomputers are extremely complex devices that use tens of thousands of processors, hundreds of thousands of other chips, and hundreds of miles of cables. In a sophisticated supercomputer, it is normal for something to fail every few hours, and the main challenge for developers is to ensure that the system remains operational despite these local failures.
The scale and synchronous nature of training 16,384 GPUs makes it prone to failures. If failures are not mitigated properly, a single GPU failure can disrupt the entire training job, requiring a restart. However, the Llama 3 team maintained an effective training time of over 90%.
During a 54-day period prior to training, 466 work interruptions were observed, of which 47 were planned and 419 were unplanned. Planned interruptions were due to automated maintenance, while unplanned interruptions were primarily related to hardware issues. GPU issues were the largest category, accounting for 58.7% of unplanned interruptions. Only three incidents required significant manual intervention; the rest were handled by automation.
asdfgasdfg
asdfg
Out of 419 unexpected outages, 148 (30.1%) were caused by various GPU failures (including NVLink failures), while 72 (17.2%) were caused by HBM3 memory failures, which isn’t too surprising given that Nvidia’s H100 GPUs draw around 700W and are under a lot of thermal stress. Interestingly, only two CPUs failed in 54 days.
But while GPUs are the most important and also the most fragile components, 41.3% of unexpected outages were caused by a number of factors, including software bugs, network cables, and network adapters.
To improve efficiency, the Meta team reduced startup and task monitoring times and developed proprietary diagnostic tools. PyTorch’s NCCL flight recorder has been widely used to quickly diagnose and resolve crashes and performance issues, particularly related to NCCLX. This tool captures collective metadata and stack traces, which helps in rapid troubleshooting.
NCCLX played a crucial role in fault detection and localization, especially for NVLink and RoCE issues. Integration with PyTorch made it possible to monitor and automatically interrupt communications blocked due to NVLink failures.
Lagging GPUs, which can slow down thousands of other GPUs, were identified using specialized tools. These tools prioritized problematic communications, enabling efficient detection and rapid resolution of stragglers, minimizing slowdowns and maintaining overall training effectiveness.
Environmental factors, such as mid-day temperature fluctuations, impacted training performance by causing a 1-2% variation in throughput. Dynamic voltage and frequency scaling of GPUs was affected by these temperature changes, although it was not a major issue.
Another challenge the Llama 3 405B LLM training team faced was the simultaneous fluctuations in power consumption of tens of thousands of GPUs, which put a strain on their data center’s power grid. These fluctuations, sometimes in the tens of megawatts, put a strain on the grid’s limits, meaning Meta had to ensure its data centers had enough power.
Considering that a cluster of 16,384 GPUs experienced 419 failures in 54 days (7.76 times per 24 hours, or one failure every three hours), one can only wonder how often xAI’s cluster containing 100,000 H100 GPUs, a six-fold increase in the number of components susceptible to failure, will experience a failure.
0 Comments