Technology
Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet that scales model training to hundreds of GPUs with minimal code changes.
Originally developed by Uber, Horovod utilizes the ring-allreduce algorithm via NVIDIA NCCL or MPI to optimize communication between nodes. It eliminates the bottleneck of centralized parameter servers, allowing users to scale a single-GPU training script to a massive cluster by adding just a few lines of Python. Major organizations like Alibaba and Amazon use it to achieve nearly linear scaling efficiency (often exceeding 90%) across thousands of high-performance accelerators.
Recent Talks & Demos
Showing 1-0 of 0