Horovod (machine learning)

Horovod

Developer	Uber
Initial release	August 9, 2017; 8 years ago (2017-08-09)^[1]

Stable release	v0.28.1^[2] / June 12, 2023; 2 years ago (2023-06-12)

Written in	Python, C++, CUDA
Platform	Linux, macOS, Windows
Type	Artificial intelligence ecosystem
License	Apache License 2.0
Website	horovod.ai

History

Summarize

Perspective

Horovod was created at Uber as part of the company's internal machine learning platform Michelangelo to simplify scaling TensorFlow models across many GPUs.^[1] The first public release of the library, version 0.9.0, was tagged on GitHub in August 2017 under the Apache 2.0 licence.^[2] In October 2017, Uber Engineering publicly introduced Horovod as an open-source component of its deep learning toolkit.^[1]

In February 2018 Alexander Sergeev and Mike Del Balso published a technical paper describing Horovod's design and benchmarking its performance on up to 512 GPUs, showing near-linear scaling for several image-classification models when compared with single-GPU baselines.^[1]

In December 2018 Uber contributed Horovod to the LF Deep Learning Foundation (later LF AI & Data), making it a Linux Foundation project.^[6]^[7]^[8] Horovod entered incubation under LF AI & Data and graduated as a full foundation project in 2020.^[9]

Since its initial release the project has expanded beyond TensorFlow to provide APIs for PyTorch, Keras and Apache MXNet, as well as integrations with frameworks such as Apache Spark and Ray, support for elastic training, and tooling for automated performance tuning and profiling.^[10]^[11]

Remove ads

Design and features

Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.^[1] The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.^[1] Communication is built on top of collective-communication libraries such as MPI, NCCL, Gloo and Intel oneCCL, and supports both GPU and CPU training.^[12]

In the benchmark experiments reported in the original paper, Horovod achieved around 90% scaling efficiency on 512 GPUs for the ResNet-101 and Inception v3 convolutional neural networks, and around 68% scaling efficiency for the VGG-16 model.^[1]

Horovod can be deployed on-premises or in cloud environments and is distributed as a Python package with optional GPU support via CUDA.^[11]^[13] The official documentation provides guides for running Horovod with Docker, Kubernetes (including via Kubeflow and the MPI Operator), commercial platforms such as Databricks, and cluster schedulers such as LSF.^[11]

Remove ads

Adoption and use cases

Within Uber, Horovod has been used for applications including autonomous driving research, fraud detection and trip forecasting.^[14]^[8]

Major cloud providers have integrated Horovod into their managed machine learning offerings. Amazon Web Services supports distributed training with Horovod in services such as Amazon SageMaker and AWS Deep Learning Containers,^[15]^[16] while Microsoft Azure documents Horovod-based training workflows for Azure Synapse Analytics.^[17]

Technical guides from academic and research computing centres, including Purdue University and the NASA Advanced Supercomputing programme, describe Horovod-based workflows for multi-GPU training on supercomputers and clusters.^[18]

Horovod is also used in conjunction with Apache Spark and dedicated storage systems as part of end-to-end data processing and model-training pipelines.^[19]Industry blogs and technical tutorials describe deployments of Horovod on Kubernetes, on-premises clusters and cloud-managed Kubernetes services such as Amazon EKS.^[19]^[16]

Horovod (machine learning)

History

Design and features

Adoption and use cases

See also

References

External links

Wikiwand - on