Top Qs
Timeline
Chat
Perspective
Horovod (machine learning)
From Wikipedia, the free encyclopedia
Remove ads
Horovod is a free and open-source distributed deep learning training framework for TensorFlow, Keras, PyTorch and Apache MXNet.[3][4]
It is designed to scale existing single-GPU training scripts to efficiently run on multiple GPUs and computer nodes with minimal code changes, using synchronous data-parallel training based on the ring-allreduce communication pattern.[5] Horovod was initially developed at Uber and released as an open-source project in 2017, and is now hosted by the LF AI & Data Foundation, a project of the Linux Foundation. [1]
Remove ads
History
Summarize
Perspective
Horovod was created at Uber as part of the company's internal machine learning platform Michelangelo to simplify scaling TensorFlow models across many GPUs.[1] The first public release of the library, version 0.9.0, was tagged on GitHub in August 2017 under the Apache 2.0 licence.[2] In October 2017, Uber Engineering publicly introduced Horovod as an open-source component of its deep learning toolkit.[1]
In February 2018 Alexander Sergeev and Mike Del Balso published a technical paper describing Horovod's design and benchmarking its performance on up to 512 GPUs, showing near-linear scaling for several image-classification models when compared with single-GPU baselines.[1]
In December 2018 Uber contributed Horovod to the LF Deep Learning Foundation (later LF AI & Data), making it a Linux Foundation project.[6][7][8] Horovod entered incubation under LF AI & Data and graduated as a full foundation project in 2020.[9]
Since its initial release the project has expanded beyond TensorFlow to provide APIs for PyTorch, Keras and Apache MXNet, as well as integrations with frameworks such as Apache Spark and Ray, support for elastic training, and tooling for automated performance tuning and profiling.[10][11]
Remove ads
Design and features
Horovod implements synchronous data-parallel training, in which each worker process maintains a replica of the model and computes gradients on different mini-batches of data.[1] The gradients are aggregated across workers using the ring-allreduce communication pattern rather than a central parameter server, which reduces communication bottlenecks and can improve scaling on multi-GPU clusters.[1] Communication is built on top of collective-communication libraries such as MPI, NCCL, Gloo and Intel oneCCL, and supports both GPU and CPU training.[12]
In the benchmark experiments reported in the original paper, Horovod achieved around 90% scaling efficiency on 512 GPUs for the ResNet-101 and Inception v3 convolutional neural networks, and around 68% scaling efficiency for the VGG-16 model.[1]
Horovod can be deployed on-premises or in cloud environments and is distributed as a Python package with optional GPU support via CUDA.[11][13] The official documentation provides guides for running Horovod with Docker, Kubernetes (including via Kubeflow and the MPI Operator), commercial platforms such as Databricks, and cluster schedulers such as LSF.[11]
Remove ads
Adoption and use cases
Within Uber, Horovod has been used for applications including autonomous driving research, fraud detection and trip forecasting.[14][8]
Major cloud providers have integrated Horovod into their managed machine learning offerings. Amazon Web Services supports distributed training with Horovod in services such as Amazon SageMaker and AWS Deep Learning Containers,[15][16] while Microsoft Azure documents Horovod-based training workflows for Azure Synapse Analytics.[17]
Technical guides from academic and research computing centres, including Purdue University and the NASA Advanced Supercomputing programme, describe Horovod-based workflows for multi-GPU training on supercomputers and clusters.[18]
Horovod is also used in conjunction with Apache Spark and dedicated storage systems as part of end-to-end data processing and model-training pipelines.[19]Industry blogs and technical tutorials describe deployments of Horovod on Kubernetes, on-premises clusters and cloud-managed Kubernetes services such as Amazon EKS.[19][16]
See also
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads
