Microsoft and Google have been actively researching new frameworks for training deep neural networks, and have recently open sourced their respective results-Microsoft’s PipeDream and Google’s GPipe.
In principle, they all follow similar principles to train deep learning models. These two projects have been described in detail in their respective research papers (PipeDream, GPipe), which will be summarized in this article.
First put the GitHub open source address:
Microsoft:
https://github.com/msr-fiddle/pipedream
Google:
https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py
As we all know, in the experiment process, although training the basic model is relatively trivial, the complexity increases linearly with the quality and size of the model. For example, the winner of the 2014 ImageNet Visual Recognition Challenge was GoogleNet, which achieved a top1 accuracy of 74.8% through 4 million parameters. Only three years later, the winner of the 2017 ImageNet Challenge used 145.8 million parameters (more than The latest neural network achieves a top1 accuracy rate of 82.7%. However, during the same period, GPU memory only increased about 3 times.
As models scale to achieve higher accuracy, training these models becomes increasingly challenging. The previous samples also showed that it is not sustainable to rely on improvements in GPU infrastructure to achieve better training. We need distributed computing methods that can parallelize the training workload across different nodes to expand the training scale. The concept of distributed training sounds trivial, but in fact it is extremely complicated.
Google’s GPipe GPipe
Focus on expanding the training workload of the deep learning program. From an infrastructure perspective, the complexity of the training process is an aspect that is often overlooked in deep learning models. The training data set is getting larger and more complex. For example, in the healthcare field, it is not uncommon for models to be trained using millions of high-resolution images. As a result, the training process usually takes a long time to complete, and memory and CPU consumption are very large.
An effective way to think about the distribution of deep learning models is to divide it into data distribution and model distribution. The data distribution method uses large machine clusters to split the input data between them. Model distributed attempts to move the model to an accelerator with specific hardware, such as GPU or TPU, to accelerate model training.
Conceptually, almost all training data sets can be distributed training according to a certain logic, but the statements about the model are different. For example, some deep learning models consist of parallel branches that can be trained independently. In that case, the classic strategy is to divide the calculation into multiple partitions and assign different partitions to different branches. However, this strategy is insufficient in a deep learning model that sequentially stacks layers.
GPipe combines data and models distributedly by using a technology called pipelining. Conceptually, GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent and pipelined distributed training, and is suitable for any DNN composed of multiple consecutive layers.
GPipe divides the model between different accelerators and automatically splits a small batch of training samples into smaller micro batches. This model allows GPipe’s accelerators to run in parallel, thereby maximizing the scalability of the training process.
The figure below illustrates the distribution of the GPipe model of a neural network with successive layers among the four accelerators. Fk is the compound forward calculation function of the k-th partition. Bk is the corresponding back propagation function. Bk depends on the intermediate activation of Bk + 1 and Fk in the upper layer. In the top model, we can see how the sequential nature of the network leads to underutilization of resources. The following figure shows the GPipe method, in which the input mini-batch is divided into smaller macro batches, which can be processed simultaneously by the accelerator.
Image Source:
https://arxiv.org/pdf/1811.06965.pdf
Microsoft’s PipeDream
A few months ago, Microsoft Research announced the creation of Project Fiddle, a series of research projects aimed at simplifying distributed deep learning. PipeDreams is one of the first releases of the Fiddle project, focusing on the parallelization of deep learning model training.
PipeDream uses a method different from other methods to extend the training of deep learning models using a technique called pipeline distributed. This approach attempts to solve some of the challenges of data and model parallel technology, such as the technology used in GPipe.
Generally, when training on a cloud infrastructure, data parallel methods will bear higher communication costs in scale, and will increase GPU computing speed over time. Similarly, model distributed technology is usually more inefficient in using hardware resources, and programmers need to decide how to split their specific model under a given hardware deployment, which brings them unnecessary burdens.
Image Source:
http://www.microsoft.com/zh-cn/research/uploads/prod/2019/08/fiddle_pipedream_sosp19.pdf
PipeDream attempts to overcome some of the challenges of the distributed approach to the data model by using a technique called pipeline distributed.
Conceptually, pipeline distribution calculation involves dividing each layer of the DNN model into multiple stages, where each stage is composed of a set of consecutive layers in the model. Each stage is mapped to a separate GPU, which performs forward pass (and reverse pass) for all layers in that stage.
Given a specific deep neural network, PipeDream will automatically determine how to partition the DNN operators based on a brief profiling performed on a single GPU, balance the computational load between different stages, and minimize the relationship with the target platform Communication. Even with model diversity (computing and communication) and platform diversity (interconnection topology and hierarchical bandwidth), PipeDream will effectively achieve load balancing. The principle of the PipeDream training distributed method has several advantages over the data model distributed method.
For starters, PipeDream requires less communication between worker program nodes, because each worker program in the pipeline execution only needs to convey a subset of the gradient and output activation information to a single other worker program.
Image Source:
https://www.microsoft.com/zh-cn/research/uploads/prod/2019/08/fiddle_pipedream_sosp19.pdf
Training distributed is one of the key challenges in building larger and more accurate deep learning models. Distributed training methods are an active research area in the deep learning community, and it is necessary to combine effective concurrent programming techniques with the essence of deep learning models. Although still in the early stages, Google’s GPipe and Microsoft’s PipeDream are already excellent products in their own right. They are the two most creative distributed training methods available to deep learning developers.
For more such interesting article like this, app/softwares, games, Gadget Reviews, comparisons, troubleshooting guides, listicles, and tips & tricks related to Windows, Android, iOS, and macOS, follow us on Google News, Facebook, Instagram, Twitter, YouTube, and Pinterest.