Towards Scalable Deep Learning with Communication Optimizations

PhD Thesis Proposal Defence


Title: "Towards Scalable Deep Learning with Communication Optimizations"

by

Mr. Lin ZHANG


Abstract:

With the burst of data and model sizes, it has become prevalent to parallelize
large deep neural networks (DNNs) training in clusters of distributed devices.
While distributed training schemes enable large-scale deep learning
applications, they introduce extensive communications over the network. The
communication overheads often consume a significant portion of the training
time, resulting in a severe performance bottleneck.

The communication optimizations have attracted much attention from both
academia and industry to improve the system scalability. We notice that the
state-of-the-art data parallel training frameworks, such as PyTorch-DDP and
Horovod, only optimize the all-reduce communication for gradient aggregation.
They do not consider other collective communication alternatives, and do not
support novel training algorithms such as second-order methods. To address
these limitations, our research objective is to optimize any form of
communication overheads in data parallel training systems.

First, we present DeAR, a novel distributed training mechanism that decouples
the allreduce primitive to two operators to enable fine-grained communication
scheduling. By doing so, we can overlap the first operation with the
back-propagation computation task, and overlap the second operation with the
feed-forward computation task, which can hide more communications of gradient
aggregation. Moreover, we propose a dynamic tensor fusion algorithm using
Bayesian optimization in DeAR to judiciously determine which tensors should be
fused to improve the training efficiency. Extensive experiments are conducted
to show that DeAR can achieve up to 83% speedup over state-of-the-art solutions
on a 64-GPU cluster connected by 10Gb/s Ethernet.

Second, we extend existing distributed training systems to support second-order
methods, notably the distributed K-FAC (D-KFAC) algorithms. We find that D-KFAC
algorithms require computing and communicating a large volume of second-order
information of Kronecker factors (KFs), posing new challenges for communication
optimizations. To address it, we present smart parallel D-KFAC (SPD-KFAC), with
a pipeline technique for KFs' computation and communication tasks, and a
load-balancing scheme for workloads of inverting the KFs. Next, we propose
placement-aware D-KFAC (PAD-KFAC) with efficient communication and optimal
tensor placement scheduling, to eliminate the redundant communications in prior
work of SPD-KFAC. Our experimental results show that PAD-KFAC can achieve up to
36% speedup over state-of-the-art D-KFAC algorithms, and outperform the SGD
counterpart in end-to-end training time on a 64-GPU cluster.


Date:                   Friday, 2 June 2023

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        lifts 25/26

Committee Members:      Prof. Bo Li (Supervisor)
                        Prof. Qian Zhang (Chairperson)
                        Prof. Kai Chen
                        Dr. Wei Wang


**** ALL are Welcome ****