Resource Management in Distributed Deep Learning Optical Clusters (Ongoing)

Published:

The servers of traditional computer clusters are interconencted by electronic packet switched networks. Such 'electronic networks' have poor scalability, bandwidth, latency, and power consumption. Data centres now consume 2% of the World's electricity, and the slowing of Moore's Law coinciding with new data-hungry demands means electronic switches are unable to keep up with emerging data-heavy applications such as machine learning, genome processing, and the internet-of-things. As such, although server compute power has had a 65x increase over the last 18 years, the bandwidth of the network facilitating communication between these servers has only increased by 4.8x, resulting in an 8-factor decrease in bytes communicated per flop. This shifts the performance bottleneck away from the servers themselves and into the network connecting them.

Low-latency, high-bandwidth, ultra-scalable optical circuit switched networks can address these challenges and enable the deployment of next-generation high-performance clusters and data centres. In particular, machine learning workloads present a unique opportunity to develop specialised circuit-based clusters for because they are predictable, periodic, and consist mostly of large network flows. Furthermore, learning models with trillions of parameters are being developed with final test performance capped primarily by the model size which is limited, in the 'strong scaling' case, by the bandwidth of the network.

In this work, we aim to address the challenge of how to make resource management decisions (from computation graph partitioning and placement to server allocation and scheduling) for training massive models on an optical cluster with distributed deep learning. By framing the problem as a Markov decision process where sequential actions must be taken to maximise some reward (such as minimise the overall job completion time), a graph neural network can be trained in an end-to-end fashion to reason about how to allocate the cluster's resources optimally. We are in the process of developing a suite of cluster environments, graph neural network models, and reinforcement learning algorithms in order to achieve this, and we hope to demonstrate both good performance and the ability to scale to large networks and jobs.