Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

Ying Mao*, Yuqi Fu, Wenjia Zheng, Long Cheng, Qingzhi Liu, Dingwen Tao

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

12 Citations (Scopus)

Abstract

In the past decade, we have witnessed a dramatically increasing volume of data collected from various sources. To maximize utilization, various machine and deep learning models have been developed to study data. While data-driven applications improve countless products, hyperparameter tuning for the models is still a time-consuming and resource-intensive process. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers create an isolated virtual environment for clients who share physical resources, e.g., CPU and memory. On the cloud, resource management schemes are implemented to enable better sharing among users and boost system-wide performance. However, general scheduling approaches, such as spread priority and balanced resource schedulers, do not work well with deep learning workloads. In this article, we propose SpeCon, a novel container scheduler optimized for short-lived deep learning applications. Based on virtualized containers, such as Kubernetes and Docker, SpeCon analyzes the typical characteristics of training processes. We design a suite of algorithms to monitor the training’s progress and speculatively migrate the slow-growing models to release resources for fast-growing ones. Specifically, the extensive experiments demonstrate that SpeCon improves an individual job’s completion time by up to 41.5%, 14.8% system-wide, and 24.7% in terms of makespan.

Original languageEnglish
Pages (from-to)3770-3781
JournalIEEE Systems Journal
Volume16
Issue number3
Early online date10 Dec 2021
DOIs
Publication statusPublished - Sept 2022

Keywords

  • Apache Yarn and Spark
  • Biological system modeling
  • Cloud computing
  • Computational modeling
  • container management
  • Containers
  • Data models
  • Docker Swarm
  • Kubernetes
  • Monitoring
  • Pytorch
  • Tensorflow
  • Training

Fingerprint

Dive into the research topics of 'Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster'. Together they form a unique fingerprint.

Cite this