Autoscaling Deep Learning Training with Kubernetes

  • November 21, 2017


We recently partnered with Litbit, a San Jose-based startup, on a project to autoscale deep learning training. Litbit enables its customers to turn their “Internet of Things” into conscious personas that can learn, think, and do helpful things. In order to accomplish this goal, customers train their AI-empowered personas using sight, sound, and touch sensors (among others) to recognize specific situations.

Since different customers may be training different AI personas at different times, the training load tends to be bursty and unpredictable. Some of these training jobs (e.g., Spark ML) make heavy use of CPUs, while others (e.g., TensorFlow) make heavy use of GPUs. In the latter case, some jobs retrain a single layer of the neural net and finish very quickly, while others need to train an entire new neural net and can take several hours to days.

To meet the diverse requirements for training in a cost-efficient manner, Litbit needed a system that could scale different types of VM pools (CPU only, light GPU, heavy GPU) up and down based on demand. In this code story, we have generalized lessons learned from this scenario and will explain how to use the acs-engine-autoscaler to scale different types of VMs up and down based on demand.


While there are many options for running containerized distributed deep learning at scale, we have selected Kubernetes due to its superior cluster management technology and the huge developer community. To start, we need to create a Kubernetes cluster with GPU support on Azure to run different types of machine learning loads. Then we need to add autoscaling capability to the Kubernetes cluster to meet bursty demands in a cost-efficient manner.

Creating a Kubernetes cluster with GPU support using ACS-engine

To create a Kubernetes cluster that supports GPUs, we will use acs-engine, an open source tool that will generate the ARM template we need to deploy our cluster with everything already configured.

NOTE: You might be wondering why we are using acs-engine and not AKS (Azure Container Service, the managed Kubernetes service on Azure). To use the acs-engine-autoscaler, the Kubernetes cluster must be created by acs-engine as the autoscaler requires metadata information about the agent pools to scale the nodes up and down, which is information only exposed by acs-engine. Therefore, the acs-engine-autoscaler does not work with AKS.


Binary downloads for the latest version of acs-engine are available. Download acs-engine for your operating system. Extract the binary and copy it to your $PATH.

Generate Templates

acs-engine reads a JSON cluster definition which describes the size, shape, and configuration of your cluster.

First, update example\kubernetes.json to create a cluster that satisfies our requirements. To create different types of VM pools (CPU only, light GPU, heavy GPU), we can create different agent pools by adding additional sections under agentPoolProfiles. Each pool can have a different VM size and can scale up to 100 nodes (as set by the MaxAgentCount constant in acs-engine). For our scenario, we want a cluster for training with GPUs and inference with CPUs only. Here we are defining two pools as we don’t want to pay for GPU unless needed. The number of agents isn’t really important because we are going to enable autoscaling later, so we will keep everything as 1.

At the time of this writing, Azure has 6 different VM sizes with GPU support. For more details with generating templates, refer to the ACS engine guide.

Here is an example of our Kubernetes cluster definition with multiple agent pools:

Now with the cluster definition JSON, let’s generate the templates by running:

This step generates a bunch of files under the _output/mymlcluster directory, including the ARM template and parameters that we want.

Deploy Templates

With the new azuredeploy.json and azuredeploy.parameters.json generated in the previous step, we can now deploy the templates using the Azure CLI.

Note: make sure you choose a region that has N-series VM available. For example, eastus and southcentralus are two regions with N-series skus available. Also, make sure your subscription has enough cores to run those VM types.

This step will take between 5 to 10 minutes to deploy. We will keep the generated azurdeploy.json and azuredeploy.parameters.json around as we will need them later to setup autoscaling.

Once the deployment is completed, copy the Kubernetes config file of the cluster locally to allow kubectl to communicate with the cluster. If you do not already have the kubectl cli, follow these instructions to install kubectl.

 Verifying the Cluster

To ensure everything is working as intended, run:

You should see the correct number of GPUs reported (in this example, it shows 1 GPU for a NC6 VM):

If is shown as 0, wait a bit longer. The driver installation takes about 12 minutes, and the node might join the cluster before the installation is completed. After a few minutes, the node should restart, and report the correct number of GPUs.

Scheduling a GPU Container

Now that we have a GPU-enabled Kubernetes cluster, we can run a container that requires GPU resources. Below is an example GPU container running TensorFlow. To request GPU resources, we have to specify how many GPU the container needs, then Kubernetes will map the device into the container. To use the drivers, we need to mount the driver from the Kubernetes agent host into the container.

Note: the drivers are installed under /usr/lib/nvidia-384 (or another version number depending on the driver’s version).

Note: we have specified 1 for the resources requests, and mounted the drivers from the host into the container. We also modified the LD_LIBRARY_PATH environment variable to let Python know where to find the driver’s libraries.

Some libraries, such as, are installed under /usr/lib/x86_64-linux-gnu on the host. Depending on your requirements, you might need to mount them separately as shown above.

Schedule the deployment with the following command:

Autoscaling Kubernetes Cluster to Meet Bursty Demands

Now that we have a Kubernetes cluster that can run CPU workloads and GPU workloads, we need to be able to scale the VM pools up and down based on demands.  Kubernetes-acs-engine-autoscaler, a fork of OpenAI’s Kubernetes-ec2-autoscaler, can autoscale an acs-engine Kubernentes cluster based on demand.

The Kubernetesacs-engine-autoscaler will run inside the cluster and monitor the different pods that get scheduled. Whenever a pod is pending because of a lack of resources, the autoscaler will create an adequate number of new VMs to support the scheduled pod. Finally, when VMs are idle, the autoscaler will delete them. As a result, we can achieve the flexibility we want, while still keeping costs down.

Setting up the Autoscaler

The acs-engine-autoscaler can be installed with a Helm chart. Helm is a Kubernetes package manager that helps us package, install, and manage our Kubernetes applications. Using the stable/acs-engine-autoscaler Helm chart, we can install the autoscaler in our cluster.

First, locate your azuredeploy.parameters.json file generated with acs-engine from the previous step.

Next, find the values.yaml file from the acs-engine-autoscaler Helm chart. Update the following parameters in the file.

resourcegroupName of the resource group containing the cluster
azurespappidAn Azure service principal ID
azurespsecretAn Azure service principal secret
azuresptenantidAn Azure service principal tenant ID
kubeconfigprivatekeyThe key passed to the kubeConfigPrivateKey parameter in your azuredeploy.parameters.json generated with acs-engine
clientprivatekeyThe key passed to the clientPrivateKey parameter in your azuredeploy.parameters.json generated with acs-engine
caprivatekey The key passed to the caPrivateKey parameter in your azuredeploy.parameters.json generated with acs-engine

Finally, after you have updated the values.yaml file in the chart, run the following to install the chart with the release name my-release:

 Verifying Installation

To verify the acs-engine-autoscaler is configured properly, find the pod that the deployment created and look at its logs. The result will look something similar to the following:

You should see something like the following in the logs of the autoscaler pod.

Autoscaling the Cluster

Recall from the previous section how our Kubernetes cluster has 2 agent pools, each with a single agent. The job we ran in the previous section requires 1 GPU and only agentpool2 has a VM with GPU. Now to test our autoscaler, let’s schedule a second job with GPU request that is similar to the tftrain.yaml deployment we ran earlier in our cluster.

From the autoscaler’s pod’s log, we should see agentpool2 scaling up to meet the demand:

After a few minutes, the new VM with GPU will be created, and our second job starts running. Once the jobs are completed, the pods are terminated. The autoscaler will notice one or more nodes are now idle and will adjust the cluster size accordingly.

First, idle VMs will be cordoned and drained:

Then after some time, the cordoned node will get deleted:

Voilà! Now we have a Kubernetes cluster that can autoscale as new pods are scheduled and resources are requested.

Horizontal Pod Autoscaling

In some scenarios, you might want to scale up and down based on some metrics, for example, CPU or memory usage. Kubernetes Horizontal Pod Autoscaling (HPA) allows us to specify a metric and target to track on a deployment.

For example, for a given deployment, you might want to configure HPA to have a combined average CPU usage not exceeding 50%. Once the CPU usage of all running pods exceeds 50%, HPA will increase the number of replicas in the deployment and spread the load across the cluster. But eventually, the existing VMs in the cluster will not be able to support more replicas, and new pods created by HPA will start hanging in a <pending> state. This is where the acs-engine-autoscaler will notice the pending pods and start to create new VMs to support them, then delete the idle VMs once the jobs are completed.

To understand how to configure Horizontal Pod Autoscaling, check out the official documentation.


With this solution, we were able to help Litbit to scale up to 40 nodes at a time then subsequently downscale as planned. Litbit has been successfully using this for the past 4 months.  This solution is ideal for use cases where you need to scale different types of VMs up and down based on demand. To test the Azure autoscaler for your own use case, check out this GitHub repo.


Cover image by Markus Spiske on Unsplash

Related Articles

Leave a reply

Your email address will not be published. Required fields are marked *