Setting up a Machine Learning Farm in the Cloud with Spot Instances + Auto Scaling

Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.
Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.

Previously, I wrote about how Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads.

In this post, I’ll provide code snippets needed to set up a workable autoscaling spot-bidding system, and point out the caveats along the way. I’ll show you how to set up an auto-scaling group with a simple CPU monitoring rule, create a spot-instance bidding policy, and attach that rule to the bidding policy.

But first, let’s talk about how to frame the machine learning problem as a distributed system.

Parallelize the Problem

The kind of machine learning problem you have will determine the structure of the machine workload. Is the primary purpose of the machine learning to train a model on a large dataset, or perform evaluation in a production setting? Will the data be batched or streaming? These are some considerations I’ll have to think about when designing how nodes will share information to coordinate a big machine learning task.

For many problems, you’ll have to figure out how to split up the computation of a training matrix across the machines in the cluster. Two main ways of dividing the matrix are row-wise (by instance) or column-wise (by feature). If you have a problem like this, you’re already familiar with some of the details, such as how much RAM is needed to compute a feature, whether the machine learning algorithm chosen can vectorize each instance independently, or whether it needs to process multiple instances. I won’t dive into the details of the task-specific code here, since it is a bit beyond the scope of this post, but I can cover it in the topic of a future post for those that are interested.

So for now, let’s say we have a problem where each instance can be vectorized independently from all the other instances (and all of the features can be computed within the RAM limitations of a single machine).  We’ll choose a typical master-worker coordination pattern where a master node queues instances and worker nodes that pull from that queue and vectorize instances.

Before I start setting up auto-scaling, I’ll create AWS images for the master and worker nodes. Then I’ll test out the application on the image and configure the worker node so that on boot it pulls a unit of work from the master on boot.

Creating an auto-scaled spot instance cluster

Alright, we’ve figured out how we plan to distribute the computation of our problem over a cluster of machines.  We’ve chosen an instance type that meets our price/performance requirements, tested out code on it, and created Amazon Machine Images for each of the roles.  We’ll boot up our master server, and the idea is that the workers will be auto-scaled up.

1. Configure the AWS command line tools

In this step, we’re going to:

  1. Download the latest AWS command line tools for AutoScaling and CloudWatch,
  2. Set the environment variables JAVA_HOME, AWS_AUTO_SCALING_HOME, and AWS_CLOUDWATCH_HOME to where we downloaded those tools, and
  3. Set our account credentials

On your local computer:

$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/jre # replace with path to your Java
$ wget
$ wget
$ unzip
$ cd AutoScaling-
$ export AWS_AUTO_SCALING_HOME=`pwd` # set the AWS_AUTO_SCALING_HOME variable to this directory
$ cd ../CloudWatch-
$ cd ..
$ cp AutoScaling- credential-file # copy the credential file template
$ nano credential-file # edit credential file

This will open an editor to configure AWS access credentials. Log-in to the AWS security credentials website. In the Access Keys tab, copy the Access Key Id and Secret Access Key and replace the blanks in the file.  Save the file.

Now, let’s set permissions on the file and the corresponding environment variable:

$ chmod 600 credential-file # user can read-write
$ export AWS_CREDENTIAL_FILE=`pwd`/credential-file # set the AWS_CREDENTIAL_FILE

2. Create a launch configuration for our worker node, which is a virtual machine image and instance type.

I typically set up the init.d scripts on these workers so they will auto-update their code and start pulling in work when they boot.

$ AutoScaling- spot-lc --image-id ami-e565ba8c --instance-type c1.xlarge –-region us-west-2 --spot-price "0.66"

Replace the name, image-id, instance type and region with your values. Here’s the important part: the “–spot-price” parameter indicates we want to launch a spot instance. Without it, an on-demand instance will be launched. (More on how much to bid for the spot price later.) For now, let’s bid at the price of an on-demand instance.

3. Name the auto-scaling group

Give it a name and assign it to the launch configuration we just created:

$ AutoScaling- ml-asg --launch-configuration SpotLC --max-size 50 --min-size 1

4. Create scaling policies.

These are the actions to take in response to an alarm (i.e. trigger). Let’s create two policies: one to scale up and one to scale down in response to load.

$ AutoScaling- my-scaleout-policy -–auto-scaling-group ml-asg --adjustment=50 --type PercentChangeInCapacity
$ AutoScaling- my-scalein-policy –auto-scaling-group ml-asg --adjustment=-1  --type ChangeInCapacity

Each of these commands will return back a lengthy string like this, which is Amazon’s id for this scaling policy. Save this–you’ll need these in the next step.


Amazon lets you specify policies that scale in absolute number of machines or relative (percentage of current cluster size).  How fast you add/subtract machines from the cluster will be determined by what you choose in the next step.

5. Create alarms and associate them with the scaling policies.

Alarms are events that trigger an action. There’s a single command to both define the alarm condition and associate it with the scaling policy we just created (using the super long policy ID from the previous step). These parameters determine how quickly our farm will respond to load.

Those of you running machine learning tasks that are offline batch processed — where latency is less important than throughput — will be able to set more conservative (slower) timeouts. If you’re running an online learning system and want to reduce system latency, then you want to scale up quickly and scale down slowly (and pay the cost of over-provisioning) to reduce the queue size. The lower bound on responsiveness to load increases is going to be the boot time of the system — how long it takes from startup to when the machine can start taking some of the load off the queue. (If boot to work was instantaneous, we wouldn’t have to make this tradeoff.)

Amazon gives you lots of knobs to twiddle with in this call. Instead of monitoring for average CPUUtilization, you can monitor disk I/O, depending on what your bottleneck is. If your base load is spikey, you may want to play with what period to perform the averaging over, or how many consecutive evaluation-periods over which to monitor. The commands will end up looking something like this:

$ CloudWatch- --alarm-name AddCapacity  --metric-name CPUUtilization --namespace “AWS/EC2” --statistic Average --period 120 --threshold 80 --comparison-operator GreaterThanOrEqualToThreshold --dimensions “AutoScalingGroupName=ml-asg” --evaluation-periods 2 --alarm-actions arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:4ee9e543-86b5-4121-b53b-aa4c23b5bbcc:autoScalingGroupName/ml-asg:policyName/my-scaleout-policy
$ CloudWatch- --alarm-name SubtractCapacity  --metric-name CPUUtilization --namespace “AWS/EC2” --statistic Average --period 120 --threshold 20 --comparison-operator LessThanOrEqualToThreshold --dimensions “AutoScalingGroupName=ml-asg” --evaluation-periods 2 --alarm-actions arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:4ee9e543-86b5-4121-b53b-aa4c23b5bbcc:autoScalingGroupName/ml-asg:policyName/my-scalein-policy

6. Fire it up.

We’ve now got a machine learning farm in the cloud tapping extra cycles in Amazon’s grid. Start sending load to your system and watch as your alarms trigger and your farm automatically right-scales to load.

Final Bits: Dealing With Spikes

What happens when spot prices exceed on-demand prices or Amazon runs out of spot instance inventory? (This happens rarely, but if you can’t be interrupted, it’s something you have to plan for.) A simple script can monitor the spot price and, if it’s above the on-demand price, replace your launch configuration with on-demand instances.

Another way is to have a second auto-scaling cluster that uses an on-demand bidding policy. This can have a longer timeout so that it will always scale up after the spot cluster has scaled up and scale down before the spot cluster scales down. This way the on-demand cluster can always scale up to cover the load, even if the spot instances are unavailable.

I’ve shown you some of the capabilities of the Amazon Auto Scaling and how to take advantages of lower Spot Instance prices to keep your distributed machine learning system operating smoothly. Without too much work, you can set up a system to scale from 2 to 200 nodes, seemlessly absorbing whatever load you throw at it, while keeping your costs at a minimum.

Questions? Continue the discussion on Hacker News.

Mike Tung

Mike Tung is the CEO at Diffbot.