Training a Donkeycar on AWS GPU instance

Training a neural network is quite an intensive task for a computer. Modern deep learning platforms, like Tensorflow, can utilize specialized hardware for speeding up things. On a basic consumer computer, it means for example NVIDIA GPU card usually used for gaming. If there is no such hardware available and one has the option to use a money cannon, it’s possible to use a cloud-based service. There are more sophisticated options also available, but in this post I will tell you how to setup a raw Linux instance in Amazon cloud. The benefit in doing that is that you can use the exact same codebase and commands than you would use on you local computer.

Early disclaimer: your fresh AWS account doesn’t have rights to create GPU instances before you make a support ticket for raising the instance limit. The following instructions assume you have an account and instance limit above zero.

If you’ve setup a local computer to run Tensorflow on a GPU, you know it’s not actually straightforward. That’s why its nice to know that AWS EC2 offers ready-made images having all the drivers and libraries installed.

Create the instance

Navigate to EC2 Dashboard in AWS Console and select Launch Instance

Create instanceCreate instance

1. Choose AMI

Find and select “Deep Learning AMI (Ubuntu)” as the base image for the instance. I selected Ubuntu over Amazon Linux just for that it’s more familiar to myself.

Create instanceChoose AMI

2. Choose instance type

P2 and P3 types are general-purpose GPU enabled instances. P3 gets really expensive really soon, so going with P2 will get you started and still wont swallow your credit card instantly. The difference between p2.xlarge/p2.8xlarge/p2.16xlarge is the amount of GPUs available (1/8/16). More GPUs cost more so you might want to start with P2.xlarge and single GPU. Also it might require additional steps to get your training running on multiple GPUs.

Create instanceChoose instance type

3. Configure Instance Details

If you have no idea what you’re doing, you can just skip this step and continue with default values.

4. Add Storage

My experience with Deep Learning AMI was that going with the default 75GB storage ended up in a full disk really soon. Options are to remove extra libraries included in the image or to make the storage a bit bigger.

Create instanceAdd storage

5. Add Tags

Add tags if you have plenty of other resources on AWS and you want to include metadata to the resources. Otherwise it’s ok to skip.

6. Configure Security Group

This is an important step. Too strict rules and you have no access to the services. Too open rules and you’re welcoming the whole world into your server. As a bad and simple example, I’m using Anywhere as the source. If you have the possibility to limit the access to your IP or IP-range, do it. You will need SSH access, which should be added as default. If you want to use Tensorboard for monitoring the training process, you must add a Custom TCP Rule for port 6006

Create instanceConfigure access

7. Review and Launch

Check through the shown settings and tips and launch the instance.

8. Select a key pair

Create a new key pair for the SSH access and download it. And finally: Don’t lose it!

Create instanceSelect a key pair

9. Wait for the instance to be created

You can monitor the EC2 Instances list for the Instance State to change from pending to running.

Access the instance

1. Take the key into use

This varies a bit depending on the OS your local computer is running. These instructions are for Ubuntu.

$ cp Downloads/donkeycar_aws_key.pem ~/.ssh/
$ sudo chmod 600 ~/.ssh/donkeycar_aws_key.pem
$ ssh-add .ssh/donkeycar_aws_key.pem
2. Check the public DNS name for the instance

Create instancePublic DNS name

3. Login into the server with SSH

The default username for Deep Learning AMI (ubuntu) is ubuntu. For Amazon Linux variant it should be ec2-user.

$ ssh ubuntu@ec2-34-243-176-63.eu-west-1.compute.amazonaws.com
4. Fix a few things

The server will be missing two environment variables: one for locale and one for Cuda library.

# Select the simplest locale
$ echo 'export LC_ALL=C' >> ~/.bashrc
# Add Cuda 9.0 libraries to library path
$ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/' >> ~/.bashrc
# Run .bashrc again to include the envs immediately
$ source ~/.bashrc

Setup Donkeycar and dependencies

Steps are roughly the same as in Donkeycar docs.

Disclaimer: These instructions are for Donkeycar version 2.5.1 as there is something fishy in the current 2.5.8 about the python dependencies

$ sudo apt-get install virtualenv build-essential python3-dev gfortran libhdf5-dev
$ virtualenv env -p python3

# Activate the virtualenv on every login
$ echo 'source env/bin/activate' >> ~/.bashrc
$ source ~/.bashrc

$ git clone https://github.com/autorope/donkeycar.git
$ cd donkeycar
# Revert to 2.5.1
$ git checkout -b tags/2.5.1
# Install Donkeycar dependencies
$ pip install -e .
# Install Donkeycar GPU supported Tensorflow. Version 1.12.0 still uses Cuda 9.0. Going TF 1.13.1 and Cuda 10 will crash because of wrong cuDNN version.
$ pip install tensorflow_gpu==1.12.0

# Create car instance
$ donkey createcar ~/car --template donkey2

Train Donkeycar using the fresh instance

First move some data into the instance from the Donkeycar Raspberry Pi

$ rsync -r ~/car/tubs/tub1 ubuntu@ec2-34-243-176-63.eu-west-1.compute.amazonaws.com:~/donkeydata

Then train a model

$ cd ~/car
$ python manage.py train --tub ../donkeydata/tub1 --model models/ec2-first

Finally move the model from the cloud back to the car

$ rsync ubuntu@ec2-34-243-176-63.eu-west-1.compute.amazonaws.com:~/car/models/ec2-first ~/car/models/

Performance

I measured training a really small dataset of 2000 records on a Dell XPS 9560 laptop with GTX 1050 GPU versus P2.xlarge instance with Tesla K80 GPU. Winner was a bit surprisingly the laptop when GPU was used. P2.xlarge instance came second and finally the laptop CPU. Results below:

Contestant Result for 10 epochs/2000 records
Dell XPS GPU (GTX1050) 36 seconds
AWS P2.xlarge 49 seconds
Dell XPS CPU 1 minute 27 seconds

Extra: Use Tensorboard as GUI for Tensorflow

This should probably be a separate post, but I’ll include this shortly in here still.

Tensorboard is a graphical interface for Tensorflow. It uses saved logfiles for showing what Tensorflow has done. You can monitor the training in near realtime or just check the model architectures. Going beyond the default setup, you can also save all kinds of debug data while training. These include images, videos, histograms and other visualizations.

1. Register the Tensorboard callback

Tensorboard is activated by adding a callback for the training process. In this Donkeycar context, there are already a couple of callbacks in the Keras part. You just need to add one more.

# First add the callback to imports
from tensorflow.python.keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard # <- this added
# Import date helpers for log directory timestamps
import datetime

## OMITTED CODE until KerasPilot.train

# Build the log_dir path using timestamps or other unique identifier for the full potential of Tensorboard UI
date = datetime.datetime.now().strftime('%y-%m-%d-%H-%M')
tbCallBack = TensorBoard(log_dir=('./tensorboard_logs/%s' % date), histogram_freq=0, write_graph=True, write_images=True)

callbacks_list = [save_best, tbCallBack]

hist = self.model.fit_generator(callbacks=callbacks_list,
### OMITTED REST OF THE CODE
2. Start the server

After some training has been done, or is running currently, you can start Tensorboard server.

cd ~/donkeycar
tensorboard --logdir tensorboard_logs
# Should now be running in port 6006
3. Use

Browse to your EC2 instance and port 6006 with your chosen web browser.

For example http://ec2-34-243-176-63.eu-west-1.compute.amazonaws.com:6006

Fiddle around and enjoy.

For further tips about what Tensorboard is capable of can be found in for example this video.