Training a neural network is quite an intensive task for a computer. Modern deep learning platforms, like Tensorflow, can utilize specialized hardware for speeding up things. On a basic consumer computer, it means for example NVIDIA GPU card usually used for gaming. If there is no such hardware available and one has the option to use a money cannon, it’s possible to use a cloud-based service. There are more sophisticated options also available, but in this post I will tell you how to setup a raw Linux instance in Amazon cloud. The benefit in doing that is that you can use the exact same codebase and commands than you would use on you local computer.
Early disclaimer: your fresh AWS account doesn’t have rights to create GPU instances before you make a support ticket for raising the instance limit. The following instructions assume you have an account and instance limit above zero.
If you’ve setup a local computer to run Tensorflow on a GPU, you know it’s not actually straightforward. That’s why its nice to know that AWS EC2 offers ready-made images having all the drivers and libraries installed.
Create the instance
Navigate to EC2 Dashboard in AWS Console and select Launch Instance
1. Choose AMI
Find and select “Deep Learning AMI (Ubuntu)” as the base image for the instance. I selected Ubuntu over Amazon Linux just for that it’s more familiar to myself.
2. Choose instance type
P2 and P3 types are general-purpose GPU enabled instances. P3 gets really expensive really soon, so going with P2 will get you started and still wont swallow your credit card instantly. The difference between p2.xlarge/p2.8xlarge/p2.16xlarge is the amount of GPUs available (1/8/16). More GPUs cost more so you might want to start with P2.xlarge and single GPU. Also it might require additional steps to get your training running on multiple GPUs.
Choose instance type
3. Configure Instance Details
If you have no idea what you’re doing, you can just skip this step and continue with default values.
4. Add Storage
My experience with Deep Learning AMI was that going with the default 75GB storage ended up in a full disk really soon. Options are to remove extra libraries included in the image or to make the storage a bit bigger.
5. Add Tags
Add tags if you have plenty of other resources on AWS and you want to include metadata to the resources. Otherwise it’s ok to skip.
6. Configure Security Group
This is an important step. Too strict rules and you have no access to the services. Too open rules and you’re welcoming the whole world into your server. As a bad and simple example, I’m using Anywhere as the source. If you have the possibility to limit the access to your IP or IP-range, do it. You will need SSH access, which should be added as default. If you want to use Tensorboard for monitoring the training process, you must add a Custom TCP Rule for port 6006
7. Review and Launch
Check through the shown settings and tips and launch the instance.
8. Select a key pair
Create a new key pair for the SSH access and download it. And finally: Don’t lose it!
Select a key pair
9. Wait for the instance to be created
You can monitor the EC2 Instances list for the Instance State to change from pending to running.
Access the instance
1. Take the key into use
This varies a bit depending on the OS your local computer is running. These instructions are for Ubuntu.
$ cp Downloads/donkeycar_aws_key.pem ~/.ssh/ $ sudo chmod 600 ~/.ssh/donkeycar_aws_key.pem $ ssh-add .ssh/donkeycar_aws_key.pem
2. Check the public DNS name for the instance
Public DNS name
3. Login into the server with SSH
The default username for Deep Learning AMI (ubuntu) is ubuntu. For Amazon Linux variant it should be ec2-user.
$ ssh firstname.lastname@example.org
4. Fix a few things
The server will be missing two environment variables: one for locale and one for Cuda library.
# Select the simplest locale $ echo 'export LC_ALL=C' >> ~/.bashrc # Add Cuda 9.0 libraries to library path $ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/' >> ~/.bashrc # Run .bashrc again to include the envs immediately $ source ~/.bashrc
Setup Donkeycar and dependencies
Steps are roughly the same as in Donkeycar docs.
Disclaimer: These instructions are for Donkeycar version 2.5.1 as there is something fishy in the current 2.5.8 about the python dependencies
$ sudo apt-get install virtualenv build-essential python3-dev gfortran libhdf5-dev $ virtualenv env -p python3 # Activate the virtualenv on every login $ echo 'source env/bin/activate' >> ~/.bashrc $ source ~/.bashrc $ git clone https://github.com/autorope/donkeycar.git $ cd donkeycar # Revert to 2.5.1 $ git checkout -b tags/2.5.1 # Install Donkeycar dependencies $ pip install -e . # Install Donkeycar GPU supported Tensorflow. Version 1.12.0 still uses Cuda 9.0. Going TF 1.13.1 and Cuda 10 will crash because of wrong cuDNN version. $ pip install tensorflow_gpu==1.12.0 # Create car instance $ donkey createcar ~/car --template donkey2
Train Donkeycar using the fresh instance
First move some data into the instance from the Donkeycar Raspberry Pi
$ rsync -r ~/car/tubs/tub1 email@example.com:~/donkeydata
Then train a model
$ cd ~/car $ python manage.py train --tub ../donkeydata/tub1 --model models/ec2-first
Finally move the model from the cloud back to the car
$ rsync firstname.lastname@example.org:~/car/models/ec2-first ~/car/models/
I measured training a really small dataset of 2000 records on a Dell XPS 9560 laptop with GTX 1050 GPU versus P2.xlarge instance with Tesla K80 GPU. Winner was a bit surprisingly the laptop when GPU was used. P2.xlarge instance came second and finally the laptop CPU. Results below:
|Contestant||Result for 10 epochs/2000 records|
|Dell XPS GPU (GTX1050)||36 seconds|
|AWS P2.xlarge||49 seconds|
|Dell XPS CPU||1 minute 27 seconds|
Extra: Use Tensorboard as GUI for Tensorflow
This should probably be a separate post, but I’ll include this shortly in here still.
Tensorboard is a graphical interface for Tensorflow. It uses saved logfiles for showing what Tensorflow has done. You can monitor the training in near realtime or just check the model architectures. Going beyond the default setup, you can also save all kinds of debug data while training. These include images, videos, histograms and other visualizations.
1. Register the Tensorboard callback
Tensorboard is activated by adding a callback for the training process. In this Donkeycar context, there are already a couple of callbacks in the Keras part. You just need to add one more.
# First add the callback to imports from tensorflow.python.keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard # <- this added # Import date helpers for log directory timestamps import datetime ## OMITTED CODE until KerasPilot.train # Build the log_dir path using timestamps or other unique identifier for the full potential of Tensorboard UI date = datetime.datetime.now().strftime('%y-%m-%d-%H-%M') tbCallBack = TensorBoard(log_dir=('./tensorboard_logs/%s' % date), histogram_freq=0, write_graph=True, write_images=True) callbacks_list = [save_best, tbCallBack] hist = self.model.fit_generator(callbacks=callbacks_list, ### OMITTED REST OF THE CODE
2. Start the server
After some training has been done, or is running currently, you can start Tensorboard server.
cd ~/donkeycar tensorboard --logdir tensorboard_logs # Should now be running in port 6006
Browse to your EC2 instance and port 6006 with your chosen web browser.
Fiddle around and enjoy.
For further tips about what Tensorboard is capable of can be found in for example this video.