Deep Learning on Rescale

Deep Learning is a sub-field of machine learning that focuses on predictive models that have large numbers of parameters, typically organized as a layered computational graph. It is fast becoming the preferred model choice for large datasets with samples that have many features.

Rescale provides GPU-based HPC nodes and clusters for training deep learning models in the cloud. Rescale supports batch training of models as well as interactive data analysis through Rescale Desktops. A wide variety of GPU configurations are available from lower cost previous-generation K80s to the latest multi-GPU P100s with NVLink interconnect. Clusters can be preconfigured with your choice from the most popular deep learning frameworks.

In this page, we will present to you different Rescale job examples for four different applications. Click on the Import Job Setup button to clone an example job into your account, which you can then submit. Click on the Get Job Results button to review the full setup and results of a completed example.

For more information on how to set up and submit a Basic Job, please refer to the tutorial here.

For more information on how to set up and launch a Desktop Session, please refer to the tutorial here.

Supported Frameworks and Applications

TensorFlow is the popular open source C++ and Python framework for high-performance computation over dataflow graphs. It is typically used to train deep neural network models and then use them to perform inference.

Here are some benchmark results comparing Rescale’s NVLinked P100 system (Rescale Amethyst) with the high-performance DGX-1 deep learning server. We see that Rescale can achieve comparable performance to high-end on-premises GPU servers.

TensorFlow Amethyst InceptionV3 Benchmark Results

MNIST Softmax Regression Example

The first example is training a classification model on the classic MNIST handwritten digit dataset. We will train the simple softmax regression model and the input training script is found here.

Inception V3 Example

The second example is Tensorflow’s image recognition model, Inception V3. This job corresponds to the benchmark results in the table above using 4 P100s.

Keras is a high-level neural network Python framework built on top of TensorFlow, CNTK, and Theano. It supports higher-order primitives specifically for convolutional and recurrent neural networks.

Here is an example of training a classification model on the classic MNIST handwritten digit dataset. We will train a simple multi-layer perceptron model using this input training script.

PyTorch is the Python port of the Torch deep learning framework. PyTorch is known for great support for building dynamic neural networks and doing reinforcement learning.

Super Resolution Example

The first example is training and then using a model to perform “super-resolution” to scale up an image while minimizing noise.

Original image

pytorch super res unscaled.ca84109e

Super Resolution image
Scaled super resolution image

DCGAN Example

Next is an example of training a Deep Convolutional Generative Adversarial Network (DCGAN) which generates new realistic fake images that are similar to the input training images. This example is trained using Rescale’s Amethyst NVLink-connected P100 systems on the LSUN bedroom image dataset.

Original images
DCGAN synthesized fake image

Synthesized fake images

pytorch dcgan synthesized.42c21c02

LSTM DOE Example

So far examples have focused on training a single model for a task with a set of parameters selected by the user building the model. The example here is now a Design of Experiments (DOE) which is a sensitivity analysis for one or more parameters. Here is an example using the Rescale DOE framework to build many models, randomly sampling hyper-parameters that define the model. This case builds a LSTM model to perform word-level modeling.

In this case, we are doing a Monte Carlo sampling of the following LSTM model parameters:

LSTM DOE Parallel Settings

The embed _size and n_hidden parameters impact how many nodes are in the network. The dropout parameter is used to control overfitting to input data. Finally, batch_size determines how many examples we train one at a time.

LSTM DOE Chart Results

Singularity containers are a tool for packaging up applications and running them on various host systems reproducibly. Singularity can import most Docker containers without issue and can be easily deployed as a user application that can run without any administrative privileges.

As of version 2.3, Singularity supports running containers that also use GPUs running CUDA applications, making it a useful choice for running packaged deep learning jobs.

Singularity Software Settings

The “–nv” flag in the command line above instructs Singularity to pass through the host GPU interface to the container, enabling CUDA applications to run inside. This particular example runs the TensorFlow CNN benchmarks in a container on one or more GPUs.