AI-Enabled R&D with NVIDIA Modulus on Rescale

Overview

In this tutorial, we are going to show you how to easily get started with Modulus to perform a lid-driven cavity problem, a common physics problem for validating computational methods. This problem can be solved with traditional CFD methods but Modulus is unique because it uses AI-assisted methods that can accelerate the initial design discovery in R&D and engineering scenarios. 

NVIDIA Modulus is a neural network framework that blends the power of physics in the form of governing partial differential equations (PDEs) with data to build parameterized surrogate models with near-real-time response. NVIDIA Modulus can support your work with AI-driven physics problems, designing digital twin models for complex non-linear, multi-physics systems, solving parameterized geometries and inverse problems. Digital twins have emerged as powerful tools for tackling problems ranging from the molecular level like drug discovery up to global challenges like climate change. NVIDIA Modulus gives scientists a framework to build highly accurate digital reproductions of complex and dynamic systems that will enable the next generation of breakthroughs across a vast range of industries. For these problems which usually require large computational resources, Rescale can provide the best hardware for your work and all the assets you need to build your AI workflow in one place. 

Modulus can be easily run by either using a batch job using the Rescale command line or interactively using Rescale Workstations. See the tutorial steps below.

Batch job (Multi-node)

You can access and directly launch the sample job (Lid Driven Cavity Flow) by clicking the Import Job Setup button or view the results by clicking the Get Job Results button below.

You can get access to other use case examples on Rescale (using Modulus V21.06) with this link, which includes blood flow in intracranial aneurysm, multi-physics simulations of conjugate heat transfer, parameterized simulations and design optimization, 3D heat sink, and many other possibilities.

Steps to run a batch job on Rescale

Select input file

Upload your job files, which in this case are Modulus python scripts. These will be loaded automatically when you select Import Job Set Up above.

Select software

For this tutorial, we will use a Singularity (Apptainer) container to load Modulus containers and run the batch job on one node or multi-node. 

For a typical batch job, you can directly modify the pre populated cmd line. This command will automatically use all the GPUs selected in the Hardware settings.

mpirun -np $[$RESCALE_GPUS_PER_NODE*$RESCALE_NODES_PER_SLOT] -N $RESCALE_GPUS_PER_NODE singularity exec --nv /usr/bin/Modulus_v21.06.sif python <your-code.py>

Notes

  1. We can add –xla=True at the end of the above cmd which is Accelerated Linear Algebra (XLA) to accelerate the training.
  2. For multi-node batch job, sometimes it cannot detect the folder “network_checkpoint” the first time, so we could first create a folder named “network_checkpoint_XXX”. For example

For example (multi-node):

mkdir network_checkpoint_ldc_2d 
mpirun -np $[$RESCALE_GPUS_PER_NODE*$RESCALE_NODES_PER_SLOT] -N 
$RESCALE_GPUS_PER_NODE singularity exec --nv /usr/bin/Modulus_v21.06.sif python 
ldc_2d.py

Hardware

Recommended Rescale Coretypes:

  • Ankerite, Celestine (NVIDIA A100)
  • Dolomite, Aquamarine V3 (NVIDIA V100)
  • Aquamarine V2 (NVIDIA P100)

Check process_output.log for the output.

You can also open a terminal by clicking the Open in New Window button and running NVIDIA-smi to monitor the GPU usage. To access all the results files, run cd ~/work/shared.

To access all the results files, including the trained model, go to the Shared directory by running this command:

$ cd ~/work/shared 

The surrogate model can now be queried for lid-driven cavity flow fields under different boundary conditions.

As you can see, with just a few mouse clicks, users can run Modulus jobs with multi-GPU and multi-node scaling through batch jobs on Rescale. Although we enable Jupyter Notebook and SSH console in the batch job workflow, the visualization ability during the job running is limited. In the next section, we show the steps to run Modulus job interactively.

Interactive workflow (Multi-node)

You can access and directly launch the sample job (Lid Driven Cavity Flow) by clicking the Import Job Setup button or view the results by clicking the Get Job Results button below.

Select the End to End Desktop job type. The input file, software selection, and hardware settings are the same as batch job workflow.

After the job is running, we can open the terminal through the Connect button on the top to enter into the desktop.

Similar to the batch job, here is the MPI cmd to run a Modulus simulation.

$ cd ~/work/shared 
$ mpirun -np $[$RESCALE_GPUS_PER_NODE*$RESCALE_NODES_PER_SLOT] -N 
$RESCALE_GPUS_PER_NODE singularity exec --nv /usr/bin/Modulus_v21.06.sif python 
<your-code.py>

Jupyter notebook

Open a new SSH terminal, cd to the working directory and load the Modulus container by

$ cd ~/work/shared
$ singularity shell --nv /usr/bin/Modulus_v21.06.sif

and run an interactive shell within it. 

$ jupyter notebook list

This shows the token and URL link. Copy the URL. Start the jupyter notebook by

$ jupyter notebook

Paste above the URL in the DCV web browser and accept the warnings.

Tensorboard

We can also start the tensorboard within the singularity container by

$ tensorboard --logdir=./

Paste https://localhost:6006/ in the DCV web browser and you can check the convergence.

Paraview

https://catalog.ngc.nvidia.com/orgs/nvidia-hpcvis/containers/paraview-index

Open a new terminal and pull the image by

$ singularity pull docker://nvcr.io/nvidia-hpcvis/paraview-index:5.7.0-egl-pvw

Start the image by bonding the results data folder. For example, start the paraview in the directory ~/work/shared by:

$ singularity run --nv -B ${PWD}/network_checkpoint_ldc_2d:/data
paraview-index_5.7.0-egl-pvw.sif

Here, we visualize the val_domain/results/ *.vtu format file. Open the paraview in DCV web browser by https://localhost:8080/ and then check your results.

Summary of running Modulus as an interactive job on Rescale

Through interactive jobs, users can develop your code, monitor your neural network training process and do post processing within one job. With an interactive job, you can use visualization and other tools while developing and prototyping your model on a virtual workstation with the latest NVIDIA GPU hardware on Rescale. This allows you to develop and test your model on a workstation that you may not have on-premises access. Furthermore, if needed, you can scale out your development workflow to multiple nodes using Singularity (Apptainer) container runtime, while still having an interactive GUI environment.

Summary and additional resources

We hope that this tutorial helped you get started with running Modulus on Rescale. As you can see, you can run your Modulus workflow with multi-node GPUs through a Rescale batch job. Or you can run Modulus and other tools interactively on a Rescale End-to-End Desktop job, for developing code, monitoring training and post processing the results in a virtual workstation with multiple GPUs. Our platform provides various GPU architectures for you to choose from and all the assets you need to build your AI workflow.

You can use Modulus to produce surrogate models for your rapid design exploration studies. It can be applied in almost any engineering design problem that can be expressed in the form of PDEs (partial differential equations), ranging from fluid dynamics to structural dynamics and to multidisciplinary optimization. For more information, please refer to NVIDIA Modulus docs and Resale’s Modulus presentation at NVIDIA GTC 2022.