GPU


Job Submission Example

In order to submit a job to a GPU server, you need to use the gpu queue and specify the number of GPU cards you wish to use. The following is a job script example for running GROMACS accelerated with GPU:

#!/bin/bash

#$ -M netid@nd.edu   # Email address for job notification
#$ -m abe            # Send mail when job begins, ends and aborts
#$ -pe smp 1         # Specify parallel environment and legal core size
#$ -q gpu            # Run on the GPU cluster
#$ -l gpu_card=1     # Run on 1 GPU card
#$ -N job_name       # Specify job name

module load gromacs  # Required modules

export OMP_NUM_THREADS=$NSLOTS
gmx mdrun -ntomp $OMP_NUM_THREADS -nb gpu -pin on -v -s input.tpr # Run with 16 MPI tasks and 1 GPU devices

Note

  • If the -pe parallel environment is not defined in the job script, the default value is smp 1. Please always make sure to request enough cores for your GPU jobs.

  • Please note that the runtime limit for GPU systems is 7 days.

  • Each job must have at least 1 GPU and 1 core to run.


Installing Software on a GPU machine

In some cases, it is necessary to use a GPU server to install the software you wish to use for your GPU jobs.

Note

  • Please note that the CRC does not provide any front end machines with GPUs.

For the installation an interactive session is necessary on a GPU node. The following is an example for starting an interactive session on a GPU system with 1 GPU card and 1 core:

qrsh -q gpu -l gpu_card=1 -pe smp 1

Once the connection is established, the required software may be installed.

Note

  • If your research lab or faculty advisor has purchased a machine(s), there is most likely a host group you can target. For the installation you can target GPUs in a specific host group by using the gpu@@hostgroupname queue.

  • Before installing and using a software for GPU jobs, please make sure that the software can take advantage of GPUs.

CUDA And cuDNN Modules’ Avalability

Usually, for the installation the CUDA (Compute Unified Device Architecture) library is necessary and in some cases the cuDNN (CUDA for Deep Neural Network) library too. Many versions of these libraries are available on the CRC system:

$ module avail cuda
----------------------------------- /afs/crc.nd.edu/x86_64_linux/Modules/modules/development_tools_and_libraries ------------------------------------
cuda/10.0  cuda/10.2  cuda/11.0  cuda/11.2  cuda/11.6
$ module avail cudnn
----------------------------------- /afs/crc.nd.edu/x86_64_linux/Modules/modules/development_tools_and_libraries ------------------------------------
cudnn/7.4  cudnn/8.0.4  cudnn/v7.0

Note

  • If you wish to use a CUDA/cuDNN version which is not installed on the CRC system, you may install other versions with Conda.


Available Hardware For General Access

You can find a list of the CRC owned GPU systems on Available Hardware.

Note

  • These machines have typically 24 cores and 4 GPUs per node.

  • Each GPU has an ID within the machine, this ID can be 0, 1, 2 or 3.


Resource and Job Monitoring

You can monitor the status and availability of GPU resources with the free_gpus.sh script.

free_gpus.sh @crc_gpu        # For general access
free_gpus.sh @crc_1080ti     # You can target host groups

The Xymon monitoring system can be used to analyze the behavior of processes on a given GPU machine.

You can check your job’s GPU usage. In order to do that, knowing the GPU ID is necessary. You can check the ID with the following command:

qstat -j jobID

You can find it under the resource map. Here is an example, the GPU ID is the number in brackets:

resource map          1:    gpu_card=qa-1080ti-004.crc.nd.edu=(1)

Other Resources For GPU Jobs

If you wish to run large number of GPU jobs, you may want to consider submitting your jobs via Condor. You can find detailed documentation and examples on HTCondor.

If you wish to use GPUs for Machine Learning, you may want to consider using CAML ND. You can find detailed documentation and examples on CAML.