.. _gpu:

#######################
GPU
#######################

............................

Job Submission Example
=========================

In order to submit a job to a GPU server, you need to use the ``gpu`` queue and specify the number of GPU cards you wish to use. The following is a job script example for running GROMACS accelerated with GPU:

.. code-block:: shell

	#!/bin/bash

	#$ -M netid@nd.edu   # Email address for job notification
	#$ -m abe            # Send mail when job begins, ends and aborts
	#$ -pe smp 1         # Specify parallel environment and legal core size
	#$ -q gpu            # Run on the GPU cluster
	#$ -l gpu_card=1     # Run on 1 GPU card
	#$ -N job_name       # Specify job name

	module load gromacs  # Required modules

	export OMP_NUM_THREADS=$NSLOTS
	gmx mdrun -ntomp $OMP_NUM_THREADS -nb gpu -pin on -v -s input.tpr # Run with 16 MPI tasks and 1 GPU devices
	

.. note::
        * If the ``-pe`` parallel environment is not defined in the job script, the default value is ``smp 1``. Please always make sure to request enough cores for your GPU jobs.
        * Please note that the runtime limit for GPU systems is 7 days.
        * **Each job must have at least 1 GPU and 1 core to run.**

............................

Installing Software on a GPU machine
=========================================
In some cases, it is necessary to use a GPU server to install the software you wish to use for your GPU jobs.

.. note::
        * Please note that the CRC does not provide any front end machines with GPUs. 

For the installation an interactive session is necessary on a GPU node. The following is an example for starting an interactive session on a GPU system with 1 GPU card and 1 core: 

.. code-block:: shell

	qrsh -q gpu -l gpu_card=1 -pe smp 1

Once the connection is established, the required software may be installed.

.. note::
        * If your research lab or faculty advisor has purchased a machine(s), there is most likely a host group you can target. For the installation you can target GPUs in a specific host group by using the ``gpu@@hostgroupname`` queue.
        * Before installing and using a software for GPU jobs, please make sure that the software can take advantage of GPUs.

CUDA And cuDNN Modules' Avalability
======================================
Usually, for the installation the **CUDA** (Compute Unified Device Architecture) library is necessary and in some cases the **cuDNN** (CUDA for Deep Neural Network) library too. Many versions of these libraries are available on the CRC system:

.. code-block:: shell

	$ module avail cuda
	----------------------------------- /afs/crc.nd.edu/x86_64_linux/Modules/modules/development_tools_and_libraries ------------------------------------
	cuda/10.0  cuda/10.2  cuda/11.0  cuda/11.2  cuda/11.6
	$ module avail cudnn
	----------------------------------- /afs/crc.nd.edu/x86_64_linux/Modules/modules/development_tools_and_libraries ------------------------------------
	cudnn/7.4  cudnn/8.0.4  cudnn/v7.0

.. note::
        * If you wish to use a CUDA/cuDNN version which is not installed on the CRC system, you may install other versions with :ref:`conda`.

............................

Available Hardware For General Access
======================================

You can find a list of the CRC owned GPU systems on :ref:`available_hardware`.

.. note::
        * These machines have typically 24 cores and 4 GPUs per node.
        * Each GPU has an ID within the machine, this ID can be 0, 1, 2 or 3.

............................

Resource and Job Monitoring
=============================

You can monitor the status and availability of GPU resources with the ``free_gpus.sh`` script.

.. code-block:: shell

	free_gpus.sh @crc_gpu        # For general access
	free_gpus.sh @crc_1080ti     # You can target host groups

The `Xymon monitoring system <https://mon.crc.nd.edu/xymon/>`_ can be used to analyze the behavior of processes on a given GPU machine.

You can check your job's GPU usage. In order to do that, knowing the GPU ID is necessary. You can check the ID with the following command:

.. code-block:: shell

	qstat -j jobID
	
You can find it under the ``resource map``. Here is an example, the GPU ID is the number in brackets:

.. code-block:: shell

	resource map          1:    gpu_card=qa-1080ti-004.crc.nd.edu=(1)
	
............................

Other Resources For GPU Jobs
===============================
If you wish to run large number of GPU jobs, you may want to consider submitting your jobs via Condor. You can find detailed documentation and examples on :ref:`condor`.

If you wish to use GPUs for Machine Learning, you may want to consider using CAML ND. You can find detailed documentation and examples on :ref:`caml`.