GPU Jobs

This guide covers running GPU workloads on the Odin cluster.

Available GPUs

Partition GPU VRAM Count Use Case
gpu-inferencing NVIDIA A10G 24GB 1 per node Inference, small training
odin NVIDIA H100 80GB 8 per node Large training
albus NVIDIA H100 80GB 8 per node Large training
bali NVIDIA H100 80GB 8 per node Large training
genius NVIDIA H100 80GB 8 per node Large training

Requesting GPUs

Always use --gres=gpu:N to request GPUs:

# Single A10G GPU
sbatch --partition=gpu-inferencing --gres=gpu:1 job.sh

# Single H100 GPU
sbatch --partition=odin --gres=gpu:1 job.sh

# All 8 H100 GPUs on one node
sbatch --partition=odin --gres=gpu:8 job.sh

# 16 H100 GPUs across 2 nodes
sbatch --partition=odin --nodes=2 --gres=gpu:8 job.sh

GPU Inference Job

For model inference using the A10G GPU:

#!/bin/bash
#SBATCH --job-name=inference
#SBATCH --partition=gpu-inferencing
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --output=inference-%j.out

# Verify GPU
nvidia-smi

# Run inference
python inference.py \
    --model /mnt/odin/models/model.pt \
    --input /mnt/qcs/qcs-odin-dev-ingest/data/ \
    --output /mnt/qcs/qcs-odin-dev-output/results/

Single-GPU Training

For training on a single H100:

#!/bin/bash
#SBATCH --job-name=single-gpu-train
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=train-%j.out

nvidia-smi

python train.py \
    --epochs 100 \
    --batch-size 64 \
    --data /mnt/odin/datasets/

Multi-GPU Training (Single Node)

Using all 8 H100s on one node:

#!/bin/bash
#SBATCH --job-name=multi-gpu-train
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --output=multi-gpu-%j.out

echo "GPUs available: $(nvidia-smi -L | wc -l)"
nvidia-smi

# PyTorch distributed training
torchrun --nproc_per_node=8 train.py \
    --epochs 100 \
    --batch-size 512 \
    --data /mnt/odin/datasets/

Multi-Node Distributed Training

Using 16 H100s across 2 nodes:

#!/bin/bash
#SBATCH --job-name=distributed-train
#SBATCH --partition=odin
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=distributed-%j.out

# Setup distributed environment
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export WORLD_SIZE=$((SLURM_NNODES * 8))

echo "Master: $MASTER_ADDR:$MASTER_PORT"
echo "World size: $WORLD_SIZE"

# Launch with srun
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=8 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_distributed.py

CUDA Environment

SLURM automatically sets CUDA_VISIBLE_DEVICES based on allocated GPUs. You can verify:

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
nvidia-smi

Monitoring GPU Usage

In your job script:

# Run nvidia-smi in background
nvidia-smi dmon -s um -d 10 > gpu_metrics.log &
GPU_MONITOR_PID=$!

# Your training code here
python train.py

# Stop monitoring
kill $GPU_MONITOR_PID

Check running job:

# SSH to compute node
squeue -u $USER  # Get node name
ssh <node-name>
nvidia-smi

GPU Memory Tips

  1. Batch size: Largest factor in GPU memory usage
  2. Mixed precision: Use FP16/BF16 to reduce memory by ~50%
  3. Gradient checkpointing: Trade compute for memory
  4. Gradient accumulation: Simulate larger batches

Example with mixed precision:

# PyTorch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    with autocast():
        loss = model(batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Common Issues

CUDA Out of Memory

  1. Reduce batch size
  2. Enable gradient checkpointing
  3. Use mixed precision training
  4. Request more GPUs and use data parallelism

GPU Not Found

  1. Verify --gres=gpu:N in job script
  2. Check nvidia-smi output
  3. Ensure correct partition selected

Slow Multi-GPU Scaling

  1. Increase batch size proportionally
  2. Use NCCL backend for PyTorch
  3. Consider gradient accumulation