GPU Jobs

This guide covers running GPU workloads on the Odin cluster.

Available GPUs

Partition	GPU	VRAM	Count	Use Case
`gpu-inferencing`	NVIDIA A10G	24GB	1 per node	Inference, small training
`odin`	NVIDIA H100	80GB	8 per node	Large training
`albus`	NVIDIA H100	80GB	8 per node	Large training
`bali`	NVIDIA H100	80GB	8 per node	Large training
`genius`	NVIDIA H100	80GB	8 per node	Large training

Requesting GPUs

Always use --gres=gpu:N to request GPUs:

# Single A10G GPU
sbatch --partition=gpu-inferencing --gres=gpu:1 job.sh

# Single H100 GPU
sbatch --partition=odin --gres=gpu:1 job.sh

# All 8 H100 GPUs on one node
sbatch --partition=odin --gres=gpu:8 job.sh

# 16 H100 GPUs across 2 nodes
sbatch --partition=odin --nodes=2 --gres=gpu:8 job.sh

GPU Inference Job

For model inference using the A10G GPU:

#!/bin/bash
#SBATCH --job-name=inference
#SBATCH --partition=gpu-inferencing
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --output=inference-%j.out

# Verify GPU
nvidia-smi

# Run inference
python inference.py \
    --model /mnt/odin/models/model.pt \
    --input /mnt/qcs/qcs-odin-dev-ingest/data/ \
    --output /mnt/qcs/qcs-odin-dev-output/results/

Single-GPU Training

For training on a single H100:

#!/bin/bash
#SBATCH --job-name=single-gpu-train
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=train-%j.out

nvidia-smi

python train.py \
    --epochs 100 \
    --batch-size 64 \
    --data /mnt/odin/datasets/

Multi-GPU Training (Single Node)

Using all 8 H100s on one node:

#!/bin/bash
#SBATCH --job-name=multi-gpu-train
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --output=multi-gpu-%j.out

echo "GPUs available: $(nvidia-smi -L | wc -l)"
nvidia-smi

# PyTorch distributed training
torchrun --nproc_per_node=8 train.py \
    --epochs 100 \
    --batch-size 512 \
    --data /mnt/odin/datasets/

Multi-Node Distributed Training

Using 16 H100s across 2 nodes:

#!/bin/bash
#SBATCH --job-name=distributed-train
#SBATCH --partition=odin
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=distributed-%j.out

# Setup distributed environment
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export WORLD_SIZE=$((SLURM_NNODES * 8))

echo "Master: $MASTER_ADDR:$MASTER_PORT"
echo "World size: $WORLD_SIZE"

# Launch with srun
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=8 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_distributed.py

CUDA Environment

SLURM automatically sets CUDA_VISIBLE_DEVICES based on allocated GPUs. You can verify:

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
nvidia-smi

Monitoring GPU Usage

In your job script:

# Run nvidia-smi in background
nvidia-smi dmon -s um -d 10 > gpu_metrics.log &
GPU_MONITOR_PID=$!

# Your training code here
python train.py

# Stop monitoring
kill $GPU_MONITOR_PID

Check running job:

# SSH to compute node
squeue -u $USER  # Get node name
ssh <node-name>
nvidia-smi

GPU Memory Tips

Batch size: Largest factor in GPU memory usage
Mixed precision: Use FP16/BF16 to reduce memory by ~50%
Gradient checkpointing: Trade compute for memory
Gradient accumulation: Simulate larger batches

Example with mixed precision:

# PyTorch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    with autocast():
        loss = model(batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Common Issues

CUDA Out of Memory

Reduce batch size
Enable gradient checkpointing
Use mixed precision training
Request more GPUs and use data parallelism

GPU Not Found

Verify --gres=gpu:N in job script
Check nvidia-smi output
Ensure correct partition selected

Slow Multi-GPU Scaling

Increase batch size proportionally
Use NCCL backend for PyTorch
Consider gradient accumulation