GPU Jobs
This guide covers running GPU workloads on the Odin cluster.
Available GPUs
| Partition | GPU | VRAM | Count | Use Case |
|---|---|---|---|---|
gpu-inferencing |
NVIDIA A10G | 24GB | 1 per node | Inference, small training |
odin |
NVIDIA H100 | 80GB | 8 per node | Large training |
albus |
NVIDIA H100 | 80GB | 8 per node | Large training |
bali |
NVIDIA H100 | 80GB | 8 per node | Large training |
genius |
NVIDIA H100 | 80GB | 8 per node | Large training |
Requesting GPUs
Always use --gres=gpu:N to request GPUs:
# Single A10G GPU
sbatch --partition=gpu-inferencing --gres=gpu:1 job.sh
# Single H100 GPU
sbatch --partition=odin --gres=gpu:1 job.sh
# All 8 H100 GPUs on one node
sbatch --partition=odin --gres=gpu:8 job.sh
# 16 H100 GPUs across 2 nodes
sbatch --partition=odin --nodes=2 --gres=gpu:8 job.sh
GPU Inference Job
For model inference using the A10G GPU:
#!/bin/bash
#SBATCH --job-name=inference
#SBATCH --partition=gpu-inferencing
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --output=inference-%j.out
# Verify GPU
nvidia-smi
# Run inference
python inference.py \
--model /mnt/odin/models/model.pt \
--input /mnt/qcs/qcs-odin-dev-ingest/data/ \
--output /mnt/qcs/qcs-odin-dev-output/results/
Single-GPU Training
For training on a single H100:
#!/bin/bash
#SBATCH --job-name=single-gpu-train
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=train-%j.out
nvidia-smi
python train.py \
--epochs 100 \
--batch-size 64 \
--data /mnt/odin/datasets/
Multi-GPU Training (Single Node)
Using all 8 H100s on one node:
#!/bin/bash
#SBATCH --job-name=multi-gpu-train
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --output=multi-gpu-%j.out
echo "GPUs available: $(nvidia-smi -L | wc -l)"
nvidia-smi
# PyTorch distributed training
torchrun --nproc_per_node=8 train.py \
--epochs 100 \
--batch-size 512 \
--data /mnt/odin/datasets/
Multi-Node Distributed Training
Using 16 H100s across 2 nodes:
#!/bin/bash
#SBATCH --job-name=distributed-train
#SBATCH --partition=odin
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=distributed-%j.out
# Setup distributed environment
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export WORLD_SIZE=$((SLURM_NNODES * 8))
echo "Master: $MASTER_ADDR:$MASTER_PORT"
echo "World size: $WORLD_SIZE"
# Launch with srun
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=8 \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train_distributed.py
CUDA Environment
SLURM automatically sets CUDA_VISIBLE_DEVICES based on allocated GPUs. You can verify:
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
nvidia-smi
Monitoring GPU Usage
In your job script:
# Run nvidia-smi in background
nvidia-smi dmon -s um -d 10 > gpu_metrics.log &
GPU_MONITOR_PID=$!
# Your training code here
python train.py
# Stop monitoring
kill $GPU_MONITOR_PID
Check running job:
# SSH to compute node
squeue -u $USER # Get node name
ssh <node-name>
nvidia-smi
GPU Memory Tips
- Batch size: Largest factor in GPU memory usage
- Mixed precision: Use FP16/BF16 to reduce memory by ~50%
- Gradient checkpointing: Trade compute for memory
- Gradient accumulation: Simulate larger batches
Example with mixed precision:
# PyTorch
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
loss = model(batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Common Issues
CUDA Out of Memory
- Reduce batch size
- Enable gradient checkpointing
- Use mixed precision training
- Request more GPUs and use data parallelism
GPU Not Found
- Verify
--gres=gpu:Nin job script - Check
nvidia-smioutput - Ensure correct partition selected
Slow Multi-GPU Scaling
- Increase batch size proportionally
- Use NCCL backend for PyTorch
- Consider gradient accumulation