SLURM Job Scripts

This guide covers writing and submitting SLURM batch job scripts.

Basic Job Script Structure

#!/bin/bash
#SBATCH --job-name=my-job          # Job name
#SBATCH --partition=cpu            # Queue/partition
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --ntasks=8                 # Number of tasks
#SBATCH --time=01:00:00           # Time limit (HH:MM:SS)
#SBATCH --output=my-job-%j.out    # Output file (%j = job ID)
#SBATCH --error=my-job-%j.err     # Error file

# Your commands here
echo "Running on $(hostname)"
echo "Job ID: $SLURM_JOB_ID"

python my_script.py

Common SBATCH Options

Option Description Example
--job-name Job name --job-name=training
--partition Queue to use --partition=gpu-inferencing
--nodes Number of nodes --nodes=2
--ntasks Total tasks --ntasks=16
--ntasks-per-node Tasks per node --ntasks-per-node=8
--cpus-per-task CPUs per task --cpus-per-task=4
--gres Generic resources --gres=gpu:1
--time Time limit --time=24:00:00
--mem Memory per node --mem=32G
--output Stdout file --output=job-%j.out
--error Stderr file --error=job-%j.err

Example: CPU Job

#!/bin/bash
#SBATCH --job-name=cpu-job
#SBATCH --partition=cpu
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=01:00:00
#SBATCH --output=cpu-job-%j.out

echo "Running on $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
echo "CPUs allocated: $SLURM_CPUS_ON_NODE"

# Load any required modules
module load python/3.10

# Run your script
python my_cpu_script.py

Example: GPU Inference Job

#!/bin/bash
#SBATCH --job-name=gpu-inference
#SBATCH --partition=gpu-inferencing
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --output=gpu-inference-%j.out

echo "Running on $(hostname)"
echo "GPU Info:"
nvidia-smi

# Set CUDA environment
export CUDA_VISIBLE_DEVICES=$SLURM_JOB_GPUS

# Run inference
python inference.py --model /mnt/odin/models/my_model.pt

Example: H100 Training Job

#!/bin/bash
#SBATCH --job-name=h100-training
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --output=h100-training-%j.out

echo "Running on $(hostname)"
echo "GPUs allocated: $CUDA_VISIBLE_DEVICES"
nvidia-smi

# Activate environment
source /mnt/shared/envs/pytorch/bin/activate

# Multi-GPU training with torchrun
torchrun --nproc_per_node=8 train.py \
    --data /mnt/odin/datasets/training_data \
    --output /mnt/qcs/qcs-odin-dev-output/results

Example: Multi-Node Distributed Training

#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --partition=odin
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=distributed-%j.out

echo "Master node: $SLURM_NODELIST"
echo "Total nodes: $SLURM_NNODES"
echo "Total GPUs: $((SLURM_NNODES * 8))"

# Get master address
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

# Run distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=8 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_distributed.py

Job Submission Commands

Submit a job:

sbatch my_job.sh

Submit with overrides:

sbatch --partition=gpu-inferencing --time=04:00:00 my_job.sh

Submit with job array:

sbatch --array=1-10 array_job.sh

Environment Variables

SLURM sets these variables in your job:

Variable Description
SLURM_JOB_ID Job ID
SLURM_JOB_NAME Job name
SLURM_NODELIST Allocated nodes
SLURM_NNODES Number of nodes
SLURM_NTASKS Number of tasks
SLURM_CPUS_ON_NODE CPUs on this node
SLURM_JOB_GPUS Allocated GPUs
CUDA_VISIBLE_DEVICES CUDA GPU IDs

Output Files

Use %j for job ID substitution in output filenames:

#SBATCH --output=logs/%x-%j.out   # %x = job name
#SBATCH --error=logs/%x-%j.err

Common patterns:

  • %j - Job ID
  • %x - Job name
  • %N - Node name
  • %a - Array task ID