SLURM Job Scripts

This guide covers writing and submitting SLURM batch job scripts.

Basic Job Script Structure

#!/bin/bash
#SBATCH --job-name=my-job          # Job name
#SBATCH --partition=cpu            # Queue/partition
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --ntasks=8                 # Number of tasks
#SBATCH --time=01:00:00           # Time limit (HH:MM:SS)
#SBATCH --output=my-job-%j.out    # Output file (%j = job ID)
#SBATCH --error=my-job-%j.err     # Error file

# Your commands here
echo "Running on $(hostname)"
echo "Job ID: $SLURM_JOB_ID"

python my_script.py

Common SBATCH Options

Option	Description	Example
`--job-name`	Job name	`--job-name=training`
`--partition`	Queue to use	`--partition=gpu-inferencing`
`--nodes`	Number of nodes	`--nodes=2`
`--ntasks`	Total tasks	`--ntasks=16`
`--ntasks-per-node`	Tasks per node	`--ntasks-per-node=8`
`--cpus-per-task`	CPUs per task	`--cpus-per-task=4`
`--gres`	Generic resources	`--gres=gpu:1`
`--time`	Time limit	`--time=24:00:00`
`--mem`	Memory per node	`--mem=32G`
`--output`	Stdout file	`--output=job-%j.out`
`--error`	Stderr file	`--error=job-%j.err`

Example: CPU Job

#!/bin/bash
#SBATCH --job-name=cpu-job
#SBATCH --partition=cpu
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=01:00:00
#SBATCH --output=cpu-job-%j.out

echo "Running on $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
echo "CPUs allocated: $SLURM_CPUS_ON_NODE"

# Load any required modules
module load python/3.10

# Run your script
python my_cpu_script.py

Example: GPU Inference Job

#!/bin/bash
#SBATCH --job-name=gpu-inference
#SBATCH --partition=gpu-inferencing
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --output=gpu-inference-%j.out

echo "Running on $(hostname)"
echo "GPU Info:"
nvidia-smi

# Set CUDA environment
export CUDA_VISIBLE_DEVICES=$SLURM_JOB_GPUS

# Run inference
python inference.py --model /mnt/odin/models/my_model.pt

Example: H100 Training Job

#!/bin/bash
#SBATCH --job-name=h100-training
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --output=h100-training-%j.out

echo "Running on $(hostname)"
echo "GPUs allocated: $CUDA_VISIBLE_DEVICES"
nvidia-smi

# Activate environment
source /mnt/shared/envs/pytorch/bin/activate

# Multi-GPU training with torchrun
torchrun --nproc_per_node=8 train.py \
    --data /mnt/odin/datasets/training_data \
    --output /mnt/qcs/qcs-odin-dev-output/results

Example: Multi-Node Distributed Training

#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --partition=odin
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=distributed-%j.out

echo "Master node: $SLURM_NODELIST"
echo "Total nodes: $SLURM_NNODES"
echo "Total GPUs: $((SLURM_NNODES * 8))"

# Get master address
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

# Run distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=8 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_distributed.py

Job Submission Commands

Submit a job:

sbatch my_job.sh

Submit with overrides:

sbatch --partition=gpu-inferencing --time=04:00:00 my_job.sh

Submit with job array:

sbatch --array=1-10 array_job.sh

Environment Variables

SLURM sets these variables in your job:

Variable	Description
`SLURM_JOB_ID`	Job ID
`SLURM_JOB_NAME`	Job name
`SLURM_NODELIST`	Allocated nodes
`SLURM_NNODES`	Number of nodes
`SLURM_NTASKS`	Number of tasks
`SLURM_CPUS_ON_NODE`	CPUs on this node
`SLURM_JOB_GPUS`	Allocated GPUs
`CUDA_VISIBLE_DEVICES`	CUDA GPU IDs

Output Files

Use %j for job ID substitution in output filenames:

#SBATCH --output=logs/%x-%j.out   # %x = job name
#SBATCH --error=logs/%x-%j.err

Common patterns:

%j - Job ID
%x - Job name
%N - Node name
%a - Array task ID