SLURM Job Scripts
This guide covers writing and submitting SLURM batch job scripts.
Basic Job Script Structure
#!/bin/bash
#SBATCH --job-name=my-job # Job name
#SBATCH --partition=cpu # Queue/partition
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=8 # Number of tasks
#SBATCH --time=01:00:00 # Time limit (HH:MM:SS)
#SBATCH --output=my-job-%j.out # Output file (%j = job ID)
#SBATCH --error=my-job-%j.err # Error file
# Your commands here
echo "Running on $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
python my_script.py
Common SBATCH Options
| Option | Description | Example |
|---|---|---|
--job-name |
Job name | --job-name=training |
--partition |
Queue to use | --partition=gpu-inferencing |
--nodes |
Number of nodes | --nodes=2 |
--ntasks |
Total tasks | --ntasks=16 |
--ntasks-per-node |
Tasks per node | --ntasks-per-node=8 |
--cpus-per-task |
CPUs per task | --cpus-per-task=4 |
--gres |
Generic resources | --gres=gpu:1 |
--time |
Time limit | --time=24:00:00 |
--mem |
Memory per node | --mem=32G |
--output |
Stdout file | --output=job-%j.out |
--error |
Stderr file | --error=job-%j.err |
Example: CPU Job
#!/bin/bash
#SBATCH --job-name=cpu-job
#SBATCH --partition=cpu
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=01:00:00
#SBATCH --output=cpu-job-%j.out
echo "Running on $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
echo "CPUs allocated: $SLURM_CPUS_ON_NODE"
# Load any required modules
module load python/3.10
# Run your script
python my_cpu_script.py
Example: GPU Inference Job
#!/bin/bash
#SBATCH --job-name=gpu-inference
#SBATCH --partition=gpu-inferencing
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --output=gpu-inference-%j.out
echo "Running on $(hostname)"
echo "GPU Info:"
nvidia-smi
# Set CUDA environment
export CUDA_VISIBLE_DEVICES=$SLURM_JOB_GPUS
# Run inference
python inference.py --model /mnt/odin/models/my_model.pt
Example: H100 Training Job
#!/bin/bash
#SBATCH --job-name=h100-training
#SBATCH --partition=odin
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --output=h100-training-%j.out
echo "Running on $(hostname)"
echo "GPUs allocated: $CUDA_VISIBLE_DEVICES"
nvidia-smi
# Activate environment
source /mnt/shared/envs/pytorch/bin/activate
# Multi-GPU training with torchrun
torchrun --nproc_per_node=8 train.py \
--data /mnt/odin/datasets/training_data \
--output /mnt/qcs/qcs-odin-dev-output/results
Example: Multi-Node Distributed Training
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --partition=odin
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=distributed-%j.out
echo "Master node: $SLURM_NODELIST"
echo "Total nodes: $SLURM_NNODES"
echo "Total GPUs: $((SLURM_NNODES * 8))"
# Get master address
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
# Run distributed training
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=8 \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train_distributed.py
Job Submission Commands
Submit a job:
sbatch my_job.sh
Submit with overrides:
sbatch --partition=gpu-inferencing --time=04:00:00 my_job.sh
Submit with job array:
sbatch --array=1-10 array_job.sh
Environment Variables
SLURM sets these variables in your job:
| Variable | Description |
|---|---|
SLURM_JOB_ID |
Job ID |
SLURM_JOB_NAME |
Job name |
SLURM_NODELIST |
Allocated nodes |
SLURM_NNODES |
Number of nodes |
SLURM_NTASKS |
Number of tasks |
SLURM_CPUS_ON_NODE |
CPUs on this node |
SLURM_JOB_GPUS |
Allocated GPUs |
CUDA_VISIBLE_DEVICES |
CUDA GPU IDs |
Output Files
Use %j for job ID substitution in output filenames:
#SBATCH --output=logs/%x-%j.out # %x = job name
#SBATCH --error=logs/%x-%j.err
Common patterns:
%j- Job ID%x- Job name%N- Node name%a- Array task ID