SLURM Job Management

SLURM (Simple Linux Utility for Resource Management) is the job scheduler for the Odin HPC cluster. All job-related commands should be run from login nodes.

Quick Start

# SSH to a login node first
ssh login1

# Submit a job
sbatch myjob.sh

# Check queue status
squeue

# View cluster info
sinfo

Important: Always submit jobs from login nodes, not the head node. The head node has limited memory (4GB) and is reserved for scheduler operations.

Workflow Overview

graph LR
    User[User] -->|SSH| Login[Login Node]
    Login -->|sbatch| SLURM[SLURM Controller]
    SLURM -->|Schedule| Queue[Job Queue]
    Queue -->|Dispatch| Compute[Compute Nodes]
    Compute -->|Results| Storage[FSx Storage]
    Compute -->|Notify| Slack[Slack #qcs-infra-notification]

Key Concepts

Partitions (Queues)

The cluster has multiple partitions for different workload types:

Partition Instance GPUs Max Nodes Use Case
cpu (default) c7i.8xlarge None 10 CPU workloads
gpu-inferencing g5.8xlarge 1× A10G 5 Inference
odin p5.48xlarge 8× H100 2 Large training
albus p5.48xlarge 8× H100 2 Large training
bali p5.48xlarge 8× H100 2 Large training
genius p5.48xlarge 8× H100 2 Large training

Dynamic Scaling

  • Compute nodes are automatically started when jobs are submitted
  • Nodes shut down when idle to save costs
  • First job may take a few minutes while nodes start up

Job Notifications

The cluster sends automatic Slack notifications for job events:

  • Job Started: When your job begins execution
  • Job Completed: When your job finishes successfully
  • Job Failed: When your job fails with an error

Join #qcs-infra-notification on Slack to receive notifications.

Essential Commands

Command Description
sbatch script.sh Submit a batch job
squeue View job queue
squeue -u $USER View your jobs
sinfo View partition status
scancel <job-id> Cancel a job
scontrol show job <id> Job details
sacct -j <id> Job accounting

Next Steps