SLURM Job Management

SLURM (Simple Linux Utility for Resource Management) is the job scheduler for the Odin HPC cluster. All job-related commands should be run from login nodes.

Quick Start

# SSH to a login node first
ssh login1

# Submit a job
sbatch myjob.sh

# Check queue status
squeue

# View cluster info
sinfo

Important: Always submit jobs from login nodes, not the head node. The head node has limited memory (4GB) and is reserved for scheduler operations.

Workflow Overview

graph LR
    User[User] -->|SSH| Login[Login Node]
    Login -->|sbatch| SLURM[SLURM Controller]
    SLURM -->|Schedule| Queue[Job Queue]
    Queue -->|Dispatch| Compute[Compute Nodes]
    Compute -->|Results| Storage[FSx Storage]
    Compute -->|Notify| Slack[Slack #qcs-infra-notification]

Key Concepts

Partitions (Queues)

The cluster has multiple partitions for different workload types:

Partition	Instance	GPUs	Max Nodes	Use Case
cpu (default)	c7i.8xlarge	None	10	CPU workloads
gpu-inferencing	g5.8xlarge	1× A10G	5	Inference
odin	p5.48xlarge	8× H100	2	Large training
albus	p5.48xlarge	8× H100	2	Large training
bali	p5.48xlarge	8× H100	2	Large training
genius	p5.48xlarge	8× H100	2	Large training

Dynamic Scaling

Compute nodes are automatically started when jobs are submitted
Nodes shut down when idle to save costs
First job may take a few minutes while nodes start up

Job Notifications

The cluster sends automatic Slack notifications for job events:

Job Started: When your job begins execution
Job Completed: When your job finishes successfully
Job Failed: When your job fails with an error

Join #qcs-infra-notification on Slack to receive notifications.

Essential Commands

Command	Description
`sbatch script.sh`	Submit a batch job
`squeue`	View job queue
`squeue -u $USER`	View your jobs
`sinfo`	View partition status
`scancel <job-id>`	Cancel a job
`scontrol show job <id>`	Job details
`sacct -j <id>`	Job accounting

Next Steps

Queues & Partitions - Detailed partition information
Job Scripts - Writing SLURM scripts
GPU Jobs - Running GPU workloads