SLURM Job Management
SLURM (Simple Linux Utility for Resource Management) is the job scheduler for the Odin HPC cluster. All job-related commands should be run from login nodes.
Quick Start
# SSH to a login node first
ssh login1
# Submit a job
sbatch myjob.sh
# Check queue status
squeue
# View cluster info
sinfo
Important: Always submit jobs from login nodes, not the head node. The head node has limited memory (4GB) and is reserved for scheduler operations.
Workflow Overview
graph LR
User[User] -->|SSH| Login[Login Node]
Login -->|sbatch| SLURM[SLURM Controller]
SLURM -->|Schedule| Queue[Job Queue]
Queue -->|Dispatch| Compute[Compute Nodes]
Compute -->|Results| Storage[FSx Storage]
Compute -->|Notify| Slack[Slack #qcs-infra-notification]
Key Concepts
Partitions (Queues)
The cluster has multiple partitions for different workload types:
| Partition | Instance | GPUs | Max Nodes | Use Case |
|---|---|---|---|---|
| cpu (default) | c7i.8xlarge | None | 10 | CPU workloads |
| gpu-inferencing | g5.8xlarge | 1× A10G | 5 | Inference |
| odin | p5.48xlarge | 8× H100 | 2 | Large training |
| albus | p5.48xlarge | 8× H100 | 2 | Large training |
| bali | p5.48xlarge | 8× H100 | 2 | Large training |
| genius | p5.48xlarge | 8× H100 | 2 | Large training |
Dynamic Scaling
- Compute nodes are automatically started when jobs are submitted
- Nodes shut down when idle to save costs
- First job may take a few minutes while nodes start up
Job Notifications
The cluster sends automatic Slack notifications for job events:
- Job Started: When your job begins execution
- Job Completed: When your job finishes successfully
- Job Failed: When your job fails with an error
Join #qcs-infra-notification on Slack to receive notifications.
Essential Commands
| Command | Description |
|---|---|
sbatch script.sh |
Submit a batch job |
squeue |
View job queue |
squeue -u $USER |
View your jobs |
sinfo |
View partition status |
scancel <job-id> |
Cancel a job |
scontrol show job <id> |
Job details |
sacct -j <id> |
Job accounting |
Next Steps
- Queues & Partitions - Detailed partition information
- Job Scripts - Writing SLURM scripts
- GPU Jobs - Running GPU workloads