Queues & Partitions

The Odin cluster has multiple SLURM partitions (queues) configured with dynamic scaling.

Partition Summary

Queue Instance Type vCPUs Memory GPUs Max Nodes
cpu c7i.8xlarge 32 ~61 GB None 10
gpu-inferencing g5.8xlarge 32 ~122 GB 1× A10G 5
odin p5.48xlarge 192 ~1.9 TB 8× H100 2
albus p5.48xlarge 192 ~1.9 TB 8× H100 2
bali p5.48xlarge 192 ~1.9 TB 8× H100 2
genius p5.48xlarge 192 ~1.9 TB 8× H100 2

CPU Partition (Default)

Optimized for compute-intensive CPU workloads.

Property Value
Instance c7i.8xlarge
vCPUs per node 32
Memory per node ~61 GB
Max nodes 10 (320 total vCPUs)
Use cases Data preprocessing, CPU training, batch processing

Submit a CPU job:

sbatch --partition=cpu --nodes=1 --ntasks=8 my-cpu-job.sh

GPU Inferencing Partition

Cost-effective GPU option for inference and smaller training jobs.

Property Value
Instance g5.8xlarge
vCPUs per node 32
Memory per node ~122 GB
GPU 1× NVIDIA A10G (24GB VRAM)
Max nodes 5 (5 GPUs total)
Use cases Inference, small training jobs

Submit a GPU inference job:

sbatch --partition=gpu-inferencing --nodes=1 --gres=gpu:1 my-inference-job.sh

H100 Partitions (odin, albus, bali, genius)

High-performance partitions for distributed training and large model training.

Property Value
Instance p5.48xlarge
vCPUs per node 192
Memory per node ~1.9 TB
GPUs 8× NVIDIA H100 (80GB VRAM each)
Interconnect High-bandwidth NVLink
Max nodes 2 per partition (16 H100s each)
Use cases Large-scale training, distributed training

Submit an H100 job:

sbatch --partition=odin --nodes=1 --gres=gpu:8 my-training-job.sh

Multi-node H100 job:

sbatch --partition=odin --nodes=2 --gres=gpu:8 --ntasks-per-node=8 distributed-training.sh

Viewing Queue Status

View all partitions:

sinfo

Detailed partition info:

sinfo -o '%P %.5D %.6t %.10l %.6c %.6G %.8m %N'

View specific partition:

sinfo -p odin
sinfo -p gpu-inferencing

Queue Selection Guide

Workload Type Recommended Partition
Data preprocessing cpu
Small model training cpu or gpu-inferencing
Model inference gpu-inferencing
Large model training odin, albus, bali, or genius
Multi-GPU distributed training odin, albus, bali, or genius
Hyperparameter search cpu (parallel runs)

Notes

  • Default Queue: Jobs without --partition use the cpu queue
  • GPU Resources: Always request GPUs with --gres=gpu:N
  • Job Priority: All partitions have equal priority (PriorityJobFactor=1)
  • Time Limits: No explicit default, but good practice to set one