SLURM Usage Tracking & Reporting
Understanding Allocated vs Actual Usage
SLURM accounting tracks allocated resources at job submission time, not actual utilization.
What Is Tracked
| Resource | Tracked? | Notes |
|---|---|---|
| CPUs | β Yes | Allocated, not actual % utilization |
| Memory | β Yes | Requested, not actual consumed |
| GPUs | β Yes | Requested type and count |
| Walltime | β Yes | Job ran time (not remaining time) |
| CPU % | β No | Use CloudWatch or job monitoring |
| Memory % | β No | Use CloudWatch or job monitoring |
| GPU % | β No | Use NVIDIA tools or job monitoring |
Tools for Usage Reporting
sacct - Direct Access (Recommended for Recent Data)
- Reads accounting logs directly on headnode
- Available immediately after job completion
- Per-job details
- No database sync delay
sacct [options]
sreport - Aggregated Reports (Better for Historical)
- Queries accounting database
- ~5 minute sync delay from job completion
- Provides pre-aggregated summaries
- Better for trend analysis
sreport [report] [options]
Quick Commands
Last 30 Days by Account (CPU-Hours)
ssh headnode
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--accounts=qcs,albus,bali,genius,odin \
--format="Account%12,State%12,CPUTime%15,AllocCPUS%10" \
--parsable2 | awk -F'|' '
NR>1 {
account=$1; cputime=$3; cpus=$4;
split(cputime, t, ":");
if (length(t)==2) secs = t[1]*60 + t[2];
else secs = t[1]*3600 + t[2]*60 + t[3];
cpu_hours = secs/3600;
total_jobs[account]++;
total_cpu_hours[account] += cpu_hours;
total_cpus[account] += cpus;
}
END {
printf "%-12s %-12s %-15s %-10s\n", "Account", "Jobs", "CPU-Hours", "CPU-Count";
printf "%s\n", "-----------------------------------------------";
for (a in total_jobs)
printf "%-12s %-12d %-15.1f %-10d\n", a, total_jobs[a], total_cpu_hours[a], total_cpus[a];
}' | sort -k3 -rn
By User in Last 7 Days
sacct --starttime "$(date -d '7 days ago' '+%Y-%m-%d')" \
--format="User%12,Account%10,JobCount%8,CPUTime%15" \
--parsable2 | awk -F'|' '
NR>1 {
user=$1; acct=$2; cputime=$4;
split(cputime, t, ":");
if (length(t)==2) secs = t[1]*60 + t[2];
else secs = t[1]*3600 + t[2]*60 + t[3];
cpu_hours = secs/3600;
total_hours[user,acct] += cpu_hours;
}
END {
for (key in total_hours) {
split(key, a, SUBSEP);
printf "%-12s %-10s %.1f CPU-hours\n", a[1], a[2], total_hours[key];
}
}' | sort -k3 -rn
Using sreport for Aggregated Views
Account Utilization Summary
sreport cluster UserUtilizationByAccount \
start=2026-01-01 \
end=2026-01-31 \
accounts=qcs,albus,bali,genius,odin
Output shows:
- CPU-minutes used per account
- Job count
- User breakdown
Top Users by Account
sreport user top end=now start=2026-01-01 accounts=odin
Account Statistics
sreport cluster AccountUtilizationByAccount \
start=2026-01-01 \
end=2026-01-31
Cost Estimation
CPU-Hour Based Pricing
Assuming $0.40/CPU-hour:
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--format="Account%12,CPUTime%15" \
--accounts=qcs,albus,bali,genius,odin \
--parsable2 | awk -F'|' '
NR>1 {
account=$1; cputime=$2;
split(cputime, t, ":");
if (length(t)==2) secs = t[1]*60 + t[2];
else secs = t[1]*3600 + t[2]*60 + t[3];
cpu_hours = secs/3600;
cost = cpu_hours * 0.40;
total_hours[account] += cpu_hours;
total_cost[account] += cost;
}
END {
printf "%-12s %-15s %-15s\n", "Account", "CPU-Hours", "Est. Cost";
printf "%s\n", "-------------------------------------------";
for (a in total_hours)
printf "%-12s %-15.1f $%-14.2f\n", a, total_hours[a], total_cost[a];
}' | sort -k3 -rn
Node-Hour Based Pricing
If tracking by node-hours (better for reserved capacity):
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--format="Account%12,NodeList%30,CPUTime%15" \
--parsable2 | awk -F'|' '
NR>1 {
# Extract unique node and calculate node-hours
# This is approximation: CPUTime / CPUs_per_node = node-hours
# P5.48xlarge = 192 CPUs, C7i.8xlarge = 32 CPUs
}'
GPU Usage Reports
GPU Jobs in Last Month
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--accounts=albus,bali,genius,odin \
--format="Account%12,User%12,JobName%20,AllocGRES%25,CPUTime" \
--parsable2 | grep -i gpu | head -20
GPU Hours by Account (P5 Nodes)
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--accounts=albus,bali,genius,odin \
--format="Account%12,CPUTime%15,AllocGRES%25" \
--parsable2 | awk -F'|' '
$3 ~ /gpu/ {
account=$1; cputime=$2;
split(cputime, t, ":");
if (length(t)==2) secs = t[1]*60 + t[2];
else secs = t[1]*3600 + t[2]*60 + t[3];
# P5 has 8 GPUs, so GPU-hours = (CPU-hours / 192) * 8
gpu_hours = (secs/3600) / 192 * 8;
gpu_total[account] += gpu_hours;
}
END {
for (a in gpu_total)
printf "%s: %.1f GPU-hours\n", a, gpu_total[a];
}' | sort
Failed Job Analysis
Failures by Account (Last 30 Days)
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--state=FAILED,TIMEOUT,CANCELLED \
--format="Account%12,User%12,JobName%20,State%15,ExitCode%8" \
--parsable2 | awk -F'|' '
NR>1 {
account=$1; state=$4;
failed[account,state]++;
total[account]++;
}
END {
for (key in failed) {
split(key, a, SUBSEP);
printf "%s (%s): %d failures\n", a[1], a[2], failed[key];
}
}' | sort
Failed Jobs by User
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--state=FAILED \
--format="User%12,Account%10,JobCount%8,ExitCode" \
--parsable2 | sort -u
Export for Analysis
To CSV (Excel/Sheets)
sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
--format="Account,User,JobID,JobName,Start,State,CPUTime,AllocCPUS,ReqMem" \
--parsable2 > usage_$(date +%Y%m%d).csv
# Then download and open in Excel
scp headnode:usage_*.csv ./
Database Direct Query
ssh headnode
# Connect to accounting database
mysql -u root slurm_acct_db -e "
SELECT
account,
user,
COUNT(*) as job_count,
SUM(cpu_count) as total_cpus,
SUM(mem_alloc) as total_mem_mb
FROM job_table
WHERE time_end > DATE_SUB(NOW(), INTERVAL 30 DAY)
AND account IN ('qcs', 'albus', 'bali', 'genius', 'odin')
GROUP BY account, user
ORDER BY account, total_cpus DESC;
"
Tracking Actual Utilization
For actual (not allocated) resource metrics:
CloudWatch Metrics
# EC2 instance CPU utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxx \
--start-time 2026-01-01T00:00:00Z \
--end-time 2026-01-31T23:59:59Z \
--period 3600 \
--statistics Average,Maximum
NVIDIA GPU Monitoring
# Check GPU utilization on p5 nodes (from headnode)
ssh p5-node-1 "nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv,noheader"
Job Profiling
Use SLURM epilog scripts to capture actual usage at job completion:
# In epilog script:
ps aux | grep <jobid> # Check actual memory
nvidia-smi # Check GPU activity
cat /proc/stat # Check CPU usage
Key Fields Reference
| Field | Format | Notes |
|---|---|---|
| Account | Text | SLURM account/project name |
| User | Text | Username |
| JobID | Integer | SLURM job ID |
| JobName | Text | Job name from sbatch/srun |
| Start | Timestamp | Job start time |
| State | Text | COMPLETED, FAILED, CANCELLED, TIMEOUT, RUNNING, etc. |
| CPUTime | HH:MM:SS | Allocated CPUs Γ Elapsed time |
| AllocCPUS | Integer | Number of CPUs allocated |
| ReqMem | Memory | Requested memory (MB by default) |
| AllocGRES | Text | Allocated generic resources (GPUs, etc.) |
| Elapsed | HH:MM:SS | Actual job runtime |
| ExitCode | Integer | 0 = success, non-zero = error |
Troubleshooting
sacct shows no data
# Wait 5+ minutes for initial sync from slurmctld to slurmdbd
sleep 300
# Then try again
sacct --brief | head -5
# Check if slurmdbd received records
journalctl -u slurmdbd | tail -20
# Verify database connectivity
mysql -u root slurm_acct_db -e "SELECT COUNT(*) FROM jobs_table;"
sreport not showing all accounts
# Accounts may not yet be synced to database
# Try sacct instead for immediate results
sacct --accounts=qcs,albus,bali,genius,odin --brief
# If still missing, verify account was created
/opt/slurm/bin/sacctmgr show account
Missing a specific userβs jobs
# Verify user exists in SLURM
/opt/slurm/bin/sacctmgr show user <username>
# Check if user has permission on account
/opt/slurm/bin/sacctmgr show user <username> WithAssoc
Next Steps
- Account Management - Manage accounts and limits
- User Management - Manage users and permissions
- Setup & Architecture - Understand the system architecture