SLURM Usage Tracking & Reporting

Understanding Allocated vs Actual Usage

SLURM accounting tracks allocated resources at job submission time, not actual utilization.

What Is Tracked

Resource Tracked? Notes
CPUs βœ… Yes Allocated, not actual % utilization
Memory βœ… Yes Requested, not actual consumed
GPUs βœ… Yes Requested type and count
Walltime βœ… Yes Job ran time (not remaining time)
CPU % ❌ No Use CloudWatch or job monitoring
Memory % ❌ No Use CloudWatch or job monitoring
GPU % ❌ No Use NVIDIA tools or job monitoring

Tools for Usage Reporting

  • Reads accounting logs directly on headnode
  • Available immediately after job completion
  • Per-job details
  • No database sync delay
sacct [options]

sreport - Aggregated Reports (Better for Historical)

  • Queries accounting database
  • ~5 minute sync delay from job completion
  • Provides pre-aggregated summaries
  • Better for trend analysis
sreport [report] [options]

Quick Commands

Last 30 Days by Account (CPU-Hours)

ssh headnode

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --accounts=qcs,albus,bali,genius,odin \
      --format="Account%12,State%12,CPUTime%15,AllocCPUS%10" \
      --parsable2 | awk -F'|' '
      NR>1 {
        account=$1; cputime=$3; cpus=$4;
        split(cputime, t, ":"); 
        if (length(t)==2) secs = t[1]*60 + t[2]; 
        else secs = t[1]*3600 + t[2]*60 + t[3];
        cpu_hours = secs/3600;
        
        total_jobs[account]++;
        total_cpu_hours[account] += cpu_hours;
        total_cpus[account] += cpus;
      }
      END {
        printf "%-12s %-12s %-15s %-10s\n", "Account", "Jobs", "CPU-Hours", "CPU-Count";
        printf "%s\n", "-----------------------------------------------";
        for (a in total_jobs) 
          printf "%-12s %-12d %-15.1f %-10d\n", a, total_jobs[a], total_cpu_hours[a], total_cpus[a];
      }' | sort -k3 -rn

By User in Last 7 Days

sacct --starttime "$(date -d '7 days ago' '+%Y-%m-%d')" \
      --format="User%12,Account%10,JobCount%8,CPUTime%15" \
      --parsable2 | awk -F'|' '
      NR>1 {
        user=$1; acct=$2; cputime=$4;
        split(cputime, t, ":");
        if (length(t)==2) secs = t[1]*60 + t[2];
        else secs = t[1]*3600 + t[2]*60 + t[3];
        cpu_hours = secs/3600;
        
        total_hours[user,acct] += cpu_hours;
      }
      END {
        for (key in total_hours) {
          split(key, a, SUBSEP);
          printf "%-12s %-10s %.1f CPU-hours\n", a[1], a[2], total_hours[key];
        }
      }' | sort -k3 -rn

Using sreport for Aggregated Views

Account Utilization Summary

sreport cluster UserUtilizationByAccount \
  start=2026-01-01 \
  end=2026-01-31 \
  accounts=qcs,albus,bali,genius,odin

Output shows:

  • CPU-minutes used per account
  • Job count
  • User breakdown

Top Users by Account

sreport user top end=now start=2026-01-01 accounts=odin

Account Statistics

sreport cluster AccountUtilizationByAccount \
  start=2026-01-01 \
  end=2026-01-31

Cost Estimation

CPU-Hour Based Pricing

Assuming $0.40/CPU-hour:

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --format="Account%12,CPUTime%15" \
      --accounts=qcs,albus,bali,genius,odin \
      --parsable2 | awk -F'|' '
      NR>1 {
        account=$1; cputime=$2;
        split(cputime, t, ":");
        if (length(t)==2) secs = t[1]*60 + t[2];
        else secs = t[1]*3600 + t[2]*60 + t[3];
        cpu_hours = secs/3600;
        cost = cpu_hours * 0.40;
        
        total_hours[account] += cpu_hours;
        total_cost[account] += cost;
      }
      END {
        printf "%-12s %-15s %-15s\n", "Account", "CPU-Hours", "Est. Cost";
        printf "%s\n", "-------------------------------------------";
        for (a in total_hours)
          printf "%-12s %-15.1f $%-14.2f\n", a, total_hours[a], total_cost[a];
      }' | sort -k3 -rn

Node-Hour Based Pricing

If tracking by node-hours (better for reserved capacity):

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --format="Account%12,NodeList%30,CPUTime%15" \
      --parsable2 | awk -F'|' '
      NR>1 {
        # Extract unique node and calculate node-hours
        # This is approximation: CPUTime / CPUs_per_node = node-hours
        # P5.48xlarge = 192 CPUs, C7i.8xlarge = 32 CPUs
      }' 

GPU Usage Reports

GPU Jobs in Last Month

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --accounts=albus,bali,genius,odin \
      --format="Account%12,User%12,JobName%20,AllocGRES%25,CPUTime" \
      --parsable2 | grep -i gpu | head -20

GPU Hours by Account (P5 Nodes)

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --accounts=albus,bali,genius,odin \
      --format="Account%12,CPUTime%15,AllocGRES%25" \
      --parsable2 | awk -F'|' '
      $3 ~ /gpu/ {
        account=$1; cputime=$2;
        split(cputime, t, ":");
        if (length(t)==2) secs = t[1]*60 + t[2];
        else secs = t[1]*3600 + t[2]*60 + t[3];
        # P5 has 8 GPUs, so GPU-hours = (CPU-hours / 192) * 8 
        gpu_hours = (secs/3600) / 192 * 8;
        
        gpu_total[account] += gpu_hours;
      }
      END {
        for (a in gpu_total)
          printf "%s: %.1f GPU-hours\n", a, gpu_total[a];
      }' | sort

Failed Job Analysis

Failures by Account (Last 30 Days)

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --state=FAILED,TIMEOUT,CANCELLED \
      --format="Account%12,User%12,JobName%20,State%15,ExitCode%8" \
      --parsable2 | awk -F'|' '
      NR>1 {
        account=$1; state=$4;
        failed[account,state]++;
        total[account]++;
      }
      END {
        for (key in failed) {
          split(key, a, SUBSEP);
          printf "%s (%s): %d failures\n", a[1], a[2], failed[key];
        }
      }' | sort

Failed Jobs by User

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --state=FAILED \
      --format="User%12,Account%10,JobCount%8,ExitCode" \
      --parsable2 | sort -u

Export for Analysis

To CSV (Excel/Sheets)

sacct --starttime "$(date -d '30 days ago' '+%Y-%m-%d')" \
      --format="Account,User,JobID,JobName,Start,State,CPUTime,AllocCPUS,ReqMem" \
      --parsable2 > usage_$(date +%Y%m%d).csv

# Then download and open in Excel
scp headnode:usage_*.csv ./

Database Direct Query

ssh headnode

# Connect to accounting database
mysql -u root slurm_acct_db -e "
  SELECT 
    account,
    user,
    COUNT(*) as job_count,
    SUM(cpu_count) as total_cpus,
    SUM(mem_alloc) as total_mem_mb
  FROM job_table 
  WHERE time_end > DATE_SUB(NOW(), INTERVAL 30 DAY)
    AND account IN ('qcs', 'albus', 'bali', 'genius', 'odin')
  GROUP BY account, user
  ORDER BY account, total_cpus DESC;
"

Tracking Actual Utilization

For actual (not allocated) resource metrics:

CloudWatch Metrics

# EC2 instance CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-xxxx \
  --start-time 2026-01-01T00:00:00Z \
  --end-time 2026-01-31T23:59:59Z \
  --period 3600 \
  --statistics Average,Maximum

NVIDIA GPU Monitoring

# Check GPU utilization on p5 nodes (from headnode)
ssh p5-node-1 "nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv,noheader"

Job Profiling

Use SLURM epilog scripts to capture actual usage at job completion:

# In epilog script:
ps aux | grep <jobid>  # Check actual memory
nvidia-smi              # Check GPU activity
cat /proc/stat          # Check CPU usage

Key Fields Reference

Field Format Notes
Account Text SLURM account/project name
User Text Username
JobID Integer SLURM job ID
JobName Text Job name from sbatch/srun
Start Timestamp Job start time
State Text COMPLETED, FAILED, CANCELLED, TIMEOUT, RUNNING, etc.
CPUTime HH:MM:SS Allocated CPUs Γ— Elapsed time
AllocCPUS Integer Number of CPUs allocated
ReqMem Memory Requested memory (MB by default)
AllocGRES Text Allocated generic resources (GPUs, etc.)
Elapsed HH:MM:SS Actual job runtime
ExitCode Integer 0 = success, non-zero = error

Troubleshooting

sacct shows no data

# Wait 5+ minutes for initial sync from slurmctld to slurmdbd
sleep 300

# Then try again
sacct --brief | head -5

# Check if slurmdbd received records
journalctl -u slurmdbd | tail -20

# Verify database connectivity
mysql -u root slurm_acct_db -e "SELECT COUNT(*) FROM jobs_table;"

sreport not showing all accounts

# Accounts may not yet be synced to database
# Try sacct instead for immediate results
sacct --accounts=qcs,albus,bali,genius,odin --brief

# If still missing, verify account was created
/opt/slurm/bin/sacctmgr show account

Missing a specific user’s jobs

# Verify user exists in SLURM
/opt/slurm/bin/sacctmgr show user <username>

# Check if user has permission on account
/opt/slurm/bin/sacctmgr show user <username> WithAssoc

Next Steps