SLURM Accounting Setup & Architecture

System Architecture

Components

1. SLURM Controller (slurmctld)

  • Role: Main cluster controller, runs on headnode
  • Responsibility: Job scheduling, resource allocation, writes accounting logs
  • Log File: /var/log/slurm/accounting.log (or configured path)
  • Status Check:
    systemctl is-active slurmctld
    

2. SLURM Database Daemon (slurmdbd)

  • Role: Accounting database interface
  • Port: 6819 (localhost)
  • Database: MariaDB backend (slurm_acct_db)
  • Function:
    • Receives accounting data from slurmctld
    • Syncs logs to database (~5 minute interval)
    • Responds to queries from sacct and sreport
  • Status Check:
    systemctl is-active slurmdbd
    netstat -tulpn | grep 6819  # Verify port is listening
    

3. MariaDB Database

  • Database Name: slurm_acct_db
  • User: slurm_acct_db (or configured user)
  • Key Tables:
    • accounts_table - Account definitions
    • users_table - User to account mappings
    • jobs_table - Job records
    • event_table - System events
  • Status Check:
    systemctl is-active mariadb
    mysql -u root -e "SHOW DATABASES LIKE 'slurm%';"
    

4. Compute Nodes (slurmd)

  • Role: Job execution agents
  • Responsibility: Report job resource usage back to controller
  • Status Check:
    sinfo -Ne  # Check all nodes
    

Data Flow

Job Submitted via sbatch
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   slurmctld (scheduler) β”‚
β”‚  - Allocates resources  β”‚
β”‚  - Starts job           β”‚
β”‚  - Writes to accounting β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   slurmd (compute nodes)    β”‚
β”‚  - Executes job             β”‚
β”‚  - Reports resource usage   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   slurmdbd (database daemon)     β”‚
β”‚  - Receives job completion data  β”‚
β”‚  - Syncs to MariaDB (~5 min)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   MariaDB (accounting database)    β”‚
β”‚  - Stores job history              β”‚
β”‚  - Tracks resource usage by accountβ”‚
β”‚  - Enables reporting               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Terraform Configuration

The accounting system is configured in the victoria-compute-utilities-module with:

Key Variables

variable "slurm_projects" {
  description = "List of SLURM projects/accounts to create"
  type        = list(string)
  default     = []  # Empty by default (project-specific)
}

variable "slurm_accounts" {
  description = "Account configuration with CPU limits"
  type = map(object({
    name               = string
    description        = string
    max_cpus_per_user  = number
    max_jobs_per_user  = number
  }))
}

variable "slurm_users" {
  description = "Users and their account assignments"
  type = map(object({
    username    = string
    accounts    = list(string)
    default_acc = string
  }))
}

ODIN Configuration

In odin/terraform/main.tf:

module "victoria_compute" {
  # ... other config
  
  slurm_projects = ["qcs", "albus", "bali", "genius", "odin"]
  
  slurm_accounts = {
    qcs = {
      name              = "qcs"
      description       = "General purpose CPU work"
      max_cpus_per_user = 320
      max_jobs_per_user = 100
    }
    albus = {
      name              = "albus"
      description       = "GPU research account"
      max_cpus_per_user = 384
      max_jobs_per_user = 50
    }
    # ... other accounts
  }
}

Installation Checklist

  • MariaDB installed and running
  • slurmdbd service enabled and running
  • slurm_acct_db database created with proper permissions
  • slurm accounting log file configured in slurm.conf
  • Accounting storage plugin configured in slurmdbd.conf
  • Default accounts created (qcs, albus, bali, genius, odin)
  • Users provisioned and assigned to accounts
  • Port 6819 open for local connections
  • slurmd configured to report job accounting to slurmctld

Troubleshooting

slurmdbd won’t start

# Check logs
journalctl -u slurmdbd -n 50
tail -100 /var/log/slurm/slurmdbd.log

# Verify database exists and is accessible
mysql -u slurm_acct_db -p -e "USE slurm_acct_db; SHOW TABLES;"

# Try manual start with debug output
sudo /opt/slurm/sbin/slurmdbd -Dvvvv

# Check for PID file directory
sudo mkdir -p /run/slurm
sudo chown slurm:slurm /run/slurm
sudo chmod 755 /run/slurm

Connection refused on 6819

# Verify slurmdbd is running and listening
netstat -tulpn | grep 6819
ss -tulpn | grep 6819

# Restart the service
sudo systemctl restart slurmdbd

# Check firewall (if applicable)
sudo iptables -L -n | grep 6819

sacct showing no data

# Wait 5+ minutes for initial sync
sleep 300
sacct --brief

# Check if slurmdbd received job records
journalctl -u slurmdbd | grep -i "job received"

# Verify database has data
mysql -u slurm_acct_db -p -e "SELECT COUNT(*) FROM jobs_table;"

Performance Notes

  • Initial accounting database sync: ~5 minutes after job completion
  • sacct (log-based): Immediate (no database delay)
  • sreport (database): 5+ minute delay for new records
  • For real-time monitoring use sstat (job statistics while running)

Next Steps