SLURM Accounting Setup & Architecture
System Architecture
Components
1. SLURM Controller (slurmctld)
- Role: Main cluster controller, runs on headnode
- Responsibility: Job scheduling, resource allocation, writes accounting logs
- Log File:
/var/log/slurm/accounting.log(or configured path) - Status Check:
systemctl is-active slurmctld
2. SLURM Database Daemon (slurmdbd)
- Role: Accounting database interface
- Port: 6819 (localhost)
- Database: MariaDB backend (slurm_acct_db)
- Function:
- Receives accounting data from slurmctld
- Syncs logs to database (~5 minute interval)
- Responds to queries from
sacctandsreport
- Status Check:
systemctl is-active slurmdbd netstat -tulpn | grep 6819 # Verify port is listening
3. MariaDB Database
- Database Name:
slurm_acct_db - User:
slurm_acct_db(or configured user) - Key Tables:
accounts_table- Account definitionsusers_table- User to account mappingsjobs_table- Job recordsevent_table- System events
- Status Check:
systemctl is-active mariadb mysql -u root -e "SHOW DATABASES LIKE 'slurm%';"
4. Compute Nodes (slurmd)
- Role: Job execution agents
- Responsibility: Report job resource usage back to controller
- Status Check:
sinfo -Ne # Check all nodes
Data Flow
Job Submitted via sbatch
β
βΌ
βββββββββββββββββββββββββββ
β slurmctld (scheduler) β
β - Allocates resources β
β - Starts job β
β - Writes to accounting β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β slurmd (compute nodes) β
β - Executes job β
β - Reports resource usage β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β slurmdbd (database daemon) β
β - Receives job completion data β
β - Syncs to MariaDB (~5 min) β
ββββββββββββββ¬ββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββ
β MariaDB (accounting database) β
β - Stores job history β
β - Tracks resource usage by accountβ
β - Enables reporting β
ββββββββββββββββββββββββββββββββββββββ
Terraform Configuration
The accounting system is configured in the victoria-compute-utilities-module with:
Key Variables
variable "slurm_projects" {
description = "List of SLURM projects/accounts to create"
type = list(string)
default = [] # Empty by default (project-specific)
}
variable "slurm_accounts" {
description = "Account configuration with CPU limits"
type = map(object({
name = string
description = string
max_cpus_per_user = number
max_jobs_per_user = number
}))
}
variable "slurm_users" {
description = "Users and their account assignments"
type = map(object({
username = string
accounts = list(string)
default_acc = string
}))
}
ODIN Configuration
In odin/terraform/main.tf:
module "victoria_compute" {
# ... other config
slurm_projects = ["qcs", "albus", "bali", "genius", "odin"]
slurm_accounts = {
qcs = {
name = "qcs"
description = "General purpose CPU work"
max_cpus_per_user = 320
max_jobs_per_user = 100
}
albus = {
name = "albus"
description = "GPU research account"
max_cpus_per_user = 384
max_jobs_per_user = 50
}
# ... other accounts
}
}
Installation Checklist
- MariaDB installed and running
- slurmdbd service enabled and running
- slurm_acct_db database created with proper permissions
- slurm accounting log file configured in slurm.conf
- Accounting storage plugin configured in slurmdbd.conf
- Default accounts created (qcs, albus, bali, genius, odin)
- Users provisioned and assigned to accounts
- Port 6819 open for local connections
- slurmd configured to report job accounting to slurmctld
Troubleshooting
slurmdbd wonβt start
# Check logs
journalctl -u slurmdbd -n 50
tail -100 /var/log/slurm/slurmdbd.log
# Verify database exists and is accessible
mysql -u slurm_acct_db -p -e "USE slurm_acct_db; SHOW TABLES;"
# Try manual start with debug output
sudo /opt/slurm/sbin/slurmdbd -Dvvvv
# Check for PID file directory
sudo mkdir -p /run/slurm
sudo chown slurm:slurm /run/slurm
sudo chmod 755 /run/slurm
Connection refused on 6819
# Verify slurmdbd is running and listening
netstat -tulpn | grep 6819
ss -tulpn | grep 6819
# Restart the service
sudo systemctl restart slurmdbd
# Check firewall (if applicable)
sudo iptables -L -n | grep 6819
sacct showing no data
# Wait 5+ minutes for initial sync
sleep 300
sacct --brief
# Check if slurmdbd received job records
journalctl -u slurmdbd | grep -i "job received"
# Verify database has data
mysql -u slurm_acct_db -p -e "SELECT COUNT(*) FROM jobs_table;"
Performance Notes
- Initial accounting database sync: ~5 minutes after job completion
sacct(log-based): Immediate (no database delay)sreport(database): 5+ minute delay for new records- For real-time monitoring use
sstat(job statistics while running)
Next Steps
- Account Management - Create and manage new accounts
- User Management - Provision and manage users
- Usage Tracking - Query accounting data