Operations · November 2025 · 14 min read
Reliability Playbook for Busy Slurm Queues
When thousands of researchers share the same cluster, your scheduler is the beating heart of the program. This playbook distills what high-performing labs do to keep Slurm 24.05/24.11 running smoothly even as workloads mix MPI, AI, and data pipelines.
Architectural north star
- Isolate control plane tiers—run Slurmctld, Slurmdbd, and databases on redundant protected nodes.
- Federate by intent—use multi-cluster Slurm Federation to separate internal research from external collaborations.
- Treat login/submit nodes as cattle with immutable images and automated patching.
Queue & QoS design
- Pair QoS tiers (Enterprise, Research, Rapid) with explicit max/min shares and budget-backed PriorityWeightQOS values.
- Enable PriorityFlags=MAX_TRES so single users cannot dominate GPUs/CPUs.
- Use heterogeneous jobs or partition-specific QoS to keep AI and MPI needs separate.
EffectivePriority = (FairshareWeight × AccountShare) + (AgeWeight × WaitTimeHours) + (QOSWeight × BusinessValue) + (TRESWeight × RequestedGPU/CPU)
Telemetry & alerting
- Prometheus exporters: slurm_exporter, node_exporter, DCGM, IPMI, and fabric-specific metrics.
- Event bus: send job events into Kafka/Pulsar for chargeback and chat ops.
- Alert policy: node health score < 70, queue wait-time P95 > SLA, GPU memory pressure > 90% with idle MIG slices.
Runbook essentials
Admission control
Use sdiag for controller health, sprio snapshots for top requesters, and scontrol show config diffs to detect drift.
Node quarantine
- Auto-drain when health score dips below threshold.
- Run BMC diagnostics + firmware verification.
- Re-image using the golden OS/driver stack before returning to service.
Incident comms
Every major queue incident gets a <200-word update outlining impact, mitigation, and preventative action, always linking back to the relevant playbook entry here.
Dashboards we ship
- Utilization board: CPU, GPU, memory, network, power budget with alerts for partitions >90% for 6+ hours.
- Wait-time board: P50/P90 per QoS with maintenance overlays.
- Health & tickets: node score histogram plus open tickets from ServiceNow/Jira.
Automation ideas
- Slow-drain windows before outages via reservations + job holds.
- Chargeback previews generated from SlurmDBd + energy metrics.
- Chaos drills that intentionally fence login nodes or racks.
KPI targets
- Controller failover <60 seconds.
- Job success >98.5% excluding user cancellations.
- Time-to-detect node failures <2 minutes.
- Documentation freshness: every runbook references a git commit from the last 30 days.
References
- Slurm 24.05 & 24.11 release notes.
- NVIDIA DCGM 3.x / AMD ROCm 6 monitoring guides.
- SC24 / ISC scheduler Birds-of-a-Feather sessions.
- TOP500 + MLPerf 2024 submissions.