Operations · November 2025 · 14 min read

Reliability Playbook for Busy Slurm Queues

When thousands of researchers share the same cluster, your scheduler is the beating heart of the program. This playbook distills what high-performing labs do to keep Slurm 24.05/24.11 running smoothly even as workloads mix MPI, AI, and data pipelines.

Architectural north star

Isolate control plane tiers—run Slurmctld, Slurmdbd, and databases on redundant protected nodes.
Federate by intent—use multi-cluster Slurm Federation to separate internal research from external collaborations.
Treat login/submit nodes as cattle with immutable images and automated patching.

Queue & QoS design

Pair QoS tiers (Enterprise, Research, Rapid) with explicit max/min shares and budget-backed PriorityWeightQOS values.
Enable PriorityFlags=MAX_TRES so single users cannot dominate GPUs/CPUs.
Use heterogeneous jobs or partition-specific QoS to keep AI and MPI needs separate.

EffectivePriority = (FairshareWeight × AccountShare) + (AgeWeight × WaitTimeHours) + (QOSWeight × BusinessValue) + (TRESWeight × RequestedGPU/CPU)

Telemetry & alerting

Prometheus exporters: slurm_exporter, node_exporter, DCGM, IPMI, and fabric-specific metrics.
Event bus: send job events into Kafka/Pulsar for chargeback and chat ops.
Alert policy: node health score < 70, queue wait-time P95 > SLA, GPU memory pressure > 90% with idle MIG slices.

Runbook essentials

Admission control

Use sdiag for controller health, sprio snapshots for top requesters, and scontrol show config diffs to detect drift.

Node quarantine

Auto-drain when health score dips below threshold.
Run BMC diagnostics + firmware verification.
Re-image using the golden OS/driver stack before returning to service.

Incident comms

Every major queue incident gets a <200-word update outlining impact, mitigation, and preventative action, always linking back to the relevant playbook entry here.

Dashboards we ship

Utilization board: CPU, GPU, memory, network, power budget with alerts for partitions >90% for 6+ hours.
Wait-time board: P50/P90 per QoS with maintenance overlays.
Health & tickets: node score histogram plus open tickets from ServiceNow/Jira.

Automation ideas

Slow-drain windows before outages via reservations + job holds.
Chargeback previews generated from SlurmDBd + energy metrics.
Chaos drills that intentionally fence login nodes or racks.

KPI targets

Controller failover <60 seconds.
Job success >98.5% excluding user cancellations.
Time-to-detect node failures <2 minutes.
Documentation freshness: every runbook references a git commit from the last 30 days.

References

Slurm 24.05 & 24.11 release notes.
NVIDIA DCGM 3.x / AMD ROCm 6 monitoring guides.
SC24 / ISC scheduler Birds-of-a-Feather sessions.
TOP500 + MLPerf 2024 submissions.