Operations · November 2025 · 14 min read

Reliability Playbook for Busy Slurm Queues

When thousands of researchers share the same cluster, your scheduler is the beating heart of the program. This playbook distills what high-performing labs do to keep Slurm 24.05/24.11 running smoothly even as workloads mix MPI, AI, and data pipelines.

Architectural north star

  1. Isolate control plane tiers—run Slurmctld, Slurmdbd, and databases on redundant protected nodes.
  2. Federate by intent—use multi-cluster Slurm Federation to separate internal research from external collaborations.
  3. Treat login/submit nodes as cattle with immutable images and automated patching.

Queue & QoS design

EffectivePriority = (FairshareWeight × AccountShare) + (AgeWeight × WaitTimeHours) + (QOSWeight × BusinessValue) + (TRESWeight × RequestedGPU/CPU)

Telemetry & alerting

Runbook essentials

Admission control

Use sdiag for controller health, sprio snapshots for top requesters, and scontrol show config diffs to detect drift.

Node quarantine

  1. Auto-drain when health score dips below threshold.
  2. Run BMC diagnostics + firmware verification.
  3. Re-image using the golden OS/driver stack before returning to service.

Incident comms

Every major queue incident gets a <200-word update outlining impact, mitigation, and preventative action, always linking back to the relevant playbook entry here.

Dashboards we ship

Automation ideas

KPI targets

References