Tutorial · 11 min read
Designing Resilient Slurm Jobs
Slurm 24.05/24.11 introduced better federation, burst buffer integration, and cgroup v2 accounting. The flipside: jobs fail faster when policies tighten. Use this checklist to design scripts that thrive on modern clusters.
1. Start with intent
| Question | Why it matters | Example |
|---|---|---|
| Business value? | Maps to QoS + PriorityWeightQOS | “Urgent experiment for launch readiness.” |
| Non-negotiable resources? | Helps avoid over-allocating | 4 GH200 GPUs + 1 TB RAM |
| Failure handling? | Drives checkpointing + notifications | Retry 3x then page operator |
2. Reference template
#!/bin/bash
#SBATCH --job-name=climate-run
#SBATCH --account=climate_ops
#SBATCH --partition=gh-accelerated
#SBATCH --nodes=4
#SBATCH --gpus-per-node=4
#SBATCH --time=06:00:00
#SBATCH --qos=enterprise
#SBATCH --signal=B:USR1@120
#SBATCH --requeue
module purge
module load gcc/13.2 cuda/12.4 openmpi/5.0
export SCRATCH_DIR=$SLURM_JOBTMP/climate_${SLURM_JOB_ID}
mkdir -p "$SCRATCH_DIR"
function checkpoint() {
rsync -a --delete "$SCRATCH_DIR" /lustre/project/checkpoints/${SLURM_JOB_ID}/
}
trap checkpoint USR1 TERM
srun --cpu-bind=cores --gpu-bind=closest python run.py \
--config configs/frontier-2024.yaml \
--output "$SCRATCH_DIR"
Highlights: --signal=B:USR1@120 gives two minutes to checkpoint before preemption; $SLURM_JOBTMP leverages tmpfs/burst buffers; trap checkpoint ensures graceful exits.
3. Heterogeneous jobs
#SBATCH hetjob
#SBATCH --het-group=1 --nodes=1 --cpus-per-task=8 --gpus=0
#SBATCH --het-group=2 --nodes=4 --gpus-per-node=4
srun --het-group=1 prep-data.sh
srun --het-group=2 train-model.sh
Benefits: one queue wait, consistent environment variables, easy dependency tracking.
4. Job arrays + dependency graph
#SBATCH --array=0-999%50
srun ./simulate_case.sh $SLURM_ARRAY_TASK_ID
if [[ $SLURM_ARRAY_TASK_ID -eq 999 ]]; then
sbatch --dependency=afterok:${SLURM_ARRAY_JOB_ID} ./aggregate.sh
fi
Cap concurrency with %50 to protect filesystems. Use afterok/afterany for reduce steps.
5. Observability hooks
START=$(date +%s)
...
END=$(date +%s)
curl -s -X POST https://observability.internal/jobs \
-H "Authorization: Bearer $TOKEN" \
-d "{\"job_id\":\"$SLURM_JOB_ID\",\"duration\":$((END-START))}"
Push metrics (duration, retry count, energy) to ClickHouse or Timescale so dashboards stay accurate.
6. Hardening checklist
- Validate modules/containers at the start of each job.
- Set
umask 027for regulated data. - Monitor
sacct -j $SLURM_JOB_ID --format=jobid,state,elapsed,maxrssin CI. - Store scripts in git and link commit SHA in job output.
7. References
- Slurm 24.05 release notes (signal handling, cgroup v2, elastic compute).
- HPE Cray EX burst-buffer guides.
- Argonne + ORNL SC24 BoFs on job arrays and MIG scheduling.