Tutorial · 11 min read

Designing Resilient Slurm Jobs

Slurm 24.05/24.11 introduced better federation, burst buffer integration, and cgroup v2 accounting. The flipside: jobs fail faster when policies tighten. Use this checklist to design scripts that thrive on modern clusters.

1. Start with intent

QuestionWhy it mattersExample
Business value?Maps to QoS + PriorityWeightQOS“Urgent experiment for launch readiness.”
Non-negotiable resources?Helps avoid over-allocating4 GH200 GPUs + 1 TB RAM
Failure handling?Drives checkpointing + notificationsRetry 3x then page operator

2. Reference template

#!/bin/bash
#SBATCH --job-name=climate-run
#SBATCH --account=climate_ops
#SBATCH --partition=gh-accelerated
#SBATCH --nodes=4
#SBATCH --gpus-per-node=4
#SBATCH --time=06:00:00
#SBATCH --qos=enterprise
#SBATCH --signal=B:USR1@120
#SBATCH --requeue

module purge
module load gcc/13.2 cuda/12.4 openmpi/5.0

export SCRATCH_DIR=$SLURM_JOBTMP/climate_${SLURM_JOB_ID}
mkdir -p "$SCRATCH_DIR"

function checkpoint() {
  rsync -a --delete "$SCRATCH_DIR" /lustre/project/checkpoints/${SLURM_JOB_ID}/
}

trap checkpoint USR1 TERM

srun --cpu-bind=cores --gpu-bind=closest python run.py \
  --config configs/frontier-2024.yaml \
  --output "$SCRATCH_DIR"

Highlights: --signal=B:USR1@120 gives two minutes to checkpoint before preemption; $SLURM_JOBTMP leverages tmpfs/burst buffers; trap checkpoint ensures graceful exits.

3. Heterogeneous jobs

#SBATCH hetjob
#SBATCH --het-group=1 --nodes=1 --cpus-per-task=8 --gpus=0
#SBATCH --het-group=2 --nodes=4 --gpus-per-node=4

srun --het-group=1 prep-data.sh
srun --het-group=2 train-model.sh

Benefits: one queue wait, consistent environment variables, easy dependency tracking.

4. Job arrays + dependency graph

#SBATCH --array=0-999%50
srun ./simulate_case.sh $SLURM_ARRAY_TASK_ID

if [[ $SLURM_ARRAY_TASK_ID -eq 999 ]]; then
  sbatch --dependency=afterok:${SLURM_ARRAY_JOB_ID} ./aggregate.sh
fi

Cap concurrency with %50 to protect filesystems. Use afterok/afterany for reduce steps.

5. Observability hooks

START=$(date +%s)
...
END=$(date +%s)
curl -s -X POST https://observability.internal/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -d "{\"job_id\":\"$SLURM_JOB_ID\",\"duration\":$((END-START))}"

Push metrics (duration, retry count, energy) to ClickHouse or Timescale so dashboards stay accurate.

6. Hardening checklist

7. References