Tutorial · 10 min read

MPI Launch Patterns That Scale

MPI still powers climate, CFD, and nuclear workloads—but the launch sequence now needs to respect heterogeneous nodes, multiple NICs, and accelerator libraries. Use these patterns to keep scaling past 10,000 ranks without babysitting runs.

1. Pick the right MPI stack

Update modules monthly; mismatched UCX/libfabric pairs are a top culprit for hangs.

2. Multi-NIC and fabric tuning

export UCX_TLS=rc,ud,sm,self
export UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1
export MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=1
export NCCL_COLLNET_ENABLE=1

Bind ranks to NICs using UCX_NET_DEVICES. Enable SHARPv3 or UCC for collective offload.

3. Launch templates

Slurm + Open MPI

#!/bin/bash
#SBATCH --job-name=mpi-sim
#SBATCH --nodes=256
#SBATCH --ntasks-per-node=8
#SBATCH --time=03:00:00

module load openmpi/5.0

srun --cpu-bind=cores --kill-on-bad-exit=1 \
  mpirun --map-by ppr:8:node --mca btl ^tcp \
  ./solver --config configs/frontier.json

Use --kill-on-bad-exit so Slurm cleans up hung tasks; --map-by ensures even distribution across sockets.

Flux/Balsam orchestration

flux run -n 2048 -g 2048 \
  --env UCX_TLS=rc_x,sm \
  --setattr=system.cwd=/lustre/project/run42 \
  ./solver.py --input case42.yaml

4. Debug + profiling workflow

  1. Quick triage: scontrol show job, sacct, seff.
  2. Network focus: ibstat, slingshot-adm, ethtool -S.
  3. Profilers: Nsight Systems, Omnitrace/ROC Profiler, Arm MAP.
  4. Replay: capture module lists + git commits via Spack environments or containers.

5. Handling failures at scale

6. Compliance & reproducibility

References