Tutorial · 10 min read

MPI Launch Patterns That Scale

MPI still powers climate, CFD, and nuclear workloads—but the launch sequence now needs to respect heterogeneous nodes, multiple NICs, and accelerator libraries. Use these patterns to keep scaling past 10,000 ranks without babysitting runs.

1. Pick the right MPI stack

Open MPI 5.0: General-purpose clusters, GPUs, UCX transports.
MPICH 4.2: NEC/Cray builds, applications targeting MPI-4.
HPE MPT / Intel MPI: HPE Cray EX, Slingshot fabrics, oneAPI toolchains.

Update modules monthly; mismatched UCX/libfabric pairs are a top culprit for hangs.

2. Multi-NIC and fabric tuning

export UCX_TLS=rc,ud,sm,self
export UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1
export MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=1
export NCCL_COLLNET_ENABLE=1

Bind ranks to NICs using UCX_NET_DEVICES. Enable SHARPv3 or UCC for collective offload.

3. Launch templates

Slurm + Open MPI

#!/bin/bash
#SBATCH --job-name=mpi-sim
#SBATCH --nodes=256
#SBATCH --ntasks-per-node=8
#SBATCH --time=03:00:00

module load openmpi/5.0

srun --cpu-bind=cores --kill-on-bad-exit=1 \
  mpirun --map-by ppr:8:node --mca btl ^tcp \
  ./solver --config configs/frontier.json

Use --kill-on-bad-exit so Slurm cleans up hung tasks; --map-by ensures even distribution across sockets.

Flux/Balsam orchestration

flux run -n 2048 -g 2048 \
  --env UCX_TLS=rc_x,sm \
  --setattr=system.cwd=/lustre/project/run42 \
  ./solver.py --input case42.yaml

4. Debug + profiling workflow

Quick triage: scontrol show job, sacct, seff.
Network focus: ibstat, slingshot-adm, ethtool -S.
Profilers: Nsight Systems, Omnitrace/ROC Profiler, Arm MAP.
Replay: capture module lists + git commits via Spack environments or containers.

5. Handling failures at scale

Staggered launches to avoid metadata storms.
Coordinator pattern that publishes job status to Redis/etcd for dashboards.
Checkpointing via ULFM or application-level checkpoints staged to burst buffers.

6. Compliance & reproducibility

Wrap apps in Apptainer/OCI images signed via Cosign.
Record MPI_VERSION, module list, and git SHA in job output.

References

Open MPI 5.0 release notes.
Argonne’s ReFrame/Flux tutorials from SC24.
MLPerf HPC submissions detailing launch parameters on Frontier, Aurora, Eagle.