Tutorial · 10 min read
MPI Launch Patterns That Scale
MPI still powers climate, CFD, and nuclear workloads—but the launch sequence now needs to respect heterogeneous nodes, multiple NICs, and accelerator libraries. Use these patterns to keep scaling past 10,000 ranks without babysitting runs.
1. Pick the right MPI stack
- Open MPI 5.0: General-purpose clusters, GPUs, UCX transports.
- MPICH 4.2: NEC/Cray builds, applications targeting MPI-4.
- HPE MPT / Intel MPI: HPE Cray EX, Slingshot fabrics, oneAPI toolchains.
Update modules monthly; mismatched UCX/libfabric pairs are a top culprit for hangs.
2. Multi-NIC and fabric tuning
export UCX_TLS=rc,ud,sm,self
export UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1
export MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=1
export NCCL_COLLNET_ENABLE=1
Bind ranks to NICs using UCX_NET_DEVICES. Enable SHARPv3 or UCC for collective offload.
3. Launch templates
Slurm + Open MPI
#!/bin/bash
#SBATCH --job-name=mpi-sim
#SBATCH --nodes=256
#SBATCH --ntasks-per-node=8
#SBATCH --time=03:00:00
module load openmpi/5.0
srun --cpu-bind=cores --kill-on-bad-exit=1 \
mpirun --map-by ppr:8:node --mca btl ^tcp \
./solver --config configs/frontier.json
Use --kill-on-bad-exit so Slurm cleans up hung tasks; --map-by ensures even distribution across sockets.
Flux/Balsam orchestration
flux run -n 2048 -g 2048 \
--env UCX_TLS=rc_x,sm \
--setattr=system.cwd=/lustre/project/run42 \
./solver.py --input case42.yaml
4. Debug + profiling workflow
- Quick triage:
scontrol show job,sacct,seff. - Network focus:
ibstat,slingshot-adm,ethtool -S. - Profilers: Nsight Systems, Omnitrace/ROC Profiler, Arm MAP.
- Replay: capture module lists + git commits via Spack environments or containers.
5. Handling failures at scale
- Staggered launches to avoid metadata storms.
- Coordinator pattern that publishes job status to Redis/etcd for dashboards.
- Checkpointing via ULFM or application-level checkpoints staged to burst buffers.
6. Compliance & reproducibility
- Wrap apps in Apptainer/OCI images signed via Cosign.
- Record
MPI_VERSION, module list, and git SHA in job output.
References
- Open MPI 5.0 release notes.
- Argonne’s ReFrame/Flux tutorials from SC24.
- MLPerf HPC submissions detailing launch parameters on Frontier, Aurora, Eagle.