Updated February 2025

Blueprints for serious HPC programs

hpctutorials tracks how top labs—from ORNL to Argonne—run clusters with thousands of nodes, GPUs, and impatient scientists. We distill their patterns into actionable runbooks so your team can stand up reliable, policy-compliant compute without the guesswork.

Every guide mirrors the thought-leader design system behind pranavkulkarni.org: single-column focus, ruthless clarity, and data pulled from the latest TOP500, MLPerf, and Slurm releases.

Maintained by Mandar Gurav & Pranav Kulkarni — operators who live inside Slurm, Flux, and exascale programs daily.

Read the latest briefings Browse playbooks

Where we go deep

Cluster access + zero-trust data paths Bastions, multifactor SSH, transient worker nodes, and the mundane compliance paperwork that can stall science.
Scheduler operations Slurm 24.05, PBS Pro, and Flux orchestration with QoS math, budget-aware fairshare, and multi-cluster spillover.
GPU + AI convergence Designing for MI300A, Grace Hopper, and Blackwell-class accelerators where data movement—not FLOPS—is the bottleneck.
Performance + observability Continuous profiling, node health scoring, and energy-aware scheduling that keep DOE and EuroHPC budgets intact.

Latest briefings

Long-form notes published weekly so operators, research software engineers, and leadership stay aligned.

Cluster playbooks ready today

Access & environment hygiene

From bastion policies to module stacks so new researchers ship jobs in under 30 minutes.

Scheduler fluency

Modern Slurm patterns, job arrays, heterogeneous allocations, and QoS dashboards.

Accelerated science + AI

How labs fuse MPI, CUDA, and inference with profiling and governance guardrails.

Benchmark pulse · reality, not hype

Frontier · ORNL

1.206 EF/s LINPACK · Still #1 on the November 2024 TOP500 with HPE Cray EX + AMD MI250X nodes.

Aurora · Argonne

1.012 EF/s debut · Ponte Vecchio + Sapphire Rapids system entering production science.

El Capitan · LLNL

2+ EF/s target · MI300A-powered system with strict zero-trust data transfer baked in.

Source: November 2024 TOP500 plus SC24/ISC field briefings. We monitor MLPerf releases because every lab now spans simulation + AI.

Operational KPIs

< 5 min build + deploy loops Enterprise-grade docs stay shippable thanks to incremental builds and caching.
Scheduler utilization 80–90% Fairshare tuning, job preemption policies, and GPU memory headroom watching.
Search response < 100 ms Content collections are pre-indexed so operators find the right SOP mid-incident.