Which H100 to train Nanochat | bluenotebook.io
H100 GPU variants — PCIe, SXM, and NVL configurations compared
H100 GPU variants — PCIe, SXM, and NVL configurations compared

Training to GPT-2 level performance on CORE metric 11https://github.com/karpathy/nanochat/discussions/481 has dropped from $43K in 2019 to $73 in 2026. I wanted to train Nanochat 22“The best ChatGPT that $100 can buy.” Nanochat Repository on spot instances, where Karpathy mentions the cost can be even lower to $20 on 8xH100 GPUs. But, on Runpod, I was confronted with a choice - H100 PCIe, SXM or NVL. Each at varying price points.

Choice of H100 - PCIe, SXM, NVL
Runpod offers H100 in multiple configurations

I knew these were different network interconnect options from the CS336 course 33lecture 5 https://www.youtube.com/watch?v=6OBtO9niT00 and that NVLink 4.0 was supposed to be fast.

Prof. Percy Liang mentions in the first lecture of CS336, the mindset while training LLMs is to squeeze most performance of the hardware. Every decision has an effect. This led me to examine what each of these interconnect variants has to offer.

To train the model cheaply, is the cheapest instance the best choice to complete the training run? I decided to benchmark all three.

Why care about the network interconnect?

While training on multiple GPUs, most parallelism techniques use the interconnect to transfer gradients at every step.

The current implementation of Nanochat takes about 3 hours to train on an 8xH100. The optimizer is the only distributed component.

Nanochat uses a combined Muon + AdamW optimizer 44DistMuonAdamW optim.py . Muon handles all large 2D matrices, attention projections (Q, K, V, O) and MLP weights (c_fc, c_proj) plus the tiny Value Embedding Gates. AdamW for the rest: input token embeddings (wte), LM head, Value embeddings, and two small residual addition scaling parameters (x0_params and resid_params). The optimizer runs in two stages: phase 1 is for reduce ops and phase 2 is for gather ops.

Phase 1 averages gradients across devices55Think of devices as a host having multiple GPUs. Each GPU has a rank (its ID) . all_reduce and reduce_scatter primitives are used for this. In Nanochat, all_reduce is used for tiny parameters (under 1024 elements), each rank receives the full averaged gradient in a single collective. Since these are just a few KB, the overhead to send them to all ranks is negligible. reduce_scatter the sharded alternative, handles rest of the parameters, each GPU receives 1/8 of the averaged gradient.

Phase 1
reduce_scatter(grads)
  GPU 0 → avg_grad[0:N/8]
  GPU 1 → avg_grad[N/8:2N/8]
  ...
NCCL AllGather collective operation diagram
All gather
NCCL ReduceScatter collective operation diagram
Reduce Scatter
NCCL Gather collective operation diagram
All Gather

Then in phase 2, each rank runs the optimizer on its shard in isolation, producing updated parameters for that slice. After this, all_gather lets every rank collect all the shards, so each rank has the full updated parameter tensor for next forward pass.

Phase 2
optimizer(shard) → updated params
all_gather(params)
  GPU 0 → params[0:N/8]
  GPU 1 → params[N/8:2N/8]
  ...
  → all ranks get full params[0:N]

This is the Zero-2 66Zero Stage 2 Paper pattern. Each GPU only needs optimizer state (momentum, variance buffers) for its shard, cutting memory to 1/world_size77world_size - Total number of GPUs . I strongly recommend watching Lecture 7 88Lecture 7 - CS336 in CS336 to get a deeper idea.

All figures above are from Nvidia’s NCCL documentation, which has great visualisation to understand this.

Analytical estimates of the data transfer for d26 Nanochat

From the Nanochat model architecture, we can estimate the data transfer required for each parameter group. The optimizer moves data in two phases: ReduceScatter to average gradients, then AllGather to distribute updated parameters.

Each optimizer step transfers roughly 7.1 GB across the interconnect. ~3.6 GB in AllGather (bf16), ~3.6 GB in ReduceScatter (split between bf16 and f32), and a negligible AllReduce for the two small lambda parameters. We get this value by adding the tensor sizes across all parameter groups - lm_head, wte, value_embeds, and the Muon-managed transformer blocks.

Per-group communication volume & NCCL op summary (per optimizer step)
groupkindnum_paramspadded_countelements_per_paramtotal_elementsRS (MB)AG (MB)AR (MB)
lm_headadamw1154,525,95254,525,952109.1109.10
wteadamw1154,525,95254,525,952109.1109.10
value_embedsadamw131354,525,952708,837,3761417.71417.70
resid_lambdasadamw112626000
x0_lambdasadamw112626000
muon (13, 32)muon13164166,656000
muon (1664, 1664)muon1041042,768,896287,965,184575.9575.90
muon (1664, 6656)muon263211,075,584354,418,688708.8708.80
muon (6656, 1664)muon263211,075,584354,418,688708.8708.80

NCCL op summary per step (compare with nsys CUDA GPU Kernel Summary)

nccl_opdtypecalls_per_steptotal_MBavg_MB_per_callmin_MB_per_callmax_MB_per_call
AllGatherbf16193629.41910708.8
AllReducebf1620000
ReduceScatterbf16151635.8109.1109.1109.1
ReduceScatterf3241993.6498.40708.8

That 7.1 GB is the tax every single training step pays.

Choice of H100

Back to our first question, which H100 instance to choose? Most providers offer the H100 in two form factors SXM and NVL 99Nvidia H100 specification .

SXM variant is a custom baseboard from Nvidia, whereas NVL is installed through the PCIe dual-slot. The FP8 FLOPs on SXM are 3958 vs 3341 on NVL. You can interconnect the GPUs through NVLink or PCIe. NVLink offers 900GB/s on SXM vs 600GB/s on NVL, and just 128GB/s on PCIe. One important note is that on NVL instances, only two GPUs can be connected through NVLink. Within a pair NVLink on NVL gives 300GB/s per direction, and cross-pair traffic falls back to PCIe. On SXM instances, NVSwitch connects all GPUs in a mesh providing 450GB/s per direction.

H100 SXM Systems View
H100 SXM Systems View showing the GPU interconnect. Figure from Hopper white paper
H100 NVL Systems View and Network Topology
H100 NVL Systems View and Network Topology. Figure from Hopper white paper and H100 NVL product brief

They also differ in max thermal design power. SXM can go up to 700W while NVL peaks at 400W. Horace He wrote a fun blog1010https://www.thonking.ai/p/strangely-matrix-multiplications about how data values affect power draw. Predictable data - complete zeros or ones flip fewer transistors, leading to less dynamic power and in turn better clock speeds. The main takeaway is higher power draw unlocks better clock speeds and, in turn, better FLOPs per dollar.

Runpod is one of the few providers to offer all three - SXM, NVL and PCIe. The cost for SXM is higher than PCIe but cheaper compared to NVL. Vast.ai also has SXM configuration at varied price points for both on-demand and spot instances, in many cases, cheaper than Runpod’s SXM.

Runpod PCIeRunpod NVLVast.ai SXMRunpod SXM
8-GPU node/hr (on-demand)$19.12$21.52$12.85$21.5
8-GPU node/hr (spot)$10$13.2$7-$10$14

The theoretical bandwidth ratio between NVLink (~450 GB/s per direction on SXM) and PCIe 5.0 (~64 GB/s per direction) is roughly 7x 9. If that ratio holds in practice, SXM should recoup its price premium on interconnect savings alone.

Benchmarks

My initial hypothesis was SXM instances were more expensive per hour, but they would be cheaper to complete the training run. From the nanochat leaderboard1111d26 + FP8 link , the d26 GPT-2 record uses --target-param-data-ratio=8.5 with FP8, training on ~7.8B tokens at batch size 524,288 for 14,889 steps to reach CORE 0.2578 (original GPT-2: 0.2565). Each step has the model forward, backward and the optimizer step.

The first two runs I did were on SXM and PCIe on Runpod. SXM had 160 vCPUs whereas PCIe had 252 vCPUs. However, the results were starkly different to my hypothesis. PCIe did better on overall training budget. Then I found a Vast.ai offering for 256 vCPU for SXM, which outperformed PCIe. To check if this improvement was purely due to vCPU sizing, I ran a 128 vCPU SXM on vast.ai and found it match the 256 vCPU SXM. In later sections, I detail about this disparity and the possible causes in the apparent regression of SXM 160 vCPU on Runpod. And finally, I benchmarked the NVL configuration for completeness.

In the results reported, I only talk about the three variants - SXM 128 vCPU (on vast.ai), PCIe 252 vCPU and NVL 128 vCPU on Runpod. The experiments that failed are in the final section.

I wrote this profiling script here 1212profile_comms.py which performs a warmup of 3 steps and then profiles 10 steps. I use torch.cuda.Event to time each step. This also isolates the optimizer’s average time, revealing network overhead. I measured compute and network time separately, even though they overlap during normal training. The measured times will be slightly higher than actual Nanochat training.

Nvidia’s nsys tool annotates specific parts of the script. Through torch.cuda.nvtx.range_push we break down each operation’s timing. The nvtx ranges and cuda events are split into three phases - Phase 1-Reduces, Phase 2-Compute+Gather, Phase 3-WaitGathers. Phases 1 and 2 are GPU-intensive, performing network collectives and fused optimizer kernels respectively. Phase 3 is a synchronization step where CPU waits on network completion.

Measured Step Times for d26

Profiled at device_batch_size=32, total_batch_size=524,288 (no gradient accumulation). The d26 GPT-2 record (Run 2) uses the same batch size with device_batch_size=16 and grad_accum=2, which produces equivalent step times.

SXM completes each step in ~702ms — nearly half the time of PCIe, and a third of NVL.

PlatformvCPUsAvg Step TimeOptimizer StepComm OverheadRelativeTraining Time
SXM (NVSwitch)128701.9 ms57.8 ms8.2%1.00x2.90 hours
PCIe2521411.6 ms375 ms26.6%2.01x5.84 hours
NVL1282031.5 ms395.6 ms19.5%2.89x8.40 hours

SXM’s NVSwitch mesh gives every GPU full bandwidth to every other GPU. PCIe is limited to ~64 GB/s per direction, and NVL only has NVLink within pairs — cross-pair traffic falls back to PCIe.

0 1s 1s 2s 2s SXM (NVSwitch): 701.9ms total step | Optimizer: 57.8ms | Overhead: 8.2% | 1.00x vs baseline SXM (NVSwitch) 702ms PCIe: 1,411.6ms total step | Optimizer: 375.0ms | Overhead: 26.6% | 2.01x vs baseline PCIe 1.4s NVL: 2,031.5ms total step | Optimizer: 395.6ms | Overhead: 19.5% | 2.89x vs baseline NVL 2.0s 2x 2.9x Step time Optimizer overhead
Average step time by H100 platform — lower is better. Dashed line marks the SXM baseline.

NCCL Communication

Measured GPU kernel execution times from Nsight Systems. All three runs produced the same total kernel call counts, enabling direct comparison of total times.

0s 2s 4s 6s 8s 10s 12s 14s Kernel time (10 steps) AllGather (1520 calls) SXM: 1.5s | PCIe: 12.9s | NVL: 13.3s 1.5s 12.9s 13.3s AllGather 1520 calls RS bf16 (1120 calls) SXM: 0.8s | PCIe: 3.6s | NVL: 4.5s 0.8s 3.6s 4.5s RS bf16 1120 calls RS f32 (400 calls) SXM: 1.6s | PCIe: 11.9s | NVL: 12.4s 1.6s 11.9s 12.4s RS f32 400 calls SXM PCIe NVL Total NCCL: SXM 3.9s vs PCIe 28.5s vs NVL 30.1s SXM is 7.3x faster than PCIe
NCCL kernel execution times by operation type. NVL performs nearly identically to PCIe.

NVLink delivers a 7.3x reduction in total NCCL kernel time, matching the spec sheet’s 7x bandwidth ratio.

Per-Kernel Average Latency

From nsight, I exported the NCCL calls from the CUDA GPU Kernel Summary across all configurations.

<0.01 5.0ms 10ms 15ms 20ms 25ms 30ms 35ms Avg latency per call (ms) AllGather SXM: 1.0ms | PCIe: 8.5ms | NVL: 8.7ms 1.0ms 8.5ms 8.7ms AllGather RS bf16 SXM: 0.68ms | PCIe: 3.3ms | NVL: 4.0ms 0.68ms 3.3ms 4.0ms RS bf16 RS f32 SXM: 4.1ms | PCIe: 30ms | NVL: 31ms 4.1ms 30ms 31ms RS f32 AllReduce f32 SXM: 0.03ms | PCIe: 0.23ms | NVL: 0.05ms 0.03ms 0.23ms 0.05ms AllReduce f32 SXM PCIe NVL NVLink delivers ~7x lower latency per kernel call vs PCIe
Average latency per NCCL kernel call. NVL tracks PCIe closely, confirming cross-pair traffic uses PCIe.

SXM performs the best here. NVL has NCCL kernel times nearly identical to PCIe. NVL step time (2031 ms) is 44% worse than PCIe (1412 ms) even though NCCL kernel times are nearly identical. Beyond NCCL, on NVL, inter-pair traffic shares the PCIe bus with host-to-device transfers, starving both.

Model Size Sensitivity (d12 vs d26)

Does a smaller model show the same interconnect sensitivity? I profiled d12 (286M params, device_batch_size=32, grad_accum=1) alongside d26 for two configurations.

0 100ms 200ms 300ms 400ms SXM (NVSwitch): 179ms total | Optimizer: 41ms | Overhead: 23.0% | 1.00x vs baseline SXM (NVSwitch) 179ms NVL: 366ms total | Optimizer: 62ms | Overhead: 17.0% | 2.04x vs baseline NVL 366ms 2.0x Step time Optimizer overhead
d12 average step time: SXM vs NVL. Optimizer accounts for 23% on SXM, 17% on NVL.

Smaller models are more communication-sensitive. d12 spends 23% in communication on SXM vs d26’s 8.2%. For small-model workloads, interconnect choice matters even more. Interestingly, Phase 1 time of NVL is faster than SXM for d12 likely because the small reduce volume fits within a single NVLink pair’s bandwidth, avoiding NVSwitch overhead.

d12 Optimizer Phase Breakdown
PlatformPhase 1Phase 2Phase 3Total Optimizer
SXM 128 vCPU4.2 ms25.2 ms11.7 ms41.2 ms
NVL 128 vCPU2.7 ms36.3 ms23.4 ms62.3 ms

Takeaways

SXM completes the training in nearly half the time of PCIe and a third of NVL. At $12.85/hr on Vast.ai, it’s also the cheapest per-hour option.

ProviderConfig$/hr (8-GPU)vCPUsStep TimeTraining Cost
Vast.aiSXM$12.85128701.9 ms$37.27
RunpodPCIe$19.122521411.6 ms$111.66
RunpodNVL$21.521282031.5 ms$180.77

SXM configurations seem to be the norm from most providers. Runpod was the only provider to have all three configurations and offered them as spot instances. Vast.ai is a bit of a lucky draw essentially, since it’s a marketplace not all the configurations are available consistently. For shorter training runs like Nanochat, it is the best fit. Lambda.ai has 208 vCPU count and offers only SXM, Modal also offers only SXM and has a configurable CPU count since they are serverless.

Through this exercise, I now have a better intuition on how to train with spot instances. The runpod_profile_comms.sh script has device checks early for cuda drivers and nccl communication so failure is fast.

Mistakes I made and issues I ran into

1. CPU starvation on the SXM run and NUMA socket pinning

My benchmark on SXM with 160vCPUs on Runpod clocked 1295ms per step, barely faster than PCIe’s 1412ms with 252 vCPUs. With higher FLOPs and faster interconnect SXM should have been a massive step-up, not a minor improvement.

Assuming it’s CPU starvation, I found a 256 vCPUs instance on vast.ai and got 702ms, 2x improvement. Through Nsight Systems, I found the GPU kernels themselves were fast, but they spent long stretches idle, waiting for the CPU to signal the next chunk in NCCL’s ring protocol. The pthread_cond_signal count was 1.58 million in a 10-step profile on the 160vCPU, vs ~4,000 on a healthy instance with 256 vCPUs. Running more experiments on Runpod with NVL and PCIe, I ran into multiple issues - slow internet on the VM, CUDA driver issues and also NCCL misconfigurations.

SXM 160 vCPU NUMA-split Nsight timeline
SXM 160 vCPU (NUMA-split). The OS runtime row is dense with syscalls.
SXM 256 vCPU Nsight timeline
SXM 256 vCPU. The OS runtime row is empty.
H100 PCIe Nsight Systems timeline
PCIe instance.

To ensure I was not fitting data to my narrative, I re-ran on Vast.ai with 128 vCPUs and got 701.9ms. Identical to the SXM with 256 vCPU. The CPU was not the only bottleneck. Dumping the machine topology revealed one clear difference. Runpod split GPUs 4+4 across two NUMA1313Non-Uniform Memory Access is a memory layout design used in data center machines. Link nodes, while Vast.ai placed all 8 on NUMA node 0.

Multi-socket1414A CPU socket is the physical connector on the motherboard that holds one CPU chip. A dual-socket server has two CPUs, each with its own local memory and PCIe lanes. servers have a NUMA (Non-Uniform Memory Access) architecture, each CPU socket has its own local memory. Accessing local memory takes ~10ns, but reaching memory on the other socket crosses the UPI (Ultra Path Interconnect) at ~100ns. When GPUs are split across NUMA nodes, NCCL’s CPU-side coordination threads, the ones signaling pthread_cond_signal and holding mutexes, pay this cross-socket penalty on every ring protocol step. The OS scheduler makes it worse: without explicit pinning, it can schedule a thread managing GPU 5 (socket 1) onto a core on socket 0, turning every memory access and signal delivery into a cross-UPI hop.

numactl --cpunodebind=N --membind=N pins processes to a specific socket, but NCCL spawns its own internal threads which may not respect this. The clean fix is what Vast.ai had: all 8 GPUs on a single NUMA node, so cross-socket latency never enters the picture. GPU-to-GPU NVLink communication is unaffected by NUMA since the bits travel over NVSwitch (data plane), never touching the CPU. But NCCL’s control threads, which orchestrate these transfers, run on the CPU (control plane). There is active discussion on PyTorch to include NUMA pinning to torchrun - Link.

There were also CUDA driver version differences (560 vs 570) and different kernel configs. I haven’t isolated which factor dominates, I’ll cover it in a follow-up post.

Run nvidia-smi topo -m on every new instance before benchmarking. If GPUs span multiple NUMA nodes, expect NCCL overhead. And always profile before trusting step times, a bad instance can masquerade as “SXM isn’t worth it.”

2. Spot instances being preempted mid-profile

Spot instances are 30 to 50% cheaper than on-demand instances. But the trade-off is they can be shut down at any point with a 5-second notice. Since the profiling takes roughly 12 minutes including installation, env setup and actual profiling, I was confident I could get the work done on spot instances. But I did run into shutdowns a couple of times.

3. Broken Nodes throwing CUDA errors

I ran into this issue a few times, where the host has not been configured correctly. Likely CUDA driver or GPU state was broken due to driver mismatch. Fixing this issue on the pod that is billed by the second is expensive. Shutting it down and trying at a later time is the best alternative.

>>> import torch, sys, os
>>>
>>> print(f'PyTorch {torch.__version__}, built with CUDA {torch.version.cuda}')
PyTorch 2.8.0+cu128, built with CUDA 12.8
>>>
>>> torch.cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 379, in init
    _lazy_init()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 412, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

4. NCCL connection issues on NVL

On one of the community instances of 8xH100 NVL on Runpod, there was a NCCL communication issue. The instance was unable to use SHM (Shared Memory), a fast inter-process transport using /dev/shm. I tried benchmarking anyway by disabling SHM through NCCL_SHM_DISABLE=1.

NCCL uses SHM for Peer-2-Peer (P2P) communication. On an 8-GPU NVL node, only 4 of the 28 GPU pairs share NVLink. The remaining 24 pairs rely on PCIe via SHM, disabling it cripples 6 out of every 7 ring hops. When SHM is disabled, it falls back to using IP Sockets which are orders of magnitude slower.

So, I had to do another run on the secure cloud of Runpod which also had a NVL instance. This performed much better. This paper does a great job of explaining NCCL 1515Paper .

MetricNVL (128 vCPU)NVL no SHM (152 vCPU)Degradation
Step time2031.5 ms6495.1 ms3.2x
Optimizer step395.6 ms5402.9 ms13.7x
Comm overhead19.5%83.2%
Total NCCL (10 steps)30.08s430.58s14.3x
AllGather avg8.715 ms142.656 ms16.4x
RS f32 avg30.937 ms393.702 ms12.7x
AllGather max1040 ms
0x 5x 10x 15x 20x Step time: 3.2x degradation without SHM Step time 3.2x RS f32 avg: 12.7x degradation without SHM RS f32 avg 12.7x Optimizer step: 13.7x degradation without SHM Optimizer step 13.7x Total NCCL: 14.3x degradation without SHM Total NCCL 14.3x AllGather avg: 16.4x degradation without SHM AllGather avg 16.4x Comm overhead: 19.5% (SHM) 83.2% (no SHM)
NVL degradation from disabling SHM. Dashed line marks 1x (no degradation). Communication goes from manageable to dominant.

NCCL degrades by 14.3x when SHM is disabled. Note: the no-SHM instance had slightly more vCPUs (152 vs 128), which should have helped, making the SHM effect even more dramatic than the raw numbers suggest.

Citation

BibTeX citation:

@online{kasukurthi2026},
  author = {Nikhil Kasukurthi},
  title = {Which H100 to train Nanochat},
  date = {2026-03-04},
  url = {https://bluenotebook.io/blog/h100-nanochat-training/},
  langid = {en}
}

For attribution, please cite this work as:

Nikhil Kasukurthi. 2026. "Which H100 to train Nanochat" March 4, 2026. https://bluenotebook.io/blog/h100-nanochat-training/.