Which H100 to train Nanochat

Training to GPT-2 level performance on CORE metric ¹Marginnote recent_benchmark1https://github.com/karpathy/nanochat/discussions/481 ↩ has dropped from $43K in 2019 to $73 in 2026. I wanted to train Nanochat ²Marginnote nanochat2“The best ChatGPT that $100 can buy.” Nanochat Repository ↩ on spot instances, where Karpathy mentions the cost can be even lower to $20 on 8xH100 GPUs. But, on Runpod, I was confronted with a choice - H100 PCIe, SXM or NVL. Each at varying price points.

Choice of H100 - PCIe, SXM, NVL — Runpod offers H100 in multiple configurations

I knew these were different network interconnect options from the CS336 course ³Marginnote cs3363lecture 5 https://www.youtube.com/watch?v=6OBtO9niT00 ↩ and that NVLink 4.0 was supposed to be fast.

Prof. Percy Liang mentions in the first lecture of CS336, the mindset while training LLMs is to squeeze most performance of the hardware. Every decision has an effect. This led me to examine what each of these interconnect variants has to offer.

To train the model cheaply, is the cheapest instance the best choice to complete the training run? I decided to benchmark all three.

Why care about the network interconnect?

While training on multiple GPUs, most parallelism techniques use the interconnect to transfer gradients at every step.

The current implementation of Nanochat takes about 3 hours to train on an 8xH100. The optimizer is the only distributed component.

Nanochat uses a combined Muon + AdamW optimizer ⁴Marginnote distmuonadamw4DistMuonAdamW optim.py ↩. Muon handles all large 2D matrices, attention projections (Q, K, V, O) and MLP weights (c_fc, c_proj) plus the tiny Value Embedding Gates. AdamW for the rest: input token embeddings (wte), LM head, Value embeddings, and two small residual addition scaling parameters (x0_params and resid_params). The optimizer runs in two stages: phase 1 is for reduce ops and phase 2 is for gather ops.

Phase 1 averages gradients across devices⁵Marginnote devices5Think of devices as a host having multiple GPUs. Each GPU has a rank (its ID) ↩. all_reduce and reduce_scatter primitives are used for this. In Nanochat, all_reduce is used for tiny parameters (under 1024 elements), each rank receives the full averaged gradient in a single collective. Since these are just a few KB, the overhead to send them to all ranks is negligible. reduce_scatter the sharded alternative, handles rest of the parameters, each GPU receives 1/8 of the averaged gradient.

Phase 1
reduce_scatter(grads)
  GPU 0 → avg_grad[0:N/8]
  GPU 1 → avg_grad[N/8:2N/8]
  ...

NCCL AllGather collective operation diagram — All gather

NCCL ReduceScatter collective operation diagram — Reduce Scatter

NCCL Gather collective operation diagram — All Gather

Then in phase 2, each rank runs the optimizer on its shard in isolation, producing updated parameters for that slice. After this, all_gather lets every rank collect all the shards, so each rank has the full updated parameter tensor for next forward pass.

Phase 2
optimizer(shard) → updated params
all_gather(params)
  GPU 0 → params[0:N/8]
  GPU 1 → params[N/8:2N/8]
  ...
  → all ranks get full params[0:N]

This is the Zero-2 ⁶Marginnote zero_26Zero Stage 2 Paper ↩ pattern. Each GPU only needs optimizer state (momentum, variance buffers) for its shard, cutting memory to 1/world_size⁷Marginnote world_size7world_size - Total number of GPUs ↩. I strongly recommend watching Lecture 7 ⁸Marginnote lecture78Lecture 7 - CS336 ↩ in CS336 to get a deeper idea.

All figures above are from Nvidia’s NCCL documentation, which has great visualisation to understand this.

Analytical estimates of the data transfer for d26 Nanochat

From the Nanochat model architecture, we can estimate the data transfer required for each parameter group. The optimizer moves data in two phases: ReduceScatter to average gradients, then AllGather to distribute updated parameters.

Each optimizer step transfers roughly 7.1 GB across the interconnect. ~3.6 GB in AllGather (bf16), ~3.6 GB in ReduceScatter (split between bf16 and f32), and a negligible AllReduce for the two small lambda parameters. We get this value by adding the tensor sizes across all parameter groups - lm_head, wte, value_embeds, and the Muon-managed transformer blocks.

Per-group communication volume & NCCL op summary (per optimizer step)

group	kind	num_params	padded_count	elements_per_param	total_elements	RS (MB)	AG (MB)
lm_head	adamw	1	1	54,525,952	54,525,952	109.1	109.1
wte	adamw	1	1	54,525,952	54,525,952	109.1	109.1
value_embeds	adamw	13	13	54,525,952	708,837,376	1417.7	1417.7
resid_lambdas	adamw	1	1	26	26	0	0
x0_lambdas	adamw	1	1	26	26	0	0
muon (13, 32)	muon	13	16	416	6,656	0	0
muon (1664, 1664)	muon	104	104	2,768,896	287,965,184	575.9	575.9
muon (1664, 6656)	muon	26	32	11,075,584	354,418,688	708.8	708.8
muon (6656, 1664)	muon	26	32	11,075,584	354,418,688	708.8	708.8

NCCL op summary per step (compare with nsys CUDA GPU Kernel Summary)

nccl_op	dtype	calls_per_step	total_MB	avg_MB_per_call	min_MB_per_call	max_MB_per_call
AllGather	bf16	19	3629.4	191	0	708.8
AllReduce	bf16	2	0	0	0	0
ReduceScatter	bf16	15	1635.8	109.1	109.1	109.1
ReduceScatter	f32	4	1993.6	498.4	0	708.8

That 7.1 GB is the tax every single training step pays.

Choice of H100

Back to our first question, which H100 instance to choose? Most providers offer the H100 in two form factors SXM and NVL ⁹Marginnote h100_offering9Nvidia H100 specification ↩.

SXM variant is a custom baseboard from Nvidia, whereas NVL is installed through the PCIe dual-slot. The FP8 FLOPs on SXM are 3958 vs 3341 on NVL. You can interconnect the GPUs through NVLink or PCIe. NVLink offers 900GB/s on SXM vs 600GB/s on NVL, and just 128GB/s on PCIe. One important note is that on NVL instances, only two GPUs can be connected through NVLink. Within a pair NVLink on NVL gives 300GB/s per direction, and cross-pair traffic falls back to PCIe. On SXM instances, NVSwitch connects all GPUs in a mesh providing 450GB/s per direction.

H100 SXM Systems View showing the GPU interconnect. Figure from Hopper white paper

H100 NVL Systems View and Network Topology. Figure from Hopper white paper and H100 NVL product brief

They also differ in max thermal design power. SXM can go up to 700W while NVL peaks at 400W. Horace He wrote a fun blog¹⁰Marginnote horace_he10https://www.thonking.ai/p/strangely-matrix-multiplications ↩ about how data values affect power draw. Predictable data - complete zeros or ones flip fewer transistors, leading to less dynamic power and in turn better clock speeds. The main takeaway is higher power draw unlocks better clock speeds and, in turn, better FLOPs per dollar.

Runpod is one of the few providers to offer all three - SXM, NVL and PCIe. The cost for SXM is higher than PCIe but cheaper compared to NVL. Vast.ai also has SXM configuration at varied price points for both on-demand and spot instances, in many cases, cheaper than Runpod’s SXM.

	Runpod PCIe	Runpod NVL	Vast.ai SXM	Runpod SXM
8-GPU node/hr (on-demand)	$19.12	$21.52	$12.85	$21.5
8-GPU node/hr (spot)	$10	$13.2	$7-$10	$14

The theoretical bandwidth ratio between NVLink (~450 GB/s per direction on SXM) and PCIe 5.0 (~64 GB/s per direction) is roughly 7x ⁹. If that ratio holds in practice, SXM should recoup its price premium on interconnect savings alone.

Benchmarks

My initial hypothesis was SXM instances were more expensive per hour, but they would be cheaper to complete the training run. From the nanochat leaderboard¹¹Marginnote leaderboard11d26 + FP8 link ↩, the d26 GPT-2 record uses --target-param-data-ratio=8.5 with FP8, training on ~7.8B tokens at batch size 524,288 for 14,889 steps to reach CORE 0.2578 (original GPT-2: 0.2565). Each step has the model forward, backward and the optimizer step.

The first two runs I did were on SXM and PCIe on Runpod. SXM had 160 vCPUs whereas PCIe had 252 vCPUs. However, the results were starkly different to my hypothesis. PCIe did better on overall training budget. Then I found a Vast.ai offering for 256 vCPU for SXM, which outperformed PCIe. To check if this improvement was purely due to vCPU sizing, I ran a 128 vCPU SXM on vast.ai and found it match the 256 vCPU SXM. In later sections, I detail about this disparity and the possible causes in the apparent regression of SXM 160 vCPU on Runpod. And finally, I benchmarked the NVL configuration for completeness.

In the results reported, I only talk about the three variants - SXM 128 vCPU (on vast.ai), PCIe 252 vCPU and NVL 128 vCPU on Runpod. The experiments that failed are in the final section.

I wrote this profiling script here ¹²Marginnote profile_comms.py12profile_comms.py ↩ which performs a warmup of 3 steps and then profiles 10 steps. I use torch.cuda.Event to time each step. This also isolates the optimizer’s average time, revealing network overhead. I measured compute and network time separately, even though they overlap during normal training. The measured times will be slightly higher than actual Nanochat training.

Nvidia’s nsys tool annotates specific parts of the script. Through torch.cuda.nvtx.range_push we break down each operation’s timing. The nvtx ranges and cuda events are split into three phases - Phase 1-Reduces, Phase 2-Compute+Gather, Phase 3-WaitGathers. Phases 1 and 2 are GPU-intensive, performing network collectives and fused optimizer kernels respectively. Phase 3 is a synchronization step where CPU waits on network completion.

Measured Step Times for d26

Profiled at device_batch_size=32, total_batch_size=524,288 (no gradient accumulation). The d26 GPT-2 record (Run 2) uses the same batch size with device_batch_size=16 and grad_accum=2, which produces equivalent step times.

SXM completes each step in ~702ms — nearly half the time of PCIe, and a third of NVL.

Platform	vCPUs	Avg Step Time	Optimizer Step	Comm Overhead	Relative	Training Time
SXM (NVSwitch)	128	701.9 ms	57.8 ms	8.2%	1.00x	2.90 hours
PCIe	252	1411.6 ms	375 ms	26.6%	2.01x	5.84 hours
NVL	128	2031.5 ms	395.6 ms	19.5%	2.89x	8.40 hours

SXM’s NVSwitch mesh gives every GPU full bandwidth to every other GPU. PCIe is limited to ~64 GB/s per direction, and NVL only has NVLink within pairs — cross-pair traffic falls back to PCIe.

Average step time by H100 platform — lower is better. Dashed line marks the SXM baseline.

NCCL Communication

Measured GPU kernel execution times from Nsight Systems. All three runs produced the same total kernel call counts, enabling direct comparison of total times.

NCCL kernel execution times by operation type. NVL performs nearly identically to PCIe.

NVLink delivers a 7.3x reduction in total NCCL kernel time, matching the spec sheet’s 7x bandwidth ratio.

Per-Kernel Average Latency

From nsight, I exported the NCCL calls from the CUDA GPU Kernel Summary across all configurations.

Average latency per NCCL kernel call. NVL tracks PCIe closely, confirming cross-pair traffic uses PCIe.

SXM performs the best here. NVL has NCCL kernel times nearly identical to PCIe. NVL step time (2031 ms) is 44% worse than PCIe (1412 ms) even though NCCL kernel times are nearly identical. Beyond NCCL, on NVL, inter-pair traffic shares the PCIe bus with host-to-device transfers, starving both.

Model Size Sensitivity (d12 vs d26)

Does a smaller model show the same interconnect sensitivity? I profiled d12 (286M params, device_batch_size=32, grad_accum=1) alongside d26 for two configurations.

d12 average step time: SXM vs NVL. Optimizer accounts for 23% on SXM, 17% on NVL.

Smaller models are more communication-sensitive. d12 spends 23% in communication on SXM vs d26’s 8.2%. For small-model workloads, interconnect choice matters even more. Interestingly, Phase 1 time of NVL is faster than SXM for d12 likely because the small reduce volume fits within a single NVLink pair’s bandwidth, avoiding NVSwitch overhead.

d12 Optimizer Phase Breakdown

Platform	Phase 1	Phase 2	Phase 3	Total Optimizer
SXM 128 vCPU	4.2 ms	25.2 ms	11.7 ms	41.2 ms
NVL 128 vCPU	2.7 ms	36.3 ms	23.4 ms	62.3 ms

Takeaways

SXM completes the training in nearly half the time of PCIe and a third of NVL. At $12.85/hr on Vast.ai, it’s also the cheapest per-hour option.

Provider	Config	$/hr (8-GPU)	vCPUs	Step Time	Training Cost
Vast.ai	SXM	$12.85	128	701.9 ms	$37.27
Runpod	PCIe	$19.12	252	1411.6 ms	$111.66
Runpod	NVL	$21.52	128	2031.5 ms	$180.77

SXM configurations seem to be the norm from most providers. Runpod was the only provider to have all three configurations and offered them as spot instances. Vast.ai is a bit of a lucky draw essentially, since it’s a marketplace not all the configurations are available consistently. For shorter training runs like Nanochat, it is the best fit. Lambda.ai has 208 vCPU count and offers only SXM, Modal also offers only SXM and has a configurable CPU count since they are serverless.

Through this exercise, I now have a better intuition on how to train with spot instances. The runpod_profile_comms.sh script has device checks early for cuda drivers and nccl communication so failure is fast.

Mistakes I made and issues I ran into

1. CPU starvation on the SXM run and NUMA socket pinning

My benchmark on SXM with 160vCPUs on Runpod clocked 1295ms per step, barely faster than PCIe’s 1412ms with 252 vCPUs. With higher FLOPs and faster interconnect SXM should have been a massive step-up, not a minor improvement.

Assuming it’s CPU starvation, I found a 256 vCPUs instance on vast.ai and got 702ms, 2x improvement. Through Nsight Systems, I found the GPU kernels themselves were fast, but they spent long stretches idle, waiting for the CPU to signal the next chunk in NCCL’s ring protocol. The pthread_cond_signal count was 1.58 million in a 10-step profile on the 160vCPU, vs ~4,000 on a healthy instance with 256 vCPUs. Running more experiments on Runpod with NVL and PCIe, I ran into multiple issues - slow internet on the VM, CUDA driver issues and also NCCL misconfigurations.

SXM 160 vCPU NUMA-split Nsight timeline — SXM 160 vCPU (NUMA-split). The OS runtime row is dense with syscalls.

SXM 256 vCPU Nsight timeline — SXM 256 vCPU. The OS runtime row is empty.

H100 PCIe Nsight Systems timeline — PCIe instance.

To ensure I was not fitting data to my narrative, I re-ran on Vast.ai with 128 vCPUs and got 701.9ms. Identical to the SXM with 256 vCPU. The CPU was not the only bottleneck. Dumping the machine topology revealed one clear difference. Runpod split GPUs 4+4 across two NUMA¹³Marginnote numa13Non-Uniform Memory Access is a memory layout design used in data center machines. Link ↩ nodes, while Vast.ai placed all 8 on NUMA node 0.

Multi-socket¹⁴Marginnote socket14A CPU socket is the physical connector on the motherboard that holds one CPU chip. A dual-socket server has two CPUs, each with its own local memory and PCIe lanes. ↩ servers have a NUMA (Non-Uniform Memory Access) architecture, each CPU socket has its own local memory. Accessing local memory takes ~10ns, but reaching memory on the other socket crosses the UPI (Ultra Path Interconnect) at ~100ns. When GPUs are split across NUMA nodes, NCCL’s CPU-side coordination threads, the ones signaling pthread_cond_signal and holding mutexes, pay this cross-socket penalty on every ring protocol step. The OS scheduler makes it worse: without explicit pinning, it can schedule a thread managing GPU 5 (socket 1) onto a core on socket 0, turning every memory access and signal delivery into a cross-UPI hop.

numactl --cpunodebind=N --membind=N pins processes to a specific socket, but NCCL spawns its own internal threads which may not respect this. The clean fix is what Vast.ai had: all 8 GPUs on a single NUMA node, so cross-socket latency never enters the picture. GPU-to-GPU NVLink communication is unaffected by NUMA since the bits travel over NVSwitch (data plane), never touching the CPU. But NCCL’s control threads, which orchestrate these transfers, run on the CPU (control plane). There is active discussion on PyTorch to include NUMA pinning to torchrun - Link.

There were also CUDA driver version differences (560 vs 570) and different kernel configs. I haven’t isolated which factor dominates, I’ll cover it in a follow-up post.

Run nvidia-smi topo -m on every new instance before benchmarking. If GPUs span multiple NUMA nodes, expect NCCL overhead. And always profile before trusting step times, a bad instance can masquerade as “SXM isn’t worth it.”

2. Spot instances being preempted mid-profile

Spot instances are 30 to 50% cheaper than on-demand instances. But the trade-off is they can be shut down at any point with a 5-second notice. Since the profiling takes roughly 12 minutes including installation, env setup and actual profiling, I was confident I could get the work done on spot instances. But I did run into shutdowns a couple of times.

3. Broken Nodes throwing CUDA errors

I ran into this issue a few times, where the host has not been configured correctly. Likely CUDA driver or GPU state was broken due to driver mismatch. Fixing this issue on the pod that is billed by the second is expensive. Shutting it down and trying at a later time is the best alternative.

>>> import torch, sys, os
>>>
>>> print(f'PyTorch {torch.__version__}, built with CUDA {torch.version.cuda}')
PyTorch 2.8.0+cu128, built with CUDA 12.8
>>>
>>> torch.cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 379, in init
    _lazy_init()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 412, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

4. NCCL connection issues on NVL

On one of the community instances of 8xH100 NVL on Runpod, there was a NCCL communication issue. The instance was unable to use SHM (Shared Memory), a fast inter-process transport using /dev/shm. I tried benchmarking anyway by disabling SHM through NCCL_SHM_DISABLE=1.

NCCL uses SHM for Peer-2-Peer (P2P) communication. On an 8-GPU NVL node, only 4 of the 28 GPU pairs share NVLink. The remaining 24 pairs rely on PCIe via SHM, disabling it cripples 6 out of every 7 ring hops. When SHM is disabled, it falls back to using IP Sockets which are orders of magnitude slower.

So, I had to do another run on the secure cloud of Runpod which also had a NVL instance. This performed much better. This paper does a great job of explaining NCCL ¹⁵Marginnote demystifing_nccl15Paper ↩.

Metric	NVL (128 vCPU)	NVL no SHM (152 vCPU)	Degradation
Step time	2031.5 ms	6495.1 ms	3.2x
Optimizer step	395.6 ms	5402.9 ms	13.7x
Comm overhead	19.5%	83.2%	—
Total NCCL (10 steps)	30.08s	430.58s	14.3x
AllGather avg	8.715 ms	142.656 ms	16.4x
RS f32 avg	30.937 ms	393.702 ms	12.7x
AllGather max	—	1040 ms	—

NVL degradation from disabling SHM. Dashed line marks 1x (no degradation). Communication goes from manageable to dominant.

NCCL degrades by 14.3x when SHM is disabled. Note: the no-SHM instance had slightly more vCPUs (152 vs 128), which should have helped, making the SHM effect even more dramatic than the raw numbers suggest.