Which H100 to train Nanochat
Benchmarking H100 PCIe vs SXM vs NVL on training cost, step times, and NCCL profiling to find the cheapest GPU configuration for Nanochat
Ruminations of a data scientist turned engineer - on LLMs, deep learning, and building AI systems.
I'm a data scientist, on sabbatical. I recently built clarifyit.ai
Previously, I led an AI and data engineering team at Eka.care. Training LLMs, building retrieval systems, managing vLLM deployments, and owning the product roadmap for our AI tools in healthcare. Before that, I worked on search ranking and data pipelines at Udaan, and was a visiting researcher at the National Center for Biological Sciences (NCBS).
I write here about what I learn along the way.
Benchmarking H100 PCIe vs SXM vs NVL on training cost, step times, and NCCL profiling to find the cheapest GPU configuration for Nanochat
Enabling endless capabilities for LLMs
Validate the data as it streams, don't make your users wait.