Complete Architecture
The AI Hardware Stack
From silicon atoms to generated tokens — every layer that makes modern AI inference possible, and where the bottlenecks hide.
+--------------------------------------------------------------------------+
| THE COMPLETE AI HARDWARE STACK |
+==========================================================================+
| |
| L1 INFERENCE & SERVING Prefill --> KV Transfer --> Decode |
| Disaggregated pools | TTFT / ITL | PagedAttention | vLLM |
| |
+--------------------------------------------------------------------------+
| |
| L2 SOFTWARE & KERNELS CUDA --> PTX --> SASS | CUDA Graphs |
| DeepGEMM (1550 TF) | FlashMLA | FlashAttention-3 | DeepEP |
| |
+--------------------------------------------------------------------------+
| |
| L3 COMPUTE HARDWARE B200: 208B T | TPU v7 | WSE-3 |
| Tensor Cores (FP4) | Systolic Arrays | Groq LPU | Trainium |
| |
+--------------------------------------------------------------------------+
| |
| L4 MEMORY HIERARCHY Regs --> SRAM --> L2 --> HBM --> NVMe |
| The Memory Wall: compute 3x/2yr vs bandwidth 1.6x/2yr |
| |
+--------------------------------------------------------------------------+
| |
| L5 INTRA-NODE INTERCONNECT NVLink 5.0 (1.8 TB/s) | NVSwitch |
| 72 GPUs all-to-all at 130 TB/s | NVLink-C2C (900 GB/s) |
| |
+--------------------------------------------------------------------------+
| |
| L6 INTER-NODE NETWORKING IB 800G (115 Tb/s) | SHARP v4 |
| Co-Packaged Optics (65% power cut) | RoCE v2 | GPUDirect |
| |
+--------------------------------------------------------------------------+
| |
| L7 SILICON & PACKAGING CoWoS-L (4x reticle) | TSMC 3nm |
| HBM4 (2048-bit) | Hybrid Bonding | Cadence + Synopsys EDA |
| |
+--------------------------------------------------------------------------+
Plain English
Think of the whole AI system like a
factory with 7 floors. Each floor does one job, and they all depend on each other from bottom to top. The bottom floor is the raw silicon — the physical atoms. The top floor is where your prompt gets answered. Everything in between is about moving data as fast as possible between the chips that do the math.
How Data Flows Through the Stack
A user prompt enters at L1 and cascades down to silicon, then back up:
User Prompt Generated Token
| ^
v |
[L1 Prefill] tokenize, build KV cache ----RDMA---> [L1 Decode] autoregress
| ^
v |
[L2 Kernel] FlashAttn tiles Q/K/V in SRAM, fused GEMM [L2 Kernel]
| ^
v |
[L3 Tensor Cores] wgmma: 128 threads x FP8 multiply [L3 Tensor Cores]
| ^
v |
[L4 HBM3e] 8 TB/s reads weights + KV cache [L4 HBM3e]
| ^
v |
[L5 NVLink] all-reduce across 72 GPUs at 1.8 TB/s [L5 NVLink]
| ^
v |
[L6 IB/RDMA] cross-node gradient sync, SHARP in-switch [L6 IB/RDMA]
| ^
v |
[L7 Silicon] all of this on a CoWoS-L interposer, 3nm [L7 Silicon]
Key Insight
Every layer is shaped by the Memory Wall. Compute scales 3x every 2 years; memory bandwidth only 1.6x. This single constraint drives HBM investment, FlashAttention algorithms, the prefill/decode split, disaggregated serving, and the entire NVLink/NVSwitch topology.
Layer 7 — Foundation
Silicon & Packaging
The physical substrate: advanced packaging places GPU dies next to HBM stacks on shared interposers, while EDA toolchains turn RTL into silicon. This is the #1 supply bottleneck for all of AI.
Analogy
This is literally the physical chip. TSMC takes a digital blueprint and etches circuits onto silicon wafers — like
printing a microscopic city onto a fingernail-sized piece of glass. The tricky part is "packaging": gluing the brain chip (GPU) next to its memory chips (HBM) on a shared platform called CoWoS. It's like building a motherboard at the scale of atoms. One bug in the design = $100M+ and 6 months lost.
CoWoS-L Package Cross-Section (B200)
+------------------------------------------------------------------------+
| CoWoS-L Package (GB200) |
| |
| +------+ +------+ +------+ +--------+ +--------+ +------+ +------+ |
| |HBM3e | |HBM3e | |HBM3e | |GPU Die | |GPU Die | |HBM3e | |HBM3e | |
| | 24GB | | 24GB | | 24GB | | 104B T | | 104B T | | 24GB | | 24GB | |
| +--||--+ +--||--+ +--||--+ +---||---+ +---||---+ +--||--+ +--||--+ |
| || || || 10 TB/s || || || |
| +==||=======||=======||=======||=========||=======||=======||==+ |
| | LSIC Chiplets (Local Silicon Interconnect) | |
| +-------------------------------------------------------------+ |
| +-------------------------------------------------------------+ |
| | Organic Substrate (BGA Package) | |
| +-------------------------------------------------------------+ |
| |
+------------------------------------------------------------------------+
Analogy
The CoWoS package is like a
custom apartment building. The GPU dies are the penthouse suites (doing the actual work), and the HBM stacks are the closets (storing everything). They're all connected by hallways etched into the building's foundation (the silicon interposer). The building can only be so big (reticle limit = ~858 mm²), so TSMC invented LSIC chiplets to stitch multiple foundations together — like connecting buildings with sky-bridges to make a mega-complex.
EDA Toolchain: RTL to Silicon
Cadence Flow: Synopsys Flow:
RTL (Verilog/VHDL) RTL (Verilog/VHDL)
| |
v v
[Genus] Logic Synthesis [Design Compiler]
| netlist | netlist
v v
[Innovus] Place & Route [IC Compiler II]
| layout | layout
v v
[Xcelium] Verification [VCS] Verification
| clean DRC/LVS | clean DRC/LVS
v v
[Tempus] Timing Signoff [PrimeTime] Timing Signoff
| |
v v
[Pegasus] Physical Signoff [IC Validator] Phys Signoff
| |
v v
GDSII Tapeout GDSII Tapeout
Feb 2026: Cadence ChipStack AI "Super Agent" claims 10x 3D-IC design speed
NVIDIA invested $2B in Synopsys for GPU-accelerated simulation
CoWoS Packaging Variants
Variant Interposer Pitch Max Area Used By
-------- ----------- -------- ----------- -----------
CoWoS-S Silicon 0.4 um 858 mm2 H100
CoWoS-L LSIC + Organic 2-5 um 3400 mm2+ B200, R100
CoWoS-R RDL Organic 5+ um Large Network ASICs
Reticle limit: ~858 mm2 (26mm x 33mm single exposure)
CoWoS-L bypasses this via LSIC chiplet stitching
Rubin (R100) targets 4x reticle = ~3,400 mm2 interposer area
TSMC CoWoS Capacity
130K
wafers/month target by end 2026
NVIDIA CoWoS Share
>50%
of total TSMC advanced packaging
B200 Transistors
208B
dual-die on TSMC 4NP
NV-HBI Bandwidth
10 TB/s
die-to-die on B200
Critical Bottleneck
CoWoS is the #1 supply constraint for AI. Every H100, B200, and future Rubin chip requires advanced packaging. TSMC scaling from ~75K to 130K wafers/month, but demand far exceeds supply. SK Hynix has sold out its entire 2026 HBM supply. The transition from micro-bumps (25-40um pitch) to hybrid bonding (3-9um pitch) will unlock finer interconnect density for Rubin.
Layer 6 — Network Fabric
Inter-Node Networking
How GPUs in different machines communicate. RDMA lets GPUs read/write each other's memory directly — bypassing CPUs and OS kernels entirely. InfiniBand's SHARP performs collective operations inside the switch fabric itself.
Analogy
When you need thousands of GPUs across many servers, they talk over InfiniBand. Data goes
directly from one GPU's memory to another GPU's memory, skipping the CPU entirely — like sending a package straight to someone's desk instead of going through the mailroom. This "RDMA" trick cuts latency dramatically. And SHARP v4 is even crazier: the network switch itself does math on your data
while it's in transit, like a postal sorting facility that also fills out your tax forms as your envelope passes through.
GPUDirect RDMA: Zero-Copy GPU-to-GPU Transfer
Source GPU Source NIC IB Switch Remote NIC Remote GPU
| | | | |
| 1. Write HBM | | | |
|~~~~~~~~~~~> | | | |
| | | | |
| 2. Post SQ | | | |
|-------------->| | | |
| | | | |
| 3. DMA Read | | | |
|<~~~~~~~~~~~~~~| | | |
| | | | |
| | 4. Transmit | | |
| |-------------->| | |
| | | | |
| | | 5. SHARP v4 | |
| | | (in-network | |
| | | all-reduce) | |
| | |-------------->| |
| | | | |
| | | | 6. DMA Write |
| | | |~~~~~~~~~~~~~~>|
| | | | |
| | | | 7. CQE Done |
| | | |<~~~~~~~~~~~~~~|
| | | | |
Zero CPU involvement after setup. Kernel-bypass, zero-copy, polling-based.
NIC accesses GPU HBM directly via BAR1 PCIe memory-mapped regions.
Analogy — RDMA Step by Step
Steps 1-2: The source GPU writes data and posts a "send" request to the network card.
Like writing a letter and dropping it in the outbox. Steps 3-4: The NIC reads directly from GPU memory (no CPU involved!) and transmits over the wire.
The mailroom picks up directly from your desk. Step 5: SHARP v4 does math on the data inside the switch.
The post office adds up all the numbers while sorting. Steps 6-7: Remote NIC writes directly into the destination GPU's memory.
Letter arrives directly on their desk, no unpacking needed.
SHARP v4: In-Network Compute
Traditional All-Reduce (ring): SHARP v4 All-Reduce:
GPU 0 ---data---> GPU 1 ---> GPU 2 GPU 0 ---data---> Switch
^ | GPU 1 ---data---> Switch <-- 14.4 TFLOPS
| O(N) steps | GPU 2 ---data---> Switch in-switch ALU
+--------- GPU 3 <-------------+ GPU 3 ---data---> Switch
|
Each GPU sends + receives N-1 Switch computes SUM |
times around the ring. | |
Latency: O(N) hops v |
Broadcast result |
to all GPUs <---------+
Latency: O(log N) hops
InfiniBand vs Ethernet Comparison
Feature InfiniBand (IB) Ethernet / RoCE v2
------------------- ------------------------- -------------------------
Switch Quantum-X800 Memory Wall: commodity
Port Speed 800 Gb/s 400-800 Gb/s
Total BW 115.2 Tb/s (144 ports) 25.6-51.2 Tb/s typical
Latency sub-100 ns (cut-through) 1-5 us (store-forward)
In-Network Compute SHARP v4 (14.4 TFLOPS) None (endpoint only)
RDMA Native IB verbs RoCE v2 over UDP/IP
Cost Premium (~2x) Standard (~1x)
Best For Training 10K+ GPUs Inference, <10K training
Deployments Stargate, Oracle 131K Meta inference, LinkedIn
Ultra Ethernet Consortium (AMD, Broadcom, Cisco, Intel, Meta, Microsoft)
pushing AI-optimized Ethernet: ordered delivery, packet spraying, congestion
Quantum-X800 BW
115.2 Tb/s
144 ports x 800 Gb/s
Port Latency
<100 ns
cut-through switching
SHARP Compute
14.4 TF
in-network all-reduce
CPO Power Save
65%
vs pluggable optics
Co-Packaged Optics (CPO)
NVIDIA's COUPE engines place optical transceivers directly into switch packages — converting electrical to optical at the package boundary instead of front-panel modules. This eliminates ~15cm of lossy copper PCB trace, cutting SerDes power 65% and extending reach 100x (km vs meters). Broadcom's Tomahawk 6 Davisson: 102.4 Tb/s with CPO. 2026 is the volume deployment year.
Layer 5 — Rack-Scale Fabric
Intra-Node Interconnect
NVLink and NVSwitch create a fabric where every GPU in a rack can talk to every other GPU at full bandwidth — no hotspots. This is what makes tensor parallelism across 72 GPUs possible.
Analogy
Inside a single server rack, GPUs are connected by NVLink — a
super-fast private highway that's 14x faster than the normal connection (PCIe). A special traffic controller called NVSwitch makes sure any GPU can talk to any other GPU equally fast. Think of it as
72 people in a room who can all whisper to each other at the same speed — no one is farther away than anyone else. This "all-to-all" topology is what makes it possible to split a single AI model across all 72 GPUs.
GB200 NVL72: Full Crossbar Topology
GB200 NVL72 Rack
36 Grace-Blackwell Superchips = 72 Blackwell GPUs
GPU 0 GPU 1 GPU 2 GPU 3 GPU 70 GPU 71
| | | | ... | |
| | | | | |
==|=======|=======|=======|=============|=======|==
|| NVSwitch 0 NVSwitch 1 ... NVSwitch 17 ||
|| ||
|| Full Crossbar: 130 TB/s aggregate ||
|| Every GPU equidistant: 1 hop, ~2 us ||
|| Each NVSwitch: 64 ports, 6.4 TB/s ||
==|=======|=======|=======|=============|=======|==
| | | | | |
v v v v v v
Tensor Parallelism across ALL 72 GPUs in a single domain
All-reduce after every transformer layer at full 1.8 TB/s per GPU
Analogy — Why Equidistant Matters
Without NVSwitch, GPUs would be like houses on a street — neighbors talk fast, but sending a message to the other end of town takes many hops. NVSwitch makes it like everyone's in the
same conference room. Every message is one hop. This matters because tensor parallelism requires all 72 GPUs to synchronize after every single layer of the neural network. If even one GPU is slow, everyone waits.
Grace-Blackwell Superchip: NVLink-C2C
+----------------------------+ +----------------------------+
| Grace CPU | | Blackwell GPU |
| | | |
| 72 ARM Neoverse V2 cores | | 576 Tensor Cores (5th gen)|
| 512 GB LPDDR5X @ 546 GB/s | | 192 GB HBM3e @ 8 TB/s |
| | | |
| CPU DRAM | | GPU HBM |
| (spill target) | | (primary pool) |
| | | |
+-------------||-------------+ +-------------||-------------+
|| ||
|| NVLink-C2C: 900 GB/s coherent ||
|| 7x PCIe Gen5 bandwidth ||
|| Unified virtual address space ||
|| No explicit cudaMemcpy needed ||
|| ||
++=================================++
Bandwidth Comparison
Interconnect Bandwidth Relative Role
----------------- -------------- --------- -------------------------
NVLink 5.0 1,800 GB/s 14x GPU-to-GPU (tensor parallel)
NVLink-C2C 900 GB/s 7x CPU-to-GPU (coherent memory)
PCIe Gen5 x16 128 GB/s 1x CPU-GPU, NVMe, NICs
PCIe Gen6 x16 256 GB/s 2x Next-gen (PAM-4 signaling)
NVLink 4 (H100): 18 links x 50 GB/s = 900 GB/s
NVLink 5 (B200): 18 links x 100 GB/s = 1,800 GB/s (2x improvement)
Physical: 112 Gbps PAM-4 SerDes per lane (doubled from NVLink 4)
NVLink 5 per GPU
1.8 TB/s
18 links x 100 GB/s each
NVSwitch Fabric
130 TB/s
72 GPUs all-to-all
NVLink-C2C
900 GB/s
coherent CPU-GPU link
PCIe Gap
14x
slower than NVLink
Why NVLink Matters
Tensor parallelism requires an all-reduce after every transformer layer. For a 405B model at BF16, that's gigabytes synchronized across GPUs at each step. PCIe's 128 GB/s would make even 2-GPU TP bandwidth-bound. NVLink's 1.8 TB/s keeps tensor cores fed. The 14x gap between PCIe and NVLink is the reason NVSwitch exists.
Layer 4 — The Memory Wall
Memory Hierarchy
The defining constraint of AI hardware. Compute scales 3x every 2 years; memory bandwidth only 1.6x. This widening gap drives every major architectural decision — from FlashAttention to disaggregated serving.
Analogy — The Memory Ladder
GPUs have a memory ladder, each rung trading speed for size:
•
Registers = the numbers you hold in your head (instant, tiny)
•
Shared memory / L1 = a notepad on your desk (very fast, small)
•
L2 cache = a filing cabinet in the room (fast, medium)
•
HBM = a warehouse down the street (slower, big)
•
SSD/Flash = a storage unit across town (slowest, huge)
The big problem: compute power doubles every ~2 years, but memory speed only grows 1.6x. This "memory wall" is why so much engineering goes into keeping data close to where it's being processed. FlashAttention exists because an HBM access costs 600x more time than reading a register.
Memory Hierarchy: Bandwidth vs Capacity vs Latency
BANDWIDTH CAPACITY LATENCY
+----------+
|Registers | 256 KB/SM ~1 cycle
| ~37 TB/s |
+----||----+
+-----||------+
| L1/Shared | 228 KB/SM ~28-39 cycles
| ~19 TB/s | <-- FlashAttention lives here
+-----||------+
+---------||--------+
| L2 Cache | 126 MB ~200 cycles
| ~12 TB/s |
+---------||--------+
+-------------||-------------+
| HBM3e (DRAM) | 192 GB ~600 cycles
| 8 TB/s |
+-------------||-------------+
+------------------||-------------------+
| NVMe (NAND Flash) | TB-scale ~100K cycles
| 7-14 GB/s |
+----------------------------------------+
Each level down: ~10-100x more capacity, ~10-100x more latency
HBM access (600 cyc) vs register (1 cyc) = 600x penalty
Analogy — FlashAttention
Standard attention is like a student who reads the entire textbook (writes it to HBM), highlights it (reads from HBM), then writes a summary (reads from HBM again) —
three trips to the warehouse for every page. FlashAttention is like a student who reads one chapter at a time, keeping it on their desk notepad, highlighting and summarizing before moving to the next chapter.
Same result, but the warehouse trip only happens once per chapter. This is why FlashAttention turned a memory-bound operation into a compute-bound one.
FlashAttention: Why SRAM Tiling Matters
Standard Attention: FlashAttention (tiled in SRAM):
Q (full) x K^T (full) Q_tile (fits SRAM) x K_tile^T
| |
v v
S = Q*K^T (N x N matrix) S_tile (small, in SRAM)
Materialized in HBM! Never touches HBM!
| |
v v
softmax(S) -> read from HBM online softmax (rescaling)
| |
v v
P * V -> another HBM round-trip P_tile * V_tile (still in SRAM)
|
HBM reads: O(N^2) v
Bandwidth-bound accumulate output
HBM reads: O(N^2 * d / M)
M = SRAM size
Effectively IO-linear
FlashAttention-3 (Hopper): ping-pong warpgroups, 1.3 PFLOPS = 75% peak
HBM Evolution
Generation Capacity Bandwidth Stack Interface GPU
----------- -------- --------- -------- --------- --------
HBM3 80 GB 3.35 TB/s 8-high 1024-bit H100
HBM3e 192 GB 8-9 TB/s 12-high 1024-bit B200
HBM4 384 GB+ 2+ TB/s* 12-16 2048-bit Rubin
high
* per stack; multiple stacks per GPU
HBM4: logic-in-base die, JEDEC spec April 2025, volume late 2026
Market: SK Hynix ~50%, Samsung ~40%, Micron ~10%
SK Hynix 2026 HBM supply: entirely sold out to NVIDIA
B200 HBM BW
8 TB/s
HBM3e, 192 GB capacity
B200 L2 Cache
126 MB
2.5x H100's 50 MB
SRAM vs HBM
600x
latency penalty for HBM
Memory Wall Gap
1.9x/2yr
compute vs bandwidth scaling
The Memory Wall
The memory wall is the central constraint shaping every level of AI systems design. From transistor-level packaging (HBM4 logic-in-base) through algorithms (FlashAttention) to cluster architecture (disaggregated serving). A 405B model checkpoint with Adam states is ~2.4 TB. In 100K+ GPU clusters, mean time between failure is measured in hours — checkpointing to NVMe (7-14 GB/s, 1000x slower than HBM) is the fault tolerance mechanism.
Layer 3 — The Engines
Compute Hardware
GPUs, TPUs, wafer-scale chips, and deterministic dataflow processors. The frontier is bifurcating: massive FP4/FP8 throughput for model execution, and dense ARM CPU fabrics for agentic orchestration.
Analogy — Three Types of Compute
•
GPU (NVIDIA) = Swiss Army knife. Flexible, massive ecosystem, good at everything. The general-purpose workhorse of AI.
•
Custom ASICs (TPU, Cerebras, Groq) = purpose-built industrial machines. Faster at specific tasks but less flexible. A Cerebras chip is literally an entire silicon wafer — like using the whole factory floor for one machine.
•
CPU (ARM/x86) = the orchestra conductor. For AI agents like Claude Code, the CPU does 60-90% of the work: parsing JSON, calling tools, managing state. The GPU only fires for the actual "thinking" bursts. This is why the agentic era needs 4x more CPU capacity.
NVIDIA Blackwell SM: Warp-Group Architecture
+================================================================+
| Streaming Multiprocessor (SM) |
| B200: 576 SMs total (dual-die) |
| |
| +--------+ +--------+ +--------+ +--------+ |
| | Warp 0 | | Warp 1 | | Warp 2 | | Warp 3 | = 128 thr |
| |32 thrd | |32 thrd | |32 thrd | |32 thrd | (1 wgrp) |
| +---||---+ +---||---+ +---||---+ +---||---+ |
| || || || || |
| +===||==========||==========||==========||===+ |
| | Warp-Group MMA (wgmma) Instruction | |
| | 128 threads --> single matrix multiply | |
| | FP4 / FP6 / FP8 / FP16 / BF16 | |
| +=============================================+ |
| || |
| +-----------+ +----||----+ +-------------+ |
| | Reg File | | Tensor | | L1 / Shared | |
| | 256 KB | | Core | | 228 KB | |
| +-----------+ +----------+ +-------------+ |
| |
| TMA: async 1D-5D tensor moves (global <-> shared) |
| No register file involvement, frees warps for compute |
+================================================================+
Analogy — Warp Groups
A GPU SM is like a
factory floor with 4 work crews of 32 people each (warps). In older GPUs, each crew did its own small job. With warp-group MMA (wgmma), all 4 crews (128 workers) cooperate on
one massive matrix multiplication together — like 4 construction crews building one wall instead of 4 separate small walls. The Tensor Memory Accelerator (TMA) is like a forklift that automatically delivers materials to the work area without any worker needing to stop and fetch things.
AI Accelerator Comparison
Chip Transistors Peak FLOPS HBM Architecture
--------------- ----------- ----------- ---------- ------------------
NVIDIA B200 208B (2die) 4.5 PF FP4 192GB 8T/s SIMT + Tensor Cores
Google TPU v7 N/A 4.6 PF 192GB 7.4 256x256 Systolic
Cerebras WSE-3 4T (wafer) 125 PF FP16 44GB SRAM 900K cores, no HBM
Groq LPU N/A ~1 PF 230MB SRAM Deterministic flow
AWS Trainium2 N/A 1.3 PF FP8 N/A 500K+ for Anthropic
Architecture styles:
NVIDIA: SIMT warps (32 threads) + Tensor Cores (wgmma)
Google: Data flows through 256x256 grid of MACs (systolic)
Cerebras: Entire wafer, eliminates HBM entirely (44GB SRAM)
Groq: Software-scheduled, no caches, deterministic latency
CPU's Growing Role in Agentic AI
Workload Type GPU Util CPU Util Why CPU Matters
-------------------- -------- -------- ---------------------------
Simple chatbot Q&A 90-95% 5-10% One forward pass, minimal CPU
RAG pipeline 70-80% 20-30% Retrieval, embedding, ranking
Multi-agent system 30-40% 60-70% Tool calls, API routing, JSON
Agentic orchestration 10-30% 60-90% State mgmt, memory, planning
Grace ARM CPU: 72 cores, NVLink-C2C at 900 GB/s to Blackwell
Implication: data centers need 4x current CPU capacity per GW
for agentic scale. ARM efficiency (2x perf/watt vs x86) matters.
B200 Peak (FP4)
4.5 PF
per GPU, dense
NVL72 Aggregate
~40 PF
FP4 sparse, 72 GPUs
TPU v7 Peak
4.6 PF
256x256 systolic arrays
WSE-3 Peak
125 PF
entire 300mm wafer
The Agentic Shift
Simple chatbot inference is 90-95% GPU. But agentic AI — tool calls, API routing, JSON parsing, memory management — is 60-90% CPU. This is reshaping datacenter architecture: it's no longer just about GPU FLOPS, but about CPU density, memory bandwidth per core, and power efficiency. ARM's 2x perf/watt advantage over x86 positions Grace as the agentic-era CPU.
Layer 2 — The Code
Software & Kernels
From CUDA C++ to the actual hardware instructions. DeepSeek's open-source kernels achieved cuBLAS-matching performance through PTX-level binary patching — flipping single bits in compiled GPU code for 10%+ throughput gains.
Analogy
This is the code that runs on the chips. CUDA is NVIDIA's programming language for GPUs. Your code gets compiled down through layers —
like translating a book from English to Spanish to a regional dialect: CUDA C++ (human-readable) → PTX (a "universal GPU language" that works on any generation) → SASS (the actual machine code for a specific chip). DeepSeek wrote custom kernels that squeeze dramatically more performance from the same hardware —
like a racecar driver who knows exactly how to take every corner, even finding a single-bit tweak in compiled code that gives 10%+ speedup.
CUDA Compilation Pipeline
CUDA C++ nvcc PTX ptxas SASS
(source) (compiler) (virtual ISA) (assembler) (device binary)
| | | | |
| split host | | | |
| and device | translate | | |
|-------------->| to virtual | | |
| | instrs | | |
| |-------------->| compile to | |
| | | target arch | |
| | |-------------->| actual GPU |
| | | | machine code |
| | | |-------------->|
| | | | |
Key PTX instructions:
mma.sync -- Volta/Turing tensor ops
wgmma.mma_async -- Hopper/Blackwell warp-group MMA
cp.async.bulk -- TMA-driven async global->shared copy
tma.load -- Tensor Memory Accelerator descriptor loads
JIT: PTX compiled for sm_80 JIT-compiles to SASS at runtime
via CUDA driver's ptxas (~100-500ms, cached in ~/.nv/)
Analogy — PTX vs SASS
PTX is like
sheet music — any orchestra (GPU generation) can read it and play the piece. SASS is like a
recording for a specific concert hall — optimized for that exact venue's acoustics. When you distribute PTX, it gets "performed" (JIT compiled) on whatever GPU runs it. This forward-compatibility is why CUDA code written for H100 can run on future Rubin GPUs without recompilation.
DeepGEMM: Warp-Specialized FP8 GEMM
Producer Warps Consumer Warps
(async data loading) (matrix compute)
| |
| 1. TMA cp.async.bulk |
| (Global HBM --> Shared Mem) |
| |
| Activation tiles: 1 x 128 scaling |
| Weight tiles: 128 x 128 scaling |
| |
| 2. Signal barrier |
|----------[arrive barrier]---------------->|
| |
| 3. wgmma |
| FP8 GEMM |
| on buffer |
| |
| 4. Reclaim buffer |
|<---------[release barrier]-----------------|
| |
| 5. FFMA |
| FP32 accum |
| yield bit |
| flipped! |
| |
THE BREAKTHROUGH: PTX binary patching flips bit 12 of FFMA SASS
encoding (the yield bit), hinting scheduler to context-switch during
FP32 accumulation. This single-bit patch = 10%+ throughput gain.
Result: 1,550 TFLOPS FP8 on H800 -- matching cuBLAS, fully open-source.
Analogy — DeepGEMM's Yield Bit Trick
Imagine an assembly line where some workers (producer warps) load parts onto the conveyor belt while others (consumer warps) assemble them. DeepSeek discovered that by
telling the assembler workers to take a micro-nap during a specific slow step (flipping the yield bit in FFMA instructions), the factory scheduler could keep the conveyor belt fully loaded. This one-bit post-compilation patch — like a mechanic tweaking a single screw after the car left the factory — delivered 10%+ more throughput. The result: open-source code matching NVIDIA's proprietary cuBLAS performance.
FlashMLA + DeepEP
FlashMLA (Multi-Head Latent Attention)
Traditional MHA KV Cache: FlashMLA KV Cache:
K: [layers x heads x seq x d] K+V: [layers x seq x d_latent]
V: [layers x heads x seq x d]
Low-rank joint compression:
Size: 100% d_latent = 512 (vs ~7680 original)
Size: 6.7% of traditional
Fused kernel: projection + attention + output in single launch
HBM bandwidth utilization: 3,000 GB/s (near H800 peak of 3,350 GB/s)
--------------------------------------------------------------------------
DeepEP (Expert-Parallel MoE Dispatch)
Token --> Router --> Expert assignment --> All-to-All dispatch
|
+---------------+---------------+
| | |
Local GPU Remote GPU 1 Remote GPU N
(same node) (IB GPUDirect) (IB GPUDirect)
Low-latency mode: ~50 us per dispatch (inference)
High-throughput mode: overlapped with compute (training)
Direct GPU-to-GPU via InfiniBand RDMA, no CPU involvement
CUDA Execution Hierarchy
Grid (entire kernel launch)
|
+--> Thread Block (max 1024 threads, maps to 1 SM)
| |
| +--> Warp (32 threads, SIMT lockstep)
| | |
| | +--> Thread (single execution unit)
| |
| +--> Warp Group (4 warps = 128 threads, Hopper+)
| |
| +--> wgmma: single matrix-multiply-accumulate
|
+--> Kernel launch overhead: 20-200 us (driver path)
+--> CUDA Graphs: capture DAG, replay at ~10 us (5x speedup)
DeepGEMM
1,550 TF
FP8 on H800, open-source
FlashMLA KV
6.7%
of traditional cache size
FlashAttn-3
1.3 PF
75% of H100 peak
CUDA Graphs
5x
launch overhead reduction
DeepSeek's PTX Binary Patching
The most remarkable kernel optimization of 2025-2026: DeepSeek discovered that flipping a single bit (bit 12, the yield hint) in FFMA SASS instructions causes the warp scheduler to context-switch during FP32 accumulation, hiding latency. This post-compilation binary patch — applied to NVIDIA's own ISA — delivered 10%+ throughput gains. It required reverse-engineering undocumented SASS encoding. The result: 1,550 TFLOPS FP8, matching proprietary cuBLAS, in ~300 lines of open-source code.
Layer 1 — The Application
Inference & Serving
The 2025-2026 standard: disaggregated serving separates prefill (compute-bound) from decode (memory-bound) into distinct GPU pools connected by RDMA. This eliminates phase interference and lets each pool be independently optimized.
Analogy
This is where your prompt actually gets answered, in two phases:
•
Prefill = reading the whole question at once. Like a student reading an exam prompt. Bottleneck: how fast you can read (raw compute).
•
Decode = writing the answer one word at a time, each word depending on all previous words. Like writing an essay where you re-read everything you've written before adding each new sentence. Bottleneck: how fast you can re-read (memory speed).
Smart companies split these onto different GPU pools — some GPUs optimized for reading (prefill), others for writing (decode). The "memory" of what's been read (KV cache) gets transferred between them via RDMA.
Disaggregated Serving Architecture
User Request
|
v
+-----------+ RDMA KV Cache Transfer +-----------+
| PREFILL |====================================>| DECODE |
| POOL | KV cache over IB / RoCE | POOL |
| | | |
| 3 nodes | | 9 nodes |
| 24 GPUs | | 72 GPUs |
| | | |
| Compute- | | Memory- |
| bound | | bound |
| 640 TFLOPS| | 8 TB/s BW |
| saturated | | per GPU |
+-----------+ +-----------+
| |
v v
TTFT: Time to ITL: Inter-Token
First Token Latency (10-50ms)
(prompt processing) (token generation)
LMSYS benchmark: DeepSeek-R1 on 96 H100s
3-node prefill + 9-node decode
52,300 input tok/s + 22,300 output tok/s per node
Prefill vs Decode: Why They Must Be Separated
PREFILL PHASE DECODE PHASE
--------------------------------- ---------------------------------
Process: entire prompt at once Process: one token at a time
Operation: matrix-matrix multiply Operation: matrix-vector multiply
Bottleneck: raw FLOPS Bottleneck: HBM bandwidth
Tensor cores: saturated Tensor cores: mostly idle
Arithmetic intensity: HIGH Arithmetic intensity: LOW
(many ops per byte read) (1-10 ops/byte vs 300 capacity)
Reads: weights (once) Reads: weights + FULL KV cache
Builds: KV cache Updates: KV cache (+1 token)
Metric: TTFT Metric: ITL
(time to first token) (inter-token latency, 10-50ms)
Optimal HW: max TFLOPS Optimal HW: max HBM bandwidth
(B200: 4.5 PF FP4) (B200: 8 TB/s HBM3e)
Running both on same GPUs wastes resources:
- Prefill starves decode of memory bandwidth
- Decode wastes prefill's compute capacity
- Disaggregation eliminates this interference
Analogy — Why Disaggregate?
Imagine a restaurant where some chefs are fast at reading orders (prefill) and others are fast at plating dishes (decode). If you make every chef do both, the fast readers are stuck waiting while plating, and the fast platers are stuck waiting while reading.
Disaggregation puts the readers in one kitchen and the platers in another, connected by a fast conveyor belt (RDMA). Each kitchen runs at peak efficiency. The 3:1 decode-to-prefill ratio (9 decode nodes vs 3 prefill) reflects that decoding is the sustained bottleneck — you read the order once but plate for much longer.
KV Cache Management at Scale
70B model, 128K context, batch of concurrent requests:
Single request KV cache: ~40 GB (exceeds H100's 80GB HBM alone)
100 concurrent requests: ~4 TB KV cache needed
PagedAttention (vLLM):
+------+------+------+-------+------+
|Page 0|Page 1|Page 2| Free |Page 3| Non-contiguous pages
+------+------+------+-------+------+ like OS virtual memory
| Req A| Req A| Req B| | Req A| Near-100% utilization
+------+------+------+-------+------+ No fragmentation
Optimization techniques:
+--------------------+----------------+---------------------------+
| Technique | Reduction | Mechanism |
+--------------------+----------------+---------------------------+
| GQA (grouped attn) | 4-8x fewer KV | Share KV across heads |
| FP8/INT4 quantize | 2-4x smaller | Compress KV values |
| Prefix caching | Shared prompts | Reuse system prompt KV |
| CPU/SSD offload | Infinite* | Page cold KV (+0.1-1ms) |
| FlashMLA (DeepSeek)| 15x (to 6.7%) | Low-rank KV decomposition |
+--------------------+----------------+---------------------------+
Frameworks: NVIDIA Dynamo (orchestration), vLLM (--disaggregated-prefill)
Production: Meta, LinkedIn, Mistral, Hugging Face TGI v2
LMSYS Input
52.3K
tok/sec per node (prefill)
LMSYS Output
22.3K
tok/sec per node (decode)
KV Cache (70B)
~40 GB
per request at 128K ctx
FlashMLA
6.7%
KV cache vs traditional
The Decode Bottleneck
Decode is memory-bound: each token generation reads the entire KV cache from HBM but performs trivially little compute. Arithmetic intensity is 1-10 ops/byte versus tensor core capacity of ~300 ops/byte. This means 97-99% of tensor core capability is wasted during decode. The solution is disaggregation: give decode its own GPU pool optimized for bandwidth (more HBM, fewer FLOPS), and give prefill a pool optimized for compute.
End-to-End Example
A Claude Code Session Through the Stack
What actually happens — layer by layer — when you type a message in Claude Code, from keystroke to streamed token to tool execution and back.
The Full Journey: You Type "fix the auth bug"
YOUR TERMINAL ANTHROPIC CLOUD
| |
| 1. Keystroke |
| "fix the auth bug" |
| + system prompt |
| + conversation history |
| + CLAUDE.md context |
| |
|--------- HTTPS POST /messages ------->|
| |
| LOAD BALANCER
| |
| v
| PREFILL POOL (L1)
| Process entire prompt
| ~50K tokens of context
| 640 TFLOPS saturated
| |
| RDMA KV cache xfer
| |
| v
| DECODE POOL (L1)
| Generate tokens
| one at a time
| 8 TB/s HBM reads
| |
|<-------- SSE token stream ----------|
| |
| 2. Claude decides to use a tool: |
| {"type":"tool_use", |
| "name":"Grep", |
| "input":{"pattern":"auth"}} |
| |
LOCAL CPU |
3. Your machine |
executes the tool |
(grep, bash, read) |
| |
| 4. Tool result sent back |
|--------- HTTPS POST /messages ------->|
| |
| PREFILL again (new ctx)
| |
| RDMA KV xfer
| v
| DECODE response
| |
|<-------- SSE token stream ----------|
| |
| 5. Repeat until task complete |
| (may loop 10-50+ times) |
v v
Bug fixed, code edited, tests passing
Plain English
Every time you hit Enter in Claude Code, your message (plus the entire conversation history) flies over HTTPS to Anthropic's data center. There, the prefill pool reads your whole context in one parallel burst — like speed-reading 50,000 words at once. Then the KV cache (the "working memory" of what was read) gets shipped via RDMA to the decode pool, which generates the response one token at a time. When Claude decides to use a tool (Grep, Bash, Read, Edit), the token stream includes a JSON tool_use block.
Your local machine executes that tool — the cloud never touches your files. The result goes back as a new message, triggering another prefill → decode cycle. A typical Claude Code session might do this loop 10-50+ times.
What Happens Inside One Forward Pass
Layer-by-layer trace of a single token generation:
Token "fix" (from your prompt)
|
v
[L7 SILICON] B200 GPU (208B transistors, TSMC 4NP)
| sitting on CoWoS-L interposer next to HBM3e
v
[L4 HBM3e] Load embedding weights from HBM (8 TB/s)
| token "fix" --> 8192-dim vector
v
[L2 KERNEL] FlashAttention tiles Q/K/V into SRAM (228 KB)
| avoids 600-cycle HBM penalty per access
v
[L3 TENSOR CORE] wgmma: 128 threads execute FP8 matrix multiply
| Q * K^T --> attention scores (in SRAM!)
| softmax --> P * V --> output (still in SRAM!)
v
[L5 NVLink] All-reduce across 8 GPUs (tensor parallelism)
| 1.8 TB/s per GPU, via NVSwitch
| this happens EVERY LAYER (100+ times)
v
[L6 InfiniBand] If model spans nodes: cross-node gradient sync
| SHARP v4 does all-reduce in-switch
| GPUDirect RDMA, zero CPU involvement
v
[L4 HBM3e] Write updated KV cache entry for this token
| KV cache grows by ~512 bytes per layer
v
[L1 DECODE] Sample next token from logits
| argmax/top-p/temperature
v
"Let me look at the auth middleware..."
|
v
(stream to your terminal via SSE)
Analogy — One Token's Journey
For every single word Claude generates, here's the journey: The token starts as a number looked up from a huge dictionary in HBM (the warehouse). FlashAttention loads small tiles of the attention matrices onto the GPU's notepad (SRAM) to avoid constant warehouse trips. The Tensor Cores — 128 workers cooperating on one giant matrix multiply — compute attention scores
entirely on the notepad. Then NVLink broadcasts the result to all other GPUs in the rack (the conference room PA system). If the model is too big for one rack, InfiniBand carries data across the building. The result: one new token, streamed back to your terminal.
This whole process takes 10-50 milliseconds and repeats for every single word.
The Tool-Use Loop: Where CPU Meets GPU
A typical Claude Code "fix the auth bug" session:
Turn 1: You --> "fix the auth bug"
|
v
[GPU] Prefill (50K tokens) --> Decode --> tool_use: Grep "auth"
|
v
[YOUR CPU] Execute grep locally --> 15 matching files
|
v
Turn 2: Tool result --> API
|
v
[GPU] Prefill (52K tokens) --> Decode --> tool_use: Read auth.ts
|
v
[YOUR CPU] Read file locally --> 200 lines
|
v
Turn 3: Tool result --> API
|
v
[GPU] Prefill (55K tokens) --> Decode --> tool_use: Edit auth.ts
|
v
[YOUR CPU] Apply edit locally --> file modified
|
v
Turn 4: Tool result --> API
|
v
[GPU] Prefill (56K tokens) --> Decode --> tool_use: Bash "npm test"
|
v
[YOUR CPU] Run tests locally --> all passing
|
v
Turn 5: Tool result --> API
|
v
[GPU] Prefill (58K tokens) --> Decode --> "Done! Fixed the bug."
+--------------------------------------------------------------+
| Notice: context grows every turn. By turn 5, the prefill |
| pool is processing 58K tokens — your entire conversation |
| history + all tool results. This is why prompt caching and |
| KV cache reuse matter so much for Claude Code performance. |
+--------------------------------------------------------------+
Analogy — The Loop
Claude Code works like a
chess player who can see the board but can't touch the pieces. The "thinking" (GPU) happens in Anthropic's cloud. But every time Claude wants to interact with your codebase — read a file, run a command, edit code — it has to
ask you (via tool_use JSON), and your local CPU does the actual work. The result goes back, and Claude thinks again. Each loop costs a full prefill + decode cycle. With prompt caching, the unchanged prefix of your conversation can skip re-computation, saving significant time and cost.
Where Time Is Spent
Component Time per turn Bottleneck Layer
--------------------- ------------- --------------------- -----
Network round-trip 50-200 ms Your ISP / distance N/A
Prefill (50K tokens) 200-800 ms GPU compute (TFLOPS) L1,L3
KV cache transfer 10-50 ms RDMA bandwidth L6
Decode (first token) 50-200 ms HBM bandwidth L1,L4
Decode (per token) 10-50 ms HBM bandwidth L1,L4
Tool execution 10-5000 ms Your local machine N/A
Context compression ~100 ms API processing N/A
For a 5-turn session generating ~2000 tokens with 4 tool calls:
Prefill: 5 x ~400ms = ~2s (GPU compute-bound)
Decode: 2000 x ~30ms = ~60s (HBM bandwidth-bound)
Tools: 4 x ~500ms = ~2s (your local CPU)
Network: 10 x ~100ms = ~1s (round trips)
Total: ~65s. Decode dominates — the memory wall in action.
Key Insight
In a Claude Code session,
decode dominates. The majority of wall-clock time is the GPU reading KV cache from HBM one token at a time — this is the memory wall made visible. Prefill is fast (parallel), tools are fast (local). But generating 2000 tokens at 30ms each = 60 seconds of memory-bandwidth-limited decode. This is why HBM bandwidth (8 TB/s on B200) matters more than raw TFLOPS for inference, and why disaggregated serving puts decode on bandwidth-optimized GPUs.
Prompt Caching
Claude Code's prompt caching means the first 90%+ of your context (system prompt, CLAUDE.md, conversation history) doesn't need re-prefilling every turn — the KV cache from the previous turn is reused. Only the new tool result needs to be prefilled. This turns a 50K-token prefill into a ~2K-token incremental prefill, cutting prefill time by ~25x. The 5-minute cache TTL means rapid back-and-forth tool loops stay fast, but a long pause between turns forces a cold re-prefill.
2026 Industry Shift
The SRAM vs HBM Inference Split
The decode bottleneck is HBM bandwidth. Cerebras and Groq attacked it by replacing HBM with on-chip SRAM entirely — trading capacity for 5,000-7,000x more bandwidth. In 2026, OpenAI shipped GPT-5.3-Codex-Spark on Cerebras chips, and NVIDIA integrated Groq into Vera Rubin. The industry is splitting: train on GPUs, serve on SRAM.
The Core Tradeoff: Why SRAM Wins at Decode
THE DECODE PROBLEM (from Layer 1):
Each token reads the full KV cache + weights from memory.
Arithmetic intensity: 1-10 ops/byte. Tensor cores 97% idle.
Bottleneck is PURE MEMORY BANDWIDTH.
Architecture Memory Type Bandwidth vs H100
-------------------- ----------- ------------ --------
NVIDIA H100 HBM3 3.35 TB/s 1x
NVIDIA B200 HBM3e 8 TB/s 2.4x
Cerebras WSE-3 On-wafer SRAM ~56 PB/s 7,000x
Groq 3 LPU On-chip SRAM 40 PB/s 5,000x
The catch: SRAM is 100-1000x more expensive per bit than DRAM.
Cerebras: 44 GB SRAM (one wafer). Groq: 500 MB SRAM (one chip).
NVIDIA B200: 192 GB HBM3e. Capacity vs bandwidth tradeoff.
Analogy
HBM is like a
huge warehouse with a narrow loading dock — tons of storage, but you can only move stuff in and out so fast. SRAM is like having
everything on your desk — tiny workspace, but instant access. During decode, the GPU barely computes — it spends 97% of its time waiting for the next piece of data from the warehouse. Cerebras and Groq said: "forget the warehouse, put everything on the desk." The desk is expensive and small, but for inference, you don't need a warehouse — you need speed.
How the Three Architectures Work
NVIDIA B200 (GPU) "The General-Purpose Workhorse"
+-------------------+ +-------+
| 576 Tensor Cores | | HBM3e |
| 4.5 PF FP4 |<--->| 192GB | 8 TB/s bandwidth
| 126 MB L2 cache | | (off- | Weights in HBM
| CUDA + wgmma | | chip) | Decode reads HBM every token
+-------------------+ +-------+
Strengths: Training, huge models, CUDA ecosystem, flexibility
Weakness: Decode is HBM-bandwidth-bound (97% tensor core idle)
--------------------------------------------------------------------------
CEREBRAS WSE-3 "The Entire Wafer"
+==============================================+
| 300mm Silicon Wafer (entire thing is 1 chip)|
| |
| 900,000 AI Cores 44 GB SRAM (on-wafer) |
| 4 Trillion transistors |
| ~56 PB/s aggregate memory bandwidth |
| |
| No HBM. No off-chip memory bottleneck. |
| Weights live ON the wafer. Zero bus latency. |
+==============================================+
| |
MemoryX SwarmX
(off-chip weight (multi-wafer
store for >44GB interconnect
models) for clusters)
Strengths: Fastest inference for models ≤44GB on-chip, zero TTFT
Weakness: 44GB SRAM = ~70B dense model max without MemoryX
--------------------------------------------------------------------------
GROQ 3 LPU "The Deterministic Engine"
+------------------------------------------+
| 500 MB SRAM (on-chip, no HBM, no cache) |
| 1.2 PF INT8 |
| 40 PB/s on-chip SRAM bandwidth |
| |
| Compiler pre-computes ENTIRE execution |
| graph down to individual clock cycles. |
| Zero runtime scheduling. Zero variance. |
| Every token takes exactly the same time. |
+------------------------------------------+
| |
640 TB/s Rack-scale
chip-to-chip deterministic
interconnect fabric
Strengths: Lowest latency, zero variance, 0.2s TTFT, 300+ tok/s
Weakness: 500MB = small models only, or distributed across many chips
Analogy — Three Restaurants
•
NVIDIA GPU = a massive restaurant with a huge kitchen (HBM) down the hall. Can cook any dish (flexible), but waiters spend most of their time running back and forth to the kitchen. Great for banquets (training). Slow for individual orders (decode).
•
Cerebras WSE-3 = the chef has every ingredient on the counter in front of them (on-wafer SRAM). No walking to the pantry. Blazing fast for any dish that fits on the counter. But the counter is only 44 GB — for a 10-course banquet, you need the MemoryX pantry nearby.
•
Groq LPU = a sushi bar where the chef pre-plans every movement before service starts (compiler-scheduled). Zero improvisation, zero wasted motion. Every piece of sushi takes exactly 3 seconds. Incredibly fast and predictable, but can only make sushi (inference, not training).
The 2026 Industry Split
How AI companies deploy in 2026:
TRAINING INFERENCE
(learn the model) (serve the model)
| |
v v
NVIDIA GPUs +-------+-------+-------+
Still dominates. | | | |
No real alternative v v v v
for training at scale. NVIDIA Cerebras Groq Trainium
Needs HBM capacity GPUs WSE-3 LPU 3 (AWS)
+ NVLink + IB. | | | |
v v v v
General Speed- Ultra- Captive
purpose critical low-lat (Anthropic)
serving coding real-time Claude
agents apps training
KEY DEALS (2026):
OpenAI + Cerebras: $10B+, 750 MW, GPT-5.3-Codex-Spark
NVIDIA + Groq: Groq 3 LPU integrated into Vera Rubin
AWS + Cerebras: WSE-3 on Amazon Bedrock, 5x token capacity
Anthropic + AWS: 500K+ Trainium2 chips for Claude training
The pattern: NVIDIA for training, SRAM chips for latency-critical inference.
OpenAI's Codex-Spark on Cerebras = the proof point of this split.
Plain English — Why This Split Happened
For years, NVIDIA GPUs did everything — training and inference. But as models got deployed to millions of users, companies realized they were wasting money: during decode (generating each word), 97% of the GPU's math power sits idle, waiting for data from memory. It's like hiring a team of 500 mathematicians but only giving them one page of problems at a time. Cerebras and Groq said:
"What if we made the memory the chip?" Instead of a separate memory warehouse (HBM), they put massive SRAM directly on the compute die. The data is already there — no waiting. OpenAI proved this works in production by running Codex-Spark on Cerebras. NVIDIA's response: buy Groq's technology and integrate it into their next platform (Vera Rubin), so they offer both options.
Token Economics: Why Speed = Money
Why agentic coding cares about tok/s:
A Claude Code session fixing a bug:
~5 turns x ~400 tokens decoded per turn = 2,000 tokens
On NVIDIA B200: 2000 tok / ~100 tok/s = ~20 sec decode time
On Cerebras WSE-3: 2000 tok / ~2500 tok/s = ~0.8 sec decode time
On Groq 3 LPU: 2000 tok / ~300 tok/s = ~6.7 sec decode time
For a developer doing 50 Claude Code sessions/day:
B200: 50 x 20s = 16.7 min/day waiting for decode
Cerebras: 50 x 0.8s = 0.7 min/day waiting for decode
Groq: 50 x 6.7s = 5.6 min/day waiting for decode
This is why OpenAI put their CODING model on Cerebras specifically.
For agentic AI, inference latency = developer productivity.
The Convergence
The lines are blurring. NVIDIA's Vera Rubin platform now includes Groq LPUs alongside GPUs — one rack, two chip types, automatically routing workloads to the right hardware. Cerebras is adding MemoryX (off-chip weight storage) to handle models bigger than 44GB SRAM. And NVIDIA keeps growing on-chip SRAM (B200 has 126 MB L2, up from H100's 50 MB). Everyone is converging on the same insight: the memory wall is the bottleneck, and the fix is more SRAM closer to compute. The only question is how much SRAM and where to put it.
OpenAI + Cerebras
$10B+
750 MW, multi-year deal
Cerebras vs B200
2.5x
faster tok/s on Llama 4 Maverick
Groq SRAM BW
40 PB/s
5,000x more than B200 HBM
GroqCloud Devs
3.5M+
as of Feb 2026
Deep Dive — Optics
Co-Packaged Optics (CPO)
Replacing lossy electrical traces with silicon photonics integrated directly onto the switch package. Cuts power 65%, extends reach from meters to kilometers, and enables 100+ Tb/s switch bandwidth. 2026 is the volume deployment year.
Analogy
Currently, data inside a datacenter travels as
electricity over copper wires, then gets converted to
light for longer distances using a bulky pluggable module on the front panel. CPO moves the light converter
directly onto the chip package — like moving the translator from the lobby into your office. No more signal loss walking down the hallway. The result: 65% less power, 100x more reach, and 2-3x more ports per switch.
Traditional Pluggable vs Co-Packaged Optics
TRADITIONAL (Pluggable Optics):
Switch ASIC Pluggable Module
+----------+ electrical traces +-------------+ optical fiber
| | -------- ~15 cm -----> | Transceiver | ===== fiber ====> remote
| Digital | (lossy, hot) | QSFP-DD | (long reach)
| Logic | ~15-25W per | E/O + O/E |
+----------+ 800G port +-------------+
Problems:
- Signal degrades over 15cm of board traces (SerDes power)
- E/O conversion at package boundary is power-hungry
- Front-panel space limits port density
- Cannot scale beyond ~51.2 Tb/s per switch
--------------------------------------------------------------------------
CO-PACKAGED OPTICS:
+------------------------------------------+ optical fiber
| Switch Package | ===== fiber ====> remote
| +----------+ +---------------------+ | (100x reach)
| | Switch |--->| Optical Engine | |
| | ASIC | | (Silicon Photonic | |
| | | | die ON package) | |
| | | | | |
| | Broadcom | | Laser + Modulator | |
| | or NVIDIA| | + Photodetector | |
| +----------+ +---------------------+ |
+------------------------------------------+
Benefits:
- No board traces for data (mm-scale, not cm)
- E/O conversion on-package (minimal loss)
- 65% power reduction vs pluggable
- 3.5x more energy efficient per bit
- 2-3x higher port density
Data Flow Through a CPO Engine
TRANSMIT PATH (Electrical to Optical):
Switch Die Silicon Photonic Engine
+----------+ +----------------------------+
| | electrical | +---------+ +----------+ |
| Digital | --short path---> | | Driver |-->|Modulator | |--> Fiber out
| Logic | (on-package, | | Circuit | |(Mach- | | (light)
| | mm-scale) | | | | Zehnder) | |
+----------+ | +---------+ +----------+ |
+----------------------------+
RECEIVE PATH (Optical to Electrical):
Switch Die Silicon Photonic Engine
+----------+ +----------------------------+
| | electrical | +---------+ +----------+ |
| Digital | <--short path--- | | TIA |<--| Photo- | |<-- Fiber in
| Logic | (on-package, | | (Trans- | | detector | | (light)
| | mm-scale) | | imped.)| | | |
+----------+ | +---------+ +----------+ |
+----------------------------+
Mach-Zehnder modulator: encodes data by varying light phase
TIA: converts photocurrent to voltage (trans-impedance amplifier)
Key: all conversion happens ON the package, not on a separate module
Analogy — The Two Paths
Sending data with CPO is like a
translator sitting right next to you instead of in a booth across the room. When you speak (electrical signal), the translator instantly converts to the other language (light). When someone speaks back (light arrives), the translator instantly converts back. Because the translator is inches away instead of across the room, there's no time wasted, no signal lost, and the translator uses much less energy.
CPO Key Players and Timeline
Company Product Bandwidth Timeline
---------------- ---------------------------- ----------- ----------
NVIDIA Quantum-X (InfiniBand + CPO) 1.6 Tb/s 2H 2025
NVIDIA Spectrum-X (Ethernet + CPO) 3.2 Tb/s 2H 2026
Broadcom TH6-Davisson (Ethernet) 102.4 Tb/s 2025-2026
Marvell CPO for AI datacenters TBD 2026+
Ayar Labs In-package optical I/O TBD 2025+
Market projection: >$20B by 2036 (37% CAGR), inflection point in 2026
Pluggable vs CPO: Head-to-Head
Metric Pluggable CPO
------------------------ ----------------- -----------------
Switch bandwidth 51.2 Tb/s 100+ Tb/s
Power per 800G port ~15-25W ~5-9W
Reach (copper) <10m N/A
Reach (optical) 10-2000m >1000m
Port density Constrained 2-3x higher
Cooling challenge Major Reduced
Serviceability Hot-swappable Non-removable
Maturity Production Early production
Analogy — Why CPO Matters for AI
AI training clusters have
thousands of GPUs that need to communicate constantly. With pluggable optics, each switch wastes 15-25 watts per port just on signal conversion. At 100,000+ GPU scale, that's
megawatts of wasted power. CPO cuts that by 65%. It's like upgrading from incandescent bulbs to LEDs — same brightness, fraction of the electricity. And because CPO supports 2-3x more ports, you can build
flatter network topologies with fewer hops between any two GPUs.
Power Savings
65%
vs pluggable per port
TH6-Davisson
102.4 Tb/s
Broadcom CPO switch
Market Size
$20B+
by 2036, 37% CAGR
Energy Efficiency
3.5x
better per bit vs pluggable
The Serviceability Tradeoff
The one downside of CPO: pluggable transceivers can be hot-swapped in seconds if one fails. CPO optical engines are permanently attached to the switch package — if one fails, the entire switch must be replaced. This is why some hyperscalers are cautious: at 100K+ switch scale, transceiver failures are daily events. The industry is betting CPO reliability will offset the serviceability loss, but this remains an active engineering challenge.
Deep Dive — Chip Design
EDA Chip Design Pipeline
Cadence and Synopsys control ~70% of the EDA market. Their tools take a chip from idea to silicon through 7 stages. One bug at tapeout = $100M+ and 6 months lost. AI is now designing AI chips.
Analogy
Designing a chip is like
building a city from scratch. You start with a master plan (architecture), draw detailed blueprints for every building (RTL), inspect every blueprint for errors (verification), figure out which materials to use (synthesis), decide where every building goes and how roads connect them (place & route), do final inspections (signoff), then hand it to the construction company (TSMC). The "city" has 208 billion buildings (transistors), 15+ layers of roads (metal layers), and if you find one structural flaw after construction starts, you tear everything down — $100M gone, 6 months lost.
The 7-Stage Pipeline
1. ARCHITECTURE Define what the chip does
| Define ISA, SM count, tensor core specs, memory hierarchy
| Power budget, NVLink ports, die area targets
| Tools: SystemC/TLM modeling, Arm FastModels
|
v
2. RTL DESIGN Write the logic in code
| Verilog / SystemVerilog for every block:
| SMs, Tensor Cores, L2 controllers, NVLink PHYs
| H100: 80B transistors B200: 208B transistors
|
v
3. VERIFICATION Prove it actually works
| RTL simulation (cycle-accurate)
| UVM (Universal Verification Methodology)
| Formal verification (mathematical proof)
| Emulation on Palladium / ZeBu hardware
| 208B transistors = weeks of simulation
| One bug = $100M+ respin + 6 month delay
|
v
4. LOGIC SYNTHESIS Turn code into circuits
| RTL --> Gate-level netlist (AND, OR, FF gates)
| Optimizes for timing, area, power
| Targets TSMC 3nm/4nm standard cell libraries
|
v
5. PLACE & ROUTE Physical layout on silicon
| Floorplan: assign blocks to regions
| Place: position billions of standard cells
| Clock tree: distribute clock to all flip-flops
| Route: connect with 15+ metal layers
|
v
6. SIGNOFF Final checks before tapeout
| Timing: verify all paths meet frequency target
| Power: verify grid, IR drop, electromigration
| Physical: DRC (fab rules) + LVS (layout = schematic)
|
v
7. FABRICATION + PACKAGING Build the physical chip
GDSII file --> TSMC 3nm/4nm fab
Wafer processing: ~3-4 months
CoWoS packaging: mount GPU + HBM on interposer
Total: 6-12 months from tapeout to volume
Analogy — Each Stage
Stage 1: The mayor decides the city needs 576 factories and 18 highways.
Stage 2: Architects draw detailed blueprints for every building.
Stage 3: Inspectors check every blueprint — does the plumbing work? Do the elevators reach every floor?
Stage 4: Engineers choose specific bricks and wiring from the catalog.
Stage 5: Urban planners decide where each building goes and route all roads.
Stage 6: Final inspection — fire code, structural integrity, electrical safety.
Stage 7: TSMC builds it. If inspectors missed a flaw? Demolish and rebuild. $100M+ gone.
Cadence vs Synopsys: Tool-for-Tool
Cadence Flow: Synopsys Flow:
RTL (Verilog/VHDL) RTL (Verilog/VHDL)
| |
v v
[Genus] Logic Synthesis [Design Compiler]
| netlist | netlist
v v
[Innovus] Place & Route [IC Compiler II]
| layout | layout
v v
[Xcelium] Verification [VCS] Verification
| clean DRC/LVS | clean DRC/LVS
v v
[Tempus] Timing Signoff [PrimeTime] Timing Signoff
| |
v v
[Pegasus] Physical Signoff [IC Validator] Phys Signoff
| |
v v
GDSII Tapeout GDSII Tapeout
Place & Route: What It Actually Looks Like
Floorplan of a GPU die (simplified):
+================================================================+
| GPU Die (~814 mm2) |
| |
| +------+ +------+ +------+ +------+ +------+ +------+ |
| | SM 0 | | SM 1 | | SM 2 | | SM 3 | | SM 4 | | SM 5 | |
| +------+ +------+ +------+ +------+ +------+ +------+ |
| +------+ +------+ +------+ +------+ +------+ +------+ |
| | SM 6 | | SM 7 | | ... | | ... | | ... | |SM 113| |
| +------+ +------+ +------+ +------+ +------+ +------+ |
| |
| +========================+ +========================+ |
| | L2 Cache (50 MB) | | Memory Controllers | |
| +========================+ +========================+ |
| |
| +--------+ +--------+ +--------+ +--------+ +--------+ |
| |NVLink | |NVLink | |NVLink | | PCIe | | PCIe | |
| |PHY 0-5 | |PHY 6-11| |PHY12-17| | Gen5 | | Gen5 | |
| +--------+ +--------+ +--------+ +--------+ +--------+ |
+================================================================+
The P&R tool positions every cell and routes 15+ metal layers
of interconnect between them. For 80B transistors, this is an
NP-hard optimization problem — which is why AI is being used.
Analogy — Place & Route
Place & route is like solving
the world's hardest jigsaw puzzle while also designing the road system. You have 80 billion pieces (transistors) that need exact positions on a surface the size of a thumbnail. Then you connect them all with 15 layers of microscopic wiring — imagine 15 transparent sheets stacked on top of each other, each with its own road network. The wiring must ensure every signal arrives on time (timing closure), no two wires short-circuit (DRC), and the power grid can deliver electricity to every corner (IR drop).
AI Designing AI Chips
The Meta-Loop: AI chips are now designed by AI
Synopsys DSO.ai (Design Space Optimization)
+----------------------------------------------------------+
| Reinforcement Learning agent explores floorplan space |
| |
| State: current layout metrics (timing, area, power) |
| Action: adjust block placement, routing priorities |
| Reward: timing slack * area efficiency * power budget |
| |
| Explores 1000s of options humans would never try |
| Result: 5-10x faster design closure |
+----------------------------------------------------------+
Cadence ChipStack Super Agent (Agentic AI, Feb 2026)
+----------------------------------------------------------+
| Multi-agent system for 3D-IC verification |
| |
| Agent 1: identifies critical timing paths |
| Agent 2: checks power integrity across chiplets |
| Agent 3: validates thermal profiles |
| Agent 4: cross-references DRC with layout intent |
| |
| Claims 10x speedup for 3D-IC design verification |
+----------------------------------------------------------+
NVIDIA GPU-Accelerated EDA
+----------------------------------------------------------+
| NVIDIA invested $2B in Synopsys |
| |
| PrimeSim on GH200: 15x SPICE simulation speedup |
| cuLitho: GPU-accelerated computational lithography |
| Result: mask generation 40x faster on H100 vs CPU |
+----------------------------------------------------------+
"Chips designed for AI, by AI" — the recursive loop
Analogy — AI Designing AI Chips
This is like
an architect using a building they designed to design better buildings. Synopsys DSO.ai is an RL agent (trained on GPUs) that explores thousands of chip layout options humans would never try. Cadence's ChipStack deploys a
team of AI agents that each specialize in a different aspect of chip verification. And NVIDIA invested $2B in Synopsys to accelerate the simulation tools themselves on GPUs. The meta-irony: the GPU designs the next GPU.
EDA Market Dynamics
Metric Value
------------------------ ----------------------------------------
Market duopoly Synopsys + Cadence = ~70% of EDA market
Cadence Q1 2025 rev $1.24B (+23% YoY)
Total addressable market $34.71B by 2035 (15-20% CAGR)
NVIDIA EDA investment $2B into Synopsys (GPU-accelerated EDA)
Key trend: AI workloads drive chip complexity, which drives EDA
spend, which funds AI-powered EDA tools, which design better AI
chips. A self-reinforcing flywheel.
EDA Duopoly
~70%
Synopsys + Cadence market share
Cadence Revenue
$1.24B
Q1 2025, +23% YoY
NVIDIA in Synopsys
$2B
GPU-accelerated simulation
Respin Cost
$100M+
+ 6 months for one tapeout bug
The $100M Bug
At 208 billion transistors, functional verification is the longest and most expensive phase of chip design. A single missed bug that makes it to tapeout means: scrapping the mask set ($5-10M), re-running fabrication (~4 months at TSMC), re-doing packaging and testing (~2 months), and lost revenue from delayed product. Total cost: $100M+ and 6+ months. This is why verification consumes ~60% of the total chip design effort, and why Cadence and Synopsys are pouring resources into AI-assisted verification.