Complete Architecture

The AI Hardware Stack

From silicon atoms to generated tokens — every layer that makes modern AI inference possible, and where the bottlenecks hide.

+--------------------------------------------------------------------------+ | THE COMPLETE AI HARDWARE STACK | +==========================================================================+ | | | L1 INFERENCE & SERVING Prefill --> KV Transfer --> Decode | | Disaggregated pools | TTFT / ITL | PagedAttention | vLLM | | | +--------------------------------------------------------------------------+ | | | L2 SOFTWARE & KERNELS CUDA --> PTX --> SASS | CUDA Graphs | | DeepGEMM (1550 TF) | FlashMLA | FlashAttention-3 | DeepEP | | | +--------------------------------------------------------------------------+ | | | L3 COMPUTE HARDWARE B200: 208B T | TPU v7 | WSE-3 | | Tensor Cores (FP4) | Systolic Arrays | Groq LPU | Trainium | | | +--------------------------------------------------------------------------+ | | | L4 MEMORY HIERARCHY Regs --> SRAM --> L2 --> HBM --> NVMe | | The Memory Wall: compute 3x/2yr vs bandwidth 1.6x/2yr | | | +--------------------------------------------------------------------------+ | | | L5 INTRA-NODE INTERCONNECT NVLink 5.0 (1.8 TB/s) | NVSwitch | | 72 GPUs all-to-all at 130 TB/s | NVLink-C2C (900 GB/s) | | | +--------------------------------------------------------------------------+ | | | L6 INTER-NODE NETWORKING IB 800G (115 Tb/s) | SHARP v4 | | Co-Packaged Optics (65% power cut) | RoCE v2 | GPUDirect | | | +--------------------------------------------------------------------------+ | | | L7 SILICON & PACKAGING CoWoS-L (4x reticle) | TSMC 3nm | | HBM4 (2048-bit) | Hybrid Bonding | Cadence + Synopsys EDA | | | +--------------------------------------------------------------------------+
Plain English
Think of the whole AI system like a factory with 7 floors. Each floor does one job, and they all depend on each other from bottom to top. The bottom floor is the raw silicon — the physical atoms. The top floor is where your prompt gets answered. Everything in between is about moving data as fast as possible between the chips that do the math.

How Data Flows Through the Stack

A user prompt enters at L1 and cascades down to silicon, then back up: User Prompt Generated Token | ^ v | [L1 Prefill] tokenize, build KV cache ----RDMA---> [L1 Decode] autoregress | ^ v | [L2 Kernel] FlashAttn tiles Q/K/V in SRAM, fused GEMM [L2 Kernel] | ^ v | [L3 Tensor Cores] wgmma: 128 threads x FP8 multiply [L3 Tensor Cores] | ^ v | [L4 HBM3e] 8 TB/s reads weights + KV cache [L4 HBM3e] | ^ v | [L5 NVLink] all-reduce across 72 GPUs at 1.8 TB/s [L5 NVLink] | ^ v | [L6 IB/RDMA] cross-node gradient sync, SHARP in-switch [L6 IB/RDMA] | ^ v | [L7 Silicon] all of this on a CoWoS-L interposer, 3nm [L7 Silicon]
Key Insight
Every layer is shaped by the Memory Wall. Compute scales 3x every 2 years; memory bandwidth only 1.6x. This single constraint drives HBM investment, FlashAttention algorithms, the prefill/decode split, disaggregated serving, and the entire NVLink/NVSwitch topology.
Layer 7 — Foundation

Silicon & Packaging

The physical substrate: advanced packaging places GPU dies next to HBM stacks on shared interposers, while EDA toolchains turn RTL into silicon. This is the #1 supply bottleneck for all of AI.

Analogy
This is literally the physical chip. TSMC takes a digital blueprint and etches circuits onto silicon wafers — like printing a microscopic city onto a fingernail-sized piece of glass. The tricky part is "packaging": gluing the brain chip (GPU) next to its memory chips (HBM) on a shared platform called CoWoS. It's like building a motherboard at the scale of atoms. One bug in the design = $100M+ and 6 months lost.

CoWoS-L Package Cross-Section (B200)

+------------------------------------------------------------------------+ | CoWoS-L Package (GB200) | | | | +------+ +------+ +------+ +--------+ +--------+ +------+ +------+ | | |HBM3e | |HBM3e | |HBM3e | |GPU Die | |GPU Die | |HBM3e | |HBM3e | | | | 24GB | | 24GB | | 24GB | | 104B T | | 104B T | | 24GB | | 24GB | | | +--||--+ +--||--+ +--||--+ +---||---+ +---||---+ +--||--+ +--||--+ | | || || || 10 TB/s || || || | | +==||=======||=======||=======||=========||=======||=======||==+ | | | LSIC Chiplets (Local Silicon Interconnect) | | | +-------------------------------------------------------------+ | | +-------------------------------------------------------------+ | | | Organic Substrate (BGA Package) | | | +-------------------------------------------------------------+ | | | +------------------------------------------------------------------------+
Analogy
The CoWoS package is like a custom apartment building. The GPU dies are the penthouse suites (doing the actual work), and the HBM stacks are the closets (storing everything). They're all connected by hallways etched into the building's foundation (the silicon interposer). The building can only be so big (reticle limit = ~858 mm²), so TSMC invented LSIC chiplets to stitch multiple foundations together — like connecting buildings with sky-bridges to make a mega-complex.

EDA Toolchain: RTL to Silicon

Cadence Flow: Synopsys Flow: RTL (Verilog/VHDL) RTL (Verilog/VHDL) | | v v [Genus] Logic Synthesis [Design Compiler] | netlist | netlist v v [Innovus] Place & Route [IC Compiler II] | layout | layout v v [Xcelium] Verification [VCS] Verification | clean DRC/LVS | clean DRC/LVS v v [Tempus] Timing Signoff [PrimeTime] Timing Signoff | | v v [Pegasus] Physical Signoff [IC Validator] Phys Signoff | | v v GDSII Tapeout GDSII Tapeout Feb 2026: Cadence ChipStack AI "Super Agent" claims 10x 3D-IC design speed NVIDIA invested $2B in Synopsys for GPU-accelerated simulation

CoWoS Packaging Variants

Variant Interposer Pitch Max Area Used By -------- ----------- -------- ----------- ----------- CoWoS-S Silicon 0.4 um 858 mm2 H100 CoWoS-L LSIC + Organic 2-5 um 3400 mm2+ B200, R100 CoWoS-R RDL Organic 5+ um Large Network ASICs Reticle limit: ~858 mm2 (26mm x 33mm single exposure) CoWoS-L bypasses this via LSIC chiplet stitching Rubin (R100) targets 4x reticle = ~3,400 mm2 interposer area
TSMC CoWoS Capacity
130K
wafers/month target by end 2026
NVIDIA CoWoS Share
>50%
of total TSMC advanced packaging
B200 Transistors
208B
dual-die on TSMC 4NP
NV-HBI Bandwidth
10 TB/s
die-to-die on B200
Critical Bottleneck
CoWoS is the #1 supply constraint for AI. Every H100, B200, and future Rubin chip requires advanced packaging. TSMC scaling from ~75K to 130K wafers/month, but demand far exceeds supply. SK Hynix has sold out its entire 2026 HBM supply. The transition from micro-bumps (25-40um pitch) to hybrid bonding (3-9um pitch) will unlock finer interconnect density for Rubin.
Layer 6 — Network Fabric

Inter-Node Networking

How GPUs in different machines communicate. RDMA lets GPUs read/write each other's memory directly — bypassing CPUs and OS kernels entirely. InfiniBand's SHARP performs collective operations inside the switch fabric itself.

Analogy
When you need thousands of GPUs across many servers, they talk over InfiniBand. Data goes directly from one GPU's memory to another GPU's memory, skipping the CPU entirely — like sending a package straight to someone's desk instead of going through the mailroom. This "RDMA" trick cuts latency dramatically. And SHARP v4 is even crazier: the network switch itself does math on your data while it's in transit, like a postal sorting facility that also fills out your tax forms as your envelope passes through.

GPUDirect RDMA: Zero-Copy GPU-to-GPU Transfer

Source GPU Source NIC IB Switch Remote NIC Remote GPU | | | | | | 1. Write HBM | | | | |~~~~~~~~~~~> | | | | | | | | | | 2. Post SQ | | | | |-------------->| | | | | | | | | | 3. DMA Read | | | | |<~~~~~~~~~~~~~~| | | | | | | | | | | 4. Transmit | | | | |-------------->| | | | | | | | | | | 5. SHARP v4 | | | | | (in-network | | | | | all-reduce) | | | | |-------------->| | | | | | | | | | | 6. DMA Write | | | | |~~~~~~~~~~~~~~>| | | | | | | | | | 7. CQE Done | | | | |<~~~~~~~~~~~~~~| | | | | | Zero CPU involvement after setup. Kernel-bypass, zero-copy, polling-based. NIC accesses GPU HBM directly via BAR1 PCIe memory-mapped regions.
Analogy — RDMA Step by Step
Steps 1-2: The source GPU writes data and posts a "send" request to the network card. Like writing a letter and dropping it in the outbox. Steps 3-4: The NIC reads directly from GPU memory (no CPU involved!) and transmits over the wire. The mailroom picks up directly from your desk. Step 5: SHARP v4 does math on the data inside the switch. The post office adds up all the numbers while sorting. Steps 6-7: Remote NIC writes directly into the destination GPU's memory. Letter arrives directly on their desk, no unpacking needed.

SHARP v4: In-Network Compute

Traditional All-Reduce (ring): SHARP v4 All-Reduce: GPU 0 ---data---> GPU 1 ---> GPU 2 GPU 0 ---data---> Switch ^ | GPU 1 ---data---> Switch <-- 14.4 TFLOPS | O(N) steps | GPU 2 ---data---> Switch in-switch ALU +--------- GPU 3 <-------------+ GPU 3 ---data---> Switch | Each GPU sends + receives N-1 Switch computes SUM | times around the ring. | | Latency: O(N) hops v | Broadcast result | to all GPUs <---------+ Latency: O(log N) hops

InfiniBand vs Ethernet Comparison

Feature InfiniBand (IB) Ethernet / RoCE v2 ------------------- ------------------------- ------------------------- Switch Quantum-X800 Memory Wall: commodity Port Speed 800 Gb/s 400-800 Gb/s Total BW 115.2 Tb/s (144 ports) 25.6-51.2 Tb/s typical Latency sub-100 ns (cut-through) 1-5 us (store-forward) In-Network Compute SHARP v4 (14.4 TFLOPS) None (endpoint only) RDMA Native IB verbs RoCE v2 over UDP/IP Cost Premium (~2x) Standard (~1x) Best For Training 10K+ GPUs Inference, <10K training Deployments Stargate, Oracle 131K Meta inference, LinkedIn Ultra Ethernet Consortium (AMD, Broadcom, Cisco, Intel, Meta, Microsoft) pushing AI-optimized Ethernet: ordered delivery, packet spraying, congestion
Quantum-X800 BW
115.2 Tb/s
144 ports x 800 Gb/s
Port Latency
<100 ns
cut-through switching
SHARP Compute
14.4 TF
in-network all-reduce
CPO Power Save
65%
vs pluggable optics
Co-Packaged Optics (CPO)
NVIDIA's COUPE engines place optical transceivers directly into switch packages — converting electrical to optical at the package boundary instead of front-panel modules. This eliminates ~15cm of lossy copper PCB trace, cutting SerDes power 65% and extending reach 100x (km vs meters). Broadcom's Tomahawk 6 Davisson: 102.4 Tb/s with CPO. 2026 is the volume deployment year.
Layer 5 — Rack-Scale Fabric

Intra-Node Interconnect

NVLink and NVSwitch create a fabric where every GPU in a rack can talk to every other GPU at full bandwidth — no hotspots. This is what makes tensor parallelism across 72 GPUs possible.

Analogy
Inside a single server rack, GPUs are connected by NVLink — a super-fast private highway that's 14x faster than the normal connection (PCIe). A special traffic controller called NVSwitch makes sure any GPU can talk to any other GPU equally fast. Think of it as 72 people in a room who can all whisper to each other at the same speed — no one is farther away than anyone else. This "all-to-all" topology is what makes it possible to split a single AI model across all 72 GPUs.

GB200 NVL72: Full Crossbar Topology

GB200 NVL72 Rack 36 Grace-Blackwell Superchips = 72 Blackwell GPUs GPU 0 GPU 1 GPU 2 GPU 3 GPU 70 GPU 71 | | | | ... | | | | | | | | ==|=======|=======|=======|=============|=======|== || NVSwitch 0 NVSwitch 1 ... NVSwitch 17 || || || || Full Crossbar: 130 TB/s aggregate || || Every GPU equidistant: 1 hop, ~2 us || || Each NVSwitch: 64 ports, 6.4 TB/s || ==|=======|=======|=======|=============|=======|== | | | | | | v v v v v v Tensor Parallelism across ALL 72 GPUs in a single domain All-reduce after every transformer layer at full 1.8 TB/s per GPU
Analogy — Why Equidistant Matters
Without NVSwitch, GPUs would be like houses on a street — neighbors talk fast, but sending a message to the other end of town takes many hops. NVSwitch makes it like everyone's in the same conference room. Every message is one hop. This matters because tensor parallelism requires all 72 GPUs to synchronize after every single layer of the neural network. If even one GPU is slow, everyone waits.

Grace-Blackwell Superchip: NVLink-C2C

+----------------------------+ +----------------------------+ | Grace CPU | | Blackwell GPU | | | | | | 72 ARM Neoverse V2 cores | | 576 Tensor Cores (5th gen)| | 512 GB LPDDR5X @ 546 GB/s | | 192 GB HBM3e @ 8 TB/s | | | | | | CPU DRAM | | GPU HBM | | (spill target) | | (primary pool) | | | | | +-------------||-------------+ +-------------||-------------+ || || || NVLink-C2C: 900 GB/s coherent || || 7x PCIe Gen5 bandwidth || || Unified virtual address space || || No explicit cudaMemcpy needed || || || ++=================================++

Bandwidth Comparison

Interconnect Bandwidth Relative Role ----------------- -------------- --------- ------------------------- NVLink 5.0 1,800 GB/s 14x GPU-to-GPU (tensor parallel) NVLink-C2C 900 GB/s 7x CPU-to-GPU (coherent memory) PCIe Gen5 x16 128 GB/s 1x CPU-GPU, NVMe, NICs PCIe Gen6 x16 256 GB/s 2x Next-gen (PAM-4 signaling) NVLink 4 (H100): 18 links x 50 GB/s = 900 GB/s NVLink 5 (B200): 18 links x 100 GB/s = 1,800 GB/s (2x improvement) Physical: 112 Gbps PAM-4 SerDes per lane (doubled from NVLink 4)
NVLink 5 per GPU
1.8 TB/s
18 links x 100 GB/s each
NVSwitch Fabric
130 TB/s
72 GPUs all-to-all
NVLink-C2C
900 GB/s
coherent CPU-GPU link
PCIe Gap
14x
slower than NVLink
Why NVLink Matters
Tensor parallelism requires an all-reduce after every transformer layer. For a 405B model at BF16, that's gigabytes synchronized across GPUs at each step. PCIe's 128 GB/s would make even 2-GPU TP bandwidth-bound. NVLink's 1.8 TB/s keeps tensor cores fed. The 14x gap between PCIe and NVLink is the reason NVSwitch exists.
Layer 4 — The Memory Wall

Memory Hierarchy

The defining constraint of AI hardware. Compute scales 3x every 2 years; memory bandwidth only 1.6x. This widening gap drives every major architectural decision — from FlashAttention to disaggregated serving.

Analogy — The Memory Ladder
GPUs have a memory ladder, each rung trading speed for size:
Registers = the numbers you hold in your head (instant, tiny)
Shared memory / L1 = a notepad on your desk (very fast, small)
L2 cache = a filing cabinet in the room (fast, medium)
HBM = a warehouse down the street (slower, big)
SSD/Flash = a storage unit across town (slowest, huge)

The big problem: compute power doubles every ~2 years, but memory speed only grows 1.6x. This "memory wall" is why so much engineering goes into keeping data close to where it's being processed. FlashAttention exists because an HBM access costs 600x more time than reading a register.

Memory Hierarchy: Bandwidth vs Capacity vs Latency

BANDWIDTH CAPACITY LATENCY +----------+ |Registers | 256 KB/SM ~1 cycle | ~37 TB/s | +----||----+ +-----||------+ | L1/Shared | 228 KB/SM ~28-39 cycles | ~19 TB/s | <-- FlashAttention lives here +-----||------+ +---------||--------+ | L2 Cache | 126 MB ~200 cycles | ~12 TB/s | +---------||--------+ +-------------||-------------+ | HBM3e (DRAM) | 192 GB ~600 cycles | 8 TB/s | +-------------||-------------+ +------------------||-------------------+ | NVMe (NAND Flash) | TB-scale ~100K cycles | 7-14 GB/s | +----------------------------------------+ Each level down: ~10-100x more capacity, ~10-100x more latency HBM access (600 cyc) vs register (1 cyc) = 600x penalty
Analogy — FlashAttention
Standard attention is like a student who reads the entire textbook (writes it to HBM), highlights it (reads from HBM), then writes a summary (reads from HBM again) — three trips to the warehouse for every page. FlashAttention is like a student who reads one chapter at a time, keeping it on their desk notepad, highlighting and summarizing before moving to the next chapter. Same result, but the warehouse trip only happens once per chapter. This is why FlashAttention turned a memory-bound operation into a compute-bound one.

FlashAttention: Why SRAM Tiling Matters

Standard Attention: FlashAttention (tiled in SRAM): Q (full) x K^T (full) Q_tile (fits SRAM) x K_tile^T | | v v S = Q*K^T (N x N matrix) S_tile (small, in SRAM) Materialized in HBM! Never touches HBM! | | v v softmax(S) -> read from HBM online softmax (rescaling) | | v v P * V -> another HBM round-trip P_tile * V_tile (still in SRAM) | HBM reads: O(N^2) v Bandwidth-bound accumulate output HBM reads: O(N^2 * d / M) M = SRAM size Effectively IO-linear FlashAttention-3 (Hopper): ping-pong warpgroups, 1.3 PFLOPS = 75% peak

HBM Evolution

Generation Capacity Bandwidth Stack Interface GPU ----------- -------- --------- -------- --------- -------- HBM3 80 GB 3.35 TB/s 8-high 1024-bit H100 HBM3e 192 GB 8-9 TB/s 12-high 1024-bit B200 HBM4 384 GB+ 2+ TB/s* 12-16 2048-bit Rubin high * per stack; multiple stacks per GPU HBM4: logic-in-base die, JEDEC spec April 2025, volume late 2026 Market: SK Hynix ~50%, Samsung ~40%, Micron ~10% SK Hynix 2026 HBM supply: entirely sold out to NVIDIA
B200 HBM BW
8 TB/s
HBM3e, 192 GB capacity
B200 L2 Cache
126 MB
2.5x H100's 50 MB
SRAM vs HBM
600x
latency penalty for HBM
Memory Wall Gap
1.9x/2yr
compute vs bandwidth scaling
The Memory Wall
The memory wall is the central constraint shaping every level of AI systems design. From transistor-level packaging (HBM4 logic-in-base) through algorithms (FlashAttention) to cluster architecture (disaggregated serving). A 405B model checkpoint with Adam states is ~2.4 TB. In 100K+ GPU clusters, mean time between failure is measured in hours — checkpointing to NVMe (7-14 GB/s, 1000x slower than HBM) is the fault tolerance mechanism.
Layer 3 — The Engines

Compute Hardware

GPUs, TPUs, wafer-scale chips, and deterministic dataflow processors. The frontier is bifurcating: massive FP4/FP8 throughput for model execution, and dense ARM CPU fabrics for agentic orchestration.

Analogy — Three Types of Compute
GPU (NVIDIA) = Swiss Army knife. Flexible, massive ecosystem, good at everything. The general-purpose workhorse of AI.
Custom ASICs (TPU, Cerebras, Groq) = purpose-built industrial machines. Faster at specific tasks but less flexible. A Cerebras chip is literally an entire silicon wafer — like using the whole factory floor for one machine.
CPU (ARM/x86) = the orchestra conductor. For AI agents like Claude Code, the CPU does 60-90% of the work: parsing JSON, calling tools, managing state. The GPU only fires for the actual "thinking" bursts. This is why the agentic era needs 4x more CPU capacity.

NVIDIA Blackwell SM: Warp-Group Architecture

+================================================================+ | Streaming Multiprocessor (SM) | | B200: 576 SMs total (dual-die) | | | | +--------+ +--------+ +--------+ +--------+ | | | Warp 0 | | Warp 1 | | Warp 2 | | Warp 3 | = 128 thr | | |32 thrd | |32 thrd | |32 thrd | |32 thrd | (1 wgrp) | | +---||---+ +---||---+ +---||---+ +---||---+ | | || || || || | | +===||==========||==========||==========||===+ | | | Warp-Group MMA (wgmma) Instruction | | | | 128 threads --> single matrix multiply | | | | FP4 / FP6 / FP8 / FP16 / BF16 | | | +=============================================+ | | || | | +-----------+ +----||----+ +-------------+ | | | Reg File | | Tensor | | L1 / Shared | | | | 256 KB | | Core | | 228 KB | | | +-----------+ +----------+ +-------------+ | | | | TMA: async 1D-5D tensor moves (global <-> shared) | | No register file involvement, frees warps for compute | +================================================================+
Analogy — Warp Groups
A GPU SM is like a factory floor with 4 work crews of 32 people each (warps). In older GPUs, each crew did its own small job. With warp-group MMA (wgmma), all 4 crews (128 workers) cooperate on one massive matrix multiplication together — like 4 construction crews building one wall instead of 4 separate small walls. The Tensor Memory Accelerator (TMA) is like a forklift that automatically delivers materials to the work area without any worker needing to stop and fetch things.

AI Accelerator Comparison

Chip Transistors Peak FLOPS HBM Architecture --------------- ----------- ----------- ---------- ------------------ NVIDIA B200 208B (2die) 4.5 PF FP4 192GB 8T/s SIMT + Tensor Cores Google TPU v7 N/A 4.6 PF 192GB 7.4 256x256 Systolic Cerebras WSE-3 4T (wafer) 125 PF FP16 44GB SRAM 900K cores, no HBM Groq LPU N/A ~1 PF 230MB SRAM Deterministic flow AWS Trainium2 N/A 1.3 PF FP8 N/A 500K+ for Anthropic Architecture styles: NVIDIA: SIMT warps (32 threads) + Tensor Cores (wgmma) Google: Data flows through 256x256 grid of MACs (systolic) Cerebras: Entire wafer, eliminates HBM entirely (44GB SRAM) Groq: Software-scheduled, no caches, deterministic latency

CPU's Growing Role in Agentic AI

Workload Type GPU Util CPU Util Why CPU Matters -------------------- -------- -------- --------------------------- Simple chatbot Q&A 90-95% 5-10% One forward pass, minimal CPU RAG pipeline 70-80% 20-30% Retrieval, embedding, ranking Multi-agent system 30-40% 60-70% Tool calls, API routing, JSON Agentic orchestration 10-30% 60-90% State mgmt, memory, planning Grace ARM CPU: 72 cores, NVLink-C2C at 900 GB/s to Blackwell Implication: data centers need 4x current CPU capacity per GW for agentic scale. ARM efficiency (2x perf/watt vs x86) matters.
B200 Peak (FP4)
4.5 PF
per GPU, dense
NVL72 Aggregate
~40 PF
FP4 sparse, 72 GPUs
TPU v7 Peak
4.6 PF
256x256 systolic arrays
WSE-3 Peak
125 PF
entire 300mm wafer
The Agentic Shift
Simple chatbot inference is 90-95% GPU. But agentic AI — tool calls, API routing, JSON parsing, memory management — is 60-90% CPU. This is reshaping datacenter architecture: it's no longer just about GPU FLOPS, but about CPU density, memory bandwidth per core, and power efficiency. ARM's 2x perf/watt advantage over x86 positions Grace as the agentic-era CPU.
Layer 2 — The Code

Software & Kernels

From CUDA C++ to the actual hardware instructions. DeepSeek's open-source kernels achieved cuBLAS-matching performance through PTX-level binary patching — flipping single bits in compiled GPU code for 10%+ throughput gains.

Analogy
This is the code that runs on the chips. CUDA is NVIDIA's programming language for GPUs. Your code gets compiled down through layers — like translating a book from English to Spanish to a regional dialect: CUDA C++ (human-readable) → PTX (a "universal GPU language" that works on any generation) → SASS (the actual machine code for a specific chip). DeepSeek wrote custom kernels that squeeze dramatically more performance from the same hardware — like a racecar driver who knows exactly how to take every corner, even finding a single-bit tweak in compiled code that gives 10%+ speedup.

CUDA Compilation Pipeline

CUDA C++ nvcc PTX ptxas SASS (source) (compiler) (virtual ISA) (assembler) (device binary) | | | | | | split host | | | | | and device | translate | | | |-------------->| to virtual | | | | | instrs | | | | |-------------->| compile to | | | | | target arch | | | | |-------------->| actual GPU | | | | | machine code | | | | |-------------->| | | | | | Key PTX instructions: mma.sync -- Volta/Turing tensor ops wgmma.mma_async -- Hopper/Blackwell warp-group MMA cp.async.bulk -- TMA-driven async global->shared copy tma.load -- Tensor Memory Accelerator descriptor loads JIT: PTX compiled for sm_80 JIT-compiles to SASS at runtime via CUDA driver's ptxas (~100-500ms, cached in ~/.nv/)
Analogy — PTX vs SASS
PTX is like sheet music — any orchestra (GPU generation) can read it and play the piece. SASS is like a recording for a specific concert hall — optimized for that exact venue's acoustics. When you distribute PTX, it gets "performed" (JIT compiled) on whatever GPU runs it. This forward-compatibility is why CUDA code written for H100 can run on future Rubin GPUs without recompilation.

DeepGEMM: Warp-Specialized FP8 GEMM

Producer Warps Consumer Warps (async data loading) (matrix compute) | | | 1. TMA cp.async.bulk | | (Global HBM --> Shared Mem) | | | | Activation tiles: 1 x 128 scaling | | Weight tiles: 128 x 128 scaling | | | | 2. Signal barrier | |----------[arrive barrier]---------------->| | | | 3. wgmma | | FP8 GEMM | | on buffer | | | | 4. Reclaim buffer | |<---------[release barrier]-----------------| | | | 5. FFMA | | FP32 accum | | yield bit | | flipped! | | | THE BREAKTHROUGH: PTX binary patching flips bit 12 of FFMA SASS encoding (the yield bit), hinting scheduler to context-switch during FP32 accumulation. This single-bit patch = 10%+ throughput gain. Result: 1,550 TFLOPS FP8 on H800 -- matching cuBLAS, fully open-source.
Analogy — DeepGEMM's Yield Bit Trick
Imagine an assembly line where some workers (producer warps) load parts onto the conveyor belt while others (consumer warps) assemble them. DeepSeek discovered that by telling the assembler workers to take a micro-nap during a specific slow step (flipping the yield bit in FFMA instructions), the factory scheduler could keep the conveyor belt fully loaded. This one-bit post-compilation patch — like a mechanic tweaking a single screw after the car left the factory — delivered 10%+ more throughput. The result: open-source code matching NVIDIA's proprietary cuBLAS performance.

FlashMLA + DeepEP

FlashMLA (Multi-Head Latent Attention) Traditional MHA KV Cache: FlashMLA KV Cache: K: [layers x heads x seq x d] K+V: [layers x seq x d_latent] V: [layers x heads x seq x d] Low-rank joint compression: Size: 100% d_latent = 512 (vs ~7680 original) Size: 6.7% of traditional Fused kernel: projection + attention + output in single launch HBM bandwidth utilization: 3,000 GB/s (near H800 peak of 3,350 GB/s) -------------------------------------------------------------------------- DeepEP (Expert-Parallel MoE Dispatch) Token --> Router --> Expert assignment --> All-to-All dispatch | +---------------+---------------+ | | | Local GPU Remote GPU 1 Remote GPU N (same node) (IB GPUDirect) (IB GPUDirect) Low-latency mode: ~50 us per dispatch (inference) High-throughput mode: overlapped with compute (training) Direct GPU-to-GPU via InfiniBand RDMA, no CPU involvement

CUDA Execution Hierarchy

Grid (entire kernel launch) | +--> Thread Block (max 1024 threads, maps to 1 SM) | | | +--> Warp (32 threads, SIMT lockstep) | | | | | +--> Thread (single execution unit) | | | +--> Warp Group (4 warps = 128 threads, Hopper+) | | | +--> wgmma: single matrix-multiply-accumulate | +--> Kernel launch overhead: 20-200 us (driver path) +--> CUDA Graphs: capture DAG, replay at ~10 us (5x speedup)
DeepGEMM
1,550 TF
FP8 on H800, open-source
FlashMLA KV
6.7%
of traditional cache size
FlashAttn-3
1.3 PF
75% of H100 peak
CUDA Graphs
5x
launch overhead reduction
DeepSeek's PTX Binary Patching
The most remarkable kernel optimization of 2025-2026: DeepSeek discovered that flipping a single bit (bit 12, the yield hint) in FFMA SASS instructions causes the warp scheduler to context-switch during FP32 accumulation, hiding latency. This post-compilation binary patch — applied to NVIDIA's own ISA — delivered 10%+ throughput gains. It required reverse-engineering undocumented SASS encoding. The result: 1,550 TFLOPS FP8, matching proprietary cuBLAS, in ~300 lines of open-source code.
Layer 1 — The Application

Inference & Serving

The 2025-2026 standard: disaggregated serving separates prefill (compute-bound) from decode (memory-bound) into distinct GPU pools connected by RDMA. This eliminates phase interference and lets each pool be independently optimized.

Analogy
This is where your prompt actually gets answered, in two phases:
Prefill = reading the whole question at once. Like a student reading an exam prompt. Bottleneck: how fast you can read (raw compute).
Decode = writing the answer one word at a time, each word depending on all previous words. Like writing an essay where you re-read everything you've written before adding each new sentence. Bottleneck: how fast you can re-read (memory speed).

Smart companies split these onto different GPU pools — some GPUs optimized for reading (prefill), others for writing (decode). The "memory" of what's been read (KV cache) gets transferred between them via RDMA.

Disaggregated Serving Architecture

User Request | v +-----------+ RDMA KV Cache Transfer +-----------+ | PREFILL |====================================>| DECODE | | POOL | KV cache over IB / RoCE | POOL | | | | | | 3 nodes | | 9 nodes | | 24 GPUs | | 72 GPUs | | | | | | Compute- | | Memory- | | bound | | bound | | 640 TFLOPS| | 8 TB/s BW | | saturated | | per GPU | +-----------+ +-----------+ | | v v TTFT: Time to ITL: Inter-Token First Token Latency (10-50ms) (prompt processing) (token generation) LMSYS benchmark: DeepSeek-R1 on 96 H100s 3-node prefill + 9-node decode 52,300 input tok/s + 22,300 output tok/s per node

Prefill vs Decode: Why They Must Be Separated

PREFILL PHASE DECODE PHASE --------------------------------- --------------------------------- Process: entire prompt at once Process: one token at a time Operation: matrix-matrix multiply Operation: matrix-vector multiply Bottleneck: raw FLOPS Bottleneck: HBM bandwidth Tensor cores: saturated Tensor cores: mostly idle Arithmetic intensity: HIGH Arithmetic intensity: LOW (many ops per byte read) (1-10 ops/byte vs 300 capacity) Reads: weights (once) Reads: weights + FULL KV cache Builds: KV cache Updates: KV cache (+1 token) Metric: TTFT Metric: ITL (time to first token) (inter-token latency, 10-50ms) Optimal HW: max TFLOPS Optimal HW: max HBM bandwidth (B200: 4.5 PF FP4) (B200: 8 TB/s HBM3e) Running both on same GPUs wastes resources: - Prefill starves decode of memory bandwidth - Decode wastes prefill's compute capacity - Disaggregation eliminates this interference
Analogy — Why Disaggregate?
Imagine a restaurant where some chefs are fast at reading orders (prefill) and others are fast at plating dishes (decode). If you make every chef do both, the fast readers are stuck waiting while plating, and the fast platers are stuck waiting while reading. Disaggregation puts the readers in one kitchen and the platers in another, connected by a fast conveyor belt (RDMA). Each kitchen runs at peak efficiency. The 3:1 decode-to-prefill ratio (9 decode nodes vs 3 prefill) reflects that decoding is the sustained bottleneck — you read the order once but plate for much longer.

KV Cache Management at Scale

70B model, 128K context, batch of concurrent requests: Single request KV cache: ~40 GB (exceeds H100's 80GB HBM alone) 100 concurrent requests: ~4 TB KV cache needed PagedAttention (vLLM): +------+------+------+-------+------+ |Page 0|Page 1|Page 2| Free |Page 3| Non-contiguous pages +------+------+------+-------+------+ like OS virtual memory | Req A| Req A| Req B| | Req A| Near-100% utilization +------+------+------+-------+------+ No fragmentation Optimization techniques: +--------------------+----------------+---------------------------+ | Technique | Reduction | Mechanism | +--------------------+----------------+---------------------------+ | GQA (grouped attn) | 4-8x fewer KV | Share KV across heads | | FP8/INT4 quantize | 2-4x smaller | Compress KV values | | Prefix caching | Shared prompts | Reuse system prompt KV | | CPU/SSD offload | Infinite* | Page cold KV (+0.1-1ms) | | FlashMLA (DeepSeek)| 15x (to 6.7%) | Low-rank KV decomposition | +--------------------+----------------+---------------------------+ Frameworks: NVIDIA Dynamo (orchestration), vLLM (--disaggregated-prefill) Production: Meta, LinkedIn, Mistral, Hugging Face TGI v2
LMSYS Input
52.3K
tok/sec per node (prefill)
LMSYS Output
22.3K
tok/sec per node (decode)
KV Cache (70B)
~40 GB
per request at 128K ctx
FlashMLA
6.7%
KV cache vs traditional
The Decode Bottleneck
Decode is memory-bound: each token generation reads the entire KV cache from HBM but performs trivially little compute. Arithmetic intensity is 1-10 ops/byte versus tensor core capacity of ~300 ops/byte. This means 97-99% of tensor core capability is wasted during decode. The solution is disaggregation: give decode its own GPU pool optimized for bandwidth (more HBM, fewer FLOPS), and give prefill a pool optimized for compute.
End-to-End Example

A Claude Code Session Through the Stack

What actually happens — layer by layer — when you type a message in Claude Code, from keystroke to streamed token to tool execution and back.

The Full Journey: You Type "fix the auth bug"

YOUR TERMINAL ANTHROPIC CLOUD | | | 1. Keystroke | | "fix the auth bug" | | + system prompt | | + conversation history | | + CLAUDE.md context | | | |--------- HTTPS POST /messages ------->| | | | LOAD BALANCER | | | v | PREFILL POOL (L1) | Process entire prompt | ~50K tokens of context | 640 TFLOPS saturated | | | RDMA KV cache xfer | | | v | DECODE POOL (L1) | Generate tokens | one at a time | 8 TB/s HBM reads | | |<-------- SSE token stream ----------| | | | 2. Claude decides to use a tool: | | {"type":"tool_use", | | "name":"Grep", | | "input":{"pattern":"auth"}} | | | LOCAL CPU | 3. Your machine | executes the tool | (grep, bash, read) | | | | 4. Tool result sent back | |--------- HTTPS POST /messages ------->| | | | PREFILL again (new ctx) | | | RDMA KV xfer | v | DECODE response | | |<-------- SSE token stream ----------| | | | 5. Repeat until task complete | | (may loop 10-50+ times) | v v Bug fixed, code edited, tests passing
Plain English
Every time you hit Enter in Claude Code, your message (plus the entire conversation history) flies over HTTPS to Anthropic's data center. There, the prefill pool reads your whole context in one parallel burst — like speed-reading 50,000 words at once. Then the KV cache (the "working memory" of what was read) gets shipped via RDMA to the decode pool, which generates the response one token at a time. When Claude decides to use a tool (Grep, Bash, Read, Edit), the token stream includes a JSON tool_use block. Your local machine executes that tool — the cloud never touches your files. The result goes back as a new message, triggering another prefill → decode cycle. A typical Claude Code session might do this loop 10-50+ times.

What Happens Inside One Forward Pass

Layer-by-layer trace of a single token generation: Token "fix" (from your prompt) | v [L7 SILICON] B200 GPU (208B transistors, TSMC 4NP) | sitting on CoWoS-L interposer next to HBM3e v [L4 HBM3e] Load embedding weights from HBM (8 TB/s) | token "fix" --> 8192-dim vector v [L2 KERNEL] FlashAttention tiles Q/K/V into SRAM (228 KB) | avoids 600-cycle HBM penalty per access v [L3 TENSOR CORE] wgmma: 128 threads execute FP8 matrix multiply | Q * K^T --> attention scores (in SRAM!) | softmax --> P * V --> output (still in SRAM!) v [L5 NVLink] All-reduce across 8 GPUs (tensor parallelism) | 1.8 TB/s per GPU, via NVSwitch | this happens EVERY LAYER (100+ times) v [L6 InfiniBand] If model spans nodes: cross-node gradient sync | SHARP v4 does all-reduce in-switch | GPUDirect RDMA, zero CPU involvement v [L4 HBM3e] Write updated KV cache entry for this token | KV cache grows by ~512 bytes per layer v [L1 DECODE] Sample next token from logits | argmax/top-p/temperature v "Let me look at the auth middleware..." | v (stream to your terminal via SSE)
Analogy — One Token's Journey
For every single word Claude generates, here's the journey: The token starts as a number looked up from a huge dictionary in HBM (the warehouse). FlashAttention loads small tiles of the attention matrices onto the GPU's notepad (SRAM) to avoid constant warehouse trips. The Tensor Cores — 128 workers cooperating on one giant matrix multiply — compute attention scores entirely on the notepad. Then NVLink broadcasts the result to all other GPUs in the rack (the conference room PA system). If the model is too big for one rack, InfiniBand carries data across the building. The result: one new token, streamed back to your terminal. This whole process takes 10-50 milliseconds and repeats for every single word.

The Tool-Use Loop: Where CPU Meets GPU

A typical Claude Code "fix the auth bug" session: Turn 1: You --> "fix the auth bug" | v [GPU] Prefill (50K tokens) --> Decode --> tool_use: Grep "auth" | v [YOUR CPU] Execute grep locally --> 15 matching files | v Turn 2: Tool result --> API | v [GPU] Prefill (52K tokens) --> Decode --> tool_use: Read auth.ts | v [YOUR CPU] Read file locally --> 200 lines | v Turn 3: Tool result --> API | v [GPU] Prefill (55K tokens) --> Decode --> tool_use: Edit auth.ts | v [YOUR CPU] Apply edit locally --> file modified | v Turn 4: Tool result --> API | v [GPU] Prefill (56K tokens) --> Decode --> tool_use: Bash "npm test" | v [YOUR CPU] Run tests locally --> all passing | v Turn 5: Tool result --> API | v [GPU] Prefill (58K tokens) --> Decode --> "Done! Fixed the bug." +--------------------------------------------------------------+ | Notice: context grows every turn. By turn 5, the prefill | | pool is processing 58K tokens — your entire conversation | | history + all tool results. This is why prompt caching and | | KV cache reuse matter so much for Claude Code performance. | +--------------------------------------------------------------+
Analogy — The Loop
Claude Code works like a chess player who can see the board but can't touch the pieces. The "thinking" (GPU) happens in Anthropic's cloud. But every time Claude wants to interact with your codebase — read a file, run a command, edit code — it has to ask you (via tool_use JSON), and your local CPU does the actual work. The result goes back, and Claude thinks again. Each loop costs a full prefill + decode cycle. With prompt caching, the unchanged prefix of your conversation can skip re-computation, saving significant time and cost.

Where Time Is Spent

Component Time per turn Bottleneck Layer --------------------- ------------- --------------------- ----- Network round-trip 50-200 ms Your ISP / distance N/A Prefill (50K tokens) 200-800 ms GPU compute (TFLOPS) L1,L3 KV cache transfer 10-50 ms RDMA bandwidth L6 Decode (first token) 50-200 ms HBM bandwidth L1,L4 Decode (per token) 10-50 ms HBM bandwidth L1,L4 Tool execution 10-5000 ms Your local machine N/A Context compression ~100 ms API processing N/A For a 5-turn session generating ~2000 tokens with 4 tool calls: Prefill: 5 x ~400ms = ~2s (GPU compute-bound) Decode: 2000 x ~30ms = ~60s (HBM bandwidth-bound) Tools: 4 x ~500ms = ~2s (your local CPU) Network: 10 x ~100ms = ~1s (round trips) Total: ~65s. Decode dominates — the memory wall in action.
Key Insight
In a Claude Code session, decode dominates. The majority of wall-clock time is the GPU reading KV cache from HBM one token at a time — this is the memory wall made visible. Prefill is fast (parallel), tools are fast (local). But generating 2000 tokens at 30ms each = 60 seconds of memory-bandwidth-limited decode. This is why HBM bandwidth (8 TB/s on B200) matters more than raw TFLOPS for inference, and why disaggregated serving puts decode on bandwidth-optimized GPUs.
Prompt Caching
Claude Code's prompt caching means the first 90%+ of your context (system prompt, CLAUDE.md, conversation history) doesn't need re-prefilling every turn — the KV cache from the previous turn is reused. Only the new tool result needs to be prefilled. This turns a 50K-token prefill into a ~2K-token incremental prefill, cutting prefill time by ~25x. The 5-minute cache TTL means rapid back-and-forth tool loops stay fast, but a long pause between turns forces a cold re-prefill.
2026 Industry Shift

The SRAM vs HBM Inference Split

The decode bottleneck is HBM bandwidth. Cerebras and Groq attacked it by replacing HBM with on-chip SRAM entirely — trading capacity for 5,000-7,000x more bandwidth. In 2026, OpenAI shipped GPT-5.3-Codex-Spark on Cerebras chips, and NVIDIA integrated Groq into Vera Rubin. The industry is splitting: train on GPUs, serve on SRAM.

The Core Tradeoff: Why SRAM Wins at Decode

THE DECODE PROBLEM (from Layer 1): Each token reads the full KV cache + weights from memory. Arithmetic intensity: 1-10 ops/byte. Tensor cores 97% idle. Bottleneck is PURE MEMORY BANDWIDTH. Architecture Memory Type Bandwidth vs H100 -------------------- ----------- ------------ -------- NVIDIA H100 HBM3 3.35 TB/s 1x NVIDIA B200 HBM3e 8 TB/s 2.4x Cerebras WSE-3 On-wafer SRAM ~56 PB/s 7,000x Groq 3 LPU On-chip SRAM 40 PB/s 5,000x The catch: SRAM is 100-1000x more expensive per bit than DRAM. Cerebras: 44 GB SRAM (one wafer). Groq: 500 MB SRAM (one chip). NVIDIA B200: 192 GB HBM3e. Capacity vs bandwidth tradeoff.
Analogy
HBM is like a huge warehouse with a narrow loading dock — tons of storage, but you can only move stuff in and out so fast. SRAM is like having everything on your desk — tiny workspace, but instant access. During decode, the GPU barely computes — it spends 97% of its time waiting for the next piece of data from the warehouse. Cerebras and Groq said: "forget the warehouse, put everything on the desk." The desk is expensive and small, but for inference, you don't need a warehouse — you need speed.

How the Three Architectures Work

NVIDIA B200 (GPU) "The General-Purpose Workhorse" +-------------------+ +-------+ | 576 Tensor Cores | | HBM3e | | 4.5 PF FP4 |<--->| 192GB | 8 TB/s bandwidth | 126 MB L2 cache | | (off- | Weights in HBM | CUDA + wgmma | | chip) | Decode reads HBM every token +-------------------+ +-------+ Strengths: Training, huge models, CUDA ecosystem, flexibility Weakness: Decode is HBM-bandwidth-bound (97% tensor core idle) -------------------------------------------------------------------------- CEREBRAS WSE-3 "The Entire Wafer" +==============================================+ | 300mm Silicon Wafer (entire thing is 1 chip)| | | | 900,000 AI Cores 44 GB SRAM (on-wafer) | | 4 Trillion transistors | | ~56 PB/s aggregate memory bandwidth | | | | No HBM. No off-chip memory bottleneck. | | Weights live ON the wafer. Zero bus latency. | +==============================================+ | | MemoryX SwarmX (off-chip weight (multi-wafer store for >44GB interconnect models) for clusters) Strengths: Fastest inference for models ≤44GB on-chip, zero TTFT Weakness: 44GB SRAM = ~70B dense model max without MemoryX -------------------------------------------------------------------------- GROQ 3 LPU "The Deterministic Engine" +------------------------------------------+ | 500 MB SRAM (on-chip, no HBM, no cache) | | 1.2 PF INT8 | | 40 PB/s on-chip SRAM bandwidth | | | | Compiler pre-computes ENTIRE execution | | graph down to individual clock cycles. | | Zero runtime scheduling. Zero variance. | | Every token takes exactly the same time. | +------------------------------------------+ | | 640 TB/s Rack-scale chip-to-chip deterministic interconnect fabric Strengths: Lowest latency, zero variance, 0.2s TTFT, 300+ tok/s Weakness: 500MB = small models only, or distributed across many chips
Analogy — Three Restaurants
NVIDIA GPU = a massive restaurant with a huge kitchen (HBM) down the hall. Can cook any dish (flexible), but waiters spend most of their time running back and forth to the kitchen. Great for banquets (training). Slow for individual orders (decode).
Cerebras WSE-3 = the chef has every ingredient on the counter in front of them (on-wafer SRAM). No walking to the pantry. Blazing fast for any dish that fits on the counter. But the counter is only 44 GB — for a 10-course banquet, you need the MemoryX pantry nearby.
Groq LPU = a sushi bar where the chef pre-plans every movement before service starts (compiler-scheduled). Zero improvisation, zero wasted motion. Every piece of sushi takes exactly 3 seconds. Incredibly fast and predictable, but can only make sushi (inference, not training).

The 2026 Industry Split

How AI companies deploy in 2026: TRAINING INFERENCE (learn the model) (serve the model) | | v v NVIDIA GPUs +-------+-------+-------+ Still dominates. | | | | No real alternative v v v v for training at scale. NVIDIA Cerebras Groq Trainium Needs HBM capacity GPUs WSE-3 LPU 3 (AWS) + NVLink + IB. | | | | v v v v General Speed- Ultra- Captive purpose critical low-lat (Anthropic) serving coding real-time Claude agents apps training KEY DEALS (2026): OpenAI + Cerebras: $10B+, 750 MW, GPT-5.3-Codex-Spark NVIDIA + Groq: Groq 3 LPU integrated into Vera Rubin AWS + Cerebras: WSE-3 on Amazon Bedrock, 5x token capacity Anthropic + AWS: 500K+ Trainium2 chips for Claude training The pattern: NVIDIA for training, SRAM chips for latency-critical inference. OpenAI's Codex-Spark on Cerebras = the proof point of this split.
Plain English — Why This Split Happened
For years, NVIDIA GPUs did everything — training and inference. But as models got deployed to millions of users, companies realized they were wasting money: during decode (generating each word), 97% of the GPU's math power sits idle, waiting for data from memory. It's like hiring a team of 500 mathematicians but only giving them one page of problems at a time. Cerebras and Groq said: "What if we made the memory the chip?" Instead of a separate memory warehouse (HBM), they put massive SRAM directly on the compute die. The data is already there — no waiting. OpenAI proved this works in production by running Codex-Spark on Cerebras. NVIDIA's response: buy Groq's technology and integrate it into their next platform (Vera Rubin), so they offer both options.

Token Economics: Why Speed = Money

Why agentic coding cares about tok/s: A Claude Code session fixing a bug: ~5 turns x ~400 tokens decoded per turn = 2,000 tokens On NVIDIA B200: 2000 tok / ~100 tok/s = ~20 sec decode time On Cerebras WSE-3: 2000 tok / ~2500 tok/s = ~0.8 sec decode time On Groq 3 LPU: 2000 tok / ~300 tok/s = ~6.7 sec decode time For a developer doing 50 Claude Code sessions/day: B200: 50 x 20s = 16.7 min/day waiting for decode Cerebras: 50 x 0.8s = 0.7 min/day waiting for decode Groq: 50 x 6.7s = 5.6 min/day waiting for decode This is why OpenAI put their CODING model on Cerebras specifically. For agentic AI, inference latency = developer productivity.
The Convergence
The lines are blurring. NVIDIA's Vera Rubin platform now includes Groq LPUs alongside GPUs — one rack, two chip types, automatically routing workloads to the right hardware. Cerebras is adding MemoryX (off-chip weight storage) to handle models bigger than 44GB SRAM. And NVIDIA keeps growing on-chip SRAM (B200 has 126 MB L2, up from H100's 50 MB). Everyone is converging on the same insight: the memory wall is the bottleneck, and the fix is more SRAM closer to compute. The only question is how much SRAM and where to put it.
OpenAI + Cerebras
$10B+
750 MW, multi-year deal
Cerebras vs B200
2.5x
faster tok/s on Llama 4 Maverick
Groq SRAM BW
40 PB/s
5,000x more than B200 HBM
GroqCloud Devs
3.5M+
as of Feb 2026
Deep Dive — Optics

Co-Packaged Optics (CPO)

Replacing lossy electrical traces with silicon photonics integrated directly onto the switch package. Cuts power 65%, extends reach from meters to kilometers, and enables 100+ Tb/s switch bandwidth. 2026 is the volume deployment year.

Analogy
Currently, data inside a datacenter travels as electricity over copper wires, then gets converted to light for longer distances using a bulky pluggable module on the front panel. CPO moves the light converter directly onto the chip package — like moving the translator from the lobby into your office. No more signal loss walking down the hallway. The result: 65% less power, 100x more reach, and 2-3x more ports per switch.

Traditional Pluggable vs Co-Packaged Optics

TRADITIONAL (Pluggable Optics): Switch ASIC Pluggable Module +----------+ electrical traces +-------------+ optical fiber | | -------- ~15 cm -----> | Transceiver | ===== fiber ====> remote | Digital | (lossy, hot) | QSFP-DD | (long reach) | Logic | ~15-25W per | E/O + O/E | +----------+ 800G port +-------------+ Problems: - Signal degrades over 15cm of board traces (SerDes power) - E/O conversion at package boundary is power-hungry - Front-panel space limits port density - Cannot scale beyond ~51.2 Tb/s per switch -------------------------------------------------------------------------- CO-PACKAGED OPTICS: +------------------------------------------+ optical fiber | Switch Package | ===== fiber ====> remote | +----------+ +---------------------+ | (100x reach) | | Switch |--->| Optical Engine | | | | ASIC | | (Silicon Photonic | | | | | | die ON package) | | | | | | | | | | Broadcom | | Laser + Modulator | | | | or NVIDIA| | + Photodetector | | | +----------+ +---------------------+ | +------------------------------------------+ Benefits: - No board traces for data (mm-scale, not cm) - E/O conversion on-package (minimal loss) - 65% power reduction vs pluggable - 3.5x more energy efficient per bit - 2-3x higher port density

Data Flow Through a CPO Engine

TRANSMIT PATH (Electrical to Optical): Switch Die Silicon Photonic Engine +----------+ +----------------------------+ | | electrical | +---------+ +----------+ | | Digital | --short path---> | | Driver |-->|Modulator | |--> Fiber out | Logic | (on-package, | | Circuit | |(Mach- | | (light) | | mm-scale) | | | | Zehnder) | | +----------+ | +---------+ +----------+ | +----------------------------+ RECEIVE PATH (Optical to Electrical): Switch Die Silicon Photonic Engine +----------+ +----------------------------+ | | electrical | +---------+ +----------+ | | Digital | <--short path--- | | TIA |<--| Photo- | |<-- Fiber in | Logic | (on-package, | | (Trans- | | detector | | (light) | | mm-scale) | | imped.)| | | | +----------+ | +---------+ +----------+ | +----------------------------+ Mach-Zehnder modulator: encodes data by varying light phase TIA: converts photocurrent to voltage (trans-impedance amplifier) Key: all conversion happens ON the package, not on a separate module
Analogy — The Two Paths
Sending data with CPO is like a translator sitting right next to you instead of in a booth across the room. When you speak (electrical signal), the translator instantly converts to the other language (light). When someone speaks back (light arrives), the translator instantly converts back. Because the translator is inches away instead of across the room, there's no time wasted, no signal lost, and the translator uses much less energy.

CPO Key Players and Timeline

Company Product Bandwidth Timeline ---------------- ---------------------------- ----------- ---------- NVIDIA Quantum-X (InfiniBand + CPO) 1.6 Tb/s 2H 2025 NVIDIA Spectrum-X (Ethernet + CPO) 3.2 Tb/s 2H 2026 Broadcom TH6-Davisson (Ethernet) 102.4 Tb/s 2025-2026 Marvell CPO for AI datacenters TBD 2026+ Ayar Labs In-package optical I/O TBD 2025+ Market projection: >$20B by 2036 (37% CAGR), inflection point in 2026

Pluggable vs CPO: Head-to-Head

Metric Pluggable CPO ------------------------ ----------------- ----------------- Switch bandwidth 51.2 Tb/s 100+ Tb/s Power per 800G port ~15-25W ~5-9W Reach (copper) <10m N/A Reach (optical) 10-2000m >1000m Port density Constrained 2-3x higher Cooling challenge Major Reduced Serviceability Hot-swappable Non-removable Maturity Production Early production
Analogy — Why CPO Matters for AI
AI training clusters have thousands of GPUs that need to communicate constantly. With pluggable optics, each switch wastes 15-25 watts per port just on signal conversion. At 100,000+ GPU scale, that's megawatts of wasted power. CPO cuts that by 65%. It's like upgrading from incandescent bulbs to LEDs — same brightness, fraction of the electricity. And because CPO supports 2-3x more ports, you can build flatter network topologies with fewer hops between any two GPUs.
Power Savings
65%
vs pluggable per port
TH6-Davisson
102.4 Tb/s
Broadcom CPO switch
Market Size
$20B+
by 2036, 37% CAGR
Energy Efficiency
3.5x
better per bit vs pluggable
The Serviceability Tradeoff
The one downside of CPO: pluggable transceivers can be hot-swapped in seconds if one fails. CPO optical engines are permanently attached to the switch package — if one fails, the entire switch must be replaced. This is why some hyperscalers are cautious: at 100K+ switch scale, transceiver failures are daily events. The industry is betting CPO reliability will offset the serviceability loss, but this remains an active engineering challenge.
Deep Dive — Chip Design

EDA Chip Design Pipeline

Cadence and Synopsys control ~70% of the EDA market. Their tools take a chip from idea to silicon through 7 stages. One bug at tapeout = $100M+ and 6 months lost. AI is now designing AI chips.

Analogy
Designing a chip is like building a city from scratch. You start with a master plan (architecture), draw detailed blueprints for every building (RTL), inspect every blueprint for errors (verification), figure out which materials to use (synthesis), decide where every building goes and how roads connect them (place & route), do final inspections (signoff), then hand it to the construction company (TSMC). The "city" has 208 billion buildings (transistors), 15+ layers of roads (metal layers), and if you find one structural flaw after construction starts, you tear everything down — $100M gone, 6 months lost.

The 7-Stage Pipeline

1. ARCHITECTURE Define what the chip does | Define ISA, SM count, tensor core specs, memory hierarchy | Power budget, NVLink ports, die area targets | Tools: SystemC/TLM modeling, Arm FastModels | v 2. RTL DESIGN Write the logic in code | Verilog / SystemVerilog for every block: | SMs, Tensor Cores, L2 controllers, NVLink PHYs | H100: 80B transistors B200: 208B transistors | v 3. VERIFICATION Prove it actually works | RTL simulation (cycle-accurate) | UVM (Universal Verification Methodology) | Formal verification (mathematical proof) | Emulation on Palladium / ZeBu hardware | 208B transistors = weeks of simulation | One bug = $100M+ respin + 6 month delay | v 4. LOGIC SYNTHESIS Turn code into circuits | RTL --> Gate-level netlist (AND, OR, FF gates) | Optimizes for timing, area, power | Targets TSMC 3nm/4nm standard cell libraries | v 5. PLACE & ROUTE Physical layout on silicon | Floorplan: assign blocks to regions | Place: position billions of standard cells | Clock tree: distribute clock to all flip-flops | Route: connect with 15+ metal layers | v 6. SIGNOFF Final checks before tapeout | Timing: verify all paths meet frequency target | Power: verify grid, IR drop, electromigration | Physical: DRC (fab rules) + LVS (layout = schematic) | v 7. FABRICATION + PACKAGING Build the physical chip GDSII file --> TSMC 3nm/4nm fab Wafer processing: ~3-4 months CoWoS packaging: mount GPU + HBM on interposer Total: 6-12 months from tapeout to volume
Analogy — Each Stage
Stage 1: The mayor decides the city needs 576 factories and 18 highways. Stage 2: Architects draw detailed blueprints for every building. Stage 3: Inspectors check every blueprint — does the plumbing work? Do the elevators reach every floor? Stage 4: Engineers choose specific bricks and wiring from the catalog. Stage 5: Urban planners decide where each building goes and route all roads. Stage 6: Final inspection — fire code, structural integrity, electrical safety. Stage 7: TSMC builds it. If inspectors missed a flaw? Demolish and rebuild. $100M+ gone.

Cadence vs Synopsys: Tool-for-Tool

Cadence Flow: Synopsys Flow: RTL (Verilog/VHDL) RTL (Verilog/VHDL) | | v v [Genus] Logic Synthesis [Design Compiler] | netlist | netlist v v [Innovus] Place & Route [IC Compiler II] | layout | layout v v [Xcelium] Verification [VCS] Verification | clean DRC/LVS | clean DRC/LVS v v [Tempus] Timing Signoff [PrimeTime] Timing Signoff | | v v [Pegasus] Physical Signoff [IC Validator] Phys Signoff | | v v GDSII Tapeout GDSII Tapeout

Place & Route: What It Actually Looks Like

Floorplan of a GPU die (simplified): +================================================================+ | GPU Die (~814 mm2) | | | | +------+ +------+ +------+ +------+ +------+ +------+ | | | SM 0 | | SM 1 | | SM 2 | | SM 3 | | SM 4 | | SM 5 | | | +------+ +------+ +------+ +------+ +------+ +------+ | | +------+ +------+ +------+ +------+ +------+ +------+ | | | SM 6 | | SM 7 | | ... | | ... | | ... | |SM 113| | | +------+ +------+ +------+ +------+ +------+ +------+ | | | | +========================+ +========================+ | | | L2 Cache (50 MB) | | Memory Controllers | | | +========================+ +========================+ | | | | +--------+ +--------+ +--------+ +--------+ +--------+ | | |NVLink | |NVLink | |NVLink | | PCIe | | PCIe | | | |PHY 0-5 | |PHY 6-11| |PHY12-17| | Gen5 | | Gen5 | | | +--------+ +--------+ +--------+ +--------+ +--------+ | +================================================================+ The P&R tool positions every cell and routes 15+ metal layers of interconnect between them. For 80B transistors, this is an NP-hard optimization problem — which is why AI is being used.
Analogy — Place & Route
Place & route is like solving the world's hardest jigsaw puzzle while also designing the road system. You have 80 billion pieces (transistors) that need exact positions on a surface the size of a thumbnail. Then you connect them all with 15 layers of microscopic wiring — imagine 15 transparent sheets stacked on top of each other, each with its own road network. The wiring must ensure every signal arrives on time (timing closure), no two wires short-circuit (DRC), and the power grid can deliver electricity to every corner (IR drop).

AI Designing AI Chips

The Meta-Loop: AI chips are now designed by AI Synopsys DSO.ai (Design Space Optimization) +----------------------------------------------------------+ | Reinforcement Learning agent explores floorplan space | | | | State: current layout metrics (timing, area, power) | | Action: adjust block placement, routing priorities | | Reward: timing slack * area efficiency * power budget | | | | Explores 1000s of options humans would never try | | Result: 5-10x faster design closure | +----------------------------------------------------------+ Cadence ChipStack Super Agent (Agentic AI, Feb 2026) +----------------------------------------------------------+ | Multi-agent system for 3D-IC verification | | | | Agent 1: identifies critical timing paths | | Agent 2: checks power integrity across chiplets | | Agent 3: validates thermal profiles | | Agent 4: cross-references DRC with layout intent | | | | Claims 10x speedup for 3D-IC design verification | +----------------------------------------------------------+ NVIDIA GPU-Accelerated EDA +----------------------------------------------------------+ | NVIDIA invested $2B in Synopsys | | | | PrimeSim on GH200: 15x SPICE simulation speedup | | cuLitho: GPU-accelerated computational lithography | | Result: mask generation 40x faster on H100 vs CPU | +----------------------------------------------------------+ "Chips designed for AI, by AI" — the recursive loop
Analogy — AI Designing AI Chips
This is like an architect using a building they designed to design better buildings. Synopsys DSO.ai is an RL agent (trained on GPUs) that explores thousands of chip layout options humans would never try. Cadence's ChipStack deploys a team of AI agents that each specialize in a different aspect of chip verification. And NVIDIA invested $2B in Synopsys to accelerate the simulation tools themselves on GPUs. The meta-irony: the GPU designs the next GPU.

EDA Market Dynamics

Metric Value ------------------------ ---------------------------------------- Market duopoly Synopsys + Cadence = ~70% of EDA market Cadence Q1 2025 rev $1.24B (+23% YoY) Total addressable market $34.71B by 2035 (15-20% CAGR) NVIDIA EDA investment $2B into Synopsys (GPU-accelerated EDA) Key trend: AI workloads drive chip complexity, which drives EDA spend, which funds AI-powered EDA tools, which design better AI chips. A self-reinforcing flywheel.
EDA Duopoly
~70%
Synopsys + Cadence market share
Cadence Revenue
$1.24B
Q1 2025, +23% YoY
NVIDIA in Synopsys
$2B
GPU-accelerated simulation
Respin Cost
$100M+
+ 6 months for one tapeout bug
The $100M Bug
At 208 billion transistors, functional verification is the longest and most expensive phase of chip design. A single missed bug that makes it to tapeout means: scrapping the mask set ($5-10M), re-running fabrication (~4 months at TSMC), re-doing packaging and testing (~2 months), and lost revenue from delayed product. Total cost: $100M+ and 6+ months. This is why verification consumes ~60% of the total chip design effort, and why Cadence and Synopsys are pouring resources into AI-assisted verification.
Navigate layers