GPU Architecture Reference — A Deep Dive for LLM Inference

Author: Karim El Mernissi | Date: April 2026

This guide provides a concise, pedagogical overview of GPU architecture, specifically tailored to explain the performance characteristics of Large Language Model (LLM) inference.


1. The Core Bottleneck: Compute vs. Memory

To understand LLM inference, one must grasp the fundamental difference between Compute-Bound and Memory-Bandwidth-Bound workloads.

For LLM Inference:


2. Core GPU Components

2.1 CUDA Cores & Tensor Cores

2.2 VRAM (Video RAM)

VRAM stores model weights, the KV Cache, and runtime activations. If the model size exceeds VRAM, it must be offloaded to system RAM, causing a severe performance cliff due to the slow PCIe bus interface.

2.3 Memory Bus & Bandwidth

Bandwidth defines how fast data flows from VRAM to the compute cores. It is the single most critical specification governing LLM token generation speed.

$$ \text{Bandwidth (GB/s)} = \frac{\text{Data Rate (MT/s)} \times \text{Bus Width (bits)}}{8 \times 1000} $$

Where:


3. GPU Memory Technologies

TechnologyCharacteristicsExample GPUMax Bandwidth
GDDR6 / GDDR6XStandard VRAM. Narrower bus (256-384 bit). High speed but limited capacity per GPU.RTX 4090~1,000 GB/s
HBM (High Bandwidth Memory)Premium data center tech. 3D-stacked dies on a silicon interposer. Ultra-wide bus (5000+ bits).H100, MI300X3,300 - 5,300 GB/s

Insight: HBM is what allows data center GPUs to generate tokens 3-5x faster than consumer GPUs, despite having similar or sometimes lower raw TFLOPS.


4. The Roofline Model for LLMs

The Roofline Model predicts performance limits based on Arithmetic Intensity (AI):

$$ AI_{\text{ridge}} = \frac{\text{TFLOPS}}{\text{Memory Bandwidth (TB/s)}} $$

Where:

Why LLM Decode is Bandwidth-Bound:

During token generation (at batch size 1), every weight parameter is read exactly once from memory to perform one multiply-accumulate operation (which counts as 2 FLOPs).

$$ AI_{\text{decode}} = \frac{2 \times P_{\text{active}}}{P_{\text{active}} \times b_{\text{param}}} = \frac{2}{b_{\text{param}}} $$

Where:

Because a typical modern GPU’s ridge point is $>100 \text{ FLOP/byte}$, LLM decode operates far below the compute ceiling. The compute units spend the majority of their time starved for data.


5. Impact of Quantization

Since decode is bottlenecked by memory bandwidth, reducing the physical byte-size of the model weights directly and proportionally increases token generation speed.

$$ \text{Theoretical Speedup} \approx \frac{b_{\text{FP16}}}{b_{\text{quantized}}} $$

Where:

PrecisionBytes ($b$)Relative Decode SpeedQuality Impact
FP162.0$1.0\times$ (Baseline)None
INT81.0$\approx 2.0\times$Minimal
Q4 (INT4)0.5$\approx 4.0\times$Moderate

6. Multi-GPU Interconnects

When a model is too large for a single GPU, Tensor Parallelism splits the matrix multiplications across multiple cards. This requires massive inter-GPU bandwidth to synchronize layers.


7. Hardware Selection Priority

When configuring hardware for LLM inference, follow this strict hierarchy of constraints:

  1. VRAM Capacity: Does the model (weights + KV cache + overhead) mathematically fit in memory?
  2. Memory Bandwidth: Defines your maximum token generation speed (tokens/sec).
  3. Interconnects: NVLink is strictly required if splitting a model across multiple GPUs for latency-sensitive applications.
  4. TFLOPS: Primarily affects time-to-first-token (Prefill) and intensive training workloads, but rarely improves decoding.