Local LLM Resource Estimator — Technical Documentation & Formula Reference
Author: Karim EL MERNISSI
Date: April 2026
1. Introduction
The Local LLM Resource Estimator is a self-contained, browser-based tool designed to estimate the memory requirements, inference performance, and operational costs of deploying Large Language Models (LLMs) on GPU hardware. It integrates directly with the HuggingFace model hub to retrieve real model metadata, including parameter counts, tensor types, quantization configurations, and available quantized variants.
This document provides the complete mathematical foundation, algorithmic descriptions, and pedagogical references for every calculation performed by the tool. Each formula is derived from first principles and annotated with its assumptions, limitations, and practical implications.
1.1 Design Philosophy
The calculator follows three core principles: rigor (every estimate is grounded in a well-defined mathematical model), transparency (all formulas are displayed to the user with live substitution of their chosen parameters), and pedagogy (every input and output is accompanied by an on-demand explanation accessible via the “?” buttons throughout the interface).
1.2 Scope
The calculator addresses two primary deployment scenarios:
- Inference: Estimating VRAM requirements for model serving, including weight storage, KV cache, framework overhead, and performance metrics such as tokens-per-second and time-to-first-token.
- Fine-tuning: Estimating additional memory for gradient storage, optimizer states (AdamW), and activation memory for both LoRA and full fine-tuning regimes.
2. VRAM Estimation Model
Total GPU memory consumption is modeled as the sum of three primary components:
$$ V_{total} = V_{weights} + V_{kv} + V_{overhead} \quad \text{(Equation 1)} $$
where V_weights is the memory for model parameters, V_kv is the KV cache memory, and V_overhead accounts for framework and runtime overhead.
2.1 Model Weight Memory
The dominant memory component is the storage of model weights. Its size is determined by the total parameter count and the precision (bytes per parameter):
$$ V_{weights} = P_{total} \times b_{param} \quad \text{(Equation 2)} $$
Definition — Total Parameters:P_total is the total number of learnable parameters in the model, including all experts for Mixture-of-Experts (MoE) architectures. For dense models, P_total = P_active. For MoE models, P_total ≫ P_active since only a subset of experts is activated per token.
Definition — Bytes per Parameter:b_param denotes the number of bytes used to store each weight parameter. This depends on the chosen precision or quantization level. The table below lists common values.
| Precision | Bits | B/param | Notes |
|---|---|---|---|
| FP32 | 32 | 4.00 | Full precision, rarely used for inference |
| BF16 / FP16 | 16 | 2.00 | Standard training & inference precision |
| FP8 / INT8 | 8 | 1.00 | 8-bit quantization |
| Q8_0 | 8 | 1.06 | GGUF 8-bit with overhead |
| Q6_K | 6 | 0.83 | 6-bit K-quant |
| Q5_K_M | 5 | 0.71 | 5-bit medium K-quant |
| Q4_K_M | 4 | 0.60 | 4-bit medium K-quant |
| Q4 / NF4 | 4 | 0.50 | 4-bit quantization, QLoRA default |
| Q3_K_M | 3 | 0.43 | 3-bit medium K-quant |
| Q2_K | 2 | 0.32 | 2-bit K-quant (aggressive) |
2.2 KV Cache Memory
During autoregressive generation, the model caches past Key and Value states from attention layers to avoid recomputation. The KV cache size is:
$$ V_{kv} = 2 \times L \times n_{kv} \times d_{h} \times C \times B \times U \times b_{kv} \quad \text{(Equation 3)} $$
where the factor of 2 accounts for separate Key and Value tensors.
KV Cache Parameters:
- L: Number of transformer layers (blocks)
- n_kv: Number of key-value heads (depends on attention architecture)
- d_h: Head dimension, typically
d_h = h / n_headsor explicitly defined in config - C: Context length (maximum sequence length in tokens)
- B: Batch size (sequences processed simultaneously)
- U: Number of concurrent users (each requires separate KV cache)
- b_kv: Bytes per cached element (2 for FP16, 1 for FP8/INT8, 0.5 for Q4)
Attention Architecture Variants
The number of KV heads n_kv varies significantly across attention architectures, directly impacting KV cache memory:
| Architecture | KV Heads | Description |
|---|---|---|
| Multi-Head (MHA) | n_kv = n_q | Each query head has its own KV pair. Largest KV cache. |
| Grouped-Query (GQA) | 1 < n_kv < n_q | Query heads share KV heads in groups. Good balance. |
| Multi-Query (MQA) | n_kv = 1 | All query heads share a single KV head. Smallest cache. |
For example, Llama 3.1 70B uses GQA with n_q = 64 query heads but only n_kv = 8 KV heads, reducing KV cache memory by 8× compared to MHA.
Fallback Estimation
When architecture details (L, n_kv, d_h) are unavailable, the calculator falls back to a rule-of-thumb estimate:
$$ V_{kv} \approx P_{total} \times 0.12 \times (C / 4096) \times B \times U \quad \text{(Equation 4)} $$
This assumes KV cache is approximately 12% of weight memory at 4096-token context for a typical GQA model.
2.3 Framework Overhead
Framework overhead includes CUDA kernels, temporary buffers, activation memory, and communication buffers. It is estimated as a fraction of weight memory:
$$ V_{overhead} = V_{weights} \times f_{overhead} \quad \text{(Equation 5)} $$
where f_overhead varies by operation mode:
| Mode | f_overhead |
|---|---|
| Inference | 0.12 (12%) |
| LoRA / QLoRA fine-tuning | 0.40 (40%) |
| Full fine-tuning | 2.00 (200%) |
For full fine-tuning, the overhead factor of 2.0 accounts for gradients (1× weights in training precision) and AdamW optimizer states (8 bytes per parameter in FP32), plus activation memory.
3. RAM Overflow / Offloading
⚠️ Performance Warning
RAM offloading allows models that exceed VRAM to run by spilling layers to system RAM, but at a severe performance cost: 10–50× slower inference for the offloaded portion.
When the total required memory exceeds available VRAM, the calculator supports an optional RAM offloading mode:
$$ V_{overflow} = \max(0, V_{total} - V_{VRAM}) \quad \text{(Equation 6)} $$
$$ V_{RAM,usable} = \min(V_{overflow}, V_{RAM,available}) \quad \text{(Equation 7)} $$
The effective available memory becomes:
$$ V_{effective} = V_{VRAM} + V_{RAM,usable} \quad \text{(Equation 8)} $$
Memory Allocation Parameters:
V_overflow: The amount of required memory that exceeds the physical GPU VRAM capacity.V_VRAM: The total physical Video RAM available on the GPU.V_RAM,usable: The actual amount of system RAM that will be used for offloading.V_RAM,available: The total physical System RAM available on the host machine.V_effective: The total combined memory (VRAM + usable RAM) available for inference.
3.1 The Bus Wall (“Le Mur du Bus”)
The fundamental performance limitation of RAM offloading is captured by the Bus Wall concept — the ratio of GPU HBM bandwidth to the effective transfer bandwidth between CPU and GPU:
$$ Bus Wall Ratio = BW_{HBM} / BW_{transfer} \quad \text{(Equation 9)} $$
where BW_transfer is the effective bandwidth of the data path from system RAM to GPU, determined by the bottleneck in the transfer chain:
$$ BW_{transfer} = \min(BW_{PCIe,effective}, BW_{RAM,effective}) \quad \text{(Equation 10)} $$
The Bus Wall ratio tells you how many times slower RAM-offloaded layers are compared to VRAM-resident layers. Typical values range from 30× (RTX 4090 with Gen4 x16) to over 100× (H100 SXM with Gen5 x16).
3.2 Improved RAM Offload Performance Model
The performance degradation from RAM offloading is modeled by decomposing the decode time into two independent data paths:
$$ T_{decode} = W_{VRAM} / BW_{HBM} + W_{RAM} / BW_{transfer} \quad \text{(Equation 11)} $$
where W_VRAM is the absolute amount of weights stored in VRAM (in GB), and W_RAM is the absolute amount of weights offloaded to system RAM (in GB). This two-path model is more accurate than a simple degradation factor because it correctly accounts for the fact that VRAM-resident layers still run at full HBM speed, while only the offloaded portion is slowed by the Bus Wall.
The fraction of weights in RAM is:
$$ f_{RAM} = V_{RAM,usable} / V_{total} \quad \text{(Equation 12)} $$
Substituting, the effective decode speed becomes:
$$ TPS_{offload} = 1 / ((1-f_{RAM}) \times W / BW_{HBM} + f_{RAM} \times W / BW_{transfer}) \quad \text{(Equation 13)} $$
where W (W = P_active × b_param / 10^9) is the total active weight memory in GB.
3.3 VRAM Priority Allocation & Offloading Regimes
The previous offloading model treated all RAM offloading uniformly, applying the Bus Wall penalty to the entire decode process. However, the two primary components that can be offloaded — model weights and KV cache — have fundamentally different performance implications when offloaded:
- Weight offloading: Causes a Bus Wall penalty on every decode token, because each token generation requires reading all weights from memory. This is catastrophic for throughput.
- KV Cache offloading: Only impacts TTFT (one-time swap-in latency per context switch), because the KV cache is read once at the start of generation and then remains resident during decoding. Decode speed is unaffected.
This distinction leads to a priority-based VRAM allocation model and four distinct performance regimes.
N1: VRAM Priority Allocation
When VRAM is scarce, it must be allocated with careful prioritization. Model weights receive VRAM priority over KV cache because weight offloading causes a per-token Bus Wall penalty, while KV offloading only causes a one-time swap cost:
$$ V_{kv,VRAM}^{max} = \max(0, V_{VRAM} - V_{weights} - V_{overhead}) \quad \text{(Equation 13a)} $$
The actual KV cache stored in VRAM is then:
$$ V_{kv,VRAM} = \min(V_{kv}, V_{kv,VRAM}^{max}) \quad \text{(Equation 13b)} $$
And the KV cache that must be stored in RAM:
$$ V_{kv,RAM} = \max(0, V_{kv} - V_{kv,VRAM}^{max}) \quad \text{(Equation 13c)} $$
If weights alone exceed VRAM, then all VRAM is used for (partial) weights and all KV must go to RAM:
$$ V_{weights,RAM} = \max(0, V_{weights} - V_{VRAM} + V_{overhead}) \quad \text{(Equation 13d)} $$
KV Cache & Weight Allocation Parameters:
V_{kv,VRAM}^{max}: The theoretical maximum amount of KV cache that can fit into VRAM after weights and overhead are accounted for.V_{kv,VRAM}: The actual amount of KV cache stored in VRAM.V_{kv,RAM}: The remaining KV cache that must be offloaded to system RAM.V_{weights,RAM}: The amount of model weights offloaded to system RAM (this only occurs if weights alone exceed VRAM capacity).
N2: Regime Classification
Based on the allocation, the deployment falls into one of four performance regimes:
| Regime | Condition | Description |
|---|---|---|
| A | V_total ≤ V_VRAM | All in VRAM. Full HBM performance. |
| B | V_weights ≤ V_VRAM, V_kv > V_kv_VRAM_max | Weights in VRAM, KV in RAM. Full decode speed, TTFT + swap. |
| C | V_weights > V_VRAM, V_kv ≤ V_kv_VRAM_max | Weights in RAM, KV in VRAM. Bus Wall on every token. (Rare) |
| D | V_weights > V_VRAM, V_kv > V_kv_VRAM_max | Both in RAM. Bus Wall + KV swap. Worst case. |
Key Insight: Why Regime B is dramatically better than C/D In Regime B, decode speed equals Regime A (full HBM speed) because weights are still read from VRAM. The KV cache swap penalty only affects TTFT, not per-token throughput. In contrast, Regimes C and D impose the Bus Wall on every single decode token, making decode 30–100× slower. This is why the VRAM priority allocation (weights first) is so important.
N3: Decode Speed in Regime B
In Regime B, all weights reside in VRAM, so decode proceeds at full HBM speed:
$$ TPS_B = TPS_A = BW_{HBM} / (P_{active} \times b_{param}) \quad \text{(Equation 14)} $$
This is the critical result: KV cache offloading does not slow down decode. Only the initial context loading (TTFT) is impacted.
N4: KV Cache Swap-In Latency
When a user’s KV cache is stored in RAM, it must be swapped into VRAM before generation can begin. This is a one-time cost per context switch:
$$ T_{kv,swap} = V_{kv,RAM} / BW_{transfer} \quad \text{(Equation 15)} $$
where BW_transfer = min(BW_PCIe_effective, BW_RAM_effective) is the same transfer bandwidth used in the Bus Wall calculation.
N5: TTFT in Regime B
The time to first token in Regime B adds the KV swap penalty to the standard prefill time:
$$ TTFT_B = TTFT_A + T_{kv,swap} \quad \text{(Equation 16)} $$
This means the first token for a user with offloaded KV cache is delayed by the swap time, but all subsequent tokens generate at full decode speed.
N6: TTFT in Regime C
In Regime C, weight offloading means that the prompt must wait for weights to be loaded from RAM before each layer can compute. The effective TTFT is:
$$ TTFT_C = \max(T_{compute}, T_{weight,load}) \quad \text{(Equation 17)} $$
Time Parameters:
T_{compute}: The theoretical time to compute the forward pass if all data were instantly available in HBM.T_{weight,load}(= V_{weights,RAM} / BW_{transfer}): The time required to load the offloaded weights from system RAM over the PCIe/RAM transfer bus.
In practice, T_{weight,load} \gg T_{compute} because the Bus Wall ratio is typically 30–100×.
N7: TTFT in Regime D
Regime D combines the worst of both worlds:
$$ TTFT_D = \max(T_{compute}, T_{weight,load}) + T_{kv,swap} \quad \text{(Equation 18)} $$
Both the weight loading penalty and the KV swap penalty apply. This regime should be avoided whenever possible.
N8: Concurrency Limits with KV Offloading
When KV cache is partially stored in RAM, the number of simultaneously served users splits into two groups. Active users have their KV cache entirely in VRAM and experience no swap penalty; swapped users have their KV in RAM and pay the swap cost on context switches:
$$ U_{active} = \lfloor V_{kv,VRAM}^{max} / V_{kv,per user} \rfloor \quad \text{(Equation 19)} $$
$$ U_{swapped} = \lfloor V_{RAM,for KV} / V_{kv,per user} \rfloor \quad \text{(Equation 20)} $$
$$ U_{total} = U_{active} + U_{swapped} \quad \text{(Equation 21)} $$
Concurrency Parameters:
U_{active}: The number of concurrent users whose entire KV cache fits within VRAM (no swap penalty).U_{swapped}: The number of concurrent users whose KV cache is stored in RAM (incurs a swap penalty on context switch).V_{kv,per user}(= 2 × L × n_{kv} × d_h × C × b_{kv} / 10^9): The KV cache size per individual user in GB.
N9: Effective Throughput with KV Swapping
Total throughput with KV swapping accounts for the time spent swapping vs. generating:
$$ TPS_{eff} = U_{active} \times TPS + U_{swapped} \times TPS \times \eta_{swap} \quad \text{(Equation 22)} $$
where the swap efficiency η_swap depends on the ratio of swap time to generation time per context:
$$ \eta_{swap} = 1 / (1 + T_{kv,swap} / T_{gen}) \quad \text{(Equation 23)} $$
where T_{gen} is the total time spent generating tokens for a single user’s response before switching context, and η_{swap} represents the resulting swap efficiency penalty.
For long conversations (T_{gen} \gg T_{kv,swap}), η_{swap} \approx 1 and the swap overhead is negligible. For short exchanges, swap overhead is more significant.
N10: Quantization vs. Offload Decision Threshold
When weights do not fit in VRAM, there is a strategic choice: (1) quantize weights more aggressively to fit in VRAM, or (2) keep higher precision but offload to RAM. The Bus Wall makes option 2 almost always inferior:
$$ b_{fit} = (V_{VRAM} - V_{overhead}) / P_{total} \quad \text{(Equation 24)} $$
where b_fit is the maximum bytes per parameter that allows all weights to fit in VRAM. The decision rule is:
Quantize if f_RAM × Bus Wall Ratio > 2 (Equation 25)
Since typical Bus Wall ratios range from 30–100×, even a small fraction of weights in RAM creates a severe performance penalty. For example, a 70B model at Q4 (35 GB) fits in a single 80 GB A100 with room for KV cache. The same model at FP16 (140 GB) requires offloading 60 GB to RAM, resulting in decode that is ~50× slower — far worse than any quality loss from Q4 quantization.
Best Practices for Offloading
- Quantize weights first: Reduce
b_paramuntilV_weights ≤ V_VRAM. This avoids Regimes C/D entirely.- Quantize KV cache second: If KV cache still overflows after weight quantization, reduce
b_kvto FP8 or INT8. This halves or quarters the KV memory.- Use KV offloading for concurrency: Regime B (KV in RAM, weights in VRAM) is acceptable for multi-user serving because decode speed is preserved.
- Hardware matching: DDR5 + PCIe Gen5 makes KV swap faster (57 GB/s vs 28 GB/s for Gen4), reducing the Regime B penalty.
4. Connectivity & Bandwidth Architecture
This section provides the complete theoretical foundation for the data transfer paths that determine inference performance. Understanding these paths is essential for accurate performance estimation, especially when RAM offloading or multi-GPU configurations are involved.
4.1 The Data Path Hierarchy
Data involved in LLM inference traverses a strict hierarchy of interconnects, each with vastly different bandwidth characteristics:
| Path | Interconnect | BW Range | Use Case |
|---|---|---|---|
| GPU HBM | On-die bus | 900–4800 GB/s | VRAM-resident weight reading |
| NVLink/NVSwitch | GPU-GPU link | 400–900 GB/s | Tensor Parallelism all-reduce |
| PCIe | CPU-GPU bus | 7–57 GB/s eff. | RAM offload data transfer |
| System RAM | Memory bus | 43–304 GB/s eff. | Offloaded weight storage |
The key insight is that each level in this hierarchy is roughly 10–100× slower than the one above it. This creates a “bandwidth cliff” when inference must access slower paths.
4.2 PCIe Bus Architecture
PCI Express (PCIe) is the primary data highway between the CPU and GPU. Its bandwidth is determined by three factors: the generation (signaling rate), the number of lanes (bus width), and practical protocol efficiency.
PCIe Bandwidth Calculation
Theoretical PCIe bandwidth is calculated as:
$$ BW_{PCIe,theoretical} = R_{GT}/s \times N_{lanes} \times (128/130 (for PCIe Gen 3.0+)) \div 8 \quad \text{(Equation 26)} $$
where R_GT/s is the transfer rate per lane, N_lanes is the number of lanes (typically 16, 8, or 4), and the 128/130 (for PCIe Gen 3.0+) factor accounts for the encoding overhead introduced in PCIe 3.0+.
| Generation | GT/s/lane | x16 (GB/s) | x8 (GB/s) |
|---|---|---|---|
| PCIe 3.0 | 8 | 15.75 | 7.88 |
| PCIe 4.0 | 16 | 31.50 | 15.75 |
| PCIe 5.0 | 32 | 63.00 | 31.50 |
Practical PCIe Efficiency
Theoretical bandwidth is never fully achieved. Protocol overhead reduces practical throughput:
$$ BW_{PCIe,effective} = BW_{PCIe,theoretical} \times η_{PCIe} \quad \text{(Equation 27)} $$
where η_PCIe ≈ 0.90 for large sequential DMA transfers. The overhead comes from:
- TLP headers: Each Transaction Layer Packet has a 12–16 byte header for a 256–4096 byte payload
- DLLP and ACK/NAK: Data Link Layer packets add protocol overhead
- Root complex latency: CPU-side DMA controller adds a few microseconds of setup latency
- Memory mapping: IOMMU address translation for DMA adds overhead
For small random transfers (e.g., individual parameter updates), efficiency can drop to 40–70%. The calculator uses η_PCIe = 0.90 as a conservative estimate for the large sequential DMA transfers characteristic of weight loading during inference.
PCIe Lane Width Impact
Most desktop and server GPUs connect via x16 (16 lanes). However, some configurations reduce the effective lane width:
- Motherboards that share PCIe lanes between slots may run the GPU at x8 when both slots are populated
- Some budget GPUs are physically x8 or x4
- SXM form factors bypass PCIe entirely for GPU-GPU communication, but still use PCIe for CPU-GPU data transfer
Reducing from x16 to x8 exactly halves the available bandwidth, which significantly impacts RAM offload performance.
4.3 System RAM Bandwidth
Theoretical RAM Bandwidth
System RAM bandwidth is determined by the memory type, transfer rate, and number of channels:
$$ BW_{RAM,theoretical} = MT/s \times N_{channels} \times 8 bytes/transfer \quad \text{(Equation 28)} $$
| Configuration | MT/s | Channels | BW (GB/s) |
|---|---|---|---|
| DDR4-3200 2-ch | 3200 | 2 | 51.2 |
| DDR4-3200 4-ch | 3200 | 4 | 102.4 |
| DDR4-3200 8-ch | 3200 | 8 | 204.8 |
| DDR5-5600 2-ch | 5600 | 2 | 89.6 |
| DDR5-5600 4-ch | 5600 | 4 | 179.2 |
| DDR5-5600 8-ch | 5600 | 8 | 358.4 |
Practical RAM Efficiency
As with PCIe, theoretical RAM bandwidth is not fully achievable:
$$ BW_{RAM,effective} = BW_{RAM,theoretical} \times η_{RAM} \times f_{NUMA} \quad \text{(Equation 29)} $$
where η_RAM ≈ 0.85 accounts for DRAM refresh cycles, row misses, and memory controller scheduling, and f_NUMA is the NUMA efficiency factor.
NUMA Effects on Multi-Socket Systems
On multi-socket servers (AMD EPYC, Intel Xeon), each CPU socket has its own memory controller and attached RAM. Accessing RAM on the local socket is fast, but accessing RAM attached to a remote socket traverses an inter-socket link (AMD Infinity Fabric or Intel UPI) that adds latency and reduces bandwidth by 30–50%:
$$ f_{NUMA} = 1.0 \quad \text{(Equation 30)} $$ if NUMA-aware (weights on local socket)
f_NUMA = 0.65if no NUMA awareness (potential cross-socket access)
The calculator uses f_NUMA = 0.65 when NUMA awareness is disabled, reflecting the worst case where weight data may be allocated on a remote socket. For production deployments, always enable NUMA-aware allocation and pin the GPU to the same NUMA node as the RAM holding the offloaded weights.
4.4 The Transfer Chain: RAM → CPU → PCIe → GPU
When RAM-offloaded layers are accessed during inference, data must traverse the entire transfer chain:
- RAM read: Data is read from DRAM through the CPU memory controller
- CPU processing: The CPU’s IOMMU maps the DMA buffer address
- PCIe DMA transfer: Data is transferred via DMA from the pinned CPU buffer to GPU VRAM
- GPU reception: The GPU’s PCIe controller receives the data into VRAM
The bottleneck in this chain is the slower of PCIe effective bandwidth and RAM effective bandwidth:
$$ BW_{transfer} = \min(BW_{PCIe,effective}, BW_{RAM,effective}) \quad \text{(Equation 31)} $$
In most configurations, PCIe is the bottleneck. Even DDR4 2-channel at 51.2 GB/s theoretical (≈43 GB/s effective) exceeds PCIe Gen4 x8 at ≈14 GB/s effective. However, with fast PCIe Gen5 x16 (≈57 GB/s effective), slower RAM configurations (DDR4 2-channel) can become the bottleneck instead.
4.5 GPU Internal Architecture & HBM
High Bandwidth Memory (HBM)
GPU HBM is the fastest memory in the inference data path, connected to the GPU die via an ultra-wide bus (3072–6144 bits) on an organic or silicon interposer. This provides bandwidth of 900–4800 GB/s, orders of magnitude faster than any external interconnect.
| Memory Type | Example GPU | Bandwidth | Bus Width |
|---|---|---|---|
| GDDR6X | RTX 4090 | 1008 GB/s | 384-bit |
| HBM2e | A100 | 2000 GB/s | 5120-bit |
| HBM3 | H100 SXM | 3350 GB/s | 5120-bit |
| HBM3e | H200 SXM | 4800 GB/s | 6144-bit |
LLM decode is almost always HBM-bandwidth-bound: each token generation requires reading ALL active weights from HBM, but only performs 2/b FLOPs per byte of weight data (where b is bytes per parameter). This arithmetic intensity is far below the GPU’s ridge point, confirming the bandwidth-bound nature of decode.
GPU Clock Speed & Compute Throughput
GPU clock speed (typically 1.5–2.5 GHz) affects compute throughput (TFLOPS) but has limited impact on bandwidth-bound inference. The relationship between clock speed and compute throughput is:
$$ TFLOPS = N_{cores} \times f_{clock} \times 2 \quad \text{FLOP/clock/core (CUDA cores, FMA operation) (Equation 32)} $$
where N_{cores} is the number of CUDA cores and f_{clock} is the GPU clock frequency in GHz.
Note: This formula applies to standard CUDA scalar cores only. Tensor Core throughput is architecture- and precision-specific (e.g., H100 FP16 dense ≈ 989 TFLOPS) and must be taken from vendor specification tables — it cannot be derived from this formula.
For bandwidth-bound decode, increasing clock speed provides no benefit — the bottleneck is HBM read speed, not compute. Clock speed only matters for compute-bound prefill, where higher TFLOPS directly translates to faster prompt processing.
The Roofline Model
The roofline model provides a unified framework for understanding whether a workload is compute-bound or bandwidth-bound:
Definition — Arithmetic Intensity:
Arithmetic intensity is the ratio of FLOPs performed to bytes of data accessed:
$$ AI = FLOPs / Bytes accessed \quad \text{(Equation 33)} $$
For LLM decode with active parameters P_active stored at b bytes per parameter:
$$ AI_{decode} = (2 \times P_{active}) / (P_{active} \times b) = 2/b \quad \text{(Equation 34)} $$
At Q4 (b = 0.5), the arithmetic intensity is only 4 FLOP/byte. At FP16 (b = 2), it drops to 1 FLOP/byte. These values are far below the GPU’s ridge point (the arithmetic intensity at which compute and bandwidth are equally limiting):
$$ AI_{ridge} = TFLOPS / BW_{HBM} \quad \text{(Equation 35)} $$
For H100, using the dense FP16 figure as a conservative ceiling:
AI_ridge = 989 / 3,350 ≈ 295 FLOP/byte (dense, non-sparse)
Using the sparse figure from NVIDIA’s spec sheet:
AI_ridge = 1,979 / 3,350 ≈ 591 FLOP/byte (sparse, 2:4 structured)
Either way, LLM decode arithmetic intensity (1–4 FLOP/byte) is far below both ridge points, confirming bandwidth-bound execution regardless of convention used.
For prefill, the arithmetic intensity is much higher because the same weights are reused across all prompt tokens in a single batched matrix multiplication. Prefill is typically compute-bound.
4.6 Multi-GPU Interconnectivity
NVLink
NVLink is NVIDIA’s proprietary high-speed GPU-to-GPU interconnect. It provides dramatically higher bandwidth than PCIe, enabling efficient Tensor Parallelism (TP) where model weights are split across multiple GPUs.
| Generation | Links | BW per GPU | Example GPU |
|---|---|---|---|
| NVLink 2 | 6 | 300 GB/s | V100 |
| NVLink 3 | 12 | 600 GB/s | A100 |
| NVLink 4 | 18 | 900 GB/s | H100/H200 |
NVLink uses a point-to-point topology: each GPU has direct links to specific other GPUs. In a 4-GPU system, this creates a mesh where each GPU connects to every other GPU. In an 8-GPU system, each GPU typically connects to 4 neighbors, and multi-hop routing is required for non-adjacent communication.
NVSwitch
NVSwitch is NVIDIA’s switching fabric that provides full all-to-all connectivity between all GPUs in a node. Unlike point-to-point NVLink, NVSwitch allows any GPU to communicate with any other GPU at full NVLink speed simultaneously.
Equation 36 — All-reduce with NVSwitch:
All-reduce_NVSwitch = 2 × (h × 2) / BW_NVLink
vs. ring all-reduce on point-to-point NVLink:
$$ All-reduce_{ring} = 2 \times ((N-1)/N) \times (h \times 2) / BW_{NVLink} \quad \text{(Equation 37)} $$
where h is the hidden size and N is the number of GPUs. NVSwitch eliminates the (N−1)/N penalty and reduces the number of hops, providing significantly better TP efficiency for 4+ GPUs.
AMD Infinity Fabric
AMD’s Infinity Fabric serves a similar role to NVLink on MI-series GPUs. The MI300X is a multi-chiplet accelerator integrating 8 compute dies (XCDs) and 4 I/O dies — 12 chiplets in total — connected via AMD Infinity Fabric within the package. It provides 192 GB of HBM3 memory at 5,300 GB/s aggregate bandwidth. For multi-GPU scaling, each discrete MI300X offers a 16-lane PCIe® Gen 5 host interface and seven external AMD Infinity Fabric links (each at 128 GB/s bidirectional), allowing full all-to-all connectivity between eight GPUs in a ring topology.
PCIe Peer-to-Peer (P2P)
Without NVLink or Infinity Fabric, GPUs can communicate via PCIe peer-to-peer transfers. This uses the PCIe bus for direct GPU-to-GPU communication without CPU involvement, but the bandwidth is limited to PCIe speeds (typically 15–57 GB/s effective), making TP inefficient for all but the smallest models.
4.7 Tensor Parallelism vs. Pipeline Parallelism
Tensor Parallelism (TP)
TP splits model weights across N GPUs. Each GPU holds 1/N of the weights and performs the corresponding portion of each matrix multiplication. After each transformer layer, an all-reduce synchronizes the partial results across all GPUs.
The decode time per token with TP is:
$$ T_{TP} = (P_{active} \times b) / (N \times BW_{HBM}) + L \times ((2 \times (N-1)/N \times h \times 2) / BW_{interconnect} + λ_{AR}) \quad \text{(Equation 38)} $$
where λ_AR is the all-reduce latency per layer (5–100 μs depending on interconnect).
TP efficiency is:
$$ η_{TP} = T_{compute} / (T_{compute} + T_{communication}) \quad \text{(Equation 39)} $$
| Interconnect | 2-GPU | 4-GPU | 8-GPU |
|---|---|---|---|
| NVSwitch (H100) | 96% | 93% | 88% |
| NVLink P2P (H100) | 94% | 87% | 75% |
| PCIe Gen4 x16 | 65% | 40% | 20% |
TP is the preferred strategy when NVLink or NVSwitch is available, as it provides near-linear scaling with minimal latency overhead.
Pipeline Parallelism (PP)
PP splits model layers across GPUs sequentially. Each GPU processes a contiguous group of layers and passes the activation tensor to the next GPU. PP requires much less communication than TP — only the activation tensor (hidden size × batch × precision) is sent between stages, not an all-reduce.
However, PP introduces pipeline bubbles — idle time where GPUs wait for preceding stages to complete:
$$ Bubble fraction = (N-1) / (N + M - 1) \quad \text{(Equation 40)} $$
where M is the number of micro-batches. For single-user inference with batch size 1, only 1 micro-batch is possible, giving bubble fraction (N−1)/N.
PP is preferred when NVLink is unavailable (e.g., consumer GPUs connected via PCIe) or for cross-node deployment where network latency makes TP impractical.
Combined TP + PP
For very large models on many GPUs, both strategies can be combined: TP within a node (using NVLink) and PP across nodes (using network). For example, a 405B model on 8 H100s might use TP=4 within two nodes and PP=2 across nodes.
4.8 Combined Performance Model
The complete decode time model accounts for all connectivity factors:
$$ T_{decode} = (1-f_{RAM}) \times W / (η_{TP} \times N \times BW_{HBM}) + f_{RAM} \times W / \min(η_{PCIe} \times BW_{PCIe}, η_{RAM} \times f_{NUMA} \times BW_{RAM}) \quad \text{(Equation 41)} $$
This formula captures the essential physics of LLM inference:
- VRAM-resident layers benefit from both HBM bandwidth and TP parallelism
- RAM-offloaded layers are limited by the slowest link in the transfer chain
- The Bus Wall ratio quantifies the performance gap between these two regimes
- NUMA effects can further degrade RAM-offloaded performance on multi-socket systems
- PCIe lane width and generation directly affect the transfer bottleneck
5. Performance Estimation
5.1 Decode Speed (Tokens per Second)
During autoregressive decoding, each token generation requires loading all model weights from GPU memory. This makes inference bandwidth-bound:
$$ TPS = BW / (P_{active} \times b_{param} \times 1000) \quad \text{(Equation 42)} $$
where BW is the GPU memory bandwidth in GB/s. For MoE models, P_active (the number of parameters active per forward pass) is used instead of P_total, since only the routed experts are loaded during decode.
Definition — Active Parameters:
For dense models, P_active = P_total. For MoE models, P_active represents only the parameters used per forward pass, including the embedding layer, shared experts, and the k routed experts per token. For example, Mixtral 8x7B has P_total = 47B but P_active ≈ 13B.
5.2 Time to First Token (TTFT)
The prefill phase processes the entire prompt in parallel. The time to first token is estimated as:
$$ TTFT = (P_{active} \times C \times 2 \times b_{param}) / BW \times 1000 ms \quad \text{(Equation 43)} $$
where C is the prompt length in tokens. The factor of 2 accounts for the read and write of activations during the forward pass.
5.3 Extended Performance Metrics
The calculator provides additional derived metrics for practical deployment planning:
$$ Latency per token = 1000 / TPS \quad \text{ms (Equation 44)} $$
$$ Time_{100} = TTFT/1000 + 100/TPS \quad \text{seconds (Equation 45)} $$
$$ Time_{1000} = TTFT/1000 + 1000/TPS \quad \text{seconds (Equation 46)} $$
$$ Throughput = TPS \times U \quad \text{tok/s total (Equation 47)} $$
6. Power & Cost Estimation
6.1 Power Model
GPU power consumption is estimated from the Thermal Design Power (TDP) and utilization:
$$ P_{draw} = TDP \times (U_{GPU} / 100) \quad \text{(Equation 48)} $$
where U_{GPU} is the GPU utilization percentage. LLM inference is typically memory-bandwidth-bound rather than compute-bound, resulting in 60–90% utilization during decode.
6.2 Energy and Cost Calculations
$$ E_{hour} = P_{draw} / 1000 \quad \text{kWh (Equation 49)} $$
$$ C_{hour} = E_{hour} \times R_{elec} \quad \text{(Equation 50)} $$
$$ C_{day} = C_{hour} \times H_{day} \quad \text{(Equation 51)} $$
$$ C_{month} = C_{day} \times 30 \quad \text{(Equation 52)} $$
$$ C_{1M} tok = (C_{hour} / (TPS \times 3600)) \times 10⁶ \quad \text{(Equation 53)} $$
where R_elec is the electricity rate ($/kWh), H_day is operating hours per day, and TPS is the decode speed.
6.3 Carbon Emissions
$$ CO_{2,hour} = E_{hour} \times I_{carbon} \quad \text{(Equation 54)} $$
$$ CO_{2,annual} = CO_{2,hour} \times H_{day} \times 365 / 1000 \quad \text{tonnes (Equation 55)} $$
where I_carbon is the grid carbon intensity in kg CO₂/kWh. Default values and regional references:
| Region | kg CO₂/kWh |
|---|---|
| World average | 0.417 |
| European Union | 0.255 |
| United States | 0.387 |
| France (nuclear) | 0.056 |
| Sweden (hydro) | 0.045 |
| Poland (coal) | 0.769 |
| China | 0.555 |
7. Fine-Tuning Memory Model
For fine-tuning, the memory model extends to include gradients and optimizer states:
$$ V_{FT} = V_{weights} + V_{gradients} + V_{optimizer} + V_{activations} \quad \text{(Equation 56)} $$
7.1 LoRA / QLoRA
With LoRA, only low-rank adaptation matrices are trained. The additional memory is:
$$ V_{LoRA} \approx 0.40 \times V_{weights} \quad \text{(Equation 57)} $$
This accounts for LoRA weights (typically <1% of base weights), their gradients (FP32), and 8-bit AdamW states.
7.2 Full Fine-Tuning
Full fine-tuning requires gradients for all parameters and optimizer states:
$$ V_{gradients} = P_{total} \times b_{train} \quad \text{(Equation 58)} $$
$$ V_{optimizer} = P_{total} \times 8 \quad \text{(AdamW FP32: momentum + variance) (Equation 59)} $$
Combined with the overhead factor of 2.0, this yields approximately 3× the weight memory for FP16 training with AdamW.
8. Quantization Reference
8.1 GGUF Quantization Levels
GGUF (GPT-Generated Unified Format) is the binary file format introduced by the llama.cpp project as a successor to the older GGML format. The name is not an official acronym — it is commonly rendered as “GGUF” without expansion. It is the standard format for llama.cpp and compatible inference engines. The following table lists supported quantization levels:
| Level | Label | B/param | Description |
|---|---|---|---|
| Q2_K | 2-bit K-quant | 0.32 | Aggressive 2-bit quantization, K-quants method |
| Q3_K_S | 3-bit small | 0.34 | 3-bit quantization, small variant |
| Q3_K_M | 3-bit medium | 0.43 | 3-bit quantization, medium variant |
| Q3_K_L | 3-bit large | 0.45 | 3-bit quantization, large variant |
| Q4_0 | 4-bit base | 0.56 | Basic 4-bit quantization |
| Q4_K_S | 4-bit small K | 0.58 | 4-bit K-quant, small variant |
| Q4_K_M | 4-bit medium K | 0.60 | 4-bit K-quant, medium variant |
| Q5_0 | 5-bit base | 0.68 | Basic 5-bit quantization |
| Q5_K_S | 5-bit small K | 0.69 | 5-bit K-quant, small variant |
| Q5_K_M | 5-bit medium K | 0.71 | 5-bit K-quant, medium variant |
| Q6_K | 6-bit K-quant | 0.83 | 6-bit K-quant |
| Q8_0 | 8-bit quant | 1.06 | Near-FP16 quality, 8-bit quantization |
8.2 Weighted Quantization Methods
Beyond GGUF, the calculator recognizes several weighted quantization formats commonly found on HuggingFace:
| Method | Typical Bits | Characteristics |
|---|---|---|
| GPTQ | 2–8 bits | Post-training quantization with calibration dataset; group-wise quantization with optional desc_act; commonly 4-bit with group_size=128 |
| AWQ | 4–8 bits | Activation-aware weight quantization; preserves salient weights; group-wise with zero-point |
| EXL2 | 2–8 bits | ExLlamaV2 format; mixed-precision per-layer quantization; optimized for ExLlamaV2 inference engine |
| BNB/NF4 | 4 bits | BitsAndBytes NF4 quantization; default for QLoRA; double quantization support |
9. HuggingFace Integration
9.1 Model Metadata Retrieval
The calculator fetches model metadata from the HuggingFace API using the endpoint:
GET https://huggingface.co/api/models/{model_id}
This returns the model card data including parameter counts (from safetensors.total), tensor types (from safetensors.parameters), siblings (file list), and default branch information.
9.2 Tensor Type Detection
Tensor type detection follows a 4-level priority chain:
- quantization_config in
config.json: Highest priority for explicitly quantized models (GPTQ, AWQ, EXL2, BNB). - safetensors.parameters from the API: Shows actual dtype names of stored tensors (e.g.,
F16,Q8_0). - Safetensors index metadata: dtype fields in
model.safetensors.index.json. - torch_dtype in
config.json: Fallback for unquantized models (typicallybfloat16orfloat16).
9.3 Quantized Variant Discovery
The calculator automatically discovers quantized variants by:
- Scanning current model’s siblings for GGUF files (parsing filenames like
model-Q4_K_M.gguf). - Searching HuggingFace for related repos:
[base-name] GGUF,[base-name] GPTQ,[base-name] AWQ. - Deduplicating and sorting variants by quantization level.
10. Hardware Reference
Supported GPU Hardware
| GPU | VRAM (GB) | HBM BW (GB/s) | TFLOPS FP16 (Dense) | TFLOPS FP16 (Sparse) | PCIe Gen | NVLink (GB/s) | NVS |
|---|---|---|---|---|---|---|---|
| H200 SXM | 141 | 4800 | 989 | 1979 | 5 | 900 | Yes |
| H100 SXM | 80 | 3350 | 989 | 1979 | 5 | 900 | Yes |
| H100 PCIe | 80 | 2000 | 756 | 1513 | 5 | 0 | No |
| A100 80GB | 80 | 2000 | 312 | 624 | 4 | 600 | Yes |
| A100 40GB | 40 | 1555 | 312 | 624 | 4 | 600 | Yes |
| A6000 Ada | 48 | 960 | 182 | 364 | 4 | 0 | No |
| RTX 4090 | 24 | 1008 | 165 | 330 | 4 | 0 | No |
| RTX 3090 | 24 | 936 | 71 | 142 | 4 | 0 | No |
| L40S | 48 | 864 | 366 | 733 | 4 | 0 | No |
| MI300X | 192 | 5300 | 1307 | 2614 | 5 | 400 | No |
| MI250X | 128 | 3276 | 383 | 383 | 4 | 400 | No |
Note: Sparse TFLOPS assume a 2:4 structured sparsity pattern, effectively doubling throughput for supported operations compared to Dense matrices. Older architectures like MI250X (CDNA2) do not feature structured sparsity hardware acceleration.
PCIe Bandwidth Reference
| Config | Theoretical | Effective (η=0.90) | Bus Wall (H200) | Bus Wall (4090) |
|---|---|---|---|---|
| Gen3 x16 | 15.75 GB/s | 14.2 GB/s | 338× | 71× |
| Gen3 x8 | 7.88 GB/s | 7.1 GB/s | 676× | 142× |
| Gen4 x16 | 31.5 GB/s | 28.4 GB/s | 169× | 36× |
| Gen4 x8 | 15.75 GB/s | 14.2 GB/s | 338× | 71× |
| Gen5 x16 | 63.0 GB/s | 56.7 GB/s | 85× | 18× |
| Gen5 x8 | 31.5 GB/s | 28.4 GB/s | 169× | 36× |
Interconnect Latency Reference
| Interconnect | Latency | Topology | Notes |
|---|---|---|---|
| NVSwitch | ~5 μs | All-to-all | Best for 4+ GPUs |
| NVLink P2P | ~10 μs | Ring/mesh | Direct GPU-GPU link |
| PCIe P2P | ~40 μs | Ring | Through root complex |
| Through CPU | ~100 μs | Multi-hop | GPU → CPU → GPU |
11. Limitations & Assumptions
Roofline model: Performance estimates assume bandwidth-bound decode (confirmed by arithmetic intensity analysis). Actual performance may be compute-bound for very small models or very short sequences with high batch sizes.
PCIe efficiency: The 0.90 efficiency factor is a conservative estimate for large DMA transfers. Real efficiency varies by motherboard, IOMMU configuration, and CPU architecture. Measured values range from 0.85 to 0.95.
RAM efficiency: The 0.85 factor for sequential RAM reads is an average. Actual values depend on DRAM type, frequency, and access pattern. Mixed read/write workloads achieve lower efficiency.
NUMA model: The 0.65 factor for non-NUMA-aware allocation is an approximation. Actual cross-socket bandwidth depends on the inter-socket link (AMD Infinity Fabric, Intel UPI), memory topology, and system firmware configuration.
KV cache quantization: Quality impact of aggressive KV cache quantization (Q4) is not modeled; it may degrade output quality for sensitive tasks.
RAM offloading: The regime-aware model (N1–N10) distinguishes weight offloading from KV cache offloading, but still simplifies the real behavior of offloading engines. In practice, llama.cpp and vLLM may use pipelined or overlapped transfers that partially hide latency. The KV swap model assumes sequential swap-in, while real implementations may prefetch or stream KV cache during prefill.
Multi-GPU TP: Estimates use the analytical all-reduce model with fixed latency. Real implementations may use custom all-reduce algorithms (e.g., NVLink SHARP, NCCL topology-aware) that differ from the ring model.
Pipeline Parallelism: The bubble fraction model assumes single micro-batch inference. With continuous batching or multiple concurrent requests, the bubble can be hidden more effectively.
Framework overhead: The 12% inference overhead is a conservative average. vLLM and llama.cpp may use less; PyTorch native may use more.
MoE active parameters: The calculator uses the active parameter count reported by the model card or estimated from config. Actual activation patterns may differ, and expert routing is token-dependent.
Power model: TDP-based power estimation is approximate. Real power varies with clock speed, temperature throttling, and workload characteristics.
GPU clock speed: Clock speed is not explicitly modeled in the calculator since decode is bandwidth-bound. For compute-bound prefill, the TFLOPS value implicitly captures clock speed effects.
12. Glossary
- VRAM — Video Random Access Memory: the dedicated high-bandwidth memory on a GPU, used to store model weights, activations, and KV cache during inference.
- HBM — High Bandwidth Memory: a type of VRAM connected to the GPU die via an ultra-wide bus (3072–6144 bits) on an interposer, providing 900–4800 GB/s bandwidth.
- KV Cache — Key-Value Cache: stored attention key and value tensors from previous positions, enabling efficient autoregressive generation without recomputing the entire context at each step.
- GQA — Grouped-Query Attention: an attention mechanism where query heads are divided into groups, each sharing a single KV head. Balances MHA quality with MQA efficiency.
- MQA — Multi-Query Attention: all query heads share a single KV head, minimizing KV cache size at the cost of some expressiveness.
- MoE — Mixture of Experts: a model architecture where only a subset (typically 2–8) of expert modules is activated per token, reducing compute while maintaining model capacity.
- TPS — Tokens Per Second: the generation speed during autoregressive decoding, the primary throughput metric for LLM inference.
- TTFT — Time to First Token: the latency from receiving a prompt to generating the first output token, dominated by the prefill (prompt processing) phase.
- TDP — Thermal Design Power: the maximum sustained power dissipation a cooling system must handle, used as a proxy for peak GPU power consumption.
- QLoRA — Quantized Low-Rank Adaptation: a parameter-efficient fine-tuning method that quantizes the base model to 4-bit (NF4) and trains small low-rank adapter matrices.
- GGUF — GGUF is the binary file format introduced by the llama.cpp project as a successor to the older GGML format. The name is not an official acronym — it is commonly rendered as “GGUF” without expansion. It is the standard format for llama.cpp and compatible inference engines.
- Bus Wall — Le Mur du Bus: the ratio of GPU HBM bandwidth to the effective transfer bandwidth (PCIe + RAM) for RAM-offloaded layers. Quantifies how many times slower offloaded layers are compared to VRAM-resident layers.
- PCIe — Peripheral Component Interconnect Express: the primary data bus between CPU and GPU, providing 7–57 GB/s effective bandwidth depending on generation and lane count.
- NVLink — NVIDIA’s proprietary high-speed GPU-to-GPU interconnect, providing 300–900 GB/s bandwidth for Tensor Parallelism communication.
- NVSwitch — NVIDIA’s switching fabric providing all-to-all connectivity between GPUs in a node, improving TP efficiency for 4+ GPUs compared to point-to-point NVLink.
- NUMA — Non-Uniform Memory Access: a memory architecture where each CPU socket has local memory; accessing remote socket memory is slower, affecting RAM offload performance on multi-socket servers.
- Arithmetic Intensity — FLOPs performed per byte of data accessed, determining whether a workload is compute-bound or bandwidth-bound according to the roofline model.
- Ridge Point — The arithmetic intensity at which a GPU transitions from bandwidth-bound to compute-bound execution, equal to
TFLOPS / BW_HBM. - TP — Tensor Parallelism: a multi-GPU strategy that splits model weights across GPUs, requiring all-reduce communication after each layer.
- PP — Pipeline Parallelism: a multi-GPU strategy that splits model layers across GPUs sequentially, requiring only activation tensor communication between stages.
- Dense TFLOPS — Compute throughput calculated without assuming structural sparsity. Every operation in the matrix multiply is explicitly computed.
- Sparse TFLOPS — Compute throughput assuming structured sparsity (e.g., 2:4 sparsity where 2 out of every 4 elements are zero). This allows specialized hardware (like Tensor Cores) to skip zero-multiplies, effectively doubling the theoretical throughput for supported operations.
- 2:4 Structured Sparsity — A sparsity pattern where exactly 2 out of every 4 consecutive weight elements are zero, in a fixed pattern. This is the only sparsity format natively accelerated by NVIDIA Tensor Cores (Ampere and later). Pruning a model to 2:4 sparsity typically preserves most of the model’s accuracy while halving the compute required for matrix multiplications.
- Regime A/B/C/D — Offloading regimes that classify how memory is distributed between VRAM and RAM, determining whether the Bus Wall penalty applies per-token (C/D) or only at context switch (B).
- KV Swap — The one-time cost of loading a user’s KV cache from RAM to VRAM before generation can begin. Only applies in Regimes B and D. Does not affect decode speed.
- VRAM Priority Allocation — The rule that weights are allocated to VRAM before KV cache, because weight offloading causes a per-token Bus Wall penalty while KV offloading only causes a one-time swap cost.