Local LLM Resource Estimator — Technical Documentation & Formula Reference

Author: Karim EL MERNISSI
Date: April 2026


1. Introduction

The Local LLM Resource Estimator is a self-contained, browser-based tool designed to estimate the memory requirements, inference performance, and operational costs of deploying Large Language Models (LLMs) on GPU hardware. It integrates directly with the HuggingFace model hub to retrieve real model metadata, including parameter counts, tensor types, quantization configurations, and available quantized variants.

This document provides the complete mathematical foundation, algorithmic descriptions, and pedagogical references for every calculation performed by the tool. Each formula is derived from first principles and annotated with its assumptions, limitations, and practical implications.

1.1 Design Philosophy

The calculator follows three core principles: rigor (every estimate is grounded in a well-defined mathematical model), transparency (all formulas are displayed to the user with live substitution of their chosen parameters), and pedagogy (every input and output is accompanied by an on-demand explanation accessible via the “?” buttons throughout the interface).

1.2 Scope

The calculator addresses two primary deployment scenarios:


2. VRAM Estimation Model

Total GPU memory consumption is modeled as the sum of three primary components:

$$ V_{total} = V_{weights} + V_{kv} + V_{overhead} \quad \text{(Equation 1)} $$

where V_weights is the memory for model parameters, V_kv is the KV cache memory, and V_overhead accounts for framework and runtime overhead.

2.1 Model Weight Memory

The dominant memory component is the storage of model weights. Its size is determined by the total parameter count and the precision (bytes per parameter):

$$ V_{weights} = P_{total} \times b_{param} \quad \text{(Equation 2)} $$

Definition — Total Parameters:
P_total is the total number of learnable parameters in the model, including all experts for Mixture-of-Experts (MoE) architectures. For dense models, P_total = P_active. For MoE models, P_total ≫ P_active since only a subset of experts is activated per token.

Definition — Bytes per Parameter:
b_param denotes the number of bytes used to store each weight parameter. This depends on the chosen precision or quantization level. The table below lists common values.

PrecisionBitsB/paramNotes
FP32324.00Full precision, rarely used for inference
BF16 / FP16162.00Standard training & inference precision
FP8 / INT881.008-bit quantization
Q8_081.06GGUF 8-bit with overhead
Q6_K60.836-bit K-quant
Q5_K_M50.715-bit medium K-quant
Q4_K_M40.604-bit medium K-quant
Q4 / NF440.504-bit quantization, QLoRA default
Q3_K_M30.433-bit medium K-quant
Q2_K20.322-bit K-quant (aggressive)

2.2 KV Cache Memory

During autoregressive generation, the model caches past Key and Value states from attention layers to avoid recomputation. The KV cache size is:

$$ V_{kv} = 2 \times L \times n_{kv} \times d_{h} \times C \times B \times U \times b_{kv} \quad \text{(Equation 3)} $$

where the factor of 2 accounts for separate Key and Value tensors.

KV Cache Parameters:

Attention Architecture Variants

The number of KV heads n_kv varies significantly across attention architectures, directly impacting KV cache memory:

ArchitectureKV HeadsDescription
Multi-Head (MHA)n_kv = n_qEach query head has its own KV pair. Largest KV cache.
Grouped-Query (GQA)1 < n_kv < n_qQuery heads share KV heads in groups. Good balance.
Multi-Query (MQA)n_kv = 1All query heads share a single KV head. Smallest cache.

For example, Llama 3.1 70B uses GQA with n_q = 64 query heads but only n_kv = 8 KV heads, reducing KV cache memory by 8× compared to MHA.

Fallback Estimation

When architecture details (L, n_kv, d_h) are unavailable, the calculator falls back to a rule-of-thumb estimate:

$$ V_{kv} \approx P_{total} \times 0.12 \times (C / 4096) \times B \times U \quad \text{(Equation 4)} $$

This assumes KV cache is approximately 12% of weight memory at 4096-token context for a typical GQA model.

2.3 Framework Overhead

Framework overhead includes CUDA kernels, temporary buffers, activation memory, and communication buffers. It is estimated as a fraction of weight memory:

$$ V_{overhead} = V_{weights} \times f_{overhead} \quad \text{(Equation 5)} $$

where f_overhead varies by operation mode:

Modef_overhead
Inference0.12 (12%)
LoRA / QLoRA fine-tuning0.40 (40%)
Full fine-tuning2.00 (200%)

For full fine-tuning, the overhead factor of 2.0 accounts for gradients (1× weights in training precision) and AdamW optimizer states (8 bytes per parameter in FP32), plus activation memory.


3. RAM Overflow / Offloading

⚠️ Performance Warning
RAM offloading allows models that exceed VRAM to run by spilling layers to system RAM, but at a severe performance cost: 10–50× slower inference for the offloaded portion.

When the total required memory exceeds available VRAM, the calculator supports an optional RAM offloading mode:

$$ V_{overflow} = \max(0, V_{total} - V_{VRAM}) \quad \text{(Equation 6)} $$

$$ V_{RAM,usable} = \min(V_{overflow}, V_{RAM,available}) \quad \text{(Equation 7)} $$

The effective available memory becomes:

$$ V_{effective} = V_{VRAM} + V_{RAM,usable} \quad \text{(Equation 8)} $$

Memory Allocation Parameters:

3.1 The Bus Wall (“Le Mur du Bus”)

The fundamental performance limitation of RAM offloading is captured by the Bus Wall concept — the ratio of GPU HBM bandwidth to the effective transfer bandwidth between CPU and GPU:

$$ Bus Wall Ratio = BW_{HBM} / BW_{transfer} \quad \text{(Equation 9)} $$

where BW_transfer is the effective bandwidth of the data path from system RAM to GPU, determined by the bottleneck in the transfer chain:

$$ BW_{transfer} = \min(BW_{PCIe,effective}, BW_{RAM,effective}) \quad \text{(Equation 10)} $$

The Bus Wall ratio tells you how many times slower RAM-offloaded layers are compared to VRAM-resident layers. Typical values range from 30× (RTX 4090 with Gen4 x16) to over 100× (H100 SXM with Gen5 x16).

3.2 Improved RAM Offload Performance Model

The performance degradation from RAM offloading is modeled by decomposing the decode time into two independent data paths:

$$ T_{decode} = W_{VRAM} / BW_{HBM} + W_{RAM} / BW_{transfer} \quad \text{(Equation 11)} $$

where W_VRAM is the absolute amount of weights stored in VRAM (in GB), and W_RAM is the absolute amount of weights offloaded to system RAM (in GB). This two-path model is more accurate than a simple degradation factor because it correctly accounts for the fact that VRAM-resident layers still run at full HBM speed, while only the offloaded portion is slowed by the Bus Wall.

The fraction of weights in RAM is:

$$ f_{RAM} = V_{RAM,usable} / V_{total} \quad \text{(Equation 12)} $$

Substituting, the effective decode speed becomes:

$$ TPS_{offload} = 1 / ((1-f_{RAM}) \times W / BW_{HBM} + f_{RAM} \times W / BW_{transfer}) \quad \text{(Equation 13)} $$

where W (W = P_active × b_param / 10^9) is the total active weight memory in GB.

3.3 VRAM Priority Allocation & Offloading Regimes

The previous offloading model treated all RAM offloading uniformly, applying the Bus Wall penalty to the entire decode process. However, the two primary components that can be offloaded — model weights and KV cache — have fundamentally different performance implications when offloaded:

This distinction leads to a priority-based VRAM allocation model and four distinct performance regimes.

N1: VRAM Priority Allocation

When VRAM is scarce, it must be allocated with careful prioritization. Model weights receive VRAM priority over KV cache because weight offloading causes a per-token Bus Wall penalty, while KV offloading only causes a one-time swap cost:

$$ V_{kv,VRAM}^{max} = \max(0, V_{VRAM} - V_{weights} - V_{overhead}) \quad \text{(Equation 13a)} $$

The actual KV cache stored in VRAM is then:

$$ V_{kv,VRAM} = \min(V_{kv}, V_{kv,VRAM}^{max}) \quad \text{(Equation 13b)} $$

And the KV cache that must be stored in RAM:

$$ V_{kv,RAM} = \max(0, V_{kv} - V_{kv,VRAM}^{max}) \quad \text{(Equation 13c)} $$

If weights alone exceed VRAM, then all VRAM is used for (partial) weights and all KV must go to RAM:

$$ V_{weights,RAM} = \max(0, V_{weights} - V_{VRAM} + V_{overhead}) \quad \text{(Equation 13d)} $$

KV Cache & Weight Allocation Parameters:

N2: Regime Classification

Based on the allocation, the deployment falls into one of four performance regimes:

RegimeConditionDescription
AV_total ≤ V_VRAMAll in VRAM. Full HBM performance.
BV_weights ≤ V_VRAM, V_kv > V_kv_VRAM_maxWeights in VRAM, KV in RAM. Full decode speed, TTFT + swap.
CV_weights > V_VRAM, V_kv ≤ V_kv_VRAM_maxWeights in RAM, KV in VRAM. Bus Wall on every token. (Rare)
DV_weights > V_VRAM, V_kv > V_kv_VRAM_maxBoth in RAM. Bus Wall + KV swap. Worst case.

Key Insight: Why Regime B is dramatically better than C/D In Regime B, decode speed equals Regime A (full HBM speed) because weights are still read from VRAM. The KV cache swap penalty only affects TTFT, not per-token throughput. In contrast, Regimes C and D impose the Bus Wall on every single decode token, making decode 30–100× slower. This is why the VRAM priority allocation (weights first) is so important.

N3: Decode Speed in Regime B

In Regime B, all weights reside in VRAM, so decode proceeds at full HBM speed:

$$ TPS_B = TPS_A = BW_{HBM} / (P_{active} \times b_{param}) \quad \text{(Equation 14)} $$

This is the critical result: KV cache offloading does not slow down decode. Only the initial context loading (TTFT) is impacted.

N4: KV Cache Swap-In Latency

When a user’s KV cache is stored in RAM, it must be swapped into VRAM before generation can begin. This is a one-time cost per context switch:

$$ T_{kv,swap} = V_{kv,RAM} / BW_{transfer} \quad \text{(Equation 15)} $$

where BW_transfer = min(BW_PCIe_effective, BW_RAM_effective) is the same transfer bandwidth used in the Bus Wall calculation.

N5: TTFT in Regime B

The time to first token in Regime B adds the KV swap penalty to the standard prefill time:

$$ TTFT_B = TTFT_A + T_{kv,swap} \quad \text{(Equation 16)} $$

This means the first token for a user with offloaded KV cache is delayed by the swap time, but all subsequent tokens generate at full decode speed.

N6: TTFT in Regime C

In Regime C, weight offloading means that the prompt must wait for weights to be loaded from RAM before each layer can compute. The effective TTFT is:

$$ TTFT_C = \max(T_{compute}, T_{weight,load}) \quad \text{(Equation 17)} $$

Time Parameters:

In practice, T_{weight,load} \gg T_{compute} because the Bus Wall ratio is typically 30–100×.

N7: TTFT in Regime D

Regime D combines the worst of both worlds:

$$ TTFT_D = \max(T_{compute}, T_{weight,load}) + T_{kv,swap} \quad \text{(Equation 18)} $$

Both the weight loading penalty and the KV swap penalty apply. This regime should be avoided whenever possible.

N8: Concurrency Limits with KV Offloading

When KV cache is partially stored in RAM, the number of simultaneously served users splits into two groups. Active users have their KV cache entirely in VRAM and experience no swap penalty; swapped users have their KV in RAM and pay the swap cost on context switches:

$$ U_{active} = \lfloor V_{kv,VRAM}^{max} / V_{kv,per user} \rfloor \quad \text{(Equation 19)} $$

$$ U_{swapped} = \lfloor V_{RAM,for KV} / V_{kv,per user} \rfloor \quad \text{(Equation 20)} $$

$$ U_{total} = U_{active} + U_{swapped} \quad \text{(Equation 21)} $$

Concurrency Parameters:

N9: Effective Throughput with KV Swapping

Total throughput with KV swapping accounts for the time spent swapping vs. generating:

$$ TPS_{eff} = U_{active} \times TPS + U_{swapped} \times TPS \times \eta_{swap} \quad \text{(Equation 22)} $$

where the swap efficiency η_swap depends on the ratio of swap time to generation time per context:

$$ \eta_{swap} = 1 / (1 + T_{kv,swap} / T_{gen}) \quad \text{(Equation 23)} $$

where T_{gen} is the total time spent generating tokens for a single user’s response before switching context, and η_{swap} represents the resulting swap efficiency penalty.

For long conversations (T_{gen} \gg T_{kv,swap}), η_{swap} \approx 1 and the swap overhead is negligible. For short exchanges, swap overhead is more significant.

N10: Quantization vs. Offload Decision Threshold

When weights do not fit in VRAM, there is a strategic choice: (1) quantize weights more aggressively to fit in VRAM, or (2) keep higher precision but offload to RAM. The Bus Wall makes option 2 almost always inferior:

$$ b_{fit} = (V_{VRAM} - V_{overhead}) / P_{total} \quad \text{(Equation 24)} $$

where b_fit is the maximum bytes per parameter that allows all weights to fit in VRAM. The decision rule is:

Quantize if f_RAM × Bus Wall Ratio > 2 (Equation 25)

Since typical Bus Wall ratios range from 30–100×, even a small fraction of weights in RAM creates a severe performance penalty. For example, a 70B model at Q4 (35 GB) fits in a single 80 GB A100 with room for KV cache. The same model at FP16 (140 GB) requires offloading 60 GB to RAM, resulting in decode that is ~50× slower — far worse than any quality loss from Q4 quantization.

Best Practices for Offloading

  1. Quantize weights first: Reduce b_param until V_weights ≤ V_VRAM. This avoids Regimes C/D entirely.
  2. Quantize KV cache second: If KV cache still overflows after weight quantization, reduce b_kv to FP8 or INT8. This halves or quarters the KV memory.
  3. Use KV offloading for concurrency: Regime B (KV in RAM, weights in VRAM) is acceptable for multi-user serving because decode speed is preserved.
  4. Hardware matching: DDR5 + PCIe Gen5 makes KV swap faster (57 GB/s vs 28 GB/s for Gen4), reducing the Regime B penalty.

4. Connectivity & Bandwidth Architecture

This section provides the complete theoretical foundation for the data transfer paths that determine inference performance. Understanding these paths is essential for accurate performance estimation, especially when RAM offloading or multi-GPU configurations are involved.

4.1 The Data Path Hierarchy

Data involved in LLM inference traverses a strict hierarchy of interconnects, each with vastly different bandwidth characteristics:

PathInterconnectBW RangeUse Case
GPU HBMOn-die bus900–4800 GB/sVRAM-resident weight reading
NVLink/NVSwitchGPU-GPU link400–900 GB/sTensor Parallelism all-reduce
PCIeCPU-GPU bus7–57 GB/s eff.RAM offload data transfer
System RAMMemory bus43–304 GB/s eff.Offloaded weight storage

The key insight is that each level in this hierarchy is roughly 10–100× slower than the one above it. This creates a “bandwidth cliff” when inference must access slower paths.

4.2 PCIe Bus Architecture

PCI Express (PCIe) is the primary data highway between the CPU and GPU. Its bandwidth is determined by three factors: the generation (signaling rate), the number of lanes (bus width), and practical protocol efficiency.

PCIe Bandwidth Calculation

Theoretical PCIe bandwidth is calculated as:

$$ BW_{PCIe,theoretical} = R_{GT}/s \times N_{lanes} \times (128/130 (for PCIe Gen 3.0+)) \div 8 \quad \text{(Equation 26)} $$

where R_GT/s is the transfer rate per lane, N_lanes is the number of lanes (typically 16, 8, or 4), and the 128/130 (for PCIe Gen 3.0+) factor accounts for the encoding overhead introduced in PCIe 3.0+.

GenerationGT/s/lanex16 (GB/s)x8 (GB/s)
PCIe 3.0815.757.88
PCIe 4.01631.5015.75
PCIe 5.03263.0031.50

Practical PCIe Efficiency

Theoretical bandwidth is never fully achieved. Protocol overhead reduces practical throughput:

$$ BW_{PCIe,effective} = BW_{PCIe,theoretical} \times η_{PCIe} \quad \text{(Equation 27)} $$

where η_PCIe ≈ 0.90 for large sequential DMA transfers. The overhead comes from:

For small random transfers (e.g., individual parameter updates), efficiency can drop to 40–70%. The calculator uses η_PCIe = 0.90 as a conservative estimate for the large sequential DMA transfers characteristic of weight loading during inference.

PCIe Lane Width Impact

Most desktop and server GPUs connect via x16 (16 lanes). However, some configurations reduce the effective lane width:

Reducing from x16 to x8 exactly halves the available bandwidth, which significantly impacts RAM offload performance.

4.3 System RAM Bandwidth

Theoretical RAM Bandwidth

System RAM bandwidth is determined by the memory type, transfer rate, and number of channels:

$$ BW_{RAM,theoretical} = MT/s \times N_{channels} \times 8 bytes/transfer \quad \text{(Equation 28)} $$

ConfigurationMT/sChannelsBW (GB/s)
DDR4-3200 2-ch3200251.2
DDR4-3200 4-ch32004102.4
DDR4-3200 8-ch32008204.8
DDR5-5600 2-ch5600289.6
DDR5-5600 4-ch56004179.2
DDR5-5600 8-ch56008358.4

Practical RAM Efficiency

As with PCIe, theoretical RAM bandwidth is not fully achievable:

$$ BW_{RAM,effective} = BW_{RAM,theoretical} \times η_{RAM} \times f_{NUMA} \quad \text{(Equation 29)} $$

where η_RAM ≈ 0.85 accounts for DRAM refresh cycles, row misses, and memory controller scheduling, and f_NUMA is the NUMA efficiency factor.

NUMA Effects on Multi-Socket Systems

On multi-socket servers (AMD EPYC, Intel Xeon), each CPU socket has its own memory controller and attached RAM. Accessing RAM on the local socket is fast, but accessing RAM attached to a remote socket traverses an inter-socket link (AMD Infinity Fabric or Intel UPI) that adds latency and reduces bandwidth by 30–50%:

$$ f_{NUMA} = 1.0 \quad \text{(Equation 30)} $$ if NUMA-aware (weights on local socket)

f_NUMA = 0.65 if no NUMA awareness (potential cross-socket access)

The calculator uses f_NUMA = 0.65 when NUMA awareness is disabled, reflecting the worst case where weight data may be allocated on a remote socket. For production deployments, always enable NUMA-aware allocation and pin the GPU to the same NUMA node as the RAM holding the offloaded weights.

4.4 The Transfer Chain: RAM → CPU → PCIe → GPU

When RAM-offloaded layers are accessed during inference, data must traverse the entire transfer chain:

  1. RAM read: Data is read from DRAM through the CPU memory controller
  2. CPU processing: The CPU’s IOMMU maps the DMA buffer address
  3. PCIe DMA transfer: Data is transferred via DMA from the pinned CPU buffer to GPU VRAM
  4. GPU reception: The GPU’s PCIe controller receives the data into VRAM

The bottleneck in this chain is the slower of PCIe effective bandwidth and RAM effective bandwidth:

$$ BW_{transfer} = \min(BW_{PCIe,effective}, BW_{RAM,effective}) \quad \text{(Equation 31)} $$

In most configurations, PCIe is the bottleneck. Even DDR4 2-channel at 51.2 GB/s theoretical (≈43 GB/s effective) exceeds PCIe Gen4 x8 at ≈14 GB/s effective. However, with fast PCIe Gen5 x16 (≈57 GB/s effective), slower RAM configurations (DDR4 2-channel) can become the bottleneck instead.

4.5 GPU Internal Architecture & HBM

High Bandwidth Memory (HBM)

GPU HBM is the fastest memory in the inference data path, connected to the GPU die via an ultra-wide bus (3072–6144 bits) on an organic or silicon interposer. This provides bandwidth of 900–4800 GB/s, orders of magnitude faster than any external interconnect.

Memory TypeExample GPUBandwidthBus Width
GDDR6XRTX 40901008 GB/s384-bit
HBM2eA1002000 GB/s5120-bit
HBM3H100 SXM3350 GB/s5120-bit
HBM3eH200 SXM4800 GB/s6144-bit

LLM decode is almost always HBM-bandwidth-bound: each token generation requires reading ALL active weights from HBM, but only performs 2/b FLOPs per byte of weight data (where b is bytes per parameter). This arithmetic intensity is far below the GPU’s ridge point, confirming the bandwidth-bound nature of decode.

GPU Clock Speed & Compute Throughput

GPU clock speed (typically 1.5–2.5 GHz) affects compute throughput (TFLOPS) but has limited impact on bandwidth-bound inference. The relationship between clock speed and compute throughput is:

$$ TFLOPS = N_{cores} \times f_{clock} \times 2 \quad \text{FLOP/clock/core (CUDA cores, FMA operation) (Equation 32)} $$

where N_{cores} is the number of CUDA cores and f_{clock} is the GPU clock frequency in GHz.

Note: This formula applies to standard CUDA scalar cores only. Tensor Core throughput is architecture- and precision-specific (e.g., H100 FP16 dense ≈ 989 TFLOPS) and must be taken from vendor specification tables — it cannot be derived from this formula.

For bandwidth-bound decode, increasing clock speed provides no benefit — the bottleneck is HBM read speed, not compute. Clock speed only matters for compute-bound prefill, where higher TFLOPS directly translates to faster prompt processing.

The Roofline Model

The roofline model provides a unified framework for understanding whether a workload is compute-bound or bandwidth-bound:

Definition — Arithmetic Intensity:
Arithmetic intensity is the ratio of FLOPs performed to bytes of data accessed:

$$ AI = FLOPs / Bytes accessed \quad \text{(Equation 33)} $$

For LLM decode with active parameters P_active stored at b bytes per parameter:

$$ AI_{decode} = (2 \times P_{active}) / (P_{active} \times b) = 2/b \quad \text{(Equation 34)} $$

At Q4 (b = 0.5), the arithmetic intensity is only 4 FLOP/byte. At FP16 (b = 2), it drops to 1 FLOP/byte. These values are far below the GPU’s ridge point (the arithmetic intensity at which compute and bandwidth are equally limiting):

$$ AI_{ridge} = TFLOPS / BW_{HBM} \quad \text{(Equation 35)} $$

For H100, using the dense FP16 figure as a conservative ceiling: AI_ridge = 989 / 3,350 ≈ 295 FLOP/byte (dense, non-sparse)

Using the sparse figure from NVIDIA’s spec sheet: AI_ridge = 1,979 / 3,350 ≈ 591 FLOP/byte (sparse, 2:4 structured)

Either way, LLM decode arithmetic intensity (1–4 FLOP/byte) is far below both ridge points, confirming bandwidth-bound execution regardless of convention used.

For prefill, the arithmetic intensity is much higher because the same weights are reused across all prompt tokens in a single batched matrix multiplication. Prefill is typically compute-bound.

4.6 Multi-GPU Interconnectivity

NVLink is NVIDIA’s proprietary high-speed GPU-to-GPU interconnect. It provides dramatically higher bandwidth than PCIe, enabling efficient Tensor Parallelism (TP) where model weights are split across multiple GPUs.

GenerationLinksBW per GPUExample GPU
NVLink 26300 GB/sV100
NVLink 312600 GB/sA100
NVLink 418900 GB/sH100/H200

NVLink uses a point-to-point topology: each GPU has direct links to specific other GPUs. In a 4-GPU system, this creates a mesh where each GPU connects to every other GPU. In an 8-GPU system, each GPU typically connects to 4 neighbors, and multi-hop routing is required for non-adjacent communication.

NVSwitch

NVSwitch is NVIDIA’s switching fabric that provides full all-to-all connectivity between all GPUs in a node. Unlike point-to-point NVLink, NVSwitch allows any GPU to communicate with any other GPU at full NVLink speed simultaneously.

Equation 36 — All-reduce with NVSwitch:
All-reduce_NVSwitch = 2 × (h × 2) / BW_NVLink

vs. ring all-reduce on point-to-point NVLink:

$$ All-reduce_{ring} = 2 \times ((N-1)/N) \times (h \times 2) / BW_{NVLink} \quad \text{(Equation 37)} $$

where h is the hidden size and N is the number of GPUs. NVSwitch eliminates the (N−1)/N penalty and reduces the number of hops, providing significantly better TP efficiency for 4+ GPUs.

AMD Infinity Fabric

AMD’s Infinity Fabric serves a similar role to NVLink on MI-series GPUs. The MI300X is a multi-chiplet accelerator integrating 8 compute dies (XCDs) and 4 I/O dies — 12 chiplets in total — connected via AMD Infinity Fabric within the package. It provides 192 GB of HBM3 memory at 5,300 GB/s aggregate bandwidth. For multi-GPU scaling, each discrete MI300X offers a 16-lane PCIe® Gen 5 host interface and seven external AMD Infinity Fabric links (each at 128 GB/s bidirectional), allowing full all-to-all connectivity between eight GPUs in a ring topology.

PCIe Peer-to-Peer (P2P)

Without NVLink or Infinity Fabric, GPUs can communicate via PCIe peer-to-peer transfers. This uses the PCIe bus for direct GPU-to-GPU communication without CPU involvement, but the bandwidth is limited to PCIe speeds (typically 15–57 GB/s effective), making TP inefficient for all but the smallest models.

4.7 Tensor Parallelism vs. Pipeline Parallelism

Tensor Parallelism (TP)

TP splits model weights across N GPUs. Each GPU holds 1/N of the weights and performs the corresponding portion of each matrix multiplication. After each transformer layer, an all-reduce synchronizes the partial results across all GPUs.

The decode time per token with TP is:

$$ T_{TP} = (P_{active} \times b) / (N \times BW_{HBM}) + L \times ((2 \times (N-1)/N \times h \times 2) / BW_{interconnect} + λ_{AR}) \quad \text{(Equation 38)} $$

where λ_AR is the all-reduce latency per layer (5–100 μs depending on interconnect).

TP efficiency is:

$$ η_{TP} = T_{compute} / (T_{compute} + T_{communication}) \quad \text{(Equation 39)} $$

Interconnect2-GPU4-GPU8-GPU
NVSwitch (H100)96%93%88%
NVLink P2P (H100)94%87%75%
PCIe Gen4 x1665%40%20%

TP is the preferred strategy when NVLink or NVSwitch is available, as it provides near-linear scaling with minimal latency overhead.

Pipeline Parallelism (PP)

PP splits model layers across GPUs sequentially. Each GPU processes a contiguous group of layers and passes the activation tensor to the next GPU. PP requires much less communication than TP — only the activation tensor (hidden size × batch × precision) is sent between stages, not an all-reduce.

However, PP introduces pipeline bubbles — idle time where GPUs wait for preceding stages to complete:

$$ Bubble fraction = (N-1) / (N + M - 1) \quad \text{(Equation 40)} $$

where M is the number of micro-batches. For single-user inference with batch size 1, only 1 micro-batch is possible, giving bubble fraction (N−1)/N.

PP is preferred when NVLink is unavailable (e.g., consumer GPUs connected via PCIe) or for cross-node deployment where network latency makes TP impractical.

Combined TP + PP

For very large models on many GPUs, both strategies can be combined: TP within a node (using NVLink) and PP across nodes (using network). For example, a 405B model on 8 H100s might use TP=4 within two nodes and PP=2 across nodes.

4.8 Combined Performance Model

The complete decode time model accounts for all connectivity factors:

$$ T_{decode} = (1-f_{RAM}) \times W / (η_{TP} \times N \times BW_{HBM}) + f_{RAM} \times W / \min(η_{PCIe} \times BW_{PCIe}, η_{RAM} \times f_{NUMA} \times BW_{RAM}) \quad \text{(Equation 41)} $$

This formula captures the essential physics of LLM inference:


5. Performance Estimation

5.1 Decode Speed (Tokens per Second)

During autoregressive decoding, each token generation requires loading all model weights from GPU memory. This makes inference bandwidth-bound:

$$ TPS = BW / (P_{active} \times b_{param} \times 1000) \quad \text{(Equation 42)} $$

where BW is the GPU memory bandwidth in GB/s. For MoE models, P_active (the number of parameters active per forward pass) is used instead of P_total, since only the routed experts are loaded during decode.

Definition — Active Parameters:
For dense models, P_active = P_total. For MoE models, P_active represents only the parameters used per forward pass, including the embedding layer, shared experts, and the k routed experts per token. For example, Mixtral 8x7B has P_total = 47B but P_active ≈ 13B.

5.2 Time to First Token (TTFT)

The prefill phase processes the entire prompt in parallel. The time to first token is estimated as:

$$ TTFT = (P_{active} \times C \times 2 \times b_{param}) / BW \times 1000 ms \quad \text{(Equation 43)} $$

where C is the prompt length in tokens. The factor of 2 accounts for the read and write of activations during the forward pass.

5.3 Extended Performance Metrics

The calculator provides additional derived metrics for practical deployment planning:

$$ Latency per token = 1000 / TPS \quad \text{ms (Equation 44)} $$

$$ Time_{100} = TTFT/1000 + 100/TPS \quad \text{seconds (Equation 45)} $$

$$ Time_{1000} = TTFT/1000 + 1000/TPS \quad \text{seconds (Equation 46)} $$

$$ Throughput = TPS \times U \quad \text{tok/s total (Equation 47)} $$


6. Power & Cost Estimation

6.1 Power Model

GPU power consumption is estimated from the Thermal Design Power (TDP) and utilization:

$$ P_{draw} = TDP \times (U_{GPU} / 100) \quad \text{(Equation 48)} $$

where U_{GPU} is the GPU utilization percentage. LLM inference is typically memory-bandwidth-bound rather than compute-bound, resulting in 60–90% utilization during decode.

6.2 Energy and Cost Calculations

$$ E_{hour} = P_{draw} / 1000 \quad \text{kWh (Equation 49)} $$

$$ C_{hour} = E_{hour} \times R_{elec} \quad \text{(Equation 50)} $$

$$ C_{day} = C_{hour} \times H_{day} \quad \text{(Equation 51)} $$

$$ C_{month} = C_{day} \times 30 \quad \text{(Equation 52)} $$

$$ C_{1M} tok = (C_{hour} / (TPS \times 3600)) \times 10⁶ \quad \text{(Equation 53)} $$

where R_elec is the electricity rate ($/kWh), H_day is operating hours per day, and TPS is the decode speed.

6.3 Carbon Emissions

$$ CO_{2,hour} = E_{hour} \times I_{carbon} \quad \text{(Equation 54)} $$

$$ CO_{2,annual} = CO_{2,hour} \times H_{day} \times 365 / 1000 \quad \text{tonnes (Equation 55)} $$

where I_carbon is the grid carbon intensity in kg CO₂/kWh. Default values and regional references:

Regionkg CO₂/kWh
World average0.417
European Union0.255
United States0.387
France (nuclear)0.056
Sweden (hydro)0.045
Poland (coal)0.769
China0.555

7. Fine-Tuning Memory Model

For fine-tuning, the memory model extends to include gradients and optimizer states:

$$ V_{FT} = V_{weights} + V_{gradients} + V_{optimizer} + V_{activations} \quad \text{(Equation 56)} $$

7.1 LoRA / QLoRA

With LoRA, only low-rank adaptation matrices are trained. The additional memory is:

$$ V_{LoRA} \approx 0.40 \times V_{weights} \quad \text{(Equation 57)} $$

This accounts for LoRA weights (typically <1% of base weights), their gradients (FP32), and 8-bit AdamW states.

7.2 Full Fine-Tuning

Full fine-tuning requires gradients for all parameters and optimizer states:

$$ V_{gradients} = P_{total} \times b_{train} \quad \text{(Equation 58)} $$

$$ V_{optimizer} = P_{total} \times 8 \quad \text{(AdamW FP32: momentum + variance) (Equation 59)} $$

Combined with the overhead factor of 2.0, this yields approximately 3× the weight memory for FP16 training with AdamW.


8. Quantization Reference

8.1 GGUF Quantization Levels

GGUF (GPT-Generated Unified Format) is the binary file format introduced by the llama.cpp project as a successor to the older GGML format. The name is not an official acronym — it is commonly rendered as “GGUF” without expansion. It is the standard format for llama.cpp and compatible inference engines. The following table lists supported quantization levels:

LevelLabelB/paramDescription
Q2_K2-bit K-quant0.32Aggressive 2-bit quantization, K-quants method
Q3_K_S3-bit small0.343-bit quantization, small variant
Q3_K_M3-bit medium0.433-bit quantization, medium variant
Q3_K_L3-bit large0.453-bit quantization, large variant
Q4_04-bit base0.56Basic 4-bit quantization
Q4_K_S4-bit small K0.584-bit K-quant, small variant
Q4_K_M4-bit medium K0.604-bit K-quant, medium variant
Q5_05-bit base0.68Basic 5-bit quantization
Q5_K_S5-bit small K0.695-bit K-quant, small variant
Q5_K_M5-bit medium K0.715-bit K-quant, medium variant
Q6_K6-bit K-quant0.836-bit K-quant
Q8_08-bit quant1.06Near-FP16 quality, 8-bit quantization

8.2 Weighted Quantization Methods

Beyond GGUF, the calculator recognizes several weighted quantization formats commonly found on HuggingFace:

MethodTypical BitsCharacteristics
GPTQ2–8 bitsPost-training quantization with calibration dataset; group-wise quantization with optional desc_act; commonly 4-bit with group_size=128
AWQ4–8 bitsActivation-aware weight quantization; preserves salient weights; group-wise with zero-point
EXL22–8 bitsExLlamaV2 format; mixed-precision per-layer quantization; optimized for ExLlamaV2 inference engine
BNB/NF44 bitsBitsAndBytes NF4 quantization; default for QLoRA; double quantization support

9. HuggingFace Integration

9.1 Model Metadata Retrieval

The calculator fetches model metadata from the HuggingFace API using the endpoint:

GET https://huggingface.co/api/models/{model_id}

This returns the model card data including parameter counts (from safetensors.total), tensor types (from safetensors.parameters), siblings (file list), and default branch information.

9.2 Tensor Type Detection

Tensor type detection follows a 4-level priority chain:

  1. quantization_config in config.json: Highest priority for explicitly quantized models (GPTQ, AWQ, EXL2, BNB).
  2. safetensors.parameters from the API: Shows actual dtype names of stored tensors (e.g., F16, Q8_0).
  3. Safetensors index metadata: dtype fields in model.safetensors.index.json.
  4. torch_dtype in config.json: Fallback for unquantized models (typically bfloat16 or float16).

9.3 Quantized Variant Discovery

The calculator automatically discovers quantized variants by:

  1. Scanning current model’s siblings for GGUF files (parsing filenames like model-Q4_K_M.gguf).
  2. Searching HuggingFace for related repos: [base-name] GGUF, [base-name] GPTQ, [base-name] AWQ.
  3. Deduplicating and sorting variants by quantization level.

10. Hardware Reference

Supported GPU Hardware

GPUVRAM (GB)HBM BW (GB/s)TFLOPS FP16 (Dense)TFLOPS FP16 (Sparse)PCIe GenNVLink (GB/s)NVS
H200 SXM141480098919795900Yes
H100 SXM80335098919795900Yes
H100 PCIe802000756151350No
A100 80GB8020003126244600Yes
A100 40GB4015553126244600Yes
A6000 Ada4896018236440No
RTX 409024100816533040No
RTX 3090249367114240No
L40S4886436673340No
MI300X1925300130726145400No
MI250X12832763833834400No

Note: Sparse TFLOPS assume a 2:4 structured sparsity pattern, effectively doubling throughput for supported operations compared to Dense matrices. Older architectures like MI250X (CDNA2) do not feature structured sparsity hardware acceleration.

PCIe Bandwidth Reference

ConfigTheoreticalEffective (η=0.90)Bus Wall (H200)Bus Wall (4090)
Gen3 x1615.75 GB/s14.2 GB/s338×71×
Gen3 x87.88 GB/s7.1 GB/s676×142×
Gen4 x1631.5 GB/s28.4 GB/s169×36×
Gen4 x815.75 GB/s14.2 GB/s338×71×
Gen5 x1663.0 GB/s56.7 GB/s85×18×
Gen5 x831.5 GB/s28.4 GB/s169×36×

Interconnect Latency Reference

InterconnectLatencyTopologyNotes
NVSwitch~5 μsAll-to-allBest for 4+ GPUs
NVLink P2P~10 μsRing/meshDirect GPU-GPU link
PCIe P2P~40 μsRingThrough root complex
Through CPU~100 μsMulti-hopGPU → CPU → GPU

11. Limitations & Assumptions

  1. Roofline model: Performance estimates assume bandwidth-bound decode (confirmed by arithmetic intensity analysis). Actual performance may be compute-bound for very small models or very short sequences with high batch sizes.

  2. PCIe efficiency: The 0.90 efficiency factor is a conservative estimate for large DMA transfers. Real efficiency varies by motherboard, IOMMU configuration, and CPU architecture. Measured values range from 0.85 to 0.95.

  3. RAM efficiency: The 0.85 factor for sequential RAM reads is an average. Actual values depend on DRAM type, frequency, and access pattern. Mixed read/write workloads achieve lower efficiency.

  4. NUMA model: The 0.65 factor for non-NUMA-aware allocation is an approximation. Actual cross-socket bandwidth depends on the inter-socket link (AMD Infinity Fabric, Intel UPI), memory topology, and system firmware configuration.

  5. KV cache quantization: Quality impact of aggressive KV cache quantization (Q4) is not modeled; it may degrade output quality for sensitive tasks.

  6. RAM offloading: The regime-aware model (N1–N10) distinguishes weight offloading from KV cache offloading, but still simplifies the real behavior of offloading engines. In practice, llama.cpp and vLLM may use pipelined or overlapped transfers that partially hide latency. The KV swap model assumes sequential swap-in, while real implementations may prefetch or stream KV cache during prefill.

  7. Multi-GPU TP: Estimates use the analytical all-reduce model with fixed latency. Real implementations may use custom all-reduce algorithms (e.g., NVLink SHARP, NCCL topology-aware) that differ from the ring model.

  8. Pipeline Parallelism: The bubble fraction model assumes single micro-batch inference. With continuous batching or multiple concurrent requests, the bubble can be hidden more effectively.

  9. Framework overhead: The 12% inference overhead is a conservative average. vLLM and llama.cpp may use less; PyTorch native may use more.

  10. MoE active parameters: The calculator uses the active parameter count reported by the model card or estimated from config. Actual activation patterns may differ, and expert routing is token-dependent.

  11. Power model: TDP-based power estimation is approximate. Real power varies with clock speed, temperature throttling, and workload characteristics.

  12. GPU clock speed: Clock speed is not explicitly modeled in the calculator since decode is bandwidth-bound. For compute-bound prefill, the TFLOPS value implicitly captures clock speed effects.


12. Glossary