Local LLM Resource Estimator

🖥️ Hardware Cards 🧠 GPU Architecture Reference 📄 Technical Documentation

Hugging Face import

Param source
Task
Downloads
Library
License
Base model
Base relation
Last modified
Total params
Active params
Expert count
Sliding window
Context length
Vocabulary
Tensor / quant type
Modalities
Vision encoder
Languages
Datasets
Tags
Eval metrics

Quantized variants (click to explore)

Manual architecture overrides

Confidence: not imported

Total params → memory sizing. Active params → throughput only (MoE). Missing fields downgrade to rule-of-thumb.

Model

Parameters ?
Architecture ?
Layers ?
KV heads ?

Hardware

1
Total usable VRAM
PCIe BW (effective) ?
GB/s
RAM BW (effective) ?
GB/s
Bus Wall ?
× slower
Transfer bottleneck ?
PCIe lanes
HBM BW
GB/s

Quantization & context

1
4,096
1
Weights ?
GB
KV cache ?
GB
Overhead ?
GB

Memory use

Weights KV cache Overhead

Performances estimation

Prefill speed ?
tok/s
Decode speed ?
tok/s per user
Latency/tok ?
ms
TTFT ?
ms
100 tokens ?
seconds
1000 tokens ?
seconds
Multi-user throughput ?
tok/s all users
HBM bandwidth used ?
GB/s
Bottleneck ?
Arith. intensity ?
FLOP/byte

Roofline model: decode is bandwidth-bound (low arithmetic intensity), prefill is compute-bound. MoE models use active-parameter count. Actual results depend on engine, batch strategy, and kernel optimization.

Power & cost estimation

Power draw
W
Cost / hour
$
Cost / month
$
Cost / 1M tokens
$
Energy / hour
kWh
Cost / day
$
CO2 / hour
kg
Annual CO2
tonnes

Power draw = TDP × utilization%. Energy cost varies significantly by region ($0.05–$0.40/kWh). Carbon intensity: world avg ~0.417 kg CO2/kWh, EU ~0.255, US ~0.387, France ~0.056 (nuclear), Poland ~0.769 (coal).

×