Local LLM Resource Estimator

🖥️ Hardware Cards 🧠 GPU Architecture Reference 📄 Technical Documentation

Hugging Face import

Model ID

Private / gated (requires token)

Param source

—

Task

—

Downloads

—

Library

—

License

—

Base model

—

Base relation

—

Last modified

—

Total params

—

Active params

—

Expert count

—

Sliding window

—

Context length

—

Vocabulary

—

Tensor / quant type

—

Modalities

—

Vision encoder

—

Languages

—

Datasets

—

Quantized variants (click to explore)

Manual architecture overrides

Total params (B) ?

Active params (B) ?

Architecture ?

Layers ?

Hidden size

Attn heads

KV heads ?

Head dim

Task

Confidence: not imported

Total params → memory sizing. Active params → throughput only (MoE). Missing fields downgrade to rule-of-thumb.

Model

Select preset

Parameters ?

—

Architecture ?

—

Layers ?

—

KV heads ?

—

Hardware

GPU device

VRAM (GB)

HBM Bandwidth (GB/s)

TDP (W)

FP16 TFLOPS ?

PCIe Generation ?

PCIe Lanes ?

NVLink BW (GB/s) ?

Has NVSwitch? ?

Number of GPUs

Total usable VRAM

—

Enable RAM offloading

System RAM (GB) ?

RAM type ?

NUMA-aware allocation ?

PCIe BW (effective) ?

—

GB/s

RAM BW (effective) ?

—

GB/s

Bus Wall ?

—

× slower

Transfer bottleneck ?

—

PCIe lanes

—

HBM BW

—

GB/s

Quantization & context

Weight precision ?

KV cache precision ?

Concurrent users ?

Context length (tokens) ?

4,096

Batch size ?

Weights ?

—

KV cache ?

—

Overhead ?

—

Memory use

—

Weights KV cache Overhead

——

Performances estimation

Prefill speed ?

—

tok/s

Decode speed ?

—

tok/s per user

Latency/tok ?

—

TTFT ?

—

100 tokens ?

—

seconds

1000 tokens ?

—

seconds

Multi-user throughput ?

—

tok/s all users

HBM bandwidth used ?

—

GB/s

Bottleneck ?

—

Arith. intensity ?

—

FLOP/byte

Roofline model: decode is bandwidth-bound (low arithmetic intensity), prefill is compute-bound. MoE models use active-parameter count. Actual results depend on engine, batch strategy, and kernel optimization.

Power & cost estimation

GPU TDP (W) ?

Electricity ($/kWh) ?

GPU utilization (%) ?

Hours per day

Carbon intensity (kg CO2/kWh) ?

Power draw

—

Cost / hour

—

Cost / month

—

Cost / 1M tokens

—

Energy / hour

—

kWh

Cost / day

—

CO2 / hour

—

Annual CO2

—

tonnes

Power draw = TDP × utilization%. Energy cost varies significantly by region ($0.05–$0.40/kWh). Carbon intensity: world avg ~0.417 kg CO2/kWh, EU ~0.255, US ~0.387, France ~0.056 (nuclear), Poland ~0.769 (coal).