Local LLM Memory Calculator extended

Configuration

Pick a model, quantization, context and hardware — or load a preset.

Workload profile
max-model-len / server context limit
Context length
Batch size (concurrent sequences held in KV cache)

Result

GPU ENGINE 0% VRAM
Unified RAM
M3 Max
GPU · CPU · NPU
Unified RAM
Blackwell
นำเมาส์มาวางที่ส่วนต่าง ๆ เพื่อดูรายละเอียดแรมการ์ดจอ
– tok/s
single-stream decode (batch 1)
ConcurrencyTotal throughputPer-stream
Quality impact & estimated cost
Weights ·
Quality retained
Error / hallucination
KV cache ·
Error / hallucination

Local Setup & CLI Command

Copy command and follow steps to run this model in your local runtime.

💡 Tip: Click highlighted command options (like --kv-cache-dtype) to view explanations.
bash — local server
# Command placeholder
VRAM
GPU CORE
Live Inference Simulator
0 tok/s

Behind the Calculator: How it Works

Theoretical performance modeling using memory bandwidth

tok/s
Bandwidth × Framework × Quant-Kernel × Attention × Spec-Decode
Bytes Read per Token
Roofline Performance Model (How Speed is Estimated)
Click or hover any variable in the equation above to see how it affects generation speed. LLM decoding is memory-bound: to generate each word, the GPU must scan all active parameters from its memory. Reading fewer bytes yields faster speeds.
Prefill vs Decode: where TFLOPS and Memory Bandwidth matter

Prompt tokens

many tokens in parallel

The whole prompt enters the model first.

Prefill

TFLOPS matters most

Runs large matrix math over the prompt and writes KV cache. Bigger batch/prompt can use GPU Core better.

KV cache

history stored in VRAM

Prefill creates the conversation memory that decode reads every next token.

Decode tok/s

Memory Bandwidth matters most

Generates one new token at a time, repeatedly reading weights and KV from VRAM.

Prefill is compute-heavy and fills KV once; decode is memory-traffic-heavy and repeats for every output token. Good serving needs both.

1. Model Quantization

Speeds up generation purely by reading fewer bytes from VRAM. A 4-bit model reads ~4× fewer bytes than FP16, speeding up generation by ~3.5× (including dequantization kernel overhead).

2. KV Cache Bottleneck

At long context lengths, reading the conversation history (KV cache) dominates memory bandwidth. Generating at 32K context is slower because the GPU spends more time scanning history.

3. Speculative Decoding

Boosts speed by guessing multiple future tokens per step. However, draft model acceptance rates drop at long contexts or under high user concurrency.

4. Target Calibration

Memory bandwidth utilisation is derated based on platform architectures (e.g. Apple Silicon M-chips derated to fit compute limits, and server rigs calibrated vs vLLM Spark telemetry).

Memory Units (หน่วยวัดหน่วยความจำ)

VRAM capacities are shown in GiB (binary: base 2) matching GPU reports. Speed limits use decimal GB/s (base 10) to align with standard memory bandwidth specifications.

Calibration & Telemetry (การสอบเทียบกับระบบจริง)

Theoretical speeds are derated by architecture limits (e.g. compute constraints on Apple M-chips or vLLM Spark server deployments calibrated to telemetry at ~70% peak bandwidth).