Pick a model, quantization, context and hardware — or load a preset.
| Concurrency | Total throughput | Per-stream |
|---|
Copy command and follow steps to run this model in your local runtime.
# Command placeholder
Theoretical performance modeling using memory bandwidth
The whole prompt enters the model first.
Runs large matrix math over the prompt and writes KV cache. Bigger batch/prompt can use GPU Core better.
Prefill creates the conversation memory that decode reads every next token.
Generates one new token at a time, repeatedly reading weights and KV from VRAM.
Prefill is compute-heavy and fills KV once; decode is memory-traffic-heavy and repeats for every output token. Good serving needs both.
Speeds up generation purely by reading fewer bytes from VRAM. A 4-bit model reads ~4× fewer bytes than FP16, speeding up generation by ~3.5× (including dequantization kernel overhead).
At long context lengths, reading the conversation history (KV cache) dominates memory bandwidth. Generating at 32K context is slower because the GPU spends more time scanning history.
Boosts speed by guessing multiple future tokens per step. However, draft model acceptance rates drop at long contexts or under high user concurrency.
Memory bandwidth utilisation is derated based on platform architectures (e.g. Apple Silicon M-chips derated to fit compute limits, and server rigs calibrated vs vLLM Spark telemetry).