LLM Inference Benchmark v5 live

—

⬤ Default · :8000 ⬤ Optimized · :8001 —

Auto-refresh —

Model

Preset

Model ID

~2.2 GB VRAM · no HF token required

--dtype

Default Server

Endpoint URL

Concurrency

Num Requests

Prompt Type

10 one-sentence prompts · 128 max tokens

Optimized Server Parameters

Endpoint URL

--max-num-seqs

--max-num-batched-tokens

--gpu-memory-utilization

0 – 1

--enable-chunked-prefill

Enabled

--max-model-len

tokens

--block-size

--scheduling-policy

Rewrites docker-compose.yml and restarts the optimized container. If the model changes, both containers restart (~4 min). Requires Docker on the server.

Idle — configure above and run

📊

No benchmark data yet

Configure the servers in the left panel and click Run Benchmark. Results update live.

Live Log

Log output will appear here when a benchmark or config apply runs.

LLM Inference Benchmark v5 live

End-to-End Latency (ms) — lower is better

Time to First Token / TTFT (ms) — lower is better

Latency Distribution

TTFT Distribution

Latency Percentiles (ms)

Throughput

Scheduler State