Chat Completions
OpenAI-compatible chat completions with our extensions for processors and detailed timing metadata.
Overview
The /v1/chat/completions endpoint accepts any payload that works with the OpenAI Chat Completions API. We add a single optional extension object — inferada — and enriches the response with a timings block and optional processor metadata. All other fields pass straight through to the model.
Create a chat completion
/v1/chat/completionsCreate a chat completion. Streaming (SSE) and buffered responses supported.
Request
modelrequiredstring- ID of the model to use. Use /v1/models to discover available ones.
messagesrequiredarray- Array of messages with role (system / user / assistant / tool) and content. The
systemrole is rendered server-side per the model's configuredtemplate_render_mode— see Prompt Templating. streamboolean- Stream the response as SSE. Defaults to true. Automatically disabled if any processor requires buffering.
temperaturenumber- Sampling temperature. Standard OpenAI semantics.
top_pnumber- Nucleus sampling.
max_tokensnumber- Max tokens to generate in the completion.
stopstring | string[]- Stop sequences.
frequency_penaltynumber- Standard OpenAI parameter.
presence_penaltynumber- Standard OpenAI parameter.
toolsarray- Tool / function definitions the model can call.
tool_choicestring | object- How the model picks tools.
response_formatobject- E.g. { type: "json_object" } for structured output.
inferadaobject- Our extension namespace. Carries
processors: string[], an optionallanguagestring, and processor-specific config. See the Processors guide for the catalogue and per-processor shape.
{
"model": "qwen3.5-35b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
"stream": false,
"temperature": 0.7,
"max_tokens": 1000
}Response · 200
Buffered response. Standard OpenAI shape with two additions: timings and inferada.
idstring- Completion ID — matches the request_id.
choicesarray- Array of completion choices (OpenAI shape).
usageobject- Token counts: prompt_tokens, completion_tokens, total_tokens.
timingsobject- Our extension. Latency breakdown (total_ms, model time, tokens/second, processor time).
inferadaobject- Per-processor metadata (only present when processors ran). See the Processors guide.
{
"id": "chatcmpl-550e8400-e29b-41d4-a716-446655440000",
"object": "chat.completion",
"created": 1704900000,
"model": "qwen3.5-35b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
},
"timings": {
"total_ms": 870,
"upstream_ms": 820,
"tokens_per_second": 38.5
}
}Errors
Possible errors: 400 (invalid body / unknown processor), 401 (missing/invalid token), 403 (scope / IP), 404 (model not found), 422 (processor policy rejection — see Processors — or missing/unsupported inferada.language), 429 (rate or concurrency limit), 500 (service failure). See Errors and Rate Limits.
Streaming
When stream: true(or omitted — it's the default), the response is sent as Server-Sent Events with Content-Type: text/event-stream. Each chunk is an OpenAI-compatible delta frame. The stream ends with a [DONE] marker.
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Some processors require the full response and will cause streaming to be disabled for that request. See the Processors guide for details.
Processors
Opt-in steps that wrap your completion (redaction, spellcheck, PII / injection guards, routing policy). Catalogue, request shape, response metadata, scope requirements, and limitations live on the Processors guide.
Timings
Every buffered completion includes a timings block useful for debugging latency:
total_msnumber- Total latency end to end. Includes queue wait and all processing.
upstream_msnumber- Time the model itself spent generating.
tokens_per_secondnumber- Throughput computed from completion tokens and upstream_ms. Only present when tokens > 0.
processors_msnumber- Measured aggregate of pre + post processor time. Only present when processors ran — see Processors.
queued_msnumber- Time spent waiting for a concurrency slot before the request started. Only present when greater than 0 — a non-zero value means your workload is bumping a max_concurrent rule. Streaming responses expose the same value in the
x-inferada-queued-msheader. gateway_msnumber- Residual time spent in the gateway itself (validation, serialization, accounting writes). Only present when non-zero.
List available models
/v1/modelsList models available to your token. OpenAI-compatible list format with extra display metadata.
Response · 200
OpenAI-compatible list. Each model has standard fields plus extra metadata (display_name, description).
{
"object": "list",
"data": [
{
"id": "qwen3.5-35b",
"object": "model",
"created": 1704900000,
"owned_by": "inferada",
"display_name": "Qwen 3.5 35B",
"description": "General-purpose chat model with 128K context."
}
]
}If your token has an allowedModels restriction, this list is filtered accordingly.