Chat Completions
OpenAI-compatible chat completions with our extensions for processors and detailed timing metadata.
Overview
The /v1/chat/completions endpoint accepts any payload that works with the OpenAI Chat Completions API. We add a single optional extension object — inferada — and enriches the response with a timings block and optional processor metadata. All other fields pass straight through to the model.
Create a chat completion
/v1/chat/completionsCreate a chat completion. Streaming (SSE) and buffered responses supported.
Request
modelrequiredstring- ID of the model to use. Use /v1/models to discover available ones.
messagesrequiredarray- Array of messages with role (system / user / assistant / tool) and content. The
systemrole is rendered server-side per the model's configuredtemplate_render_mode— see Prompt Templating. streamboolean- Stream the response as SSE. Defaults to false (matches the OpenAI spec) — set to true to receive incremental chunks. Automatically disabled if any processor requires buffering.
temperaturenumber- Sampling temperature. Standard OpenAI semantics.
top_pnumber- Nucleus sampling.
max_tokensnumber- Max tokens to generate in the completion.
stopstring | string[]- Stop sequences.
frequency_penaltynumber- Standard OpenAI parameter.
presence_penaltynumber- Standard OpenAI parameter.
toolsarray- Tool / function definitions the model can call.
tool_choicestring | object- How the model picks tools.
response_formatobject- E.g. { type: "json_object" } for structured output.
inferadaobject- Our extension namespace. Carries
processors: string[], an optionallanguagestring, and processor-specific config. See the Processors guide for the catalogue and per-processor shape. inferada.haystackobject- Caller-side structured context —
{ data, schema?, format?, redact? }. Activate by including"haystack"ininferada.processors. See the Haystack guide for the field-def shape and the published meta-schema. inferada.system_modestring"merge_default"(default) or"replace_default". Controls whether the model's configured default system prompt is merged into the assembled upstream system prompt or skipped. See the system-prompt assembly guide.
{
"model": "qwen3.5-35b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
"stream": false,
"temperature": 0.7,
"max_tokens": 1000
}Response · 200
Buffered response. Standard OpenAI shape with two additions: timings and inferada.
idstring- Completion ID — matches the request_id.
choicesarray- Array of completion choices (OpenAI shape).
usageobject- Token counts: prompt_tokens, completion_tokens, total_tokens.
timingsobject- Our extension. Latency breakdown (total_ms, model time, tokens/second, processor time).
inferadaobject- Per-processor metadata (only present when processors ran). See the Processors guide.
{
"id": "chatcmpl-550e8400-e29b-41d4-a716-446655440000",
"object": "chat.completion",
"created": 1704900000,
"model": "qwen3.5-35b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
},
"timings": {
"total_ms": 870,
"upstream_ms": 820,
"tokens_per_second": 38.5
}
}Errors
Possible errors: 400 (invalid body / unknown processor), 401 (missing/invalid token), 403 (scope / IP), 404 (model not found), 422 (processor policy rejection — see Processors — or missing/unsupported inferada.language), 429 (rate or concurrency limit), 500 (service failure). See Errors and Rate Limits.
Streaming
When stream: true, the response is sent as Server-Sent Events with Content-Type: text/event-stream. Each chunk is an OpenAI-compatible delta frame. The stream ends with a [DONE] marker. Streaming is opt-in — leave stream unset (or pass false) to get a single buffered response.
The stream uses an idle timeout, not a hard wall-clock cap — as long as the upstream keeps sending tokens, the stream stays open. A slow-but-progressing model (e.g. a CPU-served local model at single-digit tokens per second) won't get cut off mid-response. The timer resets on every chunk; if no chunk arrives within the idle window, the stream is aborted with a service-failure error. Buffered (non-streaming) requests keep a single hard deadline since there's no per-chunk signal to reset on.
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Some processors require the full response and will cause streaming to be disabled for that request. See the Processors guide for details.
Processors
Opt-in steps that wrap your completion (redaction, spellcheck, PII / injection guards, routing policy). Catalogue, request shape, response metadata, scope requirements, and limitations live on the Processors guide.
Tools
The standard OpenAI tools and tool_choice request fields are honoured. When you ship your own tool definitions, the gateway forwards them to the model verbatim, streams tool_callsdeltas back on the wire, and stays out of the way (no server-side execution, no post-processing of the assistant turn). When you don't, the model uses your org's configured catalogue tools and the gateway runs them in-process. Both paths plus multi-turn examples live on the Tools guide.
Timings
Every buffered completion includes a timings block useful for debugging latency:
total_msnumber- Total latency end to end. Includes queue wait and all processing.
upstream_msnumber- Time the model itself spent generating.
tokens_per_secondnumber- Throughput computed from completion tokens and upstream_ms. Only present when tokens > 0.
processors_msnumber- Measured aggregate of pre + post processor time. Only present when processors ran — see Processors.
queued_msnumber- Time spent waiting for a concurrency slot before the request started. Only present when greater than 0 — a non-zero value means your workload is bumping a max_concurrent rule. Streaming responses expose the same value in the
x-inferada-queued-msheader. gateway_msnumber- Residual time spent in the gateway itself (validation, serialization, accounting writes). Only present when non-zero.
Model aliases
Org admins can register stable, suggestive names that resolve to one of their real models. Send model: "chat" instead of model: "qwen3.5-35b" and the gateway picks whichever real model the alias currently maps to. When the underlying model is swapped — newer revision, different provider, deprecation — your code keeps working.
- Each alias maps to an ordered chainof real model_ids. The resolver picks the first entry that's still active and within your token's
allowed_models— useful for graceful degradation when a primary model is deactivated. - Resolution-time fallback only. If the chosen model fails mid-call (5xx, timeout, etc.), the gateway does not automatically retry the next entry. Dispatch-time fallback is a planned future enhancement.
- Real model_ids always win over aliases — adding an alias never shadows an existing model.
- Discoverable via
/v1/models: each model entry now carries analiasesarray listing every alias key that routes to it. - Aliases are scoped per service. The completions service's
chatand the TTS service'schatwould be independent.
List available models
/v1/modelsList models available to your token, across every model scope (chat completions and embeddings). OpenAI-compatible list format with extra display metadata. Any valid API token may call this — the result is filtered by your token’s allowedModels.
Request
Optional query parameters:
scopestring- Narrow to one model scope: "completions" (chat) or "embeddings". Omit to list both. An unrecognised scope returns an empty list.
service_idstring- Narrow to a single service (see GET /v1/services). Omit to span every service.
# All models
curl https://api.inferada.com/v1/models -H "Authorization: Bearer inf_YOUR_TOKEN"
# Only embedding models
curl "https://api.inferada.com/v1/models?scope=embeddings" -H "Authorization: Bearer inf_YOUR_TOKEN"Response · 200
OpenAI-compatible list. Each model has standard fields plus extra metadata (scope, display_name, description, aliases).
{
"object": "list",
"data": [
{
"id": "qwen3.5-35b",
"object": "model",
"created": 1704900000,
"owned_by": "inferada",
"scope": "completions",
"display_name": "Qwen 3.5 35B",
"description": "General-purpose chat model with 128K context.",
"aliases": ["chat", "smart"]
}
]
}The scope field tells a chat model apart from an embedding model. The aliases array lists every alias key that resolves to this model (whether as primary or fallback). See Model aliases. If your token has an allowedModels restriction, this list is filtered accordingly.
Retrieve a model
/v1/models/{model}Retrieve a single model by id (or by one of its aliases). Same entry shape as the list endpoint. Returns 404 if the id is unknown or not accessible to your token.
Response · 200
{
"id": "qwen3.5-35b",
"object": "model",
"created": 1704900000,
"owned_by": "inferada",
"scope": "completions",
"display_name": "Qwen 3.5 35B",
"description": "General-purpose chat model with 128K context.",
"aliases": ["chat", "smart"]
}Attaching models in the portal
Models are configured per completions service. From Organisation → Models, an org admin has two ways to add them:
- Add model — opens the full single-model form: pick a template, pick an upstream, fill in
model_id, override defaults, and configure tools or limits. - Add from templates— a modal that takes one upstream + a multi-select of global / org templates and creates them all in one transaction. Each new model uses the template's
upstream_model_idas its addressablemodel_id, and inheritable defaults (system prompt, parameters, processors,inferadanamespace) fall through to the template at request time. Templates whosemodel_idis already attached to the chosen service are disabled in the picker — use Clone on the existing model row to make a variant instead.
Both flows enforce the per-service unique model_id constraint and refresh the catalogue cache after a successful save.