API Reference

Chat Completions

OpenAI-compatible chat completions with our extensions for processors and detailed timing metadata.

Overview

The /v1/chat/completions endpoint accepts any payload that works with the OpenAI Chat Completions API. We add a single optional extension object — inferada — and enriches the response with a timings block and optional processor metadata. All other fields pass straight through to the model.

Create a chat completion

POST/v1/chat/completions
Scopecompletions

Create a chat completion. Streaming (SSE) and buffered responses supported.

Request

modelrequiredstring
ID of the model to use. Use /v1/models to discover available ones.
messagesrequiredarray
Array of messages with role (system / user / assistant / tool) and content. The systemrole is rendered server-side per the model's configured template_render_mode — see Prompt Templating.
streamboolean
Stream the response as SSE. Defaults to false (matches the OpenAI spec) — set to true to receive incremental chunks. Automatically disabled if any processor requires buffering.
temperaturenumber
Sampling temperature. Standard OpenAI semantics.
top_pnumber
Nucleus sampling.
max_tokensnumber
Max tokens to generate in the completion.
stopstring | string[]
Stop sequences.
frequency_penaltynumber
Standard OpenAI parameter.
presence_penaltynumber
Standard OpenAI parameter.
toolsarray
Tool / function definitions the model can call.
tool_choicestring | object
How the model picks tools.
response_formatobject
E.g. { type: "json_object" } for structured output.
inferadaobject
Our extension namespace. Carries processors: string[], an optional language string, and processor-specific config. See the Processors guide for the catalogue and per-processor shape.
inferada.haystackobject
Caller-side structured context — { data, schema?, format?, redact? }. Activate by including "haystack" in inferada.processors. See the Haystack guide for the field-def shape and the published meta-schema.
inferada.system_modestring
"merge_default" (default) or "replace_default". Controls whether the model's configured default system prompt is merged into the assembled upstream system prompt or skipped. See the system-prompt assembly guide.
json
{
  "model": "qwen3.5-35b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "stream": false,
  "temperature": 0.7,
  "max_tokens": 1000
}

Response · 200

Buffered response. Standard OpenAI shape with two additions: timings and inferada.

idstring
Completion ID — matches the request_id.
choicesarray
Array of completion choices (OpenAI shape).
usageobject
Token counts: prompt_tokens, completion_tokens, total_tokens.
timingsobject
Our extension. Latency breakdown (total_ms, model time, tokens/second, processor time).
inferadaobject
Per-processor metadata (only present when processors ran). See the Processors guide.
json
{
  "id": "chatcmpl-550e8400-e29b-41d4-a716-446655440000",
  "object": "chat.completion",
  "created": 1704900000,
  "model": "qwen3.5-35b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  },
  "timings": {
    "total_ms": 870,
    "upstream_ms": 820,
    "tokens_per_second": 38.5
  }
}

Errors

Possible errors: 400 (invalid body / unknown processor), 401 (missing/invalid token), 403 (scope / IP), 404 (model not found), 422 (processor policy rejection — see Processors — or missing/unsupported inferada.language), 429 (rate or concurrency limit), 500 (service failure). See Errors and Rate Limits.

Streaming

When stream: true, the response is sent as Server-Sent Events with Content-Type: text/event-stream. Each chunk is an OpenAI-compatible delta frame. The stream ends with a [DONE] marker. Streaming is opt-in — leave stream unset (or pass false) to get a single buffered response.

The stream uses an idle timeout, not a hard wall-clock cap — as long as the upstream keeps sending tokens, the stream stays open. A slow-but-progressing model (e.g. a CPU-served local model at single-digit tokens per second) won't get cut off mid-response. The timer resets on every chunk; if no chunk arrives within the idle window, the stream is aborted with a service-failure error. Buffered (non-streaming) requests keep a single hard deadline since there's no per-chunk signal to reset on.

text
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Some processors require the full response and will cause streaming to be disabled for that request. See the Processors guide for details.

Processors

Opt-in steps that wrap your completion (redaction, spellcheck, PII / injection guards, routing policy). Catalogue, request shape, response metadata, scope requirements, and limitations live on the Processors guide.

Tools

The standard OpenAI tools and tool_choice request fields are honoured. When you ship your own tool definitions, the gateway forwards them to the model verbatim, streams tool_callsdeltas back on the wire, and stays out of the way (no server-side execution, no post-processing of the assistant turn). When you don't, the model uses your org's configured catalogue tools and the gateway runs them in-process. Both paths plus multi-turn examples live on the Tools guide.

Timings

Every buffered completion includes a timings block useful for debugging latency:

total_msnumber
Total latency end to end. Includes queue wait and all processing.
upstream_msnumber
Time the model itself spent generating.
tokens_per_secondnumber
Throughput computed from completion tokens and upstream_ms. Only present when tokens > 0.
processors_msnumber
Measured aggregate of pre + post processor time. Only present when processors ran — see Processors.
queued_msnumber
Time spent waiting for a concurrency slot before the request started. Only present when greater than 0 — a non-zero value means your workload is bumping a max_concurrent rule. Streaming responses expose the same value in the x-inferada-queued-ms header.
gateway_msnumber
Residual time spent in the gateway itself (validation, serialization, accounting writes). Only present when non-zero.

Model aliases

Org admins can register stable, suggestive names that resolve to one of their real models. Send model: "chat" instead of model: "qwen3.5-35b" and the gateway picks whichever real model the alias currently maps to. When the underlying model is swapped — newer revision, different provider, deprecation — your code keeps working.

  • Each alias maps to an ordered chainof real model_ids. The resolver picks the first entry that's still active and within your token's allowed_models — useful for graceful degradation when a primary model is deactivated.
  • Resolution-time fallback only. If the chosen model fails mid-call (5xx, timeout, etc.), the gateway does not automatically retry the next entry. Dispatch-time fallback is a planned future enhancement.
  • Real model_ids always win over aliases — adding an alias never shadows an existing model.
  • Discoverable via /v1/models: each model entry now carries an aliases array listing every alias key that routes to it.
  • Aliases are scoped per service. The completions service's chatand the TTS service's chat would be independent.

List available models

GET/v1/models

List models available to your token, across every model scope (chat completions and embeddings). OpenAI-compatible list format with extra display metadata. Any valid API token may call this — the result is filtered by your token’s allowedModels.

Request

Optional query parameters:

scopestring
Narrow to one model scope: "completions" (chat) or "embeddings". Omit to list both. An unrecognised scope returns an empty list.
service_idstring
Narrow to a single service (see GET /v1/services). Omit to span every service.
bash
# All models
curl https://api.inferada.com/v1/models -H "Authorization: Bearer inf_YOUR_TOKEN"

# Only embedding models
curl "https://api.inferada.com/v1/models?scope=embeddings" -H "Authorization: Bearer inf_YOUR_TOKEN"

Response · 200

OpenAI-compatible list. Each model has standard fields plus extra metadata (scope, display_name, description, aliases).

json
{
  "object": "list",
  "data": [
    {
      "id": "qwen3.5-35b",
      "object": "model",
      "created": 1704900000,
      "owned_by": "inferada",
      "scope": "completions",
      "display_name": "Qwen 3.5 35B",
      "description": "General-purpose chat model with 128K context.",
      "aliases": ["chat", "smart"]
    }
  ]
}

The scope field tells a chat model apart from an embedding model. The aliases array lists every alias key that resolves to this model (whether as primary or fallback). See Model aliases. If your token has an allowedModels restriction, this list is filtered accordingly.

Retrieve a model

GET/v1/models/{model}

Retrieve a single model by id (or by one of its aliases). Same entry shape as the list endpoint. Returns 404 if the id is unknown or not accessible to your token.

Response · 200

json
{
  "id": "qwen3.5-35b",
  "object": "model",
  "created": 1704900000,
  "owned_by": "inferada",
  "scope": "completions",
  "display_name": "Qwen 3.5 35B",
  "description": "General-purpose chat model with 128K context.",
  "aliases": ["chat", "smart"]
}

Attaching models in the portal

Models are configured per completions service. From Organisation → Models, an org admin has two ways to add them:

  • Add model — opens the full single-model form: pick a template, pick an upstream, fill in model_id, override defaults, and configure tools or limits.
  • Add from templates— a modal that takes one upstream + a multi-select of global / org templates and creates them all in one transaction. Each new model uses the template's upstream_model_id as its addressable model_id, and inheritable defaults (system prompt, parameters, processors, inferada namespace) fall through to the template at request time. Templates whose model_id is already attached to the chosen service are disabled in the picker — use Clone on the existing model row to make a variant instead.

Both flows enforce the per-service unique model_id constraint and refresh the catalogue cache after a successful save.