API Reference

Chat Completions

OpenAI-compatible chat completions with our extensions for processors and detailed timing metadata.

Overview

The /v1/chat/completions endpoint accepts any payload that works with the OpenAI Chat Completions API. We add a single optional extension object — inferada — and enriches the response with a timings block and optional processor metadata. All other fields pass straight through to the model.

Create a chat completion

POST/v1/chat/completions

Scopecompletions

Create a chat completion. Streaming (SSE) and buffered responses supported.

Request

modelrequiredstring: ID of the model to use. Use /v1/models to discover available ones.
messagesrequiredarray: Array of messages with role (system / user / assistant / tool) and content. The systemrole is rendered server-side per the model's configured template_render_mode — see Prompt Templating.
streamboolean: Stream the response as SSE. Defaults to true. Automatically disabled if any processor requires buffering.
temperaturenumber: Sampling temperature. Standard OpenAI semantics.
top_pnumber: Nucleus sampling.
max_tokensnumber: Max tokens to generate in the completion.
stopstring | string[]: Stop sequences.
frequency_penaltynumber: Standard OpenAI parameter.
presence_penaltynumber: Standard OpenAI parameter.
toolsarray: Tool / function definitions the model can call.
tool_choicestring | object: How the model picks tools.
response_formatobject: E.g. { type: "json_object" } for structured output.
inferadaobject: Our extension namespace. Carries processors: string[], an optional language string, and processor-specific config. See the Processors guide for the catalogue and per-processor shape.

json

{
  "model": "qwen3.5-35b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "stream": false,
  "temperature": 0.7,
  "max_tokens": 1000
}

Response · 200

Buffered response. Standard OpenAI shape with two additions: timings and inferada.

idstring: Completion ID — matches the request_id.
choicesarray: Array of completion choices (OpenAI shape).
usageobject: Token counts: prompt_tokens, completion_tokens, total_tokens.
timingsobject: Our extension. Latency breakdown (total_ms, model time, tokens/second, processor time).
inferadaobject: Per-processor metadata (only present when processors ran). See the Processors guide.

json

{
  "id": "chatcmpl-550e8400-e29b-41d4-a716-446655440000",
  "object": "chat.completion",
  "created": 1704900000,
  "model": "qwen3.5-35b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  },
  "timings": {
    "total_ms": 870,
    "upstream_ms": 820,
    "tokens_per_second": 38.5
  }
}

Errors

Possible errors: 400 (invalid body / unknown processor), 401 (missing/invalid token), 403 (scope / IP), 404 (model not found), 422 (processor policy rejection — see Processors — or missing/unsupported inferada.language), 429 (rate or concurrency limit), 500 (service failure). See Errors and Rate Limits.

Streaming

When stream: true(or omitted — it's the default), the response is sent as Server-Sent Events with Content-Type: text/event-stream. Each chunk is an OpenAI-compatible delta frame. The stream ends with a [DONE] marker.

text

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Some processors require the full response and will cause streaming to be disabled for that request. See the Processors guide for details.

Processors

Opt-in steps that wrap your completion (redaction, spellcheck, PII / injection guards, routing policy). Catalogue, request shape, response metadata, scope requirements, and limitations live on the Processors guide.

Timings

Every buffered completion includes a timings block useful for debugging latency:

total_msnumber: Total latency end to end. Includes queue wait and all processing.
upstream_msnumber: Time the model itself spent generating.
tokens_per_secondnumber: Throughput computed from completion tokens and upstream_ms. Only present when tokens > 0.
processors_msnumber: Measured aggregate of pre + post processor time. Only present when processors ran — see Processors.
queued_msnumber: Time spent waiting for a concurrency slot before the request started. Only present when greater than 0 — a non-zero value means your workload is bumping a max_concurrent rule. Streaming responses expose the same value in the x-inferada-queued-ms header.
gateway_msnumber: Residual time spent in the gateway itself (validation, serialization, accounting writes). Only present when non-zero.

List available models

GET/v1/models

Scopecompletions

List models available to your token. OpenAI-compatible list format with extra display metadata.

Response · 200

OpenAI-compatible list. Each model has standard fields plus extra metadata (display_name, description).

json

{
  "object": "list",
  "data": [
    {
      "id": "qwen3.5-35b",
      "object": "model",
      "created": 1704900000,
      "owned_by": "inferada",
      "display_name": "Qwen 3.5 35B",
      "description": "General-purpose chat model with 128K context."
    }
  ]
}

If your token has an allowedModels restriction, this list is filtered accordingly.