Guides

Rate Limits

A reference for how limits are expressed and what a 429 looks like. The limits themselves live in the portal — you set them there, alongside your spend caps and allowances.

Limit hierarchy

Limits are how you keep usage and spend under control for teams, users, and individual API tokens. They're all configured in the portal, right next to the corresponding entity. Every request is checked against up to five levels, in order:

  1. Service — limits attached to the service (e.g. the completions service).
  2. Model — per-model limits on a service model.
  3. Organisation — org-wide limits.
  4. User — limits on the authenticated user.
  5. Token — limits set directly on the API token.

If any level is exceeded, the request is rejected with a 429. The response body tells you which level and which rule blocked it.

Limit definition

Limits are expressed as rules with a metric, a period and a max. Your org's admins configure them per entity from the portal — no support ticket needed for day-to-day adjustments.

json
{
  "metric": "tokens",
  "period": "day",
  "max": 100000,
  "per_request": false
}
metricrequiredstring
tokens, prompt_tokens, completion_tokens, requests, audio_duration_seconds, or characters_synthesised.
periodrequiredstring
minute, day, week or month — defines the rolling window.
maxrequirednumber
Threshold above which requests are rejected.
per_requestboolean
When true, the max applies to a single request (like a hard cap on max_tokens). When false (default), it aggregates over the period.

429 — Limit exceeded

json
{
  "type": "limit_exceeded",
  "code": 429,
  "request_id": "...",
  "scope": "completions",
  "model_id": "qwen3.5-35b",
  "level": "model",
  "limit": {
    "metric": "tokens",
    "period": "day",
    "max": 100000,
    "per_request": false
  },
  "current": 99500,
  "requested": 1000
}
typestring
Always "limit_exceeded" for this class of error.
levelstring
Which level breached: service, model, organisation, user or token.
limitobject
The exact rule that blocked the request.
currentnumber
Current counter value in the rule’s period.
requestednumber
Incremental amount the request would have added.

Clients should back off until the period resets. For daylimits that's the next midnight UTC; for minutelimits it's under a minute.

429 — Concurrency limit

Concurrency limits are separate from periodic ones: they cap how many requests can be running at the same time. They can be set at any of the five levels (service, model, organisation, user, token) and each level gets its own bucket — a user-level cap doesn't count against the org-level cap. New requests wait up to wait_timeout_ms for a slot; if none frees, the request is rejected with:

json
{
  "type": "concurrency_limit",
  "code": 429,
  "error": "Concurrency limit reached: 5 concurrent completions requests allowed at user level. Waited 30000ms.",
  "scope": "completions",
  "level": "user",
  "max_concurrent": 5,
  "waited_ms": 30000,
  "request_id": "..."
}

Define one with metric: "max_concurrent" and max: N (no period). Optional wait_timeout_mscontrols how long a request queues before 429'ing (default 30000).

json
{
  "metric": "max_concurrent",
  "max": 5,
  "wait_timeout_ms": 30000
}

The difference versus a regular limit: your token isn't over quota, the system is simply saturated right now. Retrying after a short delay usually succeeds.

When a request does acquire its slot, the wait time is reported back so you can tell a slow response apart from a queued one:

  • Buffered completions: timings.queued_ms (only present when greater than 0).
  • Streaming completions: x-inferada-queued-ms response header (always present, value in milliseconds). The same header is also set on buffered responses.

timings.upstream_ms remains the pure model-generation time and timings.processors_ms is measured (not a residual), so total_ms - upstream_ms - processors_ms - queued_msis the gateway's own overhead, reported as gateway_ms when non-zero.

Per-request vs period limits

A rule with per_request: true acts as a hard cap on a single call. For example a tokens per-request rule of 4096 effectively caps max_tokens to that value regardless of what the client sent. A rule with per_request: false is a running total that accumulates over the period and resets when the period rolls over.

See also