Guides

Usage Accounting

What we count per scope, how those counts are exposed, and how to read the request history for detail.

Two views of usage

You can query your own usage in two complementary ways:

  • Live totals — aggregated counters per period, optimised for quick dashboards and budget checks. Exposed via /v1/account/usage.
  • Request history — one entry per request, with full detail of what was sent, how long it took, and what was counted. Exposed via /v1/account/history.

There is also a third, principal-agnostic view: /v1/me lets any token — a portal session, an API token, or an agent token — introspect itself. See Self-introspection below.

Metrics per scope

Each scope counts what matters for that service — token counts for completions, characters for text-to-speech, and so on.

completionsscope
tokens, prompt_tokens, completion_tokens
languagescope
requests, corrections
piiscope
requests, entities_found, entities_redacted, replacements_made
prompt_shieldscope
requests, detections_count
sttscope
requests, audio_duration_seconds
ttsscope
requests, characters_synthesised, output_audio_seconds

Live totals: /v1/account/usage

GET/v1/account/usage

Aggregated counters for the authenticated user, grouped per scope and per model.

Request

periodstring
minute, day (default), week or month. Defines the time window.
scopestring
Optional filter, e.g. completions.
model_dimensionstring
Group the `models` map by the org-level profile id (default `profile`, e.g. `qwen3.5-35b:chat`) or by the upstream base model id used for billing (`base`, e.g. `qwen3.5-35b`). Multiple profiles roll up onto one base in `base` mode.

Response · 200

json
// model_dimension=profile (default)
{
  "period": "day",
  "model_dimension": "profile",
  "scopes": {
    "completions": {
      "tokens": 5200,
      "prompt_tokens": 3100,
      "completion_tokens": 2100
    }
  },
  "models": {
    "qwen3.5-35b:chat": {
      "tokens": 5200,
      "prompt_tokens": 3100,
      "completion_tokens": 2100
    }
  }
}

// model_dimension=base (billing-aligned)
{
  "period": "day",
  "model_dimension": "base",
  "scopes": { /* unchanged */ },
  "models": {
    "qwen3.5-35b": {
      "tokens": 5200,
      "prompt_tokens": 3100,
      "completion_tokens": 2100
    }
  }
}

Reading the response

  • scopes is keyed by scope name. Each value is a metric-count map matching the table above.
  • models is keyed by model ID — the addressable profile id by default (e.g. qwen3.5-35b:chat), or the upstream base model id when model_dimension=base. Only completions traffic contributes, since only completions track per-model counts. The echoed model_dimension field tells you which lens the rollup used.
  • Empty {} objects mean no usage was recorded in that period. Older periods drop off after a while — for hard audit records, query the request history instead.
  • Note: for the completions scope, a requests count is not part of these totals — use the request history for that. For language, pii, stt and tts it is counted here.

Request history: /v1/account/history

GET/v1/account/history

Paginated per-request audit log for the authenticated user.

Request

pagenumber
Page number (1-based). Default 1.
per_pagenumber
Page size. Default 50, max 100.
scopestring
Filter by scope.
modelstring
Filter by model ID.
statusstring
One of: success (completed), rejected (refused by a processor or by the upstream for content/policy reasons — not a system failure), error (network or upstream failure), aborted (client disconnected mid-request).

Response · 200

json
{
  "data": [
    {
      "id": "uuid",
      "user_id": "uuid",
      "scope": "completions",
      "model_id": "qwen3.5-35b",
      "endpoint": "/v1/chat/completions",
      "usage": {
        "prompt_tokens": 200,
        "completion_tokens": 150,
        "processors": {
          "anonymise": { "entities_redacted": 2 }
        }
      },
      "latency_ms": 870,
      "status": "success",
      "stream": false,
      "timestamp": "2026-04-15T14:30:00.000Z"
    }
  ],
  "meta": {
    "total": 200,
    "per_page": 50,
    "current_page": 1,
    "last_page": 4
  }
}

Timezone. timestamp is the absolute UTC instant. The portal renders it (and chart period boundaries) in your profile timezone — change under Profile → Settings, or override per completion via inferada.timezone.

Self-introspection: /v1/me

/v1/meanswers "who am I, what may I do, and what have I spent?" for the calling token itself. Unlike /v1/account (portal- or API-token only), it also accepts an agent token (infa_), so an agent can check its own permissions, the limits imposed on it, and its own usage. The token is read from Authorization — there is no way to introspect another principal, and a token only ever sees its own limits. This is the surface the @inferada/agent-kit SDK calls as inferada.token.info() and inferada.token.usage().

GET/v1/me

Identity, organisation, the calling token (with its own imposed limits), and the permission keys this caller may exercise.

Response · 200

json
{
  "principal": "agent",            // or "user"
  "id": "agt_7f3c…",               // user id or agent id
  "organisation": { "id": "uuid", "name": "Acme", "slug": "acme" },
  "token": {
    "id": 42,
    "name": "prod-agent",
    "abilities": ["completions"],  // the token's coarse scopes
    "expires_at": null,            // ISO-8601 or null
    "limits": [
      { "metric": "requests", "period": "minute", "max": 60 }
    ]
  },
  "permissions": ["completions.create", "models.list"]
}

permissions vs. abilities

  • token.abilities are the coarse scopes the token was minted with (e.g. completions, pii).
  • permissionsare the fine-grained keys the caller may actually exercise. For a user that is the role's permissions intersected with the token's scopes; for an agent (which has no role) it is the scopes expanded directly — so completions becomes completions.create + models.list.
  • token.limits is the caller's own imposed limits, each { metric, period?, max, per_request? }. Empty when none are set.
GET/v1/me/usage

The calling principal’s own usage for the window — identical shape to /v1/account/usage, attributed by user_id or agent_id depending on who is calling.

Request

periodstring
minute, day (default), week or month. Defines the time window.
fromstring
Explicit ISO-8601 window start.
tostring
Explicit ISO-8601 window end.
scopestring
Optional filter, e.g. completions.
model_dimensionstring
Group the `models` map by profile id (default `profile`) or upstream base model id (`base`).

Response · 200

json
{
  "period": "day",
  "from": "2026-06-03T00:00:00.000Z",
  "to": "2026-06-03T14:30:00.000Z",
  "model_dimension": "profile",
  "scopes": { "completions": { "requests": 1, "tokens": 15 } },
  "models": {}
}

Live in-flight requests: /v1/live-requests

The portal's organisation request log shows currently-streaming requests on top of the persistent history. These come from a Redis-backed tracker the gateway updates on every scope-bearing call.

GET/v1/live-requests

Currently in-flight gateway requests, sourced from the Redis live tracker.

Request

organisation_iduuid (super-admin only)
Scope to a specific tenant. Non-super-admins are pinned to their own org regardless.
user_iduuid
Filter to one user. Applied server-side after the org index is consulted.
scopeenum
One of completions, language, pii, stt, tts. Live entries are POST-only scope-bearing calls; reads and admin chatter never appear here.
limitnumber
1–500. Default 100, sorted newest-first by start time.

Response · 200

json
{
  "data": [
    {
      "id": "uuid",
      "organisation_id": "uuid",
      "user_id": "uuid",
      "scope": "completions",
      "model_id": "qwen3.5-35b",
      "endpoint": "/v1/chat/completions",
      "timestamp": "2026-05-05T10:31:16.000Z",
      "elapsed_ms": 4120,
      "input_tokens": 12739,
      "output_tokens": 217,
      "status": "in_progress"
    }
  ]
}

input_tokens is a chars/4 estimate stamped once at request start; output_tokensupdates on a debounce (every ~750 ms or 2 KB of generated text). Both are estimates — the authoritative counts land on the persistent history row when the request finishes. The portal renders live rows with a tilde prefix and an italic style to flag the estimate, then refetches the persistent list ~2 s after a request disappears so the row flows naturally into the history.

The q (free-text search) and statusfilters on the request log are applied client-side — there's no Redis index to back them and live entries are always in_progress by definition. user_id and scope are forwarded server-side.

Shape of the usage object per scope

The usage field on a request history entry reflects what was counted for that specific call. The shape varies by scope:

completions

json
{
  "prompt_tokens": 200,
  "completion_tokens": 150,
  "processors": {
    "anonymise": { "entities_redacted": 2 }
  }
}

processors is only present when processors ran. Each processor adds its own keys under that object — see the Processors guide for the per-processor shape.

Some processors also persist forensic fields at the top level of usage (outside processors) on rejected requests — for example prompt_shield_detections. These are server-side only and never echoed to the client.

language

json
{ "requests": 1, "corrections": 2 }

pii (analyse)

json
{ "requests": 1, "entities_found": 3 }

pii (redact / restore)

json
{ "requests": 1, "entities_redacted": 2, "replacements_made": 2 }

stt

json
{ "requests": 1, "audio_duration_seconds": 4.2 }

tts

json
{ "requests": 1, "characters_synthesised": 45, "output_audio_seconds": 3.2 }

Latency and tokens-per-second

Two timing values live on every request_logs row and on the per-bucket time-series:

  • Total latency (latency_ms): wall-clock from request entry to finalisation. Includes queue time, pre/post processors, tool execution, the upstream LLM call, and gateway overhead.
  • Upstream latency (upstream_ms): the LLM-only window. For completions this is the model's eval time; for service scopes (audio, PII, languagetool) it's the upstream HTTP call time. Introduced by migration 0064.

Dashboards prefer p50_latency_ms / p95_latency_msover avg_latency_ms — LLM traffic is tail-heavy and a single slow request can drag a simple average up by 10x. tokens_per_secondon every time-series bucket is volume-weighted: SUM(completion_tokens) / SUM(upstream_ms) * 1000. That matches the per-response timings.tokens_per_second we emit on every completion (LLM-eval window only, not wall-clock — otherwise it would systematically under-report whenever a request also paid PII / queue cost).

Platform performance (super-admin)

Super-admins have a dedicated /admin/performance page covering rolling-window performance signals: requests/min, error rate, p95 latency, live in-flight requests, per-upstream-template error rollups across every organisation, per-instance upstream status, and a detailed errors panel with top messages + a 24h trend. An organisation scope selector at the top of the page lets a super-admin virtually "step into" any org without leaving the page — the per-template rollup is hidden when narrowed (it loses its meaning) and the org-only "top users" + "burn rate" cards appear in its place.

Org admins get the same shape under /organisation/:orgId/performance, scoped to their own organisation only. KPIs refresh every 30 seconds; the live in-flight requests panel polls every 2 seconds.

The compact "Platform performance" block on /admin/overview shows the four headline KPIs at a glance with a deep-link into the full page.

See also

  • Rate Limits — how these counts are used to enforce limits.
  • Processors — each processor adds to its own scope's usage.