Usage Accounting
What we count per scope, how those counts are exposed, and how to read the request history for detail.
Two views of usage
You can query your own usage in two complementary ways:
- Live totals — aggregated counters per period, optimised for quick dashboards and budget checks. Exposed via
/v1/account/usage. - Request history — one entry per request, with full detail of what was sent, how long it took, and what was counted. Exposed via
/v1/account/history.
There is also a third, principal-agnostic view: /v1/me lets any token — a portal session, an API token, or an agent token — introspect itself. See Self-introspection below.
Metrics per scope
Each scope counts what matters for that service — token counts for completions, characters for text-to-speech, and so on.
completionsscope- tokens, prompt_tokens, completion_tokens
languagescope- requests, corrections
piiscope- requests, entities_found, entities_redacted, replacements_made
prompt_shieldscope- requests, detections_count
sttscope- requests, audio_duration_seconds
ttsscope- requests, characters_synthesised, output_audio_seconds
Live totals: /v1/account/usage
/v1/account/usageAggregated counters for the authenticated user, grouped per scope and per model.
Request
periodstring- minute, day (default), week or month. Defines the time window.
scopestring- Optional filter, e.g. completions.
model_dimensionstring- Group the `models` map by the org-level profile id (default `profile`, e.g. `qwen3.5-35b:chat`) or by the upstream base model id used for billing (`base`, e.g. `qwen3.5-35b`). Multiple profiles roll up onto one base in `base` mode.
Response · 200
// model_dimension=profile (default)
{
"period": "day",
"model_dimension": "profile",
"scopes": {
"completions": {
"tokens": 5200,
"prompt_tokens": 3100,
"completion_tokens": 2100
}
},
"models": {
"qwen3.5-35b:chat": {
"tokens": 5200,
"prompt_tokens": 3100,
"completion_tokens": 2100
}
}
}
// model_dimension=base (billing-aligned)
{
"period": "day",
"model_dimension": "base",
"scopes": { /* unchanged */ },
"models": {
"qwen3.5-35b": {
"tokens": 5200,
"prompt_tokens": 3100,
"completion_tokens": 2100
}
}
}Reading the response
scopesis keyed by scope name. Each value is a metric-count map matching the table above.modelsis keyed by model ID — the addressable profile id by default (e.g.qwen3.5-35b:chat), or the upstream base model id whenmodel_dimension=base. Only completions traffic contributes, since only completions track per-model counts. The echoedmodel_dimensionfield tells you which lens the rollup used.- Empty
{}objects mean no usage was recorded in that period. Older periods drop off after a while — for hard audit records, query the request history instead. - Note: for the completions scope, a
requestscount is not part of these totals — use the request history for that. For language, pii, stt and tts it is counted here.
Request history: /v1/account/history
/v1/account/historyPaginated per-request audit log for the authenticated user.
Request
pagenumber- Page number (1-based). Default 1.
per_pagenumber- Page size. Default 50, max 100.
scopestring- Filter by scope.
modelstring- Filter by model ID.
statusstring- One of: success (completed), rejected (refused by a processor or by the upstream for content/policy reasons — not a system failure), error (network or upstream failure), aborted (client disconnected mid-request).
Response · 200
{
"data": [
{
"id": "uuid",
"user_id": "uuid",
"scope": "completions",
"model_id": "qwen3.5-35b",
"endpoint": "/v1/chat/completions",
"usage": {
"prompt_tokens": 200,
"completion_tokens": 150,
"processors": {
"anonymise": { "entities_redacted": 2 }
}
},
"latency_ms": 870,
"status": "success",
"stream": false,
"timestamp": "2026-04-15T14:30:00.000Z"
}
],
"meta": {
"total": 200,
"per_page": 50,
"current_page": 1,
"last_page": 4
}
}Timezone. timestamp is the absolute UTC instant. The portal renders it (and chart period boundaries) in your profile timezone — change under Profile → Settings, or override per completion via inferada.timezone.
Self-introspection: /v1/me
/v1/meanswers "who am I, what may I do, and what have I spent?" for the calling token itself. Unlike /v1/account (portal- or API-token only), it also accepts an agent token (infa_), so an agent can check its own permissions, the limits imposed on it, and its own usage. The token is read from Authorization — there is no way to introspect another principal, and a token only ever sees its own limits. This is the surface the @inferada/agent-kit SDK calls as inferada.token.info() and inferada.token.usage().
/v1/meIdentity, organisation, the calling token (with its own imposed limits), and the permission keys this caller may exercise.
Response · 200
{
"principal": "agent", // or "user"
"id": "agt_7f3c…", // user id or agent id
"organisation": { "id": "uuid", "name": "Acme", "slug": "acme" },
"token": {
"id": 42,
"name": "prod-agent",
"abilities": ["completions"], // the token's coarse scopes
"expires_at": null, // ISO-8601 or null
"limits": [
{ "metric": "requests", "period": "minute", "max": 60 }
]
},
"permissions": ["completions.create", "models.list"]
}permissions vs. abilities
token.abilitiesare the coarse scopes the token was minted with (e.g.completions,pii).permissionsare the fine-grained keys the caller may actually exercise. For a user that is the role's permissions intersected with the token's scopes; for an agent (which has no role) it is the scopes expanded directly — socompletionsbecomescompletions.create+models.list.token.limitsis the caller's own imposed limits, each{ metric, period?, max, per_request? }. Empty when none are set.
/v1/me/usageThe calling principal’s own usage for the window — identical shape to /v1/account/usage, attributed by user_id or agent_id depending on who is calling.
Request
periodstring- minute, day (default), week or month. Defines the time window.
fromstring- Explicit ISO-8601 window start.
tostring- Explicit ISO-8601 window end.
scopestring- Optional filter, e.g. completions.
model_dimensionstring- Group the `models` map by profile id (default `profile`) or upstream base model id (`base`).
Response · 200
{
"period": "day",
"from": "2026-06-03T00:00:00.000Z",
"to": "2026-06-03T14:30:00.000Z",
"model_dimension": "profile",
"scopes": { "completions": { "requests": 1, "tokens": 15 } },
"models": {}
}Live in-flight requests: /v1/live-requests
The portal's organisation request log shows currently-streaming requests on top of the persistent history. These come from a Redis-backed tracker the gateway updates on every scope-bearing call.
/v1/live-requestsCurrently in-flight gateway requests, sourced from the Redis live tracker.
Request
organisation_iduuid (super-admin only)- Scope to a specific tenant. Non-super-admins are pinned to their own org regardless.
user_iduuid- Filter to one user. Applied server-side after the org index is consulted.
scopeenum- One of completions, language, pii, stt, tts. Live entries are POST-only scope-bearing calls; reads and admin chatter never appear here.
limitnumber- 1–500. Default 100, sorted newest-first by start time.
Response · 200
{
"data": [
{
"id": "uuid",
"organisation_id": "uuid",
"user_id": "uuid",
"scope": "completions",
"model_id": "qwen3.5-35b",
"endpoint": "/v1/chat/completions",
"timestamp": "2026-05-05T10:31:16.000Z",
"elapsed_ms": 4120,
"input_tokens": 12739,
"output_tokens": 217,
"status": "in_progress"
}
]
}input_tokens is a chars/4 estimate stamped once at request start; output_tokensupdates on a debounce (every ~750 ms or 2 KB of generated text). Both are estimates — the authoritative counts land on the persistent history row when the request finishes. The portal renders live rows with a tilde prefix and an italic style to flag the estimate, then refetches the persistent list ~2 s after a request disappears so the row flows naturally into the history.
The q (free-text search) and statusfilters on the request log are applied client-side — there's no Redis index to back them and live entries are always in_progress by definition. user_id and scope are forwarded server-side.
Shape of the usage object per scope
The usage field on a request history entry reflects what was counted for that specific call. The shape varies by scope:
completions
{
"prompt_tokens": 200,
"completion_tokens": 150,
"processors": {
"anonymise": { "entities_redacted": 2 }
}
}processors is only present when processors ran. Each processor adds its own keys under that object — see the Processors guide for the per-processor shape.
Some processors also persist forensic fields at the top level of usage (outside processors) on rejected requests — for example prompt_shield_detections. These are server-side only and never echoed to the client.
language
{ "requests": 1, "corrections": 2 }pii (analyse)
{ "requests": 1, "entities_found": 3 }pii (redact / restore)
{ "requests": 1, "entities_redacted": 2, "replacements_made": 2 }stt
{ "requests": 1, "audio_duration_seconds": 4.2 }tts
{ "requests": 1, "characters_synthesised": 45, "output_audio_seconds": 3.2 }Latency and tokens-per-second
Two timing values live on every request_logs row and on the per-bucket time-series:
- Total latency (
latency_ms): wall-clock from request entry to finalisation. Includes queue time, pre/post processors, tool execution, the upstream LLM call, and gateway overhead. - Upstream latency (
upstream_ms): the LLM-only window. For completions this is the model's eval time; for service scopes (audio, PII, languagetool) it's the upstream HTTP call time. Introduced by migration0064.
Dashboards prefer p50_latency_ms / p95_latency_msover avg_latency_ms — LLM traffic is tail-heavy and a single slow request can drag a simple average up by 10x. tokens_per_secondon every time-series bucket is volume-weighted: SUM(completion_tokens) / SUM(upstream_ms) * 1000. That matches the per-response timings.tokens_per_second we emit on every completion (LLM-eval window only, not wall-clock — otherwise it would systematically under-report whenever a request also paid PII / queue cost).
Platform performance (super-admin)
Super-admins have a dedicated /admin/performance page covering rolling-window performance signals: requests/min, error rate, p95 latency, live in-flight requests, per-upstream-template error rollups across every organisation, per-instance upstream status, and a detailed errors panel with top messages + a 24h trend. An organisation scope selector at the top of the page lets a super-admin virtually "step into" any org without leaving the page — the per-template rollup is hidden when narrowed (it loses its meaning) and the org-only "top users" + "burn rate" cards appear in its place.
Org admins get the same shape under /organisation/:orgId/performance, scoped to their own organisation only. KPIs refresh every 30 seconds; the live in-flight requests panel polls every 2 seconds.
The compact "Platform performance" block on /admin/overview shows the four headline KPIs at a glance with a deep-link into the full page.
See also
- Rate Limits — how these counts are used to enforce limits.
- Processors — each processor adds to its own scope's usage.