Guides

System prompt assembly

The gateway combines three sources of system-prompt content into one upstream system prompt. The order is deterministic, the cacheable prefix is preserved, and the caller has the last word.

The assembly order

For every chat-completions request the gateway assembles the upstream system prompt as:

text

[ Model default — gateway-configured, cacheable ]
   ---
[ Processor segments — anonymise rules, haystack data, ... ]
   ---
[ Caller's system messages — last word ]

→ user / assistant messages

The three slots are joined with \n\n---\n\n separators into a single string, and the gateway hands that string to the upstream provider as their native system parameter (Anthropic system field; OpenAI / openai-compat top-of-messages system role; Gemini systemInstruction). One combined system prompt — the most-compatible shape across providers, including local LLMs whose chat templates handle multiple system messages unreliably.

Why this order

Model default first — operators configure it once in the portal for persona, tooling rules, or a safety floor. Stable across requests, so it sits at the top of the cacheable prefix for Anthropic and OpenAI prefix caching.
Processor segments next — anonymise placeholder rules, the structured haystackblock, etc. Per-request data sits below the stable rules so the cacheable prefix isn't invalidated by varying processor output.
Caller content last— your application gets to apply final-word rules on top of the gateway's additions. Useful for personality, formatting, or request-specific instructions.

Always preserved

Older versions of the gateway dropped the model default whenever a caller supplied a systemmessage. That's no longer the case — the model default is always merged in. If you need the old behaviour for a specific request, opt out explicitly:

json

{
  "model": "claude-sonnet-4-6:chat",
  "messages": [
    { "role": "system", "content": "Your call's system content." },
    { "role": "user", "content": "Hello." }
  ],
  "inferada": {
    "system_mode": "replace_default"
  }
}

With system_mode: "replace_default" the gateway skips the model default; processor segments and your caller content go upstream as-is.

inferada.system_modestring: 'merge_default' (default, or absent) preserves the model default. 'replace_default' skips it.

Multiple caller system messages

You can include more than one role: "system" message in your messages array. The gateway concatenates them in the order you supplied, preserves the relative ordering, and emits them as the caller-segments slot of the assembled prompt. The upstream still sees one combined system message.

Cacheable prefix

Anthropic and OpenAI both byte-prefix-cache the system prompt. The assembly order keeps the stable parts at the front:

Stable: model default + caller's system content, when those are unchanged across requests.
Variable: per-request processor segments — typically the haystack block, since its data changes per request.

The cache hit happens on the prefix up to where the processor segment begins. If your model default and your caller's system are byte-identical between requests, you retain that cache regardless of how the per-request haystack varies.

Inspecting the assembled prompt

The portal's playground exposes an "assembled-system preview" endpoint that runs your request through the same pipeline (templating, processor prepareBag, processor preProcess) and returns the final assembled string with per-segment attribution. Useful for debugging unexpected behaviour or verifying the order:

bash

POST /v1/playground/assembled-system

{
  "token_id": <api-token-id>,
  "model": "claude-sonnet-4-6:chat",
  "messages": [...],
  "inferada": { "processors": ["anonymise", "haystack"], "language": "en", "haystack": { ... } }
}

Response: { assembled_system, segments: { model_default, processors: [{name, content}], caller, mode }, total_bytes, active_processors }.