API Reference

Audio

Convert speech to text and text to speech. OpenAI-compatible request shapes so existing clients work out of the box.

Transcribe audio

POST/v1/audio/transcriptions
Scopestt

Speech-to-text. Send the audio file as a multipart upload and get back the transcribed text.

Request

Content-Type: multipart/form-data

filerequiredbinary
Audio file. Common formats such as WAV, MP3, FLAC, OGG, M4A are accepted.
languagestring
ISO language hint to improve accuracy (e.g. "en", "nl").
promptstring
Optional text hint to bias recognition toward custom vocabulary, proper nouns, or a specific style. Capped at 896 characters; longer prompts are rejected with a 422.
response_formatstring
Output format. JSON by default.
service_idstring
Optional. Route to a specific STT service (see GET /v1/services). Falls back to the org’s default STT service when omitted.

Tip: set language to the correct ISO code and pass a short prompt listing names, jargon or product terms for the best transcription accuracy.

bash
curl https://api.inferada.com/v1/audio/transcriptions \
  -H "Authorization: Bearer inf_YOUR_TOKEN" \
  -F file=@speech.mp3 \
  -F language=en \
  -F prompt='Acme Corp, Inferada'

Response · 200

textstring
Transcribed text.
durationnumber
Audio duration in seconds (when available).
json
{
  "text": "Hello, this is a transcription.",
  "duration": 4.2
}

Translate audio

POST/v1/audio/translations
Scopestt

Speech-to-text that always renders the output in English, whatever the source language. Same request shape as transcriptions, minus the language hint (the source is auto-detected).

Request

Content-Type: multipart/form-data

filerequiredbinary
Audio file. Common formats such as WAV, MP3, FLAC, OGG, M4A are accepted.
promptstring
Optional text hint to bias recognition toward custom vocabulary or proper nouns. Capped at 896 characters; longer prompts are rejected with a 422.
response_formatstring
Output format. JSON by default.
service_idstring
Optional. Route to a specific STT service (see GET /v1/services). Falls back to the org’s default STT service when omitted.
bash
curl https://api.inferada.com/v1/audio/translations \
  -H "Authorization: Bearer inf_YOUR_TOKEN" \
  -F file=@dutch-speech.mp3

Response · 200

textstring
English translation of the speech.
durationnumber
Audio duration in seconds (when available).
json
{
  "text": "Hello, this is a translation.",
  "duration": 4.2
}

Synthesise speech

POST/v1/audio/speech
Scopetts

Generate speech audio from text. Returns a binary audio stream in the requested format.

Request

voicerequiredstring
Voice ID. List available voices with /v1/audio/voices.
inputrequiredstring
Text to synthesise.
response_formatstring
mp3 (default), opus, flac, aac, wav or pcm.
speednumber
Playback speed multiplier. Default 1.0.
service_idstring
Optional. Route to a specific TTS service (see GET /v1/services). Falls back to the org’s default TTS service when omitted.
modelstring
Optional model identifier — usually inferred from the voice.
json
{
  "voice": "nl_BE-rdh-medium",
  "input": "Hallo, welkom.",
  "response_format": "mp3",
  "speed": 1.0
}

Response · 200

Binary audio stream. The Content-Type matches the requested format: audio/mpeg for MP3, audio/wav, audio/opus, audio/flac, audio/aac, or application/octet-stream for PCM.

bash
curl https://api.inferada.com/v1/audio/speech \
  -H "Authorization: Bearer inf_YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "voice": "nl_BE-rdh-medium", "input": "Hallo." }' \
  --output out.mp3

Errors

If voice is unknown, the response is a 400 with the list of available voices inlined so your client can recover:

json
{
  "error": "Voice 'xyz' not found. Available voices: [...]",
  "available_voices": [
    { "id": "nl_BE-rdh-medium", "name": "RDH", "language": "nl" }
  ],
  "request_id": "..."
}

Choosing a voice

Voices come from two backends with different strengths. Pick by language:

  • Piper — lightweight and fast. Covers English (US/GB), French, German, Spanish, Dutch (NL/BE), Italian, Portuguese (BR/PT), Chinese (Mandarin) and Hindi. Voices come in low/medium/high quality tiers.
  • Kokoro — neural, more natural-sounding but slower. Use it for Japanese (Piper has no Japanese voices) or when you want a richer-sounding English voice and can afford the extra latency. All Kokoro voices report quality: "natural".

Your organisation can have either a combined TTS service (one record, both engines) or one or more single-backend services (one engine per record). Combined services prefix every voice id with the engine — e.g. piper:nl_BE-rdh-medium, kokoro:af_bella — and route on that prefix. Single-backend services return the raw upstream id (no prefix); pass service_id to steer to a specific one.

Voices are auto-discovered from the upstream container at request time — no manual catalogue. Adding a new voice file in the upstream is enough; the gateway picks it up within five minutes.

List available voices

GET/v1/audio/voices
Scopetts

List voices live-discovered from the upstream container backing your TTS service. Pass ?service_id=… to query a specific service.

Request

service_idstring
Optional. Discover voices on a specific TTS service. Falls back to the default when omitted.
languagestring
Optional. Filter by ISO language code, e.g. "en".

Response · 200

voices[].idstring
Unique voice identifier. Use this in /v1/audio/speech.
voices[].namestring
Human-readable display name.
voices[].languagestring
ISO language code (e.g. "nl", "en").
voices[].qualitystring | null
Quality tier when available.
json
{
  "voices": [
    {
      "id": "nl_BE-rdh-medium",
      "name": "RDH",
      "language": "nl",
      "quality": "medium"
    }
  ]
}