API Reference

Audio

Convert speech to text and text to speech. OpenAI-compatible request shapes so existing clients work out of the box.

Transcribe audio

POST/v1/audio/transcriptions

Scopestt

Speech-to-text. Send the audio file as a multipart upload and get back the transcribed text.

Request

Content-Type: multipart/form-data

filerequiredbinary: Audio file. Common formats such as WAV, MP3, FLAC, OGG, M4A are accepted.
languagestring: ISO language hint to improve accuracy (e.g. "en", "nl").
response_formatstring: Output format. JSON by default.
service_idstring: Optional. Route to a specific STT service (see GET /v1/services). Falls back to the org’s default STT service when omitted.

bash

curl https://api.inferada.com/v1/audio/transcriptions \
  -H "Authorization: Bearer inf_YOUR_TOKEN" \
  -F file=@speech.mp3 \
  -F language=en

Response · 200

textstring: Transcribed text.
durationnumber: Audio duration in seconds (when available).

json

{
  "text": "Hello, this is a transcription.",
  "duration": 4.2
}

Synthesise speech

POST/v1/audio/speech

Scopetts

Generate speech audio from text. Returns a binary audio stream in the requested format.

Request

voicerequiredstring: Voice ID. List available voices with /v1/audio/voices.
inputrequiredstring: Text to synthesise.
response_formatstring: mp3 (default), opus, flac, aac, wav or pcm.
speednumber: Playback speed multiplier. Default 1.0.
modelstring: Optional model identifier — usually inferred from the voice.

json

{
  "voice": "nl_NL-rdh-medium",
  "input": "Hallo, welkom.",
  "response_format": "mp3",
  "speed": 1.0
}

Response · 200

Binary audio stream. The Content-Type matches the requested format: audio/mpeg for MP3, audio/wav, audio/opus, audio/flac, audio/aac, or application/octet-stream for PCM.

bash

curl https://api.inferada.com/v1/audio/speech \
  -H "Authorization: Bearer inf_YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "voice": "nl_NL-rdh-medium", "input": "Hallo." }' \
  --output out.mp3

Errors

If voice is unknown, the response is a 400 with the list of available voices inlined so your client can recover:

json

{
  "error": "Voice 'xyz' not found. Available voices: [...]",
  "available_voices": [
    { "id": "nl_NL-rdh-medium", "name": "RDH", "language": "nl" }
  ],
  "request_id": "..."
}

Choosing a voice

Voices come from two backends with different strengths. Pick by language:

Piper — lightweight and fast. Best coverage for Dutch, German, French and Spanish. Voices come in low/medium/high quality tiers.
Kokoro — neural, more natural-sounding. Best for English (American and British), and also covers Japanese and Italian.

The backend is baked into each voice id (e.g. piper:nl_NL-rdh-medium, kokoro:af_bella), so you just pass whatever id you picked from /v1/audio/voices.

List available voices

GET/v1/audio/voices

Scopetts

List voices configured for your organisation.

Response · 200

voices[].idstring: Unique voice identifier. Use this in /v1/audio/speech.
voices[].namestring: Human-readable display name.
voices[].languagestring: ISO language code (e.g. "nl", "en").
voices[].qualitystring | null: Quality tier when available.

json

{
  "voices": [
    {
      "id": "nl_NL-rdh-medium",
      "name": "RDH",
      "language": "nl",
      "quality": "medium"
    }
  ]
}