Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tokmodel.com/llms.txt

Use this file to discover all available pages before exploring further.

TokModel provides three audio endpoints: one that synthesizes speech from a text string, one that transcribes an audio file into text, and one that translates an audio file directly into English. The text-to-speech endpoint returns an audio binary, while transcription and translation return text. Select the tab below for the endpoint you need.

POST /v1/audio/speech

Convert a text string into spoken audio. The response is a binary audio file in the format specified by response_format. You can stream the audio by reading the response body incrementally.

Request parameters

model
string
required
The text-to-speech model to use. Use the list models endpoint for available TTS model IDs.
input
string
required
The text to synthesize into speech. Maximum length depends on the model.
voice
string
required
The voice to use for synthesis. Available voices depend on the model. Common options include "alloy", "echo", "fable", "onyx", "nova", and "shimmer".
response_format
string
default:"mp3"
The audio format for the output. Supported values: "mp3", "opus", "aac", "flac", "wav", and "pcm".
speed
number
default:"1.0"
Playback speed of the generated audio, between 0.25 and 4.0. Values above 1.0 speed up speech; values below 1.0 slow it down.

Example

The response body is the raw audio binary. Pipe it directly to a file:
curl https://tokmodel.com/v1/audio/speech \
  --request POST \
  --header "Authorization: Bearer YOUR_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "openai/tts-1",
    "input": "Welcome to TokModel, your unified LLM API gateway.",
    "voice": "nova",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output speech.mp3