TokModel provides three audio endpoints: one that synthesizes speech from a text string, one that transcribes an audio file into text, and one that translates an audio file directly into English. The text-to-speech endpoint returns an audio binary, while transcription and translation return text. Select the tab below for the endpoint you need.Documentation Index
Fetch the complete documentation index at: https://docs.tokmodel.com/llms.txt
Use this file to discover all available pages before exploring further.
- Text-to-speech
- Transcription
- Translation
POST /v1/audio/speech
Convert a text string into spoken audio. The response is a binary audio file in the format specified byresponse_format. You can stream the audio by reading the response body incrementally.Request parameters
The text-to-speech model to use. Use the list models endpoint for available TTS model IDs.
The text to synthesize into speech. Maximum length depends on the model.
The voice to use for synthesis. Available voices depend on the model. Common options include
"alloy", "echo", "fable", "onyx", "nova", and "shimmer".The audio format for the output. Supported values:
"mp3", "opus", "aac", "flac", "wav", and "pcm".Playback speed of the generated audio, between
0.25 and 4.0. Values above 1.0 speed up speech; values below 1.0 slow it down.