Documentation Index
Fetch the complete documentation index at: https://docs.tokmodel.com/llms.txt
Use this file to discover all available pages before exploring further.
TokModel provides three audio endpoints that follow the OpenAI Audio API shape: synthesize speech from text, transcribe an audio file into the spoken language, and translate spoken audio into English. You can switch between audio model providers by changing the model parameter.
Authentication
Include your API key in every request:
Authorization: Bearer YOUR_API_KEY
Text-to-speech
Transcription
Translation
Convert text to speech
POST /v1/audio/speech generates an audio file from a text string. The response body is raw audio binary — write it directly to a file.curl https://tokmodel.com/v1/audio/speech \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/tts-1",
"input": "Welcome to TokModel. Your unified gateway to over thirty AI model providers.",
"voice": "nova"
}' \
--output speech.mp3
The --output flag tells curl to save the binary response to speech.mp3 instead of printing it to the terminal.Key parameters
| Parameter | Type | Description |
|---|
model | string | The TTS model to use, e.g. openai/tts-1 or openai/tts-1-hd. |
input | string | The text to synthesize. Maximum 4096 characters. |
voice | string | Voice style. Options include alloy, echo, fable, onyx, nova, shimmer. |
response_format | string | Audio format: mp3 (default), opus, aac, or flac. |
speed | number | Playback speed from 0.25 to 4.0. Default is 1.0. |
Use openai/tts-1-hd for higher audio fidelity. It costs more per character but produces noticeably cleaner output, especially for longer texts.
Transcribe audio to text
POST /v1/audio/transcriptions takes an audio file and returns a transcript of the spoken content in the original language. The request uses multipart/form-data.curl https://tokmodel.com/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "model=openai/whisper-1" \
-F "file=@recording.mp3"
Example response
{
"text": "The transformer architecture was introduced in 2017 and has since become the foundation for most modern language models."
}
Key parameters
| Parameter | Type | Description |
|---|
model | string | The transcription model to use, e.g. openai/whisper-1. |
file | file | The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. Max 25 MB. |
language | string | ISO-639-1 language code (e.g. en, fr). Providing this improves accuracy. |
prompt | string | Optional context text to guide the model, e.g. proper nouns or acronyms. |
response_format | string | json (default), text, srt, verbose_json, or vtt. |
temperature | number | Sampling temperature between 0 and 1. Lower values are more deterministic. |
To get subtitles instead of plain text, set response_format to srt or vtt:curl https://tokmodel.com/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "model=openai/whisper-1" \
-F "file=@interview.mp3" \
-F "response_format=srt" \
> subtitles.srt
Translate audio to English
POST /v1/audio/translations transcribes audio and translates the result into English in one step, regardless of the spoken language. The request shape is the same as transcriptions, but without the language parameter.curl https://tokmodel.com/v1/audio/translations \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "model=openai/whisper-1" \
-F "file=@french-interview.mp3"
Example response
{
"text": "Good morning, I would like to talk about the latest developments in artificial intelligence."
}
Key parameters
| Parameter | Type | Description |
|---|
model | string | The translation model to use, e.g. openai/whisper-1. |
file | file | The audio file to translate. Same format and size limits as transcriptions. |
prompt | string | Optional English-language context text to guide output style. |
response_format | string | json (default), text, srt, verbose_json, or vtt. |
temperature | number | Sampling temperature between 0 and 1. |
The translations endpoint always outputs English, regardless of the source language. If you need a transcript in the original language, use the transcriptions endpoint instead.