Text-to-speech, transcription, and audio translation

TokModel provides three audio endpoints that follow the OpenAI Audio API shape: synthesize speech from text, transcribe an audio file into the spoken language, and translate spoken audio into English. You can switch between audio model providers by changing the model parameter.

Authentication

Include your API key in every request:

Authorization: Bearer YOUR_API_KEY

Text-to-speech
Transcription
Translation

Convert text to speech

POST /v1/audio/speech generates an audio file from a text string. The response body is raw audio binary — write it directly to a file.

curl

curl https://tokmodel.com/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1",
    "input": "Welcome to TokModel. Your unified gateway to over thirty AI model providers.",
    "voice": "nova"
  }' \
  --output speech.mp3

The --output flag tells curl to save the binary response to speech.mp3 instead of printing it to the terminal.

Key parameters

Parameter	Type	Description
`model`	string	The TTS model to use, e.g. `openai/tts-1` or `openai/tts-1-hd`.
`input`	string	The text to synthesize. Maximum 4096 characters.
`voice`	string	Voice style. Options include `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`.
`response_format`	string	Audio format: `mp3` (default), `opus`, `aac`, or `flac`.
`speed`	number	Playback speed from `0.25` to `4.0`. Default is `1.0`.

Use openai/tts-1-hd for higher audio fidelity. It costs more per character but produces noticeably cleaner output, especially for longer texts.

Transcribe audio to text

POST /v1/audio/transcriptions takes an audio file and returns a transcript of the spoken content in the original language. The request uses multipart/form-data.

curl

curl https://tokmodel.com/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "file=@recording.mp3"

Example response

{
  "text": "The transformer architecture was introduced in 2017 and has since become the foundation for most modern language models."
}

Key parameters

Parameter	Type	Description
`model`	string	The transcription model to use, e.g. `openai/whisper-1`.
`file`	file	The audio file to transcribe. Supported formats: `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, `webm`. Max 25 MB.
`language`	string	ISO-639-1 language code (e.g. `en`, `fr`). Providing this improves accuracy.
`prompt`	string	Optional context text to guide the model, e.g. proper nouns or acronyms.
`response_format`	string	`json` (default), `text`, `srt`, `verbose_json`, or `vtt`.
`temperature`	number	Sampling temperature between `0` and `1`. Lower values are more deterministic.

Request a subtitle format

To get subtitles instead of plain text, set response_format to srt or vtt:

curl

curl https://tokmodel.com/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "file=@interview.mp3" \
  -F "response_format=srt" \
  > subtitles.srt

Translate audio to English

POST /v1/audio/translations transcribes audio and translates the result into English in one step, regardless of the spoken language. The request shape is the same as transcriptions, but without the language parameter.

curl

curl https://tokmodel.com/v1/audio/translations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "file=@french-interview.mp3"

Example response

{
  "text": "Good morning, I would like to talk about the latest developments in artificial intelligence."
}

Key parameters

Parameter	Type	Description
`model`	string	The translation model to use, e.g. `openai/whisper-1`.
`file`	file	The audio file to translate. Same format and size limits as transcriptions.
`prompt`	string	Optional English-language context text to guide output style.
`response_format`	string	`json` (default), `text`, `srt`, `verbose_json`, or `vtt`.
`temperature`	number	Sampling temperature between `0` and `1`.

The translations endpoint always outputs English, regardless of the source language. If you need a transcript in the original language, use the transcriptions endpoint instead.

​Authentication

​Convert text to speech

​Key parameters

​Transcribe audio to text

​Example response

​Key parameters

​Request a subtitle format

​Translate audio to English

​Example response

​Key parameters

Authentication

Convert text to speech

Key parameters

Transcribe audio to text

Example response

Key parameters

Request a subtitle format

Translate audio to English

Example response

Key parameters