Speech-to-Text
Generations (results)
Speech-to-Text
Transcribe audio files using Whisper, GPT-4o Transcribe, or ElevenLabs Scribe
POST
Speech-to-Text
STT uses a dedicated multipart endpoint
POST /api/v2/stt, not the standard /generate endpoint.Request
Authorization: Bearer nb_YOUR_API_KEYAudio file. Supported formats:
mp3, mp4, wav, m4a, ogg, flac, webm. Max size: 25 MB.STT model slug. See table below.
Language code (e.g.
en, ru, es). Optional — auto-detected if omitted.Speaker diarization. Only supported by
elevenlabs-scribe.Models
| Slug | Provider | Tier | Cost | Notes |
|---|---|---|---|---|
whisper | OpenAI | Starter | 2 tkn | Fast, 99 languages |
gpt-4o-transcribe | OpenAI | Basic+ | 2 tkn | Highest accuracy |
elevenlabs-scribe | ElevenLabs | Basic+ | 2 tkn | Best for meetings, supports diarization |
Response
Diarization (who said what)
Available only withelevenlabs-scribe:
result_text:

