---
title: whisper-large-v3-turbo · Cloudflare Workers AI docs
description: "Whisper is a pre-trained model for automatic speech recognition
  (ASR) and speech translation. "
chatbotDeprioritize: false
source_url:
  html: https://developers.cloudflare.com/workers-ai/models/whisper-large-v3-turbo/
  md: https://developers.cloudflare.com/workers-ai/models/whisper-large-v3-turbo/index.md
---

![OpenAI logo](https://developers.cloudflare.com/_astro/openai.ChTKThcR.svg)

# whisper-large-v3-turbo

Automatic Speech Recognition • OpenAI

@cf/openai/whisper-large-v3-turbo

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

| Model Info | |
| - | - |
| Unit Pricing | $0.00051 per audio minute |

## Usage

Workers - TypeScript

```ts
import { Buffer } from 'node:buffer';
export interface Env {
    AI: Ai;
}
const URL = "https://pub-dbcf9f0bd3af47ca9d40971179ee62de.r2.dev/02f6edc0-1f7b-4272-bd17-f05335104725/audio.mp3";
export default {
    async fetch(request, env, ctx): Promise<Response> {
        const mp3 = await fetch(URL);
        if (!mp3.ok) {
          return Response.json({ error: `Failed to fetch MP3: ${mp3.status}` });
        }
        const mp3Buffer = await mp3.arrayBuffer();
        const base64 = Buffer.from(mp3Buffer, 'binary').toString("base64");
        try {
            const res = await env.AI.run("@cf/openai/whisper-large-v3-turbo", {
                "audio": base64
            });
            return Response.json(res);
        }
        catch (e) {
            console.error(e);
            return Response.json({ error: "An unexpected error occurred" });
        }
    },
} satisfies ExportedHandler<Env>
```

Note

To enable built-in Node.js APIs and polyfills, add the nodejs\_compat compatibility flag to your [Wrangler configuration file](https://developers.cloudflare.com/workers/wrangler/configuration/). This also enables nodejs\_compat\_v2 as long as your compatibility date is 2024-09-23 or later. [Learn more about the Node.js compatibility flag and v2](https://developers.cloudflare.com/workers/configuration/compatibility-flags/#nodejs-compatibility-flag).

## Parameters

\* indicates a required field

### Input

* `audio` string required

  Base64 encoded value of the audio data.

* `task` string default transcribe

  Supported tasks are 'translate' or 'transcribe'.

* `language` string

  The language of the audio being transcribed or translated.

* `vad_filter` boolean

  Preprocess the audio with a voice activity detection model.

* `initial_prompt` string

  A text prompt to help provide context to the model on the contents of the audio.

* `prefix` string

  The prefix it appended the the beginning of the output of the transcription and can guide the transcription result.

### Output

* `transcription_info` object

  * `language` string

    The language of the audio being transcribed or translated.

  * `language_probability` number

    The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1.

  * `duration` number

    The total duration of the original audio file, in seconds.

  * `duration_after_vad` number

    The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds.

* `text` string required

  The complete transcription of the audio.

* `word_count` number

  The total number of words in the transcription.

* `segments` array

  * `items` object

    * `start` number

      The starting time of the segment within the audio, in seconds.

    * `end` number

      The ending time of the segment within the audio, in seconds.

    * `text` string

      The transcription of the segment.

    * `temperature` number

      The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs.

    * `avg_logprob` number

      The average log probability of the predictions for the words in this segment, indicating overall confidence.

    * `compression_ratio` number

      The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process.

    * `no_speech_prob` number

      The probability that the segment contains no speech, represented as a decimal between 0 and 1.

    * `words` array

      * `items` object

        * `word` string

          The individual word transcribed from the audio.

        * `start` number

          The starting time of the word within the audio, in seconds.

        * `end` number

          The ending time of the word within the audio, in seconds.

* `vtt` string

  The transcription in WebVTT format, which includes timing and text information for use in subtitles.

## API Schemas

The following schemas are based on JSON Schema

* Input

  ```json
  {
      "type": "object",
      "properties": {
          "audio": {
              "type": "string",
              "description": "Base64 encoded value of the audio data."
          },
          "task": {
              "type": "string",
              "default": "transcribe",
              "description": "Supported tasks are 'translate' or 'transcribe'."
          },
          "language": {
              "type": "string",
              "description": "The language of the audio being transcribed or translated."
          },
          "vad_filter": {
              "type": "boolean",
              "default": false,
              "description": "Preprocess the audio with a voice activity detection model."
          },
          "initial_prompt": {
              "type": "string",
              "description": "A text prompt to help provide context to the model on the contents of the audio."
          },
          "prefix": {
              "type": "string",
              "description": "The prefix it appended the the beginning of the output of the transcription and can guide the transcription result."
          }
      },
      "required": [
          "audio"
      ]
  }
  ```

* Output

  ```json
  {
      "type": "object",
      "contentType": "application/json",
      "properties": {
          "transcription_info": {
              "type": "object",
              "properties": {
                  "language": {
                      "type": "string",
                      "description": "The language of the audio being transcribed or translated."
                  },
                  "language_probability": {
                      "type": "number",
                      "description": "The confidence level or probability of the detected language being accurate, represented as a decimal between 0 and 1."
                  },
                  "duration": {
                      "type": "number",
                      "description": "The total duration of the original audio file, in seconds."
                  },
                  "duration_after_vad": {
                      "type": "number",
                      "description": "The duration of the audio after applying Voice Activity Detection (VAD) to remove silent or irrelevant sections, in seconds."
                  }
              }
          },
          "text": {
              "type": "string",
              "description": "The complete transcription of the audio."
          },
          "word_count": {
              "type": "number",
              "description": "The total number of words in the transcription."
          },
          "segments": {
              "type": "array",
              "items": {
                  "type": "object",
                  "properties": {
                      "start": {
                          "type": "number",
                          "description": "The starting time of the segment within the audio, in seconds."
                      },
                      "end": {
                          "type": "number",
                          "description": "The ending time of the segment within the audio, in seconds."
                      },
                      "text": {
                          "type": "string",
                          "description": "The transcription of the segment."
                      },
                      "temperature": {
                          "type": "number",
                          "description": "The temperature used in the decoding process, controlling randomness in predictions. Lower values result in more deterministic outputs."
                      },
                      "avg_logprob": {
                          "type": "number",
                          "description": "The average log probability of the predictions for the words in this segment, indicating overall confidence."
                      },
                      "compression_ratio": {
                          "type": "number",
                          "description": "The compression ratio of the input to the output, measuring how much the text was compressed during the transcription process."
                      },
                      "no_speech_prob": {
                          "type": "number",
                          "description": "The probability that the segment contains no speech, represented as a decimal between 0 and 1."
                      },
                      "words": {
                          "type": "array",
                          "items": {
                              "type": "object",
                              "properties": {
                                  "word": {
                                      "type": "string",
                                      "description": "The individual word transcribed from the audio."
                                  },
                                  "start": {
                                      "type": "number",
                                      "description": "The starting time of the word within the audio, in seconds."
                                  },
                                  "end": {
                                      "type": "number",
                                      "description": "The ending time of the word within the audio, in seconds."
                                  }
                              }
                          }
                      }
                  }
              }
          },
          "vtt": {
              "type": "string",
              "description": "The transcription in WebVTT format, which includes timing and text information for use in subtitles."
          }
      },
      "required": [
          "text"
      ]
  }
  ```
