Audio generation
In addition to generating text and images, some models enable you to generate a spoken audio response to a prompt, and to use audio inputs to prompt the model. Audio inputs can contain richer data than text alone, allowing the model to detect tone, inflection, and other nuances within the input.
You can use these audio capabilities to:
- Generate a spoken audio summary of a body of text (text in, audio out)
- Perform sentiment analysis on a recording (audio in, text out)
- Async speech to speech interactions with a model (audio in, audio out)
Quickstart
To generate audio or use audio as an input, you can use the chat completions endpoint in the REST API, as seen in the examples below. You can either use the REST API from the HTTP client of your choice, or use one of OpenAI's official SDKs for your preferred programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import { writeFileSync } from "node:fs";
import OpenAI from "openai";
const openai = new OpenAI();
// Generate an audio response to the given prompt
const response = await openai.chat.completions.create({
model: "gpt-4o-audio-preview",
modalities: ["text", "audio"],
audio: { voice: "alloy", format: "wav" },
messages: [
{
role: "user",
content: "Is a golden retriever a good family dog?"
}
]
});
// Inspect returned data
console.log(response.choices[0]);
// Write audio data to a file
writeFileSync(
"dog.wav",
Buffer.from(response.choices[0].message.audio.data, 'base64'),
{ encoding: "utf-8" }
);
Multi-turn conversations
Using audio outputs from the model as inputs to multi-turn conversations requires a generated ID that appears in the response data for an audio generation. Below is an example JSON data structure for a message you might receive from /chat/completions
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"refusal": null,
"audio": {
"id": "audio_abc123",
"expires_at": 1729018505,
"data": "<bytes omitted>",
"transcript": "Yes, golden retrievers are known to be ..."
}
},
"finish_reason": "stop"
}
The value of message.audio.id
above provides an identifier you can use in an assistant
message for a new /chat/completions
request, as in the example below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
curl "https://api.openai.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o-audio-preview",
"modalities": ["text", "audio"],
"audio": { "voice": "alloy", "format": "wav" },
"messages": [
{
"role": "user",
"content": "Is a golden retriever a good family dog?"
},
{
"role": "assistant",
"audio": {
"id": "audio_abc123"
}
},
{
"role": "user",
"content": "Why do you say they are loyal?"
}
]
}'
FAQ
What modalities are supported by gpt-4o-audio-preview
gpt-4o-audio-preview
requires either audio output or audio input to be used at this time. Acceptable combinations of input and output are:
- text in → text + audio out
- audio in → text + audio out
- audio in → text out
- text + audio in → text + audio out
- text + audio in → text out
How is audio in Chat Completions different from the Realtime API?
The underlying GPT-4o audio model is exactly the same. The Realtime API operates the same model at lower latency.
How do I think about audio input to the model in terms of tokens?
We are working on better tooling to expose this, but roughly one hour of audio input is equal to 128k tokens, the max context window currently supported by this model.
How do I control which output modalities I receive?
Currently the model only programmatically allows modalities = [“text”, “audio”]
. In the future, this parameter will give more controls.
How does tool/function calling work?
Tool (and function) calling works the same as it does for other models in Chat Completions - learn more.
Next steps
Now that you know how to generate audio outputs and send audio inputs, there are a few other techniques you might want to master.
Use a specialized model to turn text into speech.
Use a specialized model to turn audio files with speech into text.
Learn to use the Realtime API to prompt a model over a WebSocket.
Check out all the options for audio generation in the API reference.