Model distillation
Model Distillation allows you to leverage the outputs of a large model to fine-tune a smaller model, enabling it to achieve similar performance on a specific task. This process can significantly reduce both cost and latency, as smaller models are typically more efficient.
Here's how it works:
- Store high-quality outputs of a large model using the
store
parameter in the chat completions API to store them. - Evaluate the stored completions with both the large and the small model to establish a baseline.
- Select the stored completions that you'd like to use to for distillation and use them to fine-tune the smaller model.
- Evaluate the performance of the fine-tuned model to see how it compares to the large model.
Let's go through these steps to see how it's done.
Store high-quality outputs of a large model
The first step in the distillation process is to generate good results with a large model like o1-preview
or gpt-4o
that meet your bar. As you generate these results, you can store them using the store: true
option in the Chat Completions API. We also recommend you use the metadata property to tag these completions for easy filtering later.
These stored completion can then be viewed and filtered in the dashboard.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import OpenAI from "openai";
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a corporate IT support expert." },
{ role: "user", content: "How can I hide the dock on my Mac?"},
],
store: true,
metadata: {
role: "manager",
department: "accounting",
source: "homepage"
}
});
console.log(response.choices[0]);
Evaluate to establish a baseline
You can use your stored completions to evaluate the performance of both the larger model and a smaller model on your task to establish a baseline. This can be done using the evals product.
Typically, the large model will outperform the smaller model on your evaluations. Establishing this baseline allows you to measure the improvements gained through the distillation / fine-tuning process.
Create training dataset to fine-tune smaller model
Next you can select a subset of your stored completions to use as training data for fine-tuning a smaller model like gpt-4o-mini
. Filter your stored completions to those that you would like to use to train the small model, and click the "Distill" button. A few hundred samples might be sufficient, but sometimes a more diverse range of thousands of samples can yield better results.
This action will open a dialog to begin a fine-tuning job, with your selected completions as the training dataset. Configure the parameters as needed, choosing the base model you wish to fine-tune. In this example, we're going to choose the latest snapshot of GPT-4o-mini.
After configuring, click "Run" to start the fine-tuning job. The process may take 15 minutes or longer, depending on the size of your training dataset.
Evaluate the fine-tuned small model
When your fine-tuning job is complete, you can run evals against it to see how it stacks up against the base small and large models. You can select fine-tuned models in the Evals product to generate new completions with the fine-tuned small model.
Alternately, you could also store new chat completions generated by the fine-tuned model, and use them to evaluate performance. By continually tweaking and improving:
- The diversity of the training data
- Your prompts and outputs on the large model
- The accuracy of your eval graders
You can bring the performance of the smaller model up to the same levels as the large model, for a specific subset of tasks.
Next steps
Distilling large model results to a small model is one powerful way to improve the results you generate from your models, but not the only one. Check out these resources to learn more about optimizing your outputs.
Improve a model's ability to generate responses tailored to your use case.
Run tests on your model outputs to ensure you're getting the right results.