Evaluating model performance
Beta

Test and improve model outputs through evaluations.

When developing with AI models, it's essential to continuously test their outputs to ensure they are accurate and useful. Regularly running evaluations (often called evals) on your model's outputs using test data helps you build and maintain high-quality and reliable AI applications.source

OpenAI provides built-in tools in the OpenAI dashboard to create and run evals on test datasets. Here's how the process works:source

Generate a test dataset
Define and run evals against your dataset
Tweak your prompt and/or fine tune your model to improve performance
Repeat until satisfied 🚀

Let's see how this is done!source

Generate a test data set

In software development, you often have to create test data (sometimes called fixtures) that your program needs in order to validate your software is working properly. Your unit tests would execute your code with fixture data, and ensure the output is what you expect.source

Similarly, your evals will require a set of test inputs that your model should be able to reply to properly. Having good test data is very important in optimizing LLM accuracy, because if your model is tested with data that's not representative of the types of requests it's going to get, you can't be confident in how it will perform on new, unknown inputs.source

Generate datasets from real traffic

One of the best ways to generate representative test data sets is using real production requests from users. This is possible using Stored Completions. In your code that generates LLM responses, use the store: true parameter, and include metadata tags that you can later use to filter your completions, as in the examples below for an IT support chatbot:source

Store completions in the API with metadata

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import OpenAI from "openai";
const openai = new OpenAI();

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "You are a corporate IT support expert." },
    { role: "user", content: "How can I hide the dock on my Mac?"},
  ],
  store: true,
  metadata: {
    role: "manager",
    department: "accounting",
    source: "homepage"
  }
});

console.log(response.choices[0]);

This will make the completion show up in the dashboard here.source

Please note that Stored Completions contain unfiltered content from API prompts and completions. When you use content from Stored Completions for fine-tuning, you are responsible for making sure you have the appropriate permissions to use this content and that it does not include any personal information or other sensitive data.source

From here, you can define an eval to judge the output of the model.source

Define and run an eval against your test data

Once you have created a test data set, either manually or by using the flow from the completions UI, you can define the parameters of your eval run. If you followed the step above and generated test data from production traffic, you won't need to run completions again. You can go right into defining the criteria for your eval.source

There are a number of evaluation criteria to choose from (sometimes called graders) - these tests will help assess the quality of your model responses. One flexible option is a model grader, which you can prompt to grade model outputs as you see fit.source

Once you have defined the criteria for your model, you can run your eval!source

Iterate and improve

After your eval runs, you will see resulting scores in the dashboard. By iterating on your prompts and criteria, you will be able to improve your model outputs over time. Having good evals and good test data in place can help you iterate on prompts and try new models with more confidence that your generation results are in good standing.source

Fine-tuning

Improve a model's ability to generate responses tailored to your use case.source

Model distillation

Learn how to distill large model results to smaller, cheaper, and faster models.source

Evaluating model performanceBeta

Generate a test data set

Generate datasets from real traffic

Define and run an eval against your test data

Iterate and improve

Evaluating model performance
Beta