Evaluations
Automatically assess the quality of your AI outputs using Basalt Evaluations.
Evaluations allow you to automatically assess the quality and characteristics of your AI outputs. By integrating evaluators into your workflows, you can monitor for issues, gather metrics, and ensure your AI-generated content meets your standards.
Overview
Basalt’s evaluation system works by attaching evaluators to your traces and generations. These evaluators run automatically when generations complete, analyzing the input and output to produce objective metrics about the content.
Evaluations help you answer questions like: “Is this content factually accurate?”, “Does it contain harmful content?”, “Is it relevant to the original query?”, or any other quality metrics important to your application.
Creating Evaluators
To use evaluations in Basalt, you’ll need to create your own evaluators through the Basalt application interface. These evaluators can be designed to measure specific aspects of your AI-generated content that are important to your use case.
Evaluators are created and managed through the Basalt app. Once created, they can be referenced in your code by their slug.
Adding Evaluators to Traces
You can add evaluators at the trace level to have them automatically apply to all generations within that trace:
Adding evaluators at the trace level is efficient when you want consistent evaluation across all generations within a workflow.
Adding Evaluators to Generations
For more targeted evaluation, you can add evaluators directly to specific generations:
Adding evaluators at the generation level allows you to apply specific evaluations to particular types of content.
Evaluation Configuration
The evaluationConfig
parameter gives you control over how evaluations are applied:
Sample Rate
The sampleRate
parameter controls how often evaluations are run:
- Value range:
0.0
to1.0
(0% to 100%) - Default value:
0.1
(10%)
Sample rates allow you to balance evaluation coverage with cost efficiency. For example:
sampleRate: 1.0
- Evaluate every trace (100%)sampleRate: 0.5
- Evaluate approximately half of all traces (50%)sampleRate: 0.1
- Evaluate approximately one in ten traces (10%)sampleRate: 0.01
- Evaluate approximately one in a hundred traces (1%)
Sampling is applied at the trace level. When a trace is selected for evaluation, all evaluators assigned to that trace and its generations will run. This ensures you get complete evaluation data for the sampled traces rather than patchy data across all traces.
In experiment mode (when a trace is attached to an experiment), the sample rate is always set to 1.0 (100%) regardless of the configured value, ensuring complete evaluation coverage for experimental workflows.
Example: Multi-Faceted Evaluation
Here’s an example of using multiple evaluators to comprehensively assess content:
In this example:
- We create a trace with evaluators for brand compliance and message clarity
- We add content-specific evaluators to the generation for tone and feature coverage
- The evaluations will run on approximately 25% of traces due to the
sampleRate
setting
Best Practices
To get the most value from evaluations:
-
Start with a baseline: Begin with a modest set of evaluators to establish baseline metrics before adding more.
-
Choose appropriate sample rates: High-volume applications may only need a small sample rate (1-10%), while critical applications might warrant higher rates (50-100%).
-
Combine evaluators strategically: Different evaluators measure different aspects of quality; use combinations that address your specific concerns.
-
Use contextual evaluators: Different content types may need different evaluations; tailor your evaluator selection to the specific content.
Related Topics
To further enhance your understanding and use of evaluations:
- Experiments - Compare evaluation results between different workflow versions
- Tracing - Add evaluators to more complex workflows
- Prompt Management - Improve prompts based on evaluation feedback