Run and compare different variants of your AI workflows with Basalt Experiments.
Experiments allow you to compare different versions of your AI workflows by running multiple traces under controlled conditions. This feature is particularly useful for A/B testing prompt changes, evaluating different model configurations, or optimizing your AI application’s performance.
To create an experiment, use the createExperiment
method:
Parameters:
feature-slug
: The slug of the feature this experiment is associated withname
: A descriptive name for the experiment (visible in the UI)The createExperiment
method returns an experiment object with the following properties:
id
: Unique identifier for the experimentname
: The name you providedfeatureSlug
: The feature slug this experiment is associated withcreatedAt
: Timestamp when the experiment was createdOnce you have created an experiment, you can attach traces to it by passing the experiment object to the createTrace
method:
You can also set an experiment on an existing trace:
The feature slug of the experiment must match the feature slug of the trace. If they don’t match, the experiment will be ignored and the trace will go to regular monitoring instead.
When you attach evaluators to traces in an experiment, they run with special behavior:
When traces are attached to an experiment, evaluators will run on 100% of the traces regardless of the sampleRate setting. This ensures you get complete evaluation data for all experimental runs, which is essential for meaningful comparisons.
This behavior allows you to:
Using experiments for testing provides several advantages:
Centralized Results: All test runs are grouped together in a single experiment, making it easy to analyze the results.
Workflow Validation: You can validate that your AI workflows produce the expected outputs for a variety of inputs.
Quality Regression Detection: By running experiments regularly (e.g., on every PR or nightly), you can detect regressions in your AI workflows.
Evaluation Integration: Automatic evaluators provide objective metrics about the quality of your AI outputs.
Mock Data Support: You can run experiments with mock data or synthetic inputs to test specific scenarios.
Historical Comparisons: Compare current performance against previous runs to ensure your changes improve quality.
To get the most out of experiments:
Run enough samples: Aim for at least 50-100 runs per variant to get statistically significant results.
Control your variables: Change only one thing at a time between variants to isolate the impact of that change.
Add detailed metadata: Include information that will help you analyze the results later, such as variant identifiers and relevant parameters.
Include evaluators: Add evaluators to automatically assess the quality of outputs from different variants.
Use the same feature slug: Make sure the experiment and all traces use the same feature slug.
Add error handling: Ensure all traces are ended properly, even if errors occur.
Name your variants: Consistently name your variants to make it easy to identify them in the results.
After running an experiment, you can view and analyze the results in the Basalt dashboard. The dashboard provides tools to compare metrics between different variants, visualize performance differences, and make data-driven decisions about which approach works best.
View experiment results