Learn how to leverage Basalt Experiments for automated testing and quality assurance in AI workflows.
Experiments in Basalt are not just for comparing different variants of your AI workflows — they also provide powerful capabilities for automated testing, continuous integration/continuous deployment (CI/CD), and quality assurance. This guide explores how to effectively use experiments in your testing and deployment pipelines.
Traditional software testing methods don’t always translate well to AI applications. LLMs can produce variable outputs for the same input, making deterministic testing challenging. Basalt Experiments solve this problem by enabling:
Systematic testing of AI workflows against known inputs
Statistical evaluation of output quality rather than exact matching
The fundamental pattern for using experiments in testing is:
Copy
Ask AI
async function testAIWorkflow() { // 1. Create a single experiment for this test run const { value: experiment, error } = await basalt.monitor.createExperiment( 'feature-slug', { name: `Test Run - ${new Date().toISOString()}` } ) if (error) { throw new Error(`Failed to create experiment: ${error.message}`) } // 2. Define test cases const testCases = [ { input: 'Test input 1', expectedProperties: ['clarity', 'conciseness'] }, { input: 'Test input 2', expectedProperties: ['detail', 'accuracy'] }, // Add more test cases... ] // 3. Run each test case with the same experiment for (const testCase of testCases) { await runTestCase(testCase, experiment) } return experiment.id}async function runTestCase(testCase, experiment) { // Create a trace for this test case const trace = basalt.monitor.createTrace('feature-slug', { name: `Test: ${testCase.input.substring(0, 20)}...`, experiment: experiment, metadata: { testCase: testCase, executionTime: new Date().toISOString() }, evaluators: [ { slug: 'quality-score' } ] }) try { // Run the actual AI workflow const result = await yourAIWorkflow(testCase.input) // End the trace with the result trace.end(result) } catch (error) { trace.update({ metadata: { error: error.message } }) trace.end(`Error: ${error.message}`) }}
For controlled testing environments, you can use mock data to ensure consistent inputs:
Copy
Ask AI
async function runMockTests() { // Load mock data from your test fixtures const mockData = loadMockData() // Create an experiment for this test run const { value: experiment } = await basalt.monitor.createExperiment( 'content-generation', { name: `Mock Test - ${new Date().toISOString()}` } ) for (const mockItem of mockData) { const trace = basalt.monitor.createTrace('content-generation', { experiment: experiment, metadata: { mockId: mockItem.id, category: mockItem.category } }) // Use mock data instead of real API calls const mockPrompt = createMockPrompt(mockItem) const mockLLMResponse = getMockLLMResponse(mockItem) // Create generations with the mock data const generation = trace.createGeneration({ name: 'mock-generation', input: mockPrompt, metadata: { mockResponseId: mockLLMResponse.id } }) // Record the mock response generation.end(mockLLMResponse.content) // Process the response with your real business logic const processedResult = yourBusinessLogic(mockLLMResponse.content) // End the trace with the final result trace.end(processedResult) }}
Regression testing ensures that improvements or changes to your AI workflows don’t negatively impact performance or quality:
Copy
Ask AI
async function runRegressionTest() { // Create an experiment for this regression test const { value: experiment } = await basalt.monitor.createExperiment( 'customer-support', { name: `Regression Test - ${new Date().toISOString()}` } ) // Load a set of benchmark queries that should produce good results const benchmarkQueries = loadBenchmarkQueries() // Run each benchmark query for (const query of benchmarkQueries) { const trace = basalt.monitor.createTrace('customer-support', { name: `Regression: ${query.id}`, experiment: experiment, evaluators: [ { slug: 'accuracy' }, { slug: 'helpfulness' } ] }) try { // Run the current version of your workflow const result = await customerSupportWorkflow(query.text) // Record the result trace.end(result) } catch (error) { trace.update({ metadata: { error: error.message } }) trace.end(`Error: ${error.message}`) } } // You would implement your own logic to compare // current results with your baseline results return { experimentId: experiment.id }}
Before deploying new AI workflows to production, you can validate them with experiments:
Copy
Ask AI
async function validateForProduction() { // Create an experiment for pre-production validation const { value: experiment } = await basalt.monitor.createExperiment( 'product-recommendation', { name: `Pre-Production Validation - ${new Date().toISOString()}` } ) // Load production-like test cases const validationCases = loadProductionValidationCases() // Define minimum acceptable scores for each evaluator const minimumScores = { 'relevance': 0.7, 'safety': 0.95, 'helpfulness': 0.8 } let results = [] // Run each validation case for (const testCase of validationCases) { const trace = basalt.monitor.createTrace('product-recommendation', { name: `Validation: ${testCase.category}`, experiment: experiment, metadata: { category: testCase.category, difficulty: testCase.difficulty }, evaluators: [ { slug: 'relevance' }, { slug: 'safety' }, { slug: 'helpfulness' } ] }) try { // Run the workflow that's about to be deployed const result = await productRecommendationWorkflow(testCase.input) // Record the result trace.end(result) results.push(result) // Store metadata about this test case trace.update({ metadata: { input: testCase.input, output: result } }) } catch (error) { trace.update({ metadata: { error: error.message, } }) trace.end(`Error: ${error.message}`) } } // Your own logic to determine if the workflow is ready for production // based on the results return { experimentId: experiment.id, results: results }}