Experiments in Basalt are not just for comparing different variants of your AI workflows — they also provide powerful capabilities for automated testing, continuous integration/continuous deployment (CI/CD), and quality assurance. This guide explores how to effectively use experiments in your testing and deployment pipelines.

Automating Quality Assurance for AI

Traditional software testing methods don’t always translate well to AI applications. LLMs can produce variable outputs for the same input, making deterministic testing challenging. Basalt Experiments solve this problem by enabling:

  • Systematic testing of AI workflows against known inputs
  • Statistical evaluation of output quality rather than exact matching
  • Detection of regressions in AI performance
  • Objective quality measurements through evaluators

Setting Up Experiments for Testing

Basic Test Setup

The fundamental pattern for using experiments in testing is:

async function testAIWorkflow() {
  // 1. Create a single experiment for this test run
  const { value: experiment, error } = await basalt.monitor.createExperiment(
    'feature-slug',
    { name: `Test Run - ${new Date().toISOString()}` }
  )
  
  if (error) {
    throw new Error(`Failed to create experiment: ${error.message}`)
  }
  
  // 2. Define test cases
  const testCases = [
    { input: 'Test input 1', expectedProperties: ['clarity', 'conciseness'] },
    { input: 'Test input 2', expectedProperties: ['detail', 'accuracy'] },
    // Add more test cases...
  ]
  
  // 3. Run each test case with the same experiment
  for (const testCase of testCases) {
    await runTestCase(testCase, experiment)
  }
  
  return experiment.id
}

async function runTestCase(testCase, experiment) {
  // Create a trace for this test case
  const trace = basalt.monitor.createTrace('feature-slug', {
    name: `Test: ${testCase.input.substring(0, 20)}...`,
    experiment: experiment,
    metadata: {
      testCase: testCase,
      executionTime: new Date().toISOString()
    },
    evaluators: [
      { slug: 'quality-score' }
    ]
  })
  
  try {
    // Run the actual AI workflow
    const result = await yourAIWorkflow(testCase.input)
    
    // End the trace with the result
    trace.end(result)
  } catch (error) {
    trace.update({
      metadata: { error: error.message }
    })
    trace.end(`Error: ${error.message}`)
  }
}

Mock Data Testing

For controlled testing environments, you can use mock data to ensure consistent inputs:

async function runMockTests() {
  // Load mock data from your test fixtures
  const mockData = loadMockData()
  
  // Create an experiment for this test run
  const { value: experiment } = await basalt.monitor.createExperiment(
    'content-generation',
    { name: `Mock Test - ${new Date().toISOString()}` }
  )
  
  for (const mockItem of mockData) {
    const trace = basalt.monitor.createTrace('content-generation', {
      experiment: experiment,
      metadata: { 
        mockId: mockItem.id,
        category: mockItem.category
      }
    })
    
    // Use mock data instead of real API calls
    const mockPrompt = createMockPrompt(mockItem)
    const mockLLMResponse = getMockLLMResponse(mockItem)
    
    // Create generations with the mock data
    const generation = trace.createGeneration({
      name: 'mock-generation',
      input: mockPrompt,
      metadata: { mockResponseId: mockLLMResponse.id }
    })
    
    // Record the mock response
    generation.end(mockLLMResponse.content)
    
    // Process the response with your real business logic
    const processedResult = yourBusinessLogic(mockLLMResponse.content)
    
    // End the trace with the final result
    trace.end(processedResult)
  }
}

Integrating with CI/CD Pipelines

GitHub Actions Example

Here’s how to integrate Basalt Experiments into a GitHub Actions workflow:

name: AI Workflow Tests

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test-ai-workflows:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          
      - name: Install dependencies
        run: npm ci
      
      - name: Run AI workflow tests
        env:
          BASALT_API_KEY: ${{ secrets.BASALT_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: npm run test:ai-workflows
        
      - name: Post experiment results
        run: node scripts/post-experiment-results.js

Your test script would create an experiment, run the tests, and write the experiment ID to an output file that the post-results script could use.

Regression Testing

Regression testing ensures that improvements or changes to your AI workflows don’t negatively impact performance or quality:

async function runRegressionTest() {
  // Create an experiment for this regression test
  const { value: experiment } = await basalt.monitor.createExperiment(
    'customer-support',
    { name: `Regression Test - ${new Date().toISOString()}` }
  )
  
  // Load a set of benchmark queries that should produce good results
  const benchmarkQueries = loadBenchmarkQueries()
  
  // Run each benchmark query
  for (const query of benchmarkQueries) {
    const trace = basalt.monitor.createTrace('customer-support', {
      name: `Regression: ${query.id}`,
      experiment: experiment,
      evaluators: [
        { slug: 'accuracy' },
        { slug: 'helpfulness' }
      ]
    })
    
    try {
      // Run the current version of your workflow
      const result = await customerSupportWorkflow(query.text)
      
      // Record the result
      trace.end(result)
    } catch (error) {
      trace.update({
        metadata: { error: error.message }
      })
      trace.end(`Error: ${error.message}`)
    }
  }
  
  // You would implement your own logic to compare 
  // current results with your baseline results
  
  return {
    experimentId: experiment.id
  }
}

Production Validation

Before deploying new AI workflows to production, you can validate them with experiments:

async function validateForProduction() {
  // Create an experiment for pre-production validation
  const { value: experiment } = await basalt.monitor.createExperiment(
    'product-recommendation',
    { name: `Pre-Production Validation - ${new Date().toISOString()}` }
  )
  
  // Load production-like test cases
  const validationCases = loadProductionValidationCases()
  
  // Define minimum acceptable scores for each evaluator
  const minimumScores = {
    'relevance': 0.7,
    'safety': 0.95,
    'helpfulness': 0.8
  }

  let results = []
  
  // Run each validation case
  for (const testCase of validationCases) {
    const trace = basalt.monitor.createTrace('product-recommendation', {
      name: `Validation: ${testCase.category}`,
      experiment: experiment,
      metadata: {
        category: testCase.category,
        difficulty: testCase.difficulty
      },
      evaluators: [
        { slug: 'relevance' },
        { slug: 'safety' },
        { slug: 'helpfulness' }
      ]
    })
    
    try {
      // Run the workflow that's about to be deployed
      const result = await productRecommendationWorkflow(testCase.input)
      
      // Record the result
      trace.end(result)
      results.push(result)
      
      // Store metadata about this test case
      trace.update({
        metadata: {
          input: testCase.input,
          output: result
        }
      })
    } catch (error) {
      trace.update({
        metadata: {
          error: error.message,
        }
      })
      trace.end(`Error: ${error.message}`)
    }
  }
  
  // Your own logic to determine if the workflow is ready for production
  // based on the results
  
  return {
    experimentId: experiment.id,
    results: results
  }
}

Best Practices for CI/CD and Testing

1. Experiment Naming Conventions

Establish clear naming conventions for your experiments:

  • Include build numbers or commit hashes in experiment names for traceability
  • Use prefixes like test-, regression-, or validation- to indicate the experiment purpose
  • Add timestamps to make experiments easily identifiable in chronological order

2. Metadata Standardization

Standardize metadata across your traces to enable consistent analysis:

  • Include relevant build information in all traces (e.g., version, environment)
  • Define expected outcomes in metadata for easier comparison

3. Automation Best Practices

  • Create dedicated CI/CD jobs for different types of AI tests (unit, integration, regression)
  • Run critical tests on every PR, but reserve extensive testing for nightly builds
  • Archive experiment IDs in your CI/CD system for future reference
  • Set up automated notifications for test failures or regressions

4. Statistical Considerations

  • Ensure sample sizes are large enough to draw meaningful conclusions
  • Account for LLM variability by using thresholds rather than exact matching
  • Track trends over time rather than focusing on individual test runs
  • Consider statistical significance when comparing experiments