Datasets
In Basalt, a dataset is a structured collection of examples you use to evaluate and improve your AI systems. Instead of testing prompts or models with ad‑hoc inputs, you store representative examples—inputs, expected outputs, and metadata—in datasets and reuse them across experiments, environments, and teams. Typical uses include:- Building evaluation suites for prompts, RAG pipelines, and agents
- Capturing high‑quality production examples for regression testing
- Sharing canonical test cases between teams and environments
This page focuses on the concept of datasets: what they are, when to use them, and how they fit into your evaluation workflow.
For concrete SDK usage, see the Python pages in this section. TypeScript v1 docs are not available yet (use the v0 archive).
For concrete SDK usage, see the Python pages in this section. TypeScript v1 docs are not available yet (use the v0 archive).
What datasets contain
A dataset usually describes a specific evaluation surface—for example “customer support Q&A” or “RAG search quality”. Each dataset is made of:- Columns – the schema describing what each row contains (e.g.
question,context,category) - Rows – individual test cases with:
- Values: the actual inputs or fields (e.g. a user question, retrieved context)
- Ideal output: the expected or “golden” answer, when available
- Metadata: additional context like difficulty, source, tags, or ratings
Why use datasets
Datasets help you:- Ensure consistency – run the same tests across versions, models, and environments
- Improve reproducibility – reproduce issues and fixes using stable, versioned examples
- Streamline experimentation – plug datasets into evaluators and experiments instead of hand‑crafting inputs
- Track progress over time – measure quality improvements on a fixed benchmark
- Organize real‑world examples – turn production traffic into structured, reusable test cases
How datasets fit with other Basalt features
Datasets work best alongside:- Prompts – use dataset rows as inputs to prompts and compare actual vs. ideal outputs
- Evaluators – automate scoring of model behavior on dataset rows
- Experiments – run A/B tests or model comparisons over a given dataset
- List and inspect datasets for your workspace
- Retrieve full datasets (columns and rows) for testing
- Add new rows programmatically from scripts, experiments, or production systems