Core Concepts
Understanding the key concepts in SupaEval will help you effectively evaluate and improve your AI agents.
Datasets
A dataset is a collection of test cases used to evaluate your agent. Each dataset contains:
- Test Cases - Individual evaluation scenarios
- Prompts - Input queries or instructions
- Expected Outputs - Reference answers (optional)
- Metadata - Additional context like difficulty, domain, or tags
Dataset Structure
Datasets can be created from CSV/JSON files, programmatically via SDK, or using our web interface.
Evaluations
An evaluation is a run that executes your agent against a dataset and measures performance using specified metrics.
Types of Evaluations
- Ad-hoc Evaluations - Quick one-off tests during development
- Continuous Evaluations - Automated runs on code changes
- Benchmark Evaluations - Standardized comparisons over time
Metrics
Metrics are quantitative measures computed by SupaEval to assess different aspects of agent performance.
Retrieval Metrics
- Precision@K - Accuracy of top K retrieved documents
- Recall@K - Coverage of relevant documents in top K
- nDCG - Normalized discounted cumulative gain
- MRR - Mean reciprocal rank
Generation Metrics
- Relevance - How well the output addresses the query
- Faithfulness - Accuracy relative to source documents
- Hallucination Detection - Identifies fabricated information
- Answer Similarity - Comparison to expected output
Task Metrics
- Success Rate - Percentage of correctly completed tasks
- Latency - Response time per query
- Cost - Token usage and API costs
Automatic Metric Computation
SupaEval computes all metrics automatically during evaluation runs. You don't need to implement scoring logic yourself.
Benchmarks
Benchmarks are named evaluation runs against fixed datasets that enable consistent comparison between:
- Different agent versions
- Prompt variations
- Model changes
- Configuration updates