Core Concepts

Understanding the key concepts in SupaEval will help you effectively evaluate and improve your AI agents.

Datasets

A dataset is a collection of test cases used to evaluate your agent. Each dataset contains:

  • Test Cases - Individual evaluation scenarios
  • Prompts - Input queries or instructions
  • Expected Outputs - Reference answers (optional)
  • Metadata - Additional context like difficulty, domain, or tags

Dataset Structure

Datasets can be created from CSV/JSON files, programmatically via SDK, or using our web interface.

Evaluations

An evaluation is a run that executes your agent against a dataset and measures performance using specified metrics.

Types of Evaluations

  • Ad-hoc Evaluations - Quick one-off tests during development
  • Continuous Evaluations - Automated runs on code changes
  • Benchmark Evaluations - Standardized comparisons over time

Metrics

Metrics are quantitative measures computed by SupaEval to assess different aspects of agent performance.

Retrieval Metrics

  • Precision@K - Accuracy of top K retrieved documents
  • Recall@K - Coverage of relevant documents in top K
  • nDCG - Normalized discounted cumulative gain
  • MRR - Mean reciprocal rank

Generation Metrics

  • Relevance - How well the output addresses the query
  • Faithfulness - Accuracy relative to source documents
  • Hallucination Detection - Identifies fabricated information
  • Answer Similarity - Comparison to expected output

Task Metrics

  • Success Rate - Percentage of correctly completed tasks
  • Latency - Response time per query
  • Cost - Token usage and API costs
Automatic Metric Computation
SupaEval computes all metrics automatically during evaluation runs. You don't need to implement scoring logic yourself.

Benchmarks

Benchmarks are named evaluation runs against fixed datasets that enable consistent comparison between:

  • Different agent versions
  • Prompt variations
  • Model changes
  • Configuration updates

Next Steps