Core Concepts

Understanding the key concepts in SupaEval will help you effectively evaluate and improve your AI agents.

Datasets

A dataset is a collection of test cases used to evaluate your agent. Each dataset contains:

Test Cases - Individual evaluation scenarios
Prompts - Input queries or instructions
Expected Outputs - Reference answers (optional)
Metadata - Additional context like difficulty, domain, or tags

Dataset Structure

Datasets can be created from CSV/JSON files, programmatically via SDK, or using our web interface.

Evaluations

An evaluation is a run that executes your agent against a dataset and measures performance using specified metrics.

Types of Evaluations

Ad-hoc Evaluations - Quick one-off tests during development
Continuous Evaluations - Automated runs on code changes
Benchmark Evaluations - Standardized comparisons over time

Metrics

Metrics are quantitative measures computed by SupaEval to assess different aspects of agent performance.

Retrieval Metrics

Precision@K - Accuracy of top K retrieved documents
Recall@K - Coverage of relevant documents in top K
nDCG - Normalized discounted cumulative gain
MRR - Mean reciprocal rank

Generation Metrics

Relevance - How well the output addresses the query
Faithfulness - Accuracy relative to source documents
Hallucination Detection - Identifies fabricated information
Answer Similarity - Comparison to expected output

Task Metrics

Success Rate - Percentage of correctly completed tasks
Latency - Response time per query
Cost - Token usage and API costs

Automatic Metric Computation

SupaEval computes all metrics automatically during evaluation runs. You don't need to implement scoring logic yourself.

Benchmarks

Benchmarks are named evaluation runs against fixed datasets that enable consistent comparison between:

Different agent versions
Prompt variations
Model changes
Configuration updates

Next Steps

Quickstart Guide

Create your first evaluation

Dataset Management

Learn how to create and manage datasets

Get up and running in 5 minutes

Integrate SupaEval with Python applications