Introduction

Welcome to SupaEval! This guide will help you understand what SupaEval is, how it works, and why it's essential for building reliable AI agents.

What is SupaEval?

SupaEval is a quality intelligence platform for AI agents. Unlike traditional evaluation tools that only measure final outputs, SupaEval provides deep insights across every layer of your agent system:

  • Retrieval Quality - How well your agent finds relevant information
  • Reasoning Accuracy - How well it processes and interprets data
  • Tool Usage - How effectively it uses available tools
  • Generation Quality - How well it produces final outputs
Why Agent-Specific Evaluation Matters
LLM evaluation tools focus on final answers. But agents fail in many ways: bad retrieval, incorrect tool calls, poor reasoning chains. SupaEval helps you find and fix these issues.

Core Concepts

Datasets

Collections of test cases used to evaluate your agent. Datasets can include prompts, multi-turn conversations, expected outputs, and metadata like difficulty or domain.

Evaluations

Runs that execute your agent against a dataset and measure performance using specified metrics. Evaluations are reproducible and can be compared over time.

Metrics

Quantitative measures of agent performance computed by SupaEval (not by you). Includes:

  • Retrieval metrics (Precision@K, Recall@K, nDCG)
  • Generation metrics (relevance, faithfulness, hallucination detection)
  • Task success rates
  • Latency and cost tracking

Benchmarks

Named evaluation runs against fixed datasets that enable comparison between agent versions, prompt changes, or model swaps.

How It Works

1

Connect Your Agent

Provide an API endpoint or integrate using our SDK. No changes to your agent code required.

2

Upload or Select Dataset

Use our benchmark datasets or upload your own custom test cases.

3

Run Evaluation

SupaEval executes your agent, captures intermediate steps, and computes metrics.

4

Analyze Results

View detailed dashboards showing where your agent succeeds and fails, with actionable insights.

Key Features

Model Agnostic

Works with any LLM, framework, or agent architecture

No Code Changes

Evaluate existing agents without modifying internals

Reproducible

Deterministic reruns for consistent benchmarking

Enterprise Ready

SOC 2 compliant with advanced security features

Ready to get started?
Follow our Quickstart Guide to run your first evaluation in under 5 minutes.

Next Steps