Introduction

Welcome to SupaEval! This guide will help you understand what SupaEval is, how it works, and why it's essential for building reliable AI agents.

What is SupaEval?

SupaEval is a quality intelligence platform for AI agents. Unlike traditional evaluation tools that only measure final outputs, SupaEval provides deep insights across every layer of your agent system:

Retrieval Quality - How well your agent finds relevant information
Reasoning Accuracy - How well it processes and interprets data
Tool Usage - How effectively it uses available tools
Generation Quality - How well it produces final outputs

Why Agent-Specific Evaluation Matters

LLM evaluation tools focus on final answers. But agents fail in many ways: bad retrieval, incorrect tool calls, poor reasoning chains. SupaEval helps you find and fix these issues.

Core Concepts

Datasets

Collections of test cases used to evaluate your agent. Datasets can include prompts, multi-turn conversations, expected outputs, and metadata like difficulty or domain.

Evaluations

Runs that execute your agent against a dataset and measure performance using specified metrics. Evaluations are reproducible and can be compared over time.

Metrics

Quantitative measures of agent performance computed by SupaEval (not by you). Includes:

Retrieval metrics (Precision@K, Recall@K, nDCG)
Generation metrics (relevance, faithfulness, hallucination detection)
Task success rates
Latency and cost tracking

Benchmarks

Named evaluation runs against fixed datasets that enable comparison between agent versions, prompt changes, or model swaps.

How It Works

Connect Your Agent

Provide an API endpoint or integrate using our SDK. No changes to your agent code required.

Upload or Select Dataset

Use our benchmark datasets or upload your own custom test cases.

Run Evaluation

SupaEval executes your agent, captures intermediate steps, and computes metrics.

Analyze Results

View detailed dashboards showing where your agent succeeds and fails, with actionable insights.

Key Features

Model Agnostic

Works with any LLM, framework, or agent architecture

No Code Changes

Evaluate existing agents without modifying internals

Reproducible

Deterministic reruns for consistent benchmarking

Enterprise Ready

SOC 2 compliant with advanced security features

Ready to get started?

Follow our Quickstart Guide to run your first evaluation in under 5 minutes.

Next Steps

Quickstart Guide

Get up and running in 5 minutes

Python SDK

Integrate with Python applications

JavaScript SDK

Build with Node.js and browsers