Benchmarks

Benchmarks are standardized evaluation configurations that enable consistent comparison of agent performance across versions, configurations, and time.

What Are Benchmarks?

A benchmark combines:

Fixed Dataset - Same test cases for each run
Standard Metrics - Consistent measurement criteria
Version Tracking - Historical performance records
Baseline Comparison - Measure improvement over time

Why Benchmarks?

Benchmarks provide objective, reproducible measurements. Use them to track progress, validate releases, and compare different approaches.

Creating Benchmarks

python

from supaeval import SupaEval

client = SupaEval(api_key="your_api_key")

# Create a benchmark
benchmark = client.benchmarks.create(
    name="production_qa_v1",
    dataset_id="dataset_123",
    metrics=["relevance", "faithfulness", "precision_at_5"],
    description="Standard QA benchmark for production releases"
)

# Run benchmark
result = client.benchmarks.run(
    benchmark_id=benchmark.id,
    agent_endpoint="https://your-agent.com/api/query",
    version_tag="v2.1.0"
)

print(f"Benchmark score: {result.overall_score}")
print(f"vs baseline: {result.vs_baseline}%")

Benchmark Types

Internal Benchmarks

Custom datasets specific to your domain and use case. Track internal progress.

Public Benchmarks

Industry-standard datasets (coming soon). Compare against other agents.

Version Tracking

Each benchmark run is tagged with a version identifier:

Git commit SHA
Semantic version (v1.0.0)
Custom labels (production, staging, experimental)

This enables tracking performance across releases and identifying regressions.

Benchmark Leaderboards

View all runs of a benchmark in a leaderboard showing:

Overall score and ranking
Per-metric breakdown
Version and timestamp
Configuration differences

Best Practice

Run benchmarks before each production deployment. Set minimum score thresholds as deployment gates to prevent regressions.

Comparing Results

The comparison view shows:

Delta Metrics - Percentage change from baseline
Statistical Significance - Whether improvements are meaningful
Per-Case Differences - Which test cases improved or regressed
Confidence Intervals - Uncertainty in measurements

Continuous Benchmarking

Integrate benchmarks into your CI/CD pipeline:

Run automatically on pull requests
Block merges if scores drop below threshold
Generate reports in PR comments
Track trends in your monitoring dashboard

Benchmark Insights

Beyond raw scores, benchmarks reveal:

Which types of queries your agent handles well
Common failure modes to address
Cost vs. quality tradeoffs
Latency improvements or regressions

Next Steps

Running Evaluations

Learn how to execute benchmark runs

Dashboard

View benchmark results and trends