Benchmarks

Benchmarks are standardized evaluation configurations that enable consistent comparison of agent performance across versions, configurations, and time.

What Are Benchmarks?

A benchmark combines:

  • Fixed Dataset - Same test cases for each run
  • Standard Metrics - Consistent measurement criteria
  • Version Tracking - Historical performance records
  • Baseline Comparison - Measure improvement over time
Why Benchmarks?
Benchmarks provide objective, reproducible measurements. Use them to track progress, validate releases, and compare different approaches.

Creating Benchmarks

python
from supaeval import SupaEval

client = SupaEval(api_key="your_api_key")

# Create a benchmark
benchmark = client.benchmarks.create(
    name="production_qa_v1",
    dataset_id="dataset_123",
    metrics=["relevance", "faithfulness", "precision_at_5"],
    description="Standard QA benchmark for production releases"
)

# Run benchmark
result = client.benchmarks.run(
    benchmark_id=benchmark.id,
    agent_endpoint="https://your-agent.com/api/query",
    version_tag="v2.1.0"
)

print(f"Benchmark score: {result.overall_score}")
print(f"vs baseline: {result.vs_baseline}%")

Benchmark Types

Internal Benchmarks

Custom datasets specific to your domain and use case. Track internal progress.

Public Benchmarks

Industry-standard datasets (coming soon). Compare against other agents.

Version Tracking

Each benchmark run is tagged with a version identifier:

  • Git commit SHA
  • Semantic version (v1.0.0)
  • Custom labels (production, staging, experimental)

This enables tracking performance across releases and identifying regressions.

Benchmark Leaderboards

View all runs of a benchmark in a leaderboard showing:

  • Overall score and ranking
  • Per-metric breakdown
  • Version and timestamp
  • Configuration differences
Best Practice
Run benchmarks before each production deployment. Set minimum score thresholds as deployment gates to prevent regressions.

Comparing Results

The comparison view shows:

  • Delta Metrics - Percentage change from baseline
  • Statistical Significance - Whether improvements are meaningful
  • Per-Case Differences - Which test cases improved or regressed
  • Confidence Intervals - Uncertainty in measurements

Continuous Benchmarking

Integrate benchmarks into your CI/CD pipeline:

  • Run automatically on pull requests
  • Block merges if scores drop below threshold
  • Generate reports in PR comments
  • Track trends in your monitoring dashboard

Benchmark Insights

Beyond raw scores, benchmarks reveal:

  • Which types of queries your agent handles well
  • Common failure modes to address
  • Cost vs. quality tradeoffs
  • Latency improvements or regressions

Next Steps