Benchmarks
Benchmarks are standardized evaluation configurations that enable consistent comparison of agent performance across versions, configurations, and time.
What Are Benchmarks?
A benchmark combines:
- Fixed Dataset - Same test cases for each run
- Standard Metrics - Consistent measurement criteria
- Version Tracking - Historical performance records
- Baseline Comparison - Measure improvement over time
Why Benchmarks?
Benchmarks provide objective, reproducible measurements. Use them to track progress, validate releases, and compare different approaches.
Creating Benchmarks
python
from supaeval import SupaEval
client = SupaEval(api_key="your_api_key")
# Create a benchmark
benchmark = client.benchmarks.create(
name="production_qa_v1",
dataset_id="dataset_123",
metrics=["relevance", "faithfulness", "precision_at_5"],
description="Standard QA benchmark for production releases"
)
# Run benchmark
result = client.benchmarks.run(
benchmark_id=benchmark.id,
agent_endpoint="https://your-agent.com/api/query",
version_tag="v2.1.0"
)
print(f"Benchmark score: {result.overall_score}")
print(f"vs baseline: {result.vs_baseline}%")Benchmark Types
Internal Benchmarks
Custom datasets specific to your domain and use case. Track internal progress.
Public Benchmarks
Industry-standard datasets (coming soon). Compare against other agents.
Version Tracking
Each benchmark run is tagged with a version identifier:
- Git commit SHA
- Semantic version (v1.0.0)
- Custom labels (production, staging, experimental)
This enables tracking performance across releases and identifying regressions.
Benchmark Leaderboards
View all runs of a benchmark in a leaderboard showing:
- Overall score and ranking
- Per-metric breakdown
- Version and timestamp
- Configuration differences
Best Practice
Run benchmarks before each production deployment. Set minimum score thresholds as deployment gates to prevent regressions.
Comparing Results
The comparison view shows:
- Delta Metrics - Percentage change from baseline
- Statistical Significance - Whether improvements are meaningful
- Per-Case Differences - Which test cases improved or regressed
- Confidence Intervals - Uncertainty in measurements
Continuous Benchmarking
Integrate benchmarks into your CI/CD pipeline:
- Run automatically on pull requests
- Block merges if scores drop below threshold
- Generate reports in PR comments
- Track trends in your monitoring dashboard
Benchmark Insights
Beyond raw scores, benchmarks reveal:
- Which types of queries your agent handles well
- Common failure modes to address
- Cost vs. quality tradeoffs
- Latency improvements or regressions