Running Evaluations
Learn how to configure and execute evaluations to measure your AI agent's performance.
Evaluation Configuration
When creating an evaluation, you need to specify:
- Dataset - The test cases to run
- Agent Endpoint - URL or SDK reference to your agent
- Metrics - Which measurements to compute
- Configuration - Timeout, retries, and other settings
Creating an Evaluation
Python
python
from supaeval import SupaEval
client = SupaEval(api_key="your_api_key")
# Run evaluation
evaluation = client.evaluations.create(
dataset_id="dataset_123",
agent_endpoint="https://your-agent.com/api/query",
metrics=["relevance", "faithfulness", "retrieval_precision"],
config={
"timeout": 30,
"max_retries": 3
}
)
# Check status
print(f"Evaluation ID: {evaluation.id}")
print(f"Status: {evaluation.status}")
# Wait for completion
result = client.evaluations.wait_for_completion(evaluation.id)
print(f"Average relevance: {result.metrics.relevance.mean}")JavaScript
javascript
import { SupaEval } from '@supaeval/js-sdk';
const client = new SupaEval({ apiKey: 'your_api_key' });
// Run evaluation
const evaluation = await client.evaluations.create({
datasetId: 'dataset_123',
agentEndpoint: 'https://your-agent.com/api/query',
metrics: ['relevance', 'faithfulness', 'retrieval_precision'],
config: {
timeout: 30,
maxRetries: 3
}
});
console.log(`Evaluation ID: ${evaluation.id}`);
// Poll for results
const result = await client.evaluations.waitForCompletion(evaluation.id);
console.log(`Average relevance: ${result.metrics.relevance.mean}`);Async Execution
Evaluations run asynchronously. Use
wait_for_completion to poll for results, or configure webhooks for notifications when evaluations complete.Available Metrics
Choose from SupaEval's comprehensive metric library:
Retrieval Metrics
- • precision_at_k
- • recall_at_k
- • ndcg
- • mrr
Generation Metrics
- • relevance
- • faithfulness
- • hallucination_score
- • answer_similarity
Monitoring Progress
Track evaluation progress in real-time:
- Dashboard View - Visual progress indicators
- SDK Polling - Programmatic status checks
- Webhooks - Automated notifications
Understanding Results
After completion, evaluation results include:
- Aggregate metrics (mean, median, std dev)
- Per-test-case scores
- Failure analysis
- Latency and cost breakdown
Optimization Tip
Run small evaluations (10-20 test cases) during development for quick feedback. Use larger datasets for final validation before deployment.