Running Evaluations

Learn how to configure and execute evaluations to measure your AI agent's performance.

Evaluation Configuration

When creating an evaluation, you need to specify:

  • Dataset - The test cases to run
  • Agent Endpoint - URL or SDK reference to your agent
  • Metrics - Which measurements to compute
  • Configuration - Timeout, retries, and other settings

Creating an Evaluation

Python

python
from supaeval import SupaEval

client = SupaEval(api_key="your_api_key")

# Run evaluation
evaluation = client.evaluations.create(
    dataset_id="dataset_123",
    agent_endpoint="https://your-agent.com/api/query",
    metrics=["relevance", "faithfulness", "retrieval_precision"],
    config={
        "timeout": 30,
        "max_retries": 3
    }
)

# Check status
print(f"Evaluation ID: {evaluation.id}")
print(f"Status: {evaluation.status}")

# Wait for completion
result = client.evaluations.wait_for_completion(evaluation.id)
print(f"Average relevance: {result.metrics.relevance.mean}")

JavaScript

javascript
import { SupaEval } from '@supaeval/js-sdk';

const client = new SupaEval({ apiKey: 'your_api_key' });

// Run evaluation
const evaluation = await client.evaluations.create({
  datasetId: 'dataset_123',
  agentEndpoint: 'https://your-agent.com/api/query',
  metrics: ['relevance', 'faithfulness', 'retrieval_precision'],
  config: {
    timeout: 30,
    maxRetries: 3
  }
});

console.log(`Evaluation ID: ${evaluation.id}`);

// Poll for results
const result = await client.evaluations.waitForCompletion(evaluation.id);
console.log(`Average relevance: ${result.metrics.relevance.mean}`);
Async Execution
Evaluations run asynchronously. Use wait_for_completion to poll for results, or configure webhooks for notifications when evaluations complete.

Available Metrics

Choose from SupaEval's comprehensive metric library:

Retrieval Metrics

  • • precision_at_k
  • • recall_at_k
  • • ndcg
  • • mrr

Generation Metrics

  • • relevance
  • • faithfulness
  • • hallucination_score
  • • answer_similarity

Monitoring Progress

Track evaluation progress in real-time:

  • Dashboard View - Visual progress indicators
  • SDK Polling - Programmatic status checks
  • Webhooks - Automated notifications

Understanding Results

After completion, evaluation results include:

  • Aggregate metrics (mean, median, std dev)
  • Per-test-case scores
  • Failure analysis
  • Latency and cost breakdown
Optimization Tip
Run small evaluations (10-20 test cases) during development for quick feedback. Use larger datasets for final validation before deployment.

Next Steps