Metrics & Scoring

SupaEval computes comprehensive metrics to evaluate different aspects of your AI agent's performance. All metrics are calculated automatically during evaluation runs.

Retrieval Metrics

Measure how well your agent finds and ranks relevant information.

Precision@K

Measures what fraction of the top K retrieved documents are relevant.

Precision@K = (# relevant docs in top K) / K

Recall@K

Measures what fraction of all relevant documents are in the top K results.

Recall@K = (# relevant docs in top K) / (total relevant docs)

nDCG (Normalized Discounted Cumulative Gain)

Measures ranking quality, giving higher weight to relevant documents at top positions. Ranges from 0 to 1, where 1 is perfect ranking.

MRR (Mean Reciprocal Rank)

Measures how high the first relevant document appears in the ranking. Higher is better (1.0 = first result is relevant).

Generation Metrics

Evaluate the quality of generated text outputs.

Relevance

How well the generated output addresses the input query. Scored 0-1 using LLM-as-judge.

Faithfulness

Measures factual accuracy relative to source documents. Detects contradictions and unsupported claims.

Hallucination Score

Identifies statements not grounded in provided context. Lower is better.

Answer Similarity

Semantic similarity to expected output using embedding-based comparison.

LLM-as-Judge
SupaEval uses advanced LLM-based evaluation for metrics like relevance and faithfulness. This provides nuanced scoring beyond simple string matching.

Task Metrics

Operational metrics for production monitoring.

  • Success Rate - Percentage of queries successfully completed
  • Latency - Average response time (p50, p95, p99)
  • Token Usage - Input and output tokens per query
  • Cost - Estimated API costs based on token usage

Custom Metrics

Need domain-specific evaluation? Contact us about custom metric development for your use case.

Metric Selection
Choose metrics aligned with your goals:
  • • RAG systems: Precision@K, relevance, faithfulness
  • • Chatbots: Relevance, answer similarity, task success
  • • Production: Latency, cost, success rate

Interpreting Scores

Most metrics are normalized 0-1 where higher is better (except hallucination score).

  • 0.9-1.0 - Excellent performance
  • 0.7-0.9 - Good performance
  • 0.5-0.7 - Acceptable, room for improvement
  • <0.5 - Needs significant improvement

Next Steps