Metrics & Scoring
SupaEval computes comprehensive metrics to evaluate different aspects of your AI agent's performance. All metrics are calculated automatically during evaluation runs.
Retrieval Metrics
Measure how well your agent finds and ranks relevant information.
Precision@K
Measures what fraction of the top K retrieved documents are relevant.
Precision@K = (# relevant docs in top K) / KRecall@K
Measures what fraction of all relevant documents are in the top K results.
Recall@K = (# relevant docs in top K) / (total relevant docs)nDCG (Normalized Discounted Cumulative Gain)
Measures ranking quality, giving higher weight to relevant documents at top positions. Ranges from 0 to 1, where 1 is perfect ranking.
MRR (Mean Reciprocal Rank)
Measures how high the first relevant document appears in the ranking. Higher is better (1.0 = first result is relevant).
Generation Metrics
Evaluate the quality of generated text outputs.
Relevance
How well the generated output addresses the input query. Scored 0-1 using LLM-as-judge.
Faithfulness
Measures factual accuracy relative to source documents. Detects contradictions and unsupported claims.
Hallucination Score
Identifies statements not grounded in provided context. Lower is better.
Answer Similarity
Semantic similarity to expected output using embedding-based comparison.
Task Metrics
Operational metrics for production monitoring.
- Success Rate - Percentage of queries successfully completed
- Latency - Average response time (p50, p95, p99)
- Token Usage - Input and output tokens per query
- Cost - Estimated API costs based on token usage
Custom Metrics
Need domain-specific evaluation? Contact us about custom metric development for your use case.
- • RAG systems: Precision@K, relevance, faithfulness
- • Chatbots: Relevance, answer similarity, task success
- • Production: Latency, cost, success rate
Interpreting Scores
Most metrics are normalized 0-1 where higher is better (except hallucination score).
- 0.9-1.0 - Excellent performance
- 0.7-0.9 - Good performance
- 0.5-0.7 - Acceptable, room for improvement
- <0.5 - Needs significant improvement