By 2026, eval coverage is no longer the optional 'maturity' layer for production AI — it's the baseline operational requirement. Both because regressions are otherwise invisible (the agent just gets quietly worse over weeks) and because the compliance frameworks now expect it. The Dubai agentic AI mandate's monitoring requirements, ISO/IEC 42001's post-market monitoring controls, and the EU AI Act's continuous-conformity expectations all converge on the same answer: you need an eval pipeline.
This post is the practical comparison we use when scoping the observability stack on Agentic Pilot engagements. It's written for the engineering lead choosing tooling, not the executive deciding strategy.
What 'eval' actually means in production
Three distinct layers, often conflated:
- Tracing. Capturing every LLM call — prompts, responses, latency, cost, token counts — into a queryable store. The "what is this thing actually doing" layer.
- Eval runs. Scoring outputs against a golden set or rubric, on demand or in CI. The "is it good enough" layer.
- Drift detection. Continuous monitoring of production behaviour — pass rates, score distributions, failure modes — with alerts when behaviour changes meaningfully. The "is it still good enough" layer.
A mature stack has all three. A common mistake is to ship layer 1 only and call it observability — you can see what's happening, but you can't tell if it's working.
The three contenders
Langfuse
Open-source, MIT-licensed, with a hosted offering and a strong self-host story. Originally tracing-first, now full-stack with eval runs and dataset management.
Strengths: self-hosting maturity (Docker Compose to production deploys cleanly), data sovereignty (we run it on UAE infrastructure for Sovereign AI clients), low cost at scale (the self-hosted version is essentially free past licensing your own infrastructure), strong tracing for multi-step agents.
Weaknesses: experimentation tooling is less polished than Braintrust, fewer pre-built scorers, smaller ecosystem of integrations.
When to pick it: you need data sovereignty, you're running regulated workloads, you have engineering capacity to self-host, or your budget is tight and you have moderate scale.
Braintrust
Hosted-first, with a focus on developer experience and rapid iteration. Strong dataset management, side-by-side comparisons, and a polished UI for the iteration loop.
Strengths: best developer experience by a wide margin, fastest path from "I have an idea" to "I see if it worked", deep experimentation tooling, generous free tier for small projects.
Weaknesses: hosted-only (no self-host path for sovereign deployments), pricing scales aggressively past the free tier, less mature for very-large-scale ML workloads.
When to pick it: your team will live in evals daily, you need to iterate fast on prompts and models, data residency isn't a hard constraint, and you're at startup-to-mid-market scale.
Arize
Enterprise-positioned, strongest at large scale and on the ML-platform end of the spectrum. Originally an ML observability tool that added LLM capabilities; brings enterprise features (RBAC, audit, advanced cohort analysis) that the others are still maturing.
Strengths: enterprise feature set, strong at very large scale, integrates with broader ML observability if you're running classical ML alongside LLMs, deepest drift-detection tooling.
Weaknesses: higher cost, more setup complexity, overkill for projects that are LLM-only at modest scale.
When to pick it: large enterprise context, broader ML platform integration needed, or you have a heavy compliance overlay that requires the enterprise feature set.
What we deploy by default
-
Sovereign deployments and regulated industries: Langfuse, self-hosted on the same infrastructure as the inference cluster. Data never leaves jurisdiction. Same pattern we run on Hisabi.ai.
-
Standard cloud deployments where iteration speed matters: Braintrust, hosted. The DX advantage compounds when the team is iterating weekly on prompts.
-
Large enterprise with existing ML platforms: Arize. The integration with their existing ML observability is the deciding factor.
How to wire it into CI/CD
Evals belong in your deployment pipeline, not as a manual quarterly review.
The pattern that works:
- On every PR: run the golden set and the adversarial set. If pass rate drops below threshold, block merge.
- On every deploy: smoke-test against a small representative sample. If anything fails, rollback automatically.
- In production, continuously: sample 5–10% of real traffic, score in the background, alert on score-distribution drift.
GitHub Actions or your CI of choice handles (1) and (2). For (3), the eval tool's webhook or scheduled-run features handle the production sampling — Langfuse, Braintrust, and Arize all support this pattern, with varying ergonomics.
The eval-gated CI pattern is the cultural shift that takes longest. Engineers initially resent the build time and the false alarms; after the eval catches the first real regression, they convert. Plan for 4–8 weeks of culture work alongside the tooling.
What goes in the golden set
Building the golden set is the work that scopes the timeline of an eval setup. The set should:
- Cover the happy path (representative typical inputs)
- Cover known edge cases
- Cover adversarial inputs (prompt injection attempts, ambiguous queries, attempts to make the model misbehave)
- Be large enough to be statistically meaningful but small enough to run cheaply (50–500 cases is typical)
- Include known-good outputs or scoring rubrics
- Be versioned alongside your code
Curation is mostly human work. We typically sample real production traffic, hand-curate the most representative cases, and supplement with synthetic adversarial cases. For a single-agent project, expect 1–2 weeks of curation work to reach a useful golden set.
What good looks like at maturity
A mature eval setup, observed in our engagements past month 4:
- Pass rate against golden set is a tracked KPI alongside latency and cost
- Eval runs are sub-5 minutes; engineers run them locally before pushing
- Drift alerts have a defined response procedure (not "investigate someday")
- Adversarial set grows organically — production failures get added back as cases
- The eval set is an auditable artefact for ISO 42001 conformity
What goes wrong
The three failure modes we see most:
-
Static golden set. Built once, never updated. Production reality drifts away from the set; the eval becomes a rubber stamp. Fix: rotate cases quarterly, sample new ones from production.
-
LLM-as-judge without validation. Use a frontier model to grade outputs without ever validating that the judge agrees with humans. The judge's biases become the eval's biases silently. Fix: hand-label 100 cases, measure judge agreement, iterate on the rubric until agreement is acceptable.
-
Pass rate optimised for the eval, not for the user. Engineers tune prompts to pass the golden set; production behaviour doesn't actually improve. Fix: rotate the eval set, add held-out cases, monitor production directly via sampling.
Where Codenovai fits
We deploy the eval and observability stack on every Agentic Pilot engagement. The default is the Langfuse self-hosted pattern for regulated workloads, Braintrust for cloud-native iteration-heavy projects. Either way the eval harness is in CI from week 9 of the 90-day program — not bolted on afterward.
For existing workloads without eval coverage, we add it as part of Fractional AI Team retainers — typically 4–8 weeks to backfill a useful harness on a system that didn't have one. Book a scoping call.
