Why are AI evals different from regular software testing?

Three differences. First, AI outputs are non-deterministic and graded on quality not pass/fail — your test framework needs to score similarity, faithfulness, or rubric adherence rather than checking equality. Second, the evaluation set itself is a curated artefact (the 'golden set') that takes meaningful work to build and maintain — you can't auto-generate it from code. Third, evals catch behaviour regressions, not just correctness regressions — a model swap that 'works' in the unit-test sense can degrade output quality in ways pytest will never detect.

What's the minimum viable eval setup for a production agent?

Three things. (1) A golden set of 50–200 representative cases with known-good outputs. (2) A scoring function — could be string match, embedding similarity, LLM-as-judge with a rubric, or a hybrid. (3) A pipeline that runs the harness on every PR or scheduled cadence and surfaces failures. The simplest viable implementation is a Python script and a JSON file; the production-grade version uses a tool like Langfuse, Braintrust, or Arize. Start simple, upgrade when the simple version cracks under volume.

How do Langfuse, Braintrust, and Arize differ in practical terms?

Langfuse is open-source, self-hostable, strongest on tracing and lowest cost — best when you need data sovereignty or on-prem deployment. Braintrust is hosted-first, strongest developer experience, deepest experimentation tooling — best when you need fast iteration and the team will live in evals daily. Arize is enterprise-positioned, strongest at scale and ML model monitoring — best when AI evals integrate into a broader ML platform. All three handle LLM tracing, eval runs, and dashboarding; the differences show up in self-hosting maturity, pricing model, and tooling depth.

What does LLM-as-judge mean and when is it appropriate?

LLM-as-judge uses an LLM (usually a frontier model like Claude or GPT-4) to grade the output of another LLM against a rubric. It's appropriate for outputs where structural correctness matters (was the answer faithful to the source? did it follow the format? did it stay on-topic?) and inappropriate where you have ground truth available (in which case use exact match or embedding similarity). Costs more than mechanical scoring but captures qualitative dimensions you can't otherwise measure. Validate the judge against human-labelled cases before trusting it at scale.

How does eval coverage interact with ISO 42001 compliance?

Closely. ISO/IEC 42001 controls around post-market monitoring (Annex A.6.2.4 territory) effectively require an eval pipeline — you need documented evidence that AI systems perform within tolerance over time. The eval harness, drift detection, and review cadence become part of your AI Management System artefacts. Auditors increasingly ask 'show me your eval results from last quarter' as part of certification. A weak eval setup is a weak compliance posture for regulated workloads.

The AI Evals & Observability Stack in 2026: Langfuse vs Braintrust vs Arize

By 2026, eval coverage is no longer the optional 'maturity' layer for production AI — it's the baseline operational requirement. Both because regressions are otherwise invisible (the agent just gets quietly worse over weeks) and because the compliance frameworks now expect it. The Dubai agentic AI mandate's monitoring requirements, ISO/IEC 42001's post-market monitoring controls, and the EU AI Act's continuous-conformity expectations all converge on the same answer: you need an eval pipeline.

This post is the practical comparison we use when scoping the observability stack on Agentic Pilot engagements. It's written for the engineering lead choosing tooling, not the executive deciding strategy.

What 'eval' actually means in production

Three distinct layers, often conflated:

Tracing. Capturing every LLM call — prompts, responses, latency, cost, token counts — into a queryable store. The "what is this thing actually doing" layer.
Eval runs. Scoring outputs against a golden set or rubric, on demand or in CI. The "is it good enough" layer.
Drift detection. Continuous monitoring of production behaviour — pass rates, score distributions, failure modes — with alerts when behaviour changes meaningfully. The "is it still good enough" layer.

A mature stack has all three. A common mistake is to ship layer 1 only and call it observability — you can see what's happening, but you can't tell if it's working.

The three contenders

Langfuse

Open-source, MIT-licensed, with a hosted offering and a strong self-host story. Originally tracing-first, now full-stack with eval runs and dataset management.

Strengths: self-hosting maturity (Docker Compose to production deploys cleanly), data sovereignty (we run it on UAE infrastructure for Sovereign AI clients), low cost at scale (the self-hosted version is essentially free past licensing your own infrastructure), strong tracing for multi-step agents.

Weaknesses: experimentation tooling is less polished than Braintrust, fewer pre-built scorers, smaller ecosystem of integrations.

When to pick it: you need data sovereignty, you're running regulated workloads, you have engineering capacity to self-host, or your budget is tight and you have moderate scale.

Braintrust

Hosted-first, with a focus on developer experience and rapid iteration. Strong dataset management, side-by-side comparisons, and a polished UI for the iteration loop.

Strengths: best developer experience by a wide margin, fastest path from "I have an idea" to "I see if it worked", deep experimentation tooling, generous free tier for small projects.

Weaknesses: hosted-only (no self-host path for sovereign deployments), pricing scales aggressively past the free tier, less mature for very-large-scale ML workloads.

When to pick it: your team will live in evals daily, you need to iterate fast on prompts and models, data residency isn't a hard constraint, and you're at startup-to-mid-market scale.

Arize

Enterprise-positioned, strongest at large scale and on the ML-platform end of the spectrum. Originally an ML observability tool that added LLM capabilities; brings enterprise features (RBAC, audit, advanced cohort analysis) that the others are still maturing.

Strengths: enterprise feature set, strong at very large scale, integrates with broader ML observability if you're running classical ML alongside LLMs, deepest drift-detection tooling.

Weaknesses: higher cost, more setup complexity, overkill for projects that are LLM-only at modest scale.

When to pick it: large enterprise context, broader ML platform integration needed, or you have a heavy compliance overlay that requires the enterprise feature set.

What we deploy by default

For Codenovai engagements:

Sovereign deployments and regulated industries: Langfuse, self-hosted on the same infrastructure as the inference cluster. Data never leaves jurisdiction. Same pattern we run on Hisabi.ai.
Standard cloud deployments where iteration speed matters: Braintrust, hosted. The DX advantage compounds when the team is iterating weekly on prompts.
Large enterprise with existing ML platforms: Arize. The integration with their existing ML observability is the deciding factor.

How to wire it into CI/CD

Evals belong in your deployment pipeline, not as a manual quarterly review.

The pattern that works:

On every PR: run the golden set and the adversarial set. If pass rate drops below threshold, block merge.
On every deploy: smoke-test against a small representative sample. If anything fails, rollback automatically.
In production, continuously: sample 5–10% of real traffic, score in the background, alert on score-distribution drift.

GitHub Actions or your CI of choice handles (1) and (2). For (3), the eval tool's webhook or scheduled-run features handle the production sampling — Langfuse, Braintrust, and Arize all support this pattern, with varying ergonomics.

The eval-gated CI pattern is the cultural shift that takes longest. Engineers initially resent the build time and the false alarms; after the eval catches the first real regression, they convert. Plan for 4–8 weeks of culture work alongside the tooling.

What goes in the golden set

Building the golden set is the work that scopes the timeline of an eval setup. The set should:

Cover the happy path (representative typical inputs)
Cover known edge cases
Cover adversarial inputs (prompt injection attempts, ambiguous queries, attempts to make the model misbehave)
Be large enough to be statistically meaningful but small enough to run cheaply (50–500 cases is typical)
Include known-good outputs or scoring rubrics
Be versioned alongside your code

Curation is mostly human work. We typically sample real production traffic, hand-curate the most representative cases, and supplement with synthetic adversarial cases. For a single-agent project, expect 1–2 weeks of curation work to reach a useful golden set.

What good looks like at maturity

A mature eval setup, observed in our engagements past month 4:

Pass rate against golden set is a tracked KPI alongside latency and cost
Eval runs are sub-5 minutes; engineers run them locally before pushing
Drift alerts have a defined response procedure (not "investigate someday")
Adversarial set grows organically — production failures get added back as cases
The eval set is an auditable artefact for ISO 42001 conformity

What goes wrong

The three failure modes we see most:

Static golden set. Built once, never updated. Production reality drifts away from the set; the eval becomes a rubber stamp. Fix: rotate cases quarterly, sample new ones from production.
LLM-as-judge without validation. Use a frontier model to grade outputs without ever validating that the judge agrees with humans. The judge's biases become the eval's biases silently. Fix: hand-label 100 cases, measure judge agreement, iterate on the rubric until agreement is acceptable.
Pass rate optimised for the eval, not for the user. Engineers tune prompts to pass the golden set; production behaviour doesn't actually improve. Fix: rotate the eval set, add held-out cases, monitor production directly via sampling.

Where Codenovai fits

We deploy the eval and observability stack on every Agentic Pilot engagement. The default is the Langfuse self-hosted pattern for regulated workloads, Braintrust for cloud-native iteration-heavy projects. Either way the eval harness is in CI from week 9 of the 90-day program — not bolted on afterward.

For existing workloads without eval coverage, we add it as part of Fractional AI Team retainers — typically 4–8 weeks to backfill a useful harness on a system that didn't have one. Book a scoping call.